3

I've tried using sed for this. I've tried putting the lines of interest in variables as well.

I have two examples I want to achieve for now. Lets say I have thousands of urls in a file called links.txt, here are the first three:

EDIT: I added another site to show domain for real-world example?

https://site.com/category/feed/ https://site2.org/feed/ https://site3.net/science/astronomy/feed/ https://feed.site4.info/market/feed/news.xml 

and I paste these variables in the terminal:

TAG='<outline type="rss" title= text= version="RSS" xmlUrl= htmlUrl=/>' NAMES=$(sed "s/https\:\/\///g;s/[\/].*//;s/\..*//g" links.txt) XMLS=$(sed "s/.*/xmlUrl=\"&\"/" links.txt) HTMLS=$(sed "s/.*/htmlUrl=\"&\"/g" links.txt) 

How can I use this kind of strategy: while IFS= read -r line; do echo "$line" to take the "stream" of the first three lines in the links.txt file, suffix the next stream with the variables on the same line numbers, to create the name number of lines of populated $TAG of each link, and >> COMBINED.txt?

This is the result I want:

<outline type="rss" title="site" text="site" version="RSS" xmlUrl="https://site.com/category/feed/" htmlUrl="https://site.com/"/> <outline type="rss" title="site2" text="site2" version="RSS" xmlUrl="https://site2.org/feed/" htmlUrl="https://site2.org/"/> <outline type="rss" title="site3" text="site3" version="RSS" xmlUrl="https://site3.net/science/astronomy/feed/" htmlUrl="https://site3.net/"/> <outline type="rss" title="site4" text="site4" version="RSS" xmlUrl="https://feed.site4.info/market/feed/news.xml" htmlUrl="https://site4.info/"/> 

I've tried several attempts, things like this and many others

TAG1='<outline type="rss"' TAG2='version="RSS"' TAG3='\/>' echo "$XMLS" >> xmlurls.txt sed "s/.*/$TAG1 & $TAG2 /" < xmlurls.txt | sed "s/[ \t]*$/$TAG3/" >> COMBINED.txt 

I’ve tried modifying the variables, escaping slashes but frequently get the error of unterminated s.

Here's another example of what I want to do with this kind of strategy: one file has a couple of dozen lines, here are the first few lines

llvm-cfi-verify llvm-config llvm-cov llvm-cvtres llvm-cxxdump llvm-cxxfilt llvm-diff llvm-dis llvm-dlltool llvm-dwarfdump llvm-dwp 

I applied the following below:

while IFS= read -r line; do echo "$line" | sed "s/.*/ --slave \/usr\/bin\/"$line"\\t\\t\\t\\t& \\t\\t\\t\\t\/usr\/bin\/"$line"-\${version}/g"; done < llvm.txt >> COMBINED 

Result:

 --slave /usr/bin/llvm-cfi-verify llvm-cfi-verify /usr/bin/llvm-cfi-verify-${version} --slave /usr/bin/llvm-config llvm-config /usr/bin/llvm-config-${version} --slave /usr/bin/llvm-cov llvm-cov /usr/bin/llvm-cov-${version} --slave /usr/bin/llvm-cvtres llvm-cvtres /usr/bin/llvm-cvtres-${version} --slave /usr/bin/llvm-cxxdump llvm-cxxdump /usr/bin/llvm-cxxdump-${version} --slave /usr/bin/llvm-cxxfilt llvm-cxxfilt /usr/bin/llvm-cxxfilt-${version} --slave /usr/bin/llvm-diff llvm-diff /usr/bin/llvm-diff-${version} --slave /usr/bin/llvm-dis llvm-dis /usr/bin/llvm-dis-${version} --slave /usr/bin/llvm-dlltool llvm-dlltool /usr/bin/llvm-dlltool-${version} --slave /usr/bin/llvm-dwarfdump llvm-dwarfdump /usr/bin/llvm-dwarfdump-${version} --slave /usr/bin/llvm-dwp llvm-dwp /usr/bin/llvm-dwp-${version} 

the tab spacing wasn't pretty, but at least it's very close to what I want. Additionally, I am only applying one modification to one file, and my goal is to apply this sort of thing to chain multiple files/variable "streams" into a combined file.:

I'm on debian 13 with gnu sed, but I can also test this on alpine, fedora, void, opensuse, etc.. if needed.

Thanks for reading. I tried reading the following, it was difficult to find this kind of question. Maybe the keywords I use in google are incorrect.

EDIT: @markp-fuso, I've been able add/remove the "https://", or add an "s" to "http://" and create new text files without that prefix, but in the end, I want it like that. Thanks everyone, all useful information. I guess I need to learn perl basics now.

7
  • It would help if you could provide the easiest way to reproduce the problem. This works for me: echo 'xmlUrl="https://site.com/category/feed/"' | sed 's/.*/<outline type="rss" & version="RSS" /' | sed 's/[ \t]*$/\/>/' Commented Nov 21 at 4:06
  • 1
    all 3 input lines contain domains with 2 dot delimited strings (eg, site.com); in your real world data can the domain contain more than 2 such strings (eg, a.nother.site.com)? can your real world data include ip addresses and/or ports (eg, 5.6.7.8, 4.5.6.7:260)? if any of these are possible then please update the question to a) show such examples and b) explain what should be assigned to the title and text attributes in the expected result Commented Nov 21 at 15:55
  • Your fourth entry makes the way you choose the title & text more confusing. At first I thought it was "everything except the top level domain" (left most part of the hostname) now it looks like "highest level domain except for top" (2nd from the right most part of the hostname). Commented Nov 21 at 18:32
  • Don't use sed. Since you know the structure of the files, set IFS=$'\012. Read each file into a bash array, and iterate through the arrays to make new lines and echo each line to a new file. (Bash fiend here.) Commented Nov 22 at 16:48
  • @Wastrel Please read why-is-using-a-shell-loop-to-process-text-considered-bad-practice then, if you still think it's a good idea, you could post your comment as an answer so it can be voted/commented on appropriately to help the OP and others reading this question in future decide if they should use that approach or not. Commented Nov 24 at 18:17

6 Answers 6

7

how can I use this kind of strategy: while IFS= read -r line; do echo "$line"

Don't. That's not how shells are meant to be used.

For most text processing, I'd use perl which has made sed/awk obsolete since the late 80s.

$ perl -lne 'print qq{<outline type="rss" title="$1" text="$1" version="RSS" xmlUrl="$_" htmlUrl="$&/"/>} if m{^https://(?:[^/]*\.)?([^./]+)\.[^/]*}' < links.txt <outline type="rss" title="site" text="site" version="RSS" xmlUrl="https://site.com/category/feed/" htmlUrl="https://site.com/"/> <outline type="rss" title="site2" text="site2" version="RSS" xmlUrl="https://site2.org/feed/" htmlUrl="https://site2.org/"/> <outline type="rss" title="site3" text="site3" version="RSS" xmlUrl="https://site3.net/science/astronomy/feed/" htmlUrl="https://site3.net/"/> <outline type="rss" title="site4" text="site4" version="RSS" xmlUrl="https://feed.site4.info/market/feed/news.xml" htmlUrl="https://feed.site4.info/"/> 

Where:

  • -ln is the sed -n mode where the expression passed to -e is evaluated for each line of input (with the current line (stripped of its delimiter with -l like in sed) in the $_ variable).
  • qq{...} is another form of "..." which like "..." allows $var expansions within (like in shells) but makes it easier to embed "s within the quoted string (see also q{...} for '...' for hard quotes).
  • m{...} similarly is like /.../ (like in awk) except it makes it easier to embed /s within. That's to match $_ against a regexp (it's short for $_ =~ m{...} like awk's /.../ is short for $0 ~ /.../). Here the regexp matches https:// at the beginning (^) optionally (?) followed by a sequence of non-/ character and a . (which we ignore) followed by any number of characters other than . and / (which is captured into $1 thanks to the (...)), followed by a literal . and any number of characters other than / (the .net/.com... tld parts).
  • In what we print if the regexp matches, $_ is the whole line as said above, $& is what is matched by the whole regexp, $1 by the first capture group.

If you add the -MEnglish option, you can replace $_ with $ARG and $& with $MATCH, or you can make it even more explicit by naming the capture groups:

perl -lne ' print qq{<outline type="rss" title="$+{title}" text="$+{title}" version="RSS" xmlUrl="$+{feed}" htmlUrl="$+{site}/"/>} if m{^(?<feed>(?<site>https://(?:[^/]*\.)?(?<title>[^./]+)\.[^/]*).*)}' < links.txt 

Where %+ is the associative array that maps capture group names to what they matched, each element accessed with $+{key}.

You can learn about

  • the special variables with perldoc -v '$&' or perldoc -v '%+', etc.
  • functions (like print) or operators (like -m) with perldoc -f print / perldoc -f m
  • how to invoke perl with perldoc perlrun (for those -l, -n, -e options).
  • the syntax with perldoc perlsyn (perldoc -f if will point you to that).
  • Modules (such as English) with perldoc with the name of the module as argument (perldoc English).

Beware however that some systems don't come with the perl documentation installed by default. You may need a apt install perl-doc or equivalent or read the documentation online (see links above).

Your second example would be trivial:

perl -lpe '$_ = qq{ --slave /usr/bin/$_ $_ /usr/bin/$_-\${version}}' < input 

-lp is the sed without -n mode, so same as with -n except that $_ is -printed after the -expression has been evaluated.

1
  • The 4th output line <outline type="rss" title="site4" text="site4" version="RSS" xmlUrl="https://feed.site4.info/market/feed/news.xml" htmlUrl="https://feed.site4.info/"/> doesn't match the expected output <outline type="rss" title="site4" text="site4" version="RSS" xmlUrl="https://feed.site4.info/market/feed/news.xml" htmlUrl="https://site4.info/"/>. Note the feed. in the actual htmlUrl="https://feed.site4.info/" vs the expected htmlUrl="https://site4.info/". Commented Nov 24 at 17:25
6

What I would do, using a template engine, here Perl's tpage from Template::Toolkit module, the clean and maintainable way:

Template:

cat rss.tmpl <outline type="rss" title="[% title %]" text="[% title %]" version="RSS" xmlUrl="[% xmlUrl %]" htmlUrl="[% htmlUrl %]"/> 

Input:

cat input site.com/category/feed/ site2.org/feed/ site3.net/science/astronomy/feed/ 

Shell code:

while read url; do url=${url%/} domain=${url%%/*} title=${domain%%.*} htmlurl=$domain/ tpage --define title=$title --define xmlUrl=$url/ --define htmlUrl=$htmlurl rss.tmpl done < input 

Output:

<outline type="rss" title="site" text="site" version="RSS" xmlUrl="site.com/category/feed/" htmlUrl="site.com/"/> <outline type="rss" title="site2" text="site2" version="RSS" xmlUrl="site2.org/feed/" htmlUrl="site2.org/"/> <outline type="rss" title="site3" text="site3" version="RSS" xmlUrl="site3.net/science/astronomy/feed/" htmlUrl="site3.net/"/> 

Install via system package, example for Debian and derivatives:

apt install libtemplate-perl 

Or with Perl's cpan utility:

cpan Template 

If you prefer Python's Jinja, check jinja2-cli

cat rss.j2 <outline type="rss" title="{{ title }}" text="{{ title }}" version="RSS" xmlUrl="{{ xmlUrl }}" htmlUrl="{{ htmlUrl }}"/> 

Code:

while read url; do url=${url%/} domain=${url%%/*} title=${domain%%.*} htmlurl=$domain/ jinja2 --format=env rss.j2<<EOF title=$title xmlUrl=$url/ htmlUrl=$htmlurl EOF done < input 

Install Python's module:

pip install jinja2-cli 
0
4

For the 1st data set ...

Assumptions/understandings:

  • the title, text and htmlUrl attributes are derived from the last 2 dot-delimited strings in the domain (eg, for feed.site4.info we're interested in site4 and site4.info)
  • we do not need to worry about ip addresses and/or ports; otherwise OP will need to provide details on how to parse said addresses (eg, 5.6.7.8, 4.5.6.7:260) into the title and text attributes
  • input will be from a pipe

One awk idea:

$ cat links.awk BEGIN { FS = "/" } # use "/" as input field delimiter { gsub(/^[[:space:]]*|[[:space:]]*$/,"") # strip leading/trailing white space n=split($3,a,/[.]/) # split 2nd field on periods and place results in array a[] printf "<outline type=\"rss\" title=\"%s\" text=\"%s\" version=\"RSS\" xmlUrl=\"%s\" htmlUrl=\"%s\"/>\n", a[n-1], a[n-1], $0, $1 FS $2 FS a[n-1] "." a[n] FS } 

NOTE: the gsub(...) line is effectively a 'no-op' if there is no leading and/or trailing white space and could be removed (or commented out with a leading #) if OP is 100% sure there's no need to worry about leading/trailing white space

Taking for a test drive:

$ cat links.txt | awk -f links.awk 

This generates:

<outline type="rss" title="site" text="site" version="RSS" xmlUrl="https://site.com/category/feed/" htmlUrl="https://site.com/"/> <outline type="rss" title="site2" text="site2" version="RSS" xmlUrl="https://site2.org/feed/" htmlUrl="https://site2.org/"/> <outline type="rss" title="site3" text="site3" version="RSS" xmlUrl="https://site3.net/science/astronomy/feed/" htmlUrl="https://site3.net/"/> <outline type="rss" title="site4" text="site4" version="RSS" xmlUrl="https://feed.site4.info/market/feed/news.xml" htmlUrl="https://site4.info/"/> 

For the 2nd data set ...

Assumptions:

  • the intention is to generate visually aligned columns
  • input will be from a pipe

One idea using sed and column:

$ cat llvm.sed s/(.*)/--slave \/usr\/bin\/\1 & \/usr\/bin\/\1-\${version}/g 

Taking for a test drive:

$ cat llvm.txt | sed -E -f llvm.sed | column -t 

This generates:

--slave /usr/bin/llvm-cfi-verify llvm-cfi-verify /usr/bin/llvm-cfi-verify-${version} --slave /usr/bin/llvm-config llvm-config /usr/bin/llvm-config-${version} --slave /usr/bin/llvm-cov llvm-cov /usr/bin/llvm-cov-${version} --slave /usr/bin/llvm-cvtres llvm-cvtres /usr/bin/llvm-cvtres-${version} --slave /usr/bin/llvm-cxxdump llvm-cxxdump /usr/bin/llvm-cxxdump-${version} --slave /usr/bin/llvm-cxxfilt llvm-cxxfilt /usr/bin/llvm-cxxfilt-${version} --slave /usr/bin/llvm-diff llvm-diff /usr/bin/llvm-diff-${version} --slave /usr/bin/llvm-dis llvm-dis /usr/bin/llvm-dis-${version} --slave /usr/bin/llvm-dlltool llvm-dlltool /usr/bin/llvm-dlltool-${version} --slave /usr/bin/llvm-dwarfdump llvm-dwarfdump /usr/bin/llvm-dwarfdump-${version} --slave /usr/bin/llvm-dwp llvm-dwp /usr/bin/llvm-dwp-${version} 

NOTE: this assumes the input does not contain white space (column's default field delimiter); otherwise we could modify the sed and column calls to use a different character as the field delimiter

2

Not sure about sed. But you can do it using awk and paste and some temporary files:

$ awk -F/ '{print $3}' links.txt | tee long | awk -F. '{print $1}' >short $ paste links.txt long short > final $ awk '{print "<outline type=\"rss\" title=\""$3"\" text=\""$3"\" version=\"RSS\" xmlUrl=\""$1"\" htmlUrl=\""$ 2"\"/>"}' final <outline type="rss" title="site" text="site" version="RSS" xmlUrl="https://site.com/category/feed/" htmlUrl="site.com"/> <outline type="rss" title="site2" text="site2" version="RSS" xmlUrl="https://site2.org/feed/" htmlUrl="site2.org"/> <outline type="rss" title="site3" text="site3" version="RSS" xmlUrl="https://site3.net/science/astronomy/feed/" htmlUrl="site3.net"/> $ rm long short final 
1
2

Regarding your shell variable assignments and splitting parts into different files, there's no need for any of that:

sed -E 's@(https://([^.]+\.)*([^.]+)\.[^./]+/).*@<outline type="rss" title="\3" text="\3" version="RSS" xmlUrl="&" htmlUrl="\1"/>@' links.txt 

Points of what may be new to you from the above:

  • s/// can use a different character besides /. I use @ above.
  • (), in the pattern, captures. You can use \1, \2, etc. to insert what you captured into the replacement. The captures are numbered by the position of the left parenthesis, even if they're nested.
  • & inserts into the replacement the match of the entire pattern.
  • -E activates Extended Regular Expressions, so that we don't need to escape + in the pattern, etc.

Regarding:

while IFS= read -r line; do echo "$line" | sed "..."; done < llvm.txt >> COMBINED 

sed is able to work with inputs of multiple lines. Why are you spoon-feeding it line-by-line with the shell? You can remove the while:

sed "..." llvm.txt >> COMBINED 
0

Your second example (the llvm stuff) is a bjillion times easier than your first example.

THING=llvm-cfi-verify echo "--slave /usr/bin/${THING} ${THING} /usr/bin/$THING\${version}" 

This is because you're not doing anything to process the input, just sticking it in the output unchanged.

You may have heard the saying "make a tool that does one thing and does it well." If you look at that from the point of view of a user of those tools, that means, "if your tool isn't awesome for this job, there's probably a better one out there."

Instead of mastering sed, you should get to be familiar with a wide range of tools and which situations they're good for. You can look up how to do things with a tool you know about, but it's harder to realize you need a tool you've never heard of.

I would use totally different approaches to the two problems you've shown us. I would probably use Perl or Python for the first problem and a three-line bash script for the 2nd.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.