0

I am trying to normalise a data file using the number of lines in a previous version of the data file. After reading these questions, I thought this could work:

awk -v num=$(wc -l my_first_file.bed) '{print $1, $2, $3, $4/num}' my_other_file.bed 

but it throws this error:

awk: cmd. line:1: my_first_file.bed awk: cmd. line:1: ^ syntax error 

Protecting the . with a backslash doesn't change anything, nor does using backticks instead of $().

How can I use the output of wc -l as an awk variable? This will all be happening inside a snakemake pipeline, so I am somewhat limited in terms of flexibility.

Contents of my_other_file.bed:

chrUn_KI270548v1 0 50 0.00000 chrUn_KI270548v1 50 192 1.00000 chrUn_KI270548v1 192 497 0.00000 chrUn_KI270548v1 497 639 1.00000 chrUn_KI270548v1 639 723 0.00000 chrUn_KI270548v1 723 860 1.00000 chrUn_KI270548v1 860 865 2.00000 chrUn_KI270548v1 865 879 1.00000 chrUn_KI270548v1 879 991 2.00000 chrUn_KI270548v1 991 1002 3.00000 chrUn_KI270548v1 1002 1021 2.00000 chrUn_KI270548v1 1021 1093 1.00000 chrUn_KI270548v1 1093 1133 2.00000 chrUn_KI270548v1 1133 1222 1.00000 chrUn_KI270548v1 1222 1235 2.00000 chrUn_KI270548v1 1235 1364 1.00000 chrUn_KI270590v1 0 16 4.00000 chrUn_KI270590v1 16 46 5.00000 chrUn_KI270590v1 46 48 6.00000 chrUn_KI270590v1 48 95 7.00000 chrUn_KI270590v1 95 117 8.00000 chrUn_KI270590v1 117 130 9.00000 chrUn_KI270590v1 130 136 8.00000 chrUn_KI270590v1 136 138 7.00000 chrUn_KI270590v1 138 139 6.00000 

3 Answers 3

6

wc -l filename will output a line containing two columns; the number of lines and the filename:

$ wc -l .profile 27 .profile 

Your awk code gets confused when you are trying to divide using this string.

If you redirect the contents of the file into wc -l, then the wc utility will not be able to output the file's name and will only output the number of newlines in the file:

$ wc -l <.profile 27 

So, change your code to this:

awk -v num=$(wc -l <my_first_file.bed) '{print $1, $2, $3, $4/num}' my_other_file.bed 

Alternatively, let awk do the counting:

awk 'FNR == NR { lines++; next } { print $1, $2, $3, $4/lines }' my_first_file.bed my_other_file.bed 

or,

awk 'FNR == NR { lines++; next } { $4 /= lines; print }' my_first_file.bed my_other_file.bed 

Here, we give awk both files to work with, but while reading the first file, all we do is increment the lines variable. When starting to read the second file, the FNR == NR condition is no longer true (the number of records read from the current file is no longer the same as the number of records read overall), and we start executing the second block instead.

This assumes that the first file is never empty.

If you want the output to be tab-delimited, then don't forget to set OFS="\t" for awk.

4
  • 1
    Or awk 'lines=(NR-FNR){$4 /= lines} 1' my_first_file.bed my_other_file.bed. Commented Jun 20, 2024 at 12:00
  • 1
    Or even shorter: awk 'NR>FNR{$4/=(NR-FNR)} 1' ... Commented Jun 20, 2024 at 13:21
  • Just realised that I probably do want to use the num=$(wc -l <my_first_file.bed) because I'm getting the other data from a pipe... any way to use the FNR trick with piped data? How do I tell awk which data to read first? Commented Jul 10, 2024 at 14:02
  • 1
    @Whitehot If you want to pipe the data to the awk command then just replace the name of the file whose data you're piping with /dev/stdin on awk's command line. Commented Jul 10, 2024 at 16:00
3

Try running just wc -l my_first_file.bed:

$ wc -l my_first_file.bed 24 my_first_file.bed 

So, your command gets expanded by the shell to

awk -v num=24 my_first_file.bed '{print $1, $2, $3, $4/num}' my_other_file.bed` 

which makes my_first_file.bed your Awk command, which of course is not valid Awk syntax.

One way you could solve this would be to change your wc -l my_first_file.bed command to only output the first column. For example, something like this:

awk -v num=$(wc -l my_first_file.bed | cut -d' ' -f1) '{print $1, $2, $3, $4/num}' my_other_file.bed 

This uses space as the delimiter for the output of cut to just pass the number of lines to your variable.

1

You can do this completely within awk by passing two files as operands:

awk 'NR==FNR{lines++;next} {print $1,$2,$3,$4/lines}' my_first_file.bed my_other_file.bed 

This will do the following:

  • While processing the first file (indicated by NR, the global line-counter, being equal to FNR, the per-file line-counter), we simply increase the line count but skip execution to the next line afterwards.
  • While processing the next file, we print all columns, but divide the 4th one by the value lines, which is no longer incremented because NR is now larger then FNR

Note that this will not work if my_first_file.bed is empty.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.