2

I have a very large file (snippet below). I need to remove any lines where the number in the first column does not increase consecutively from the line above.

For example, I want to keep the first line from my snippet, where the identifier in the first column is "40812." Then I want to preserve the row where "40813" is in the first column (line 3 in my example) and then the row that starts with "40814," and so on. I want to delete any lines that violate this succession, such as the second row.

I have looked here at previous questions/answers for possible solutions and so far have had no success. A solution that has appeared in several questions is:

awk -F',' ' '!seen[$1]++ myFile 

I adapted another solution I saw as:

sort -t':' -k 1,1 -u myFile 

If anyone could please tell me where I'm going wrong, I would be very grateful. I'm not very experienced with file manipulation.

 40812 20406.000000 0.843859468 1083.209050130 -994.562279080 -993.349611938 22.120868921 40829 20414.500000 0.891283743 1144.084593627 -994.539001565 -993.349739827 21.177788019 40813 20406.500000 0.829362077 1064.599666089 -994.546948121 -993.348764740 22.087239027 40830 20415.000000 0.889606427 1141.931529727 -994.537943593 -993.350242614 21.282490969 40814 20407.000000 0.822524589 1055.822814442 -994.540118434 -993.348757318 22.083606005 40831 20415.500000 0.875230513 1123.478077086 -994.523844766 -993.350421831 20.606467962 40815 20407.500000 0.823511602 1057.089780943 -994.541681744 -993.349315083 22.432111979 40832 20416.000000 0.846150258 1086.149592126 -994.494220141 -993.349798791 22.309054136 40816 20408.000000 0.824550451 1058.423286012 -994.543159511 -993.349731194 22.481428146 40833 20416.500000 0.811604775 1041.805740021 -994.458563132 -993.348626225 21.118428946 40834 20417.000000 0.787796672 1011.244783236 -994.434062658 -993.347887110 20.963790894 40817 20408.500000 0.819160081 1051.504008955 -994.537767061 -993.349702160 22.268819809 40835 20417.500000 0.784857495 1007.471947645 -994.431441227 -993.348167742 20.731789112 40818 20409.000000 0.807571275 1036.628191427 -994.525675417 -993.349169067 22.332761049 40836 20418.000000 0.799208319 1025.893192994 -994.446595759 -993.348938468 21.268665075 40819 20409.500000 0.797104599 1023.192780242 -994.514563564 -993.348491176 22.622548103 40837 20418.500000 0.819797939 1052.322786256 -994.467698852 -993.349417295 21.013041973 40820 20410.000000 0.796605925 1022.552664951 -994.513928312 -993.348319789 22.193170071 
0

1 Answer 1

6

This is exactly the sort of thing that awk excels at:

$ awk '{ if(NR==1 || $1 == last+1){print; last=$1}}' file 40812 20406.000000 0.843859468 1083.209050130 -994.562279080 -993.349611938 22.120868921 40813 20406.500000 0.829362077 1064.599666089 -994.546948121 -993.348764740 22.087239027 40814 20407.000000 0.822524589 1055.822814442 -994.540118434 -993.348757318 22.083606005 40815 20407.500000 0.823511602 1057.089780943 -994.541681744 -993.349315083 22.432111979 40816 20408.000000 0.824550451 1058.423286012 -994.543159511 -993.349731194 22.481428146 40817 20408.500000 0.819160081 1051.504008955 -994.537767061 -993.349702160 22.268819809 40818 20409.000000 0.807571275 1036.628191427 -994.525675417 -993.349169067 22.332761049 40819 20409.500000 0.797104599 1023.192780242 -994.514563564 -993.348491176 22.622548103 40820 20410.000000 0.796605925 1022.552664951 -994.513928312 -993.348319789 22.193170071 

Or, a little golfed:

$ awk '(NR==1 || $1 == last+1) && last=$1' file 40812 20406.000000 0.843859468 1083.209050130 -994.562279080 -993.349611938 22.120868921 40813 20406.500000 0.829362077 1064.599666089 -994.546948121 -993.348764740 22.087239027 40814 20407.000000 0.822524589 1055.822814442 -994.540118434 -993.348757318 22.083606005 40815 20407.500000 0.823511602 1057.089780943 -994.541681744 -993.349315083 22.432111979 40816 20408.000000 0.824550451 1058.423286012 -994.543159511 -993.349731194 22.481428146 40817 20408.500000 0.819160081 1051.504008955 -994.537767061 -993.349702160 22.268819809 40818 20409.000000 0.807571275 1036.628191427 -994.525675417 -993.349169067 22.332761049 40819 20409.500000 0.797104599 1023.192780242 -994.514563564 -993.348491176 22.622548103 40820 20410.000000 0.796605925 1022.552664951 -994.513928312 -993.348319789 22.193170071 

Explanation

  • if(NR==1 || $1 == last+1) : NR is the current line number. So NR == 1 will only be true while reading the first line of the file. We need this so we will always print the first line. Then, $1 == last +1 will be true if the first field of the line ($1) equals the value stored in the variable last plus 1. Taken together, this means "if this is the last line or if the first field is equal to last + 1", which defines your target lines.
  • print; last=$1 : If either of the two conditions explained above is true, print the line and set the value of last to be the first field of this line so we can process the next.
8
  • Thank you very much, terdon, for this detailed explanation. Apologies in advance for what is sure to be a newbie issue... I tried your first solution and nothing seemed to have changed. The second solution (which, if I understand correctly, is just a shortened version of the first?) did lead to the correct screen output. However, the file is unchanged. Commented Nov 28, 2019 at 15:12
  • @user3292696 never apologize for being a newbie! None of us were born knowing this stuff! Now, I don't know why the first wouldn't work, they do the same thing with the same logic. But neither will change the original file, they just print the output to the terminal. What you want is to save that output as a new file: awk '(NR==1 || $1 == last+1) && last=$1' file > newfile Commented Nov 28, 2019 at 15:14
  • Well, I know we're not supposed to spam by saying "thank you," but I hope it's OK to make an exception in this case. Thank you for your patience, terdon. Really appreciate it. Awk is very challenging for me. Your fix worked. Commented Nov 28, 2019 at 15:30
  • With the magic of Linux, you can use the same awk command to pluck out the alternating series too. You just need to hide the first line of the file, like: tail -n +2 file | Terdon'sAwk. Then it sees 40829 as its first line, and makes a series based on that. Commented Nov 28, 2019 at 16:33
  • That's great to know, @Paul_Pedant. I would rather preserve both series. I have this problem because I was sloppy and submitted one simulation twice to the local supercomputer. It is not easy to rerun simulations, so I want to use whatever output I can. Commented Nov 28, 2019 at 16:58

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.