Return to Answer

added 607 characters in body

edited Apr 5, 2019 at 12:50

252.9k
69
481
720

You can do this very easily with awk:

$ awk 'NR==FNR{a[$1]=$2; next}{$1=a[$1]; print}' file2 file1 GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173 GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12 GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3 GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748 GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12 GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3 GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8 GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559 GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7

Or, since that looks like a tab-separated file:

$ awk -vOFS="\t" 'NR==FNR{a[$1]=$2; next}{$1=a[$1]; print}' file2 file1 GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173 GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12 GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3 GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748 GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12 GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3 GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8 GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559 GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7

This assumes that every RefSeq (NC_*) id in file1 has a corresponding entry in file2.

Explanation

NR==FNR : NR is the current line number, FNR is the line number of the current file. The two will be identical only while the 1st file (here, file2) is being read.

a[$1]=$2; next: if this is the first file (see above), save the 2nd field in an array whose key is the 1st field. Then, move on to the next line. This ensures the next block isn't executed for the 1st file.

{$1=a[$1]; print} : now, in the second file, set the 1st field to whatever value was saved in the array a for the 1st field (so, the associated value from file2) and print the resulting line.

You can do this very easily with awk:

$ awk 'NR==FNR{a[$1]=$2; next}{$1=a[$1]; print}' file2 file1 GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173 GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12 GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3 GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748 GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12 GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3 GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8 GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559 GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7

Or, since that looks like a tab-separated file:

$ awk -vOFS="\t" 'NR==FNR{a[$1]=$2; next}{$1=a[$1]; print}' file2 file1 GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173 GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12 GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3 GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748 GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12 GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3 GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8 GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559 GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7

This assumes that every RefSeq (NC_*) id in file1 has a corresponding entry in file2.

You can do this very easily with awk:

$ awk 'NR==FNR{a[$1]=$2; next}{$1=a[$1]; print}' file2 file1 GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173 GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12 GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3 GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748 GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12 GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3 GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8 GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559 GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7

Or, since that looks like a tab-separated file:

$ awk -vOFS="\t" 'NR==FNR{a[$1]=$2; next}{$1=a[$1]; print}' file2 file1 GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173 GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12 GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3 GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748 GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12 GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3 GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8 GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559 GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7

This assumes that every RefSeq (NC_*) id in file1 has a corresponding entry in file2.

Explanation

NR==FNR : NR is the current line number, FNR is the line number of the current file. The two will be identical only while the 1st file (here, file2) is being read.

a[$1]=$2; next: if this is the first file (see above), save the 2nd field in an array whose key is the 1st field. Then, move on to the next line. This ensures the next block isn't executed for the 1st file.

{$1=a[$1]; print} : now, in the second file, set the 1st field to whatever value was saved in the array a for the 1st field (so, the associated value from file2) and print the resulting line.

Source Link

answered Apr 5, 2019 at 12:38

terdon ♦

252.9k
69
481
720

You can do this very easily with awk:

$ awk 'NR==FNR{a[$1]=$2; next}{$1=a[$1]; print}' file2 file1 GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173 GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12 GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3 GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748 GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12 GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3 GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8 GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559 GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7

Or, since that looks like a tab-separated file:

$ awk -vOFS="\t" 'NR==FNR{a[$1]=$2; next}{$1=a[$1]; print}' file2 file1 GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173 GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12 GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3 GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748 GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12 GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3 GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8 GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559 GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7

This assumes that every RefSeq (NC_*) id in file1 has a corresponding entry in file2.