Compare two files

Question

I have two files in UNIX. 1st file is Entity and Second File is References. 1st File has only one column named Entity ID and 2nd file has two columns Entity ID | Person ID.

I want to produce a output file where entity id's are matching in both the files.

Entity File

624197 624252 624264 624276 624280 624309 624317

Reference File

624252|624346 624264|1070122 624264|624346 624276|624588 624280|624346 624280|624582 624298|624588 624319|333008 624330|624588

Output File

624252|624346 624264|1070122 624264|624346 624276|624588 624280|624346 624280|624582

Entity Files has 90K Records and Reference file has 200K Records. Is there an efficient way to produce the third file ? Any solution is appreciated.

The two files are sorted?

enzotib
– enzotib

2011-12-06 22:01:55 +00:00
Commented Dec 6, 2011 at 22:01 — enzotib
– enzotib, Commented Dec 6, 2011 at 22:01

Kevin · Accepted Answer · 2011-12-06 22:13:22Z

Assuming your files are both sorted:

join -j1 -t\| entity.txt reference.txt

If they're not sorted, sort them:

sort entity.txt -o entity-sorted.txt sort reference.txt -o reference-sorted.txt join -j1 -t\| entity-sorted.txt reference-sorted.txt

Dalker · Accepted Answer · 2011-12-06 22:24:33Z

0

You can do that with a bash / zsh one-liner. Assuming your data is contained in files named entity and reference, just type:

for i in $(cat entity); do grep ^$i reference; done

in a console.

Also, you can redirect the whole output to an output file, as

for i in $(cat entity); do grep ^$i reference; done > output

edited Dec 6, 2011 at 22:24

answered Dec 6, 2011 at 22:17

Dalker

1,6821 gold badge12 silver badges17 bronze badges

1

He wants the first column to match, you'd need to change the grep to ^$i

Kevin
– Kevin

2011-12-06 22:21:37 +00:00
Commented Dec 6, 2011 at 22:21
@Kevin: you're absolutely right! Edited in consequence.

Dalker
– Dalker

2011-12-06 22:25:12 +00:00
Commented Dec 6, 2011 at 22:25
for i in $(cat entity) is a bad idea due to the way that $( handles word splitting. Use while IFS= read -r instead.

Chris Down
– Chris Down

2011-12-06 23:15:31 +00:00
Commented Dec 6, 2011 at 23:15
@ChrisDown, Dalker: sed -e 's/^/^/' -e 's/$/|/' | grep -F - reference >output

Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil'

2011-12-06 23:24:09 +00:00
Commented Dec 6, 2011 at 23:24

Add a comment |

Birei · Accepted Answer · 2011-12-06 22:24:55Z

A solution using Perl:

Content of entity.txt:

$ cat entity.txt 624197 624252 624264 624276 624280 624309 624317

Content of reference.txt:

$ cat reference.txt 624252|624346 624264|1070122 624264|624346 624276|624588 624280|624346 624280|624582 624298|624588 624319|333008 624330|624588

Content of Perl script:

$ cat script.pl use warnings; use strict; ## Check arguments. @ARGV == 2 or die qq[Usage: perl $0 <entity-file> <reference-file>\n]; ## File in process. my $process_file = 1; ## Hash to save entities. my %entity; while ( <> ) { ## Process file of entities. Remove leading and trailing spaces, and save the ## number to a hash. if ( $process_file == 1 ) { s/\A\s*//; s/\s*\z//; if ( defined $_ ) { $entity{ $_ } = 1 } next; } ## Process file of references. Get first field and search it in the hash. ## If found, print the line. my @f = split /\|/, $_, 2; if ( exists $entity{ $f[0] } ) { print; } } continue { ## Increment number when end processing first file. if ( eof ) { ++$process_file } }

Running the script without arguments:

$ perl script.pl Usage: perl script.pl <entity-file> <reference-file>

Running the script with arguments and result:

$ perl script.pl entity.txt reference.txt 624252|624346 624264|1070122 624264|624346 624276|624588 624280|624346 624280|624582

Jim · Accepted Answer · 2012-01-07 07:17:43Z

Hmm, maybe I'm missing something? Please correct me if I'm wrong:

$ while read id;do grep $id reference ;done <identity 624252|624346 624264|1070122 624264|624346 624276|624588 624280|624346 624280|624582

Looking at your source files, they were already sorted, but my solution should work regardless of whether they are sorted or not, I believe.

And to output to another file:

$ while read id;do grep $id reference ;done < identity > newoutput.out

Stack Exchange Network

Compare two files

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Compare two files

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions