2

I have two files in UNIX. 1st file is Entity and Second File is References. 1st File has only one column named Entity ID and 2nd file has two columns Entity ID | Person ID.

I want to produce a output file where entity id's are matching in both the files.

Entity File

624197 624252 624264 624276 624280 624309 624317 

Reference File

624252|624346 624264|1070122 624264|624346 624276|624588 624280|624346 624280|624582 624298|624588 624319|333008 624330|624588 

Output File

624252|624346 624264|1070122 624264|624346 624276|624588 624280|624346 624280|624582 

Entity Files has 90K Records and Reference file has 200K Records. Is there an efficient way to produce the third file ? Any solution is appreciated.

1
  • The two files are sorted? Commented Dec 6, 2011 at 22:01

4 Answers 4

2

Assuming your files are both sorted:

join -j1 -t\| entity.txt reference.txt 

If they're not sorted, sort them:

sort entity.txt -o entity-sorted.txt sort reference.txt -o reference-sorted.txt join -j1 -t\| entity-sorted.txt reference-sorted.txt 
0

You can do that with a bash / zsh one-liner. Assuming your data is contained in files named entity and reference, just type:

for i in $(cat entity); do grep ^$i reference; done

in a console.

Also, you can redirect the whole output to an output file, as

for i in $(cat entity); do grep ^$i reference; done > output

4
  • 1
    He wants the first column to match, you'd need to change the grep to ^$i Commented Dec 6, 2011 at 22:21
  • @Kevin: you're absolutely right! Edited in consequence. Commented Dec 6, 2011 at 22:25
  • for i in $(cat entity) is a bad idea due to the way that $( handles word splitting. Use while IFS= read -r instead. Commented Dec 6, 2011 at 23:15
  • @ChrisDown, Dalker: sed -e 's/^/^/' -e 's/$/|/' | grep -F - reference >output Commented Dec 6, 2011 at 23:24
0

A solution using Perl:

Content of entity.txt:

$ cat entity.txt 624197 624252 624264 624276 624280 624309 624317 

Content of reference.txt:

$ cat reference.txt 624252|624346 624264|1070122 624264|624346 624276|624588 624280|624346 624280|624582 624298|624588 624319|333008 624330|624588 

Content of Perl script:

$ cat script.pl use warnings; use strict; ## Check arguments. @ARGV == 2 or die qq[Usage: perl $0 <entity-file> <reference-file>\n]; ## File in process. my $process_file = 1; ## Hash to save entities. my %entity; while ( <> ) { ## Process file of entities. Remove leading and trailing spaces, and save the ## number to a hash. if ( $process_file == 1 ) { s/\A\s*//; s/\s*\z//; if ( defined $_ ) { $entity{ $_ } = 1 } next; } ## Process file of references. Get first field and search it in the hash. ## If found, print the line. my @f = split /\|/, $_, 2; if ( exists $entity{ $f[0] } ) { print; } } continue { ## Increment number when end processing first file. if ( eof ) { ++$process_file } } 

Running the script without arguments:

$ perl script.pl Usage: perl script.pl <entity-file> <reference-file> 

Running the script with arguments and result:

$ perl script.pl entity.txt reference.txt 624252|624346 624264|1070122 624264|624346 624276|624588 624280|624346 624280|624582 
0

Hmm, maybe I'm missing something? Please correct me if I'm wrong:

$ while read id;do grep $id reference ;done <identity 624252|624346 624264|1070122 624264|624346 624276|624588 624280|624346 624280|624582 

Looking at your source files, they were already sorted, but my solution should work regardless of whether they are sorted or not, I believe.

And to output to another file:

$ while read id;do grep $id reference ;done < identity > newoutput.out 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.