Skip to main content
added 1 character in body
Source Link
JJoao
  • 12.8k
  • 1
  • 26
  • 45

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3 

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary 

###Just a quick time comparation test:

  1. build a 100 000 random lines example file2

    seq 1000000 | shuf -n 100000 > file2

  2. build a 10 000 random lines example file1 (strings to remove)

    seq 1000000 | shuf -n 10000 > file1

  3. Using grep -F --- time grep -vwFf file1 file2 > file31

    real 0m0.111s user 0m0.100s sys 0m0.008s

  4. Without -F --- time grep -vwf file1 file2 > file32

... hours!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours?) UPDATED 1h03m53.983s 

Conclusion of the test:

  • grep -vFf file1 file2is much faster than grep -vf

  • grep -vFf file1 file2 has no problems with big file1 files

  • grep -vf file1 file2 is evilly affected with the increase of the size of file1 file (this is only visible for sizes > 500 lines /or > 4kbytes)

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3 

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary 

###Just a quick time comparation test:

  1. build a 100 000 random lines example file2

    seq 1000000 | shuf -n 100000 > file2

  2. build a 10 000 random lines example file1 (strings to remove)

    seq 1000000 | shuf -n 10000 > file1

  3. Using grep -F --- time grep -vwFf file1 file2 > file31

    real 0m0.111s user 0m0.100s sys 0m0.008s

  4. Without -F --- time grep -vwf file1 file2 > file32

... hours!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours?) UPDATED 1h03m53.983s 

Conclusion of the test:

  • grep -vFf file1 file2is much faster than grep -vf

  • grep -vFf file1 file2 has no problems with big file1 files

  • grep -vf file1 file2 is evilly affected with the increase of the size of file1 file (this is only visible for sizes > 500 lines / > 4kbytes)

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3 

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary 

###Just a quick time comparation test:

  1. build a 100 000 random lines example file2

    seq 1000000 | shuf -n 100000 > file2

  2. build a 10 000 random lines example file1 (strings to remove)

    seq 1000000 | shuf -n 10000 > file1

  3. Using grep -F --- time grep -vwFf file1 file2 > file31

    real 0m0.111s user 0m0.100s sys 0m0.008s

  4. Without -F --- time grep -vwf file1 file2 > file32

... hours!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours?) UPDATED 1h03m53.983s 

Conclusion of the test:

  • grep -vFf file1 file2is much faster than grep -vf

  • grep -vFf file1 file2 has no problems with big file1 files

  • grep -vf file1 file2 is evilly affected with the increase of the size of file1 file (this is only visible for sizes > 500 lines or > 4kbytes)

added 305 characters in body
Source Link
JJoao
  • 12.8k
  • 1
  • 26
  • 45

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3 

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary 

###Just a quick time comparation test:

  1. build a 100 000 random lines example file2

    seq 1000000 | shuf -n 100000 > file2

  2. build a 10 000 random lines example file1 (strings to remove)

    seq 1000000 | shuf -n 10000 > file1

  3. Using grep -F --- time grep -vwFf file1 file2 > file32file31

    real 0m0.111s user 0m0.100s sys 0m0.008s

  4. Without -F --- time grep -vwf file1 file2 > file31file32

... several hours!!!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours?)   UPDATED  1h03m53.983s 

Conclusion of the test:

  • grep -vFf file1 file2is much faster than grep -vf

  • grep -vFf file1 file2 has no problems with big file1 files

  • grep -vf file1 file2 is evilly affected with the increase of the size of file1 file (this is only visible for sizes > 500 lines / > 4kbytes)

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3 

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary 

###Just a quick time comparation test:

  1. build a 100 000 random lines example file2

    seq 1000000 | shuf -n 100000 > file2

  2. build a 10 000 random lines example file1 (strings to remove)

    seq 1000000 | shuf -n 10000 > file1

  3. Using grep -F --- time grep -vwFf file1 file2 > file32

    real 0m0.111s user 0m0.100s sys 0m0.008s

  4. Without -F --- time grep -vwf file1 file2 > file31

... several hours!!!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours)   

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3 

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary 

###Just a quick time comparation test:

  1. build a 100 000 random lines example file2

    seq 1000000 | shuf -n 100000 > file2

  2. build a 10 000 random lines example file1 (strings to remove)

    seq 1000000 | shuf -n 10000 > file1

  3. Using grep -F --- time grep -vwFf file1 file2 > file31

    real 0m0.111s user 0m0.100s sys 0m0.008s

  4. Without -F --- time grep -vwf file1 file2 > file32

... hours!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours?) UPDATED  1h03m53.983s 

Conclusion of the test:

  • grep -vFf file1 file2is much faster than grep -vf

  • grep -vFf file1 file2 has no problems with big file1 files

  • grep -vf file1 file2 is evilly affected with the increase of the size of file1 file (this is only visible for sizes > 500 lines / > 4kbytes)

added 327 characters in body
Source Link
JJoao
  • 12.8k
  • 1
  • 26
  • 45

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3 

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary 

###Just a quick time comparation test:

  1. build a 100 000 random lines example file2

    seq 1000000 | shuf | head -100000n 100000 > file2

  2. build a 10 000 random lines example file1 (strings to remove)

    seq 1000000 | shuf | head -10000n 10000 > file1

  3. Using grep -F --- time grep -vwFf file1 file2 > file32

    real 0m0.111s user 0m0.100s sys 0m0.008s

  4. Without -F --- time grep -vwf file1 file2 > file31

... several hours!!!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours) 

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3 

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary 

###Just a quick comparation:

  1. build a 100 000 example file2

    seq 1000000 | shuf | head -100000 > file2

  2. build a 10 000 example file1 (strings to remove)

    seq 1000000 | shuf | head -10000 > file1

  3. Using grep -F --- time grep -vwFf file1 file2 > file32

    real 0m0.111s user 0m0.100s sys 0m0.008s

  4. Without -F --- time grep -vwf file1 file2 > file31

... several hours!!!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s .... 10000 lines -- it is still calculating (several hours) 

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3 

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary 

###Just a quick time comparation test:

  1. build a 100 000 random lines example file2

    seq 1000000 | shuf -n 100000 > file2

  2. build a 10 000 random lines example file1 (strings to remove)

    seq 1000000 | shuf -n 10000 > file1

  3. Using grep -F --- time grep -vwFf file1 file2 > file32

    real 0m0.111s user 0m0.100s sys 0m0.008s

  4. Without -F --- time grep -vwf file1 file2 > file31

... several hours!!!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours) 
added 327 characters in body
Source Link
JJoao
  • 12.8k
  • 1
  • 26
  • 45
Loading
Source Link
JJoao
  • 12.8k
  • 1
  • 26
  • 45
Loading