Return to Answer

added 1 character in body

edited Jul 6, 2017 at 15:40

12.8k
1
26
45

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary

###Just a quick time comparation test:

build a 100 000 random lines example file2

seq 1000000 | shuf -n 100000 > file2
build a 10 000 random lines example file1 (strings to remove)

seq 1000000 | shuf -n 10000 > file1
Using grep -F --- time grep -vwFf file1 file2 > file31

real 0m0.111s user 0m0.100s sys 0m0.008s
Without -F --- time grep -vwf file1 file2 > file32

... hours!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours?) UPDATED 1h03m53.983s

Conclusion of the test:

grep -vFf file1 file2is much faster than grep -vf
grep -vFf file1 file2 has no problems with big file1 files
grep -vf file1 file2 is evilly affected with the increase of the size of file1 file (this is only visible for sizes > 500 lines /or > 4kbytes)

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary

###Just a quick time comparation test:

build a 100 000 random lines example file2

seq 1000000 | shuf -n 100000 > file2
build a 10 000 random lines example file1 (strings to remove)

seq 1000000 | shuf -n 10000 > file1
Using grep -F --- time grep -vwFf file1 file2 > file31

real 0m0.111s user 0m0.100s sys 0m0.008s
Without -F --- time grep -vwf file1 file2 > file32

... hours!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours?) UPDATED 1h03m53.983s

Conclusion of the test:

grep -vFf file1 file2is much faster than grep -vf
grep -vFf file1 file2 has no problems with big file1 files
grep -vf file1 file2 is evilly affected with the increase of the size of file1 file (this is only visible for sizes > 500 lines / > 4kbytes)

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary

###Just a quick time comparation test:

build a 100 000 random lines example file2

seq 1000000 | shuf -n 100000 > file2
build a 10 000 random lines example file1 (strings to remove)

seq 1000000 | shuf -n 10000 > file1
Using grep -F --- time grep -vwFf file1 file2 > file31

real 0m0.111s user 0m0.100s sys 0m0.008s
Without -F --- time grep -vwf file1 file2 > file32

... hours!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours?) UPDATED 1h03m53.983s

Conclusion of the test:

grep -vFf file1 file2is much faster than grep -vf
grep -vFf file1 file2 has no problems with big file1 files
grep -vf file1 file2 is evilly affected with the increase of the size of file1 file (this is only visible for sizes > 500 lines or > 4kbytes)

added 305 characters in body

Source Link

edited Jul 5, 2017 at 9:56

JJoao

12.8k
1
26
45

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary

###Just a quick time comparation test:

build a 100 000 random lines example file2

seq 1000000 | shuf -n 100000 > file2
build a 10 000 random lines example file1 (strings to remove)

seq 1000000 | shuf -n 10000 > file1
Using grep -F --- time grep -vwFf file1 file2 > file32file31

real 0m0.111s user 0m0.100s sys 0m0.008s
Without -F --- time grep -vwf file1 file2 > file31file32

... several hours!!!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours?)   UPDATED  1h03m53.983s

Conclusion of the test:

grep -vFf file1 file2is much faster than grep -vf

grep -vFf file1 file2 has no problems with big file1 files

grep -vf file1 file2 is evilly affected with the increase of the size of file1 file (this is only visible for sizes > 500 lines / > 4kbytes)

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary

###Just a quick time comparation test:

build a 100 000 random lines example file2

seq 1000000 | shuf -n 100000 > file2
build a 10 000 random lines example file1 (strings to remove)

seq 1000000 | shuf -n 10000 > file1
Using grep -F --- time grep -vwFf file1 file2 > file32

real 0m0.111s user 0m0.100s sys 0m0.008s
Without -F --- time grep -vwf file1 file2 > file31

... several hours!!!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours)

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary

###Just a quick time comparation test:

build a 100 000 random lines example file2

seq 1000000 | shuf -n 100000 > file2
build a 10 000 random lines example file1 (strings to remove)

seq 1000000 | shuf -n 10000 > file1
Using grep -F --- time grep -vwFf file1 file2 > file31

real 0m0.111s user 0m0.100s sys 0m0.008s
Without -F --- time grep -vwf file1 file2 > file32

... hours!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours?) UPDATED  1h03m53.983s

Conclusion of the test:

grep -vFf file1 file2is much faster than grep -vf

grep -vFf file1 file2 has no problems with big file1 files

grep -vf file1 file2 is evilly affected with the increase of the size of file1 file (this is only visible for sizes > 500 lines / > 4kbytes)

added 327 characters in body

Source Link

edited Jul 5, 2017 at 9:19

JJoao

12.8k
1
26
45

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary

###Just a quick time comparation test:

build a 100 000 random lines example file2

seq 1000000 | shuf | head -100000n 100000 > file2
build a 10 000 random lines example file1 (strings to remove)

seq 1000000 | shuf | head -10000n 10000 > file1
Using grep -F --- time grep -vwFf file1 file2 > file32

real 0m0.111s user 0m0.100s sys 0m0.008s
Without -F --- time grep -vwf file1 file2 > file31

... several hours!!!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours)

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary

###Just a quick comparation:

build a 100 000 example file2

seq 1000000 | shuf | head -100000 > file2
build a 10 000 example file1 (strings to remove)

seq 1000000 | shuf | head -10000 > file1
Using grep -F --- time grep -vwFf file1 file2 > file32

real 0m0.111s user 0m0.100s sys 0m0.008s
Without -F --- time grep -vwf file1 file2 > file31

... several hours!!!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s .... 10000 lines -- it is still calculating (several hours)

If you want "lines that contain a string found in another file" (and not "lines that contain a string that match a regExp in another file"), try:

grep -vFf file1 file2 > file3

"grep -F" is not looking for regexp match but simple string match (much faster)

or even better

grep -vwFf file1 file2 #respect word boundary

###Just a quick time comparation test:

build a 100 000 random lines example file2

seq 1000000 | shuf -n 100000 > file2
build a 10 000 random lines example file1 (strings to remove)

seq 1000000 | shuf -n 10000 > file1
Using grep -F --- time grep -vwFf file1 file2 > file32

real 0m0.111s user 0m0.100s sys 0m0.008s
Without -F --- time grep -vwf file1 file2 > file31

... several hours!!!

if file1 has just 300 lines -- 0.327s very fast .... 600 lines -- 8.326s .... 900 lines -- 35.334s .... 1200 lines -- 1m31.433s (quadratic with file1 len?) .... 10000 lines -- it is still calculating (several hours)

added 327 characters in body

Source Link

edited Jul 5, 2017 at 9:10

JJoao

12.8k
1
26
45

Source Link

answered Jul 5, 2017 at 8:48

JJoao

12.8k
1
26
45