Revisions to Slow performance of zgrep in multiple files

Bounty Awarded with 50 reputation awarded by happy

occurred Jun 9, 2020 at 7:03

Add ripgrep

edited Jun 3, 2020 at 6:52

3.9k
25
31

If lines in B.txt really are regular expressions or substrings of lines in A.gz you may be forced to use something like HyperScan which is designed to deal with huge regexes. If you have the disk space you could decompress A.gz and just let HyperScan get to work on that (you may even be able to let the shell decompress on the fly while HyperScan searches through it). Another alternative to try is ripgrep.

If lines in B.txt really are regular expressions or substrings of lines in A.gz you may be forced to use something like HyperScan which is designed to deal with huge regexes. If you have the disk space you could decompress A.gz and just let HyperScan get to work on that (you may even be able to let the shell decompress on the fly while HyperScan searches through it).

If lines in B.txt really are regular expressions or substrings of lines in A.gz you may be forced to use something like HyperScan which is designed to deal with huge regexes. If you have the disk space you could decompress A.gz and just let HyperScan get to work on that (you may even be able to let the shell decompress on the fly while HyperScan searches through it). Another alternative to try is ripgrep.

Clarify line hashing won't work

Source Link

edited Jun 3, 2020 at 6:30

Anon

3.9k
25
31

A 79MByte grep "string" is going to painful to work with. Are the lines of B.txt really regular expressions or are they fixed identical strings? If they are fixed strings do they appear identically in A.gz as a whole line? How many lines in the uncompressed A.gz are expected to be matched by lines in B.txt?

Pattern matching suggestion

If lines in B.txt really are regular expressions or substrings of lines in A.gz you may be forced to use something like HyperScan which is designed to deal with huge regexes. If you have the disk space you could decompress A.gz and just let HyperScan get to work on that (you may even be able to let the shell decompress on the fly while HyperScan searches through it).

Full line matching suggestion

If you're dealing with fixed full line strings in B.txt and the uncompressed A.gz contains a comparatively small (let's say on the order of 100MB) of matching lines you may be better served writing a program to pre-process A.gz:

You would now need to check candidates.txt to see if lines output EXACTLY matched those of B.txt (but this is hopefully a smaller and far easier problem and you could even modify the program above to do it all if the number of candidate lines is "small" and well within what can be held in memory)

If lines in B.txt really (The questioner later clarified in comments that they are regular expressionsnot working with full line length strings so this approach clearly won't work. You may be forced to use something like HyperScan which is designed to deal with huge regexes. If you have the disk space you could decompress A.gz and just let HyperScan get to work on that.)

A 79MByte grep "string" is going to painful to work with. Are the lines of B.txt really regular expressions or are they fixed identical strings? How many lines in the uncompressed A.gz are expected to be matched by lines in B.txt?

If you're dealing with fixed strings in B.txt and the uncompressed A.gz contains a comparatively small (let's say on the order of 100MB) of matching lines you may be better served writing a program to pre-process A.gz:

You would now need to check candidates.txt to see if lines output EXACTLY matched those of B.txt (but this is hopefully a smaller and far easier problem and you could even modify the program above to do it all if the number of candidate lines is "small" and well within what can be held in memory)

If lines in B.txt really are regular expressions this approach clearly won't work. You may be forced to use something like HyperScan which is designed to deal with huge regexes. If you have the disk space you could decompress A.gz and just let HyperScan get to work on that.

A 79MByte grep "string" is going to painful to work with. Are the lines of B.txt really regular expressions or are they fixed identical strings? If they are fixed strings do they appear identically in A.gz as a whole line? How many lines in the uncompressed A.gz are expected to be matched by lines in B.txt?

Pattern matching suggestion

If lines in B.txt really are regular expressions or substrings of lines in A.gz you may be forced to use something like HyperScan which is designed to deal with huge regexes. If you have the disk space you could decompress A.gz and just let HyperScan get to work on that (you may even be able to let the shell decompress on the fly while HyperScan searches through it).

Full line matching suggestion

If you're dealing with fixed full line strings in B.txt and the uncompressed A.gz contains a comparatively small (let's say on the order of 100MB) of matching lines you may be better served writing a program to pre-process A.gz:

You would now need to check candidates.txt to see if lines output EXACTLY matched those of B.txt (but this is hopefully a smaller and far easier problem and you could even modify the program above to do it all if the number of candidate lines is "small" and well within what can be held in memory). (The questioner later clarified in comments that they are not working with full line length strings so this approach won't work)

Add approximate filter program

Source Link

edited Jun 3, 2020 at 5:23

Anon

3.9k
25
31

A 79MByte grep "string" is going to painful to work with. Are the lines of B.txt really regular expressions or are they fixed identical strings? How many lines in the uncompressed A.gz are expected to be matched by lines in B.txt?

If you're dealing with fixed strings in B.txt and the uncompressed A.gz contains a comparatively small (let's say on the order of 100MB) of matching lines you may be better served writing a program to pre-process A.gz:

You could hash each line of B.txt and remember the hashes
You then check if any line in the uncompressed A.gz hashes to the same thing as any of your previous hashes. If so you print out the line (e.g. into C.txt) ready for further processing
You now do final pass where you more rigorously check each line whether each line in B.txt is in C.txt (or vice versa - depending on which file was smaller)

Some code for do the initial approximate filtering could be this:

# Do a quick APPROXIMATE filter of lines in FILENEEDLES that are also in # FILEHAYSTACK import sys def main(): if len(sys.argv) < 2: print("usage: %s FILENEEDLES FILEHAYSTACK" % sys.argv[0]) exit(1) first_filename = sys.argv[1] second_filename = sys.argv[2] line_hashes = set() with open(first_filename, "r") as f: for line in f: line_hashes.add(hash(line)) with open(second_filename, "r") as f: for line in f: if hash(line) in line_hashes: sys.stdout.write(line) if __name__ == "__main__": main()

For example:

$ echo -e '1\n2\n3' > B.txt $ echo -e '2\n3\n4\5' | gzip > A.gz $ ./approxfilter.py B.txt <(gzip -dc A.gz) > candidates.txt $ cat candidates.txt 2 3

You would now need to check candidates.txt to see if lines output EXACTLY matched those of B.txt (but this is hopefully a smaller and far easier problem and you could even modify the program above to do it all if the number of candidate lines is "small" and well within what can be held in memory)

If lines in B.txt really are regular expressions this approach clearly won't work. You may be forced to use something like HyperScan which is designed to deal with huge regexes. If you have the disk space you could decompress A.gz and just let HyperScan get to work on that.

A 79MByte grep "string" is going to painful to work with. Are the lines of B.txt really regular expressions or are they fixed identical strings? How many lines in the uncompressed A.gz are expected to be matched by lines in B.txt?

If you're dealing with fixed strings in B.txt and the uncompressed A.gz contains a comparatively small (let's say on the order of 100MB) of matching lines you may be better served writing a program to pre-process A.gz:

You could hash each line of B.txt and remember the hashes
You then check if any line in the uncompressed A.gz hashes to the same thing as any of your previous hashes. If so you print out the line (e.g. into C.txt) ready for further processing
You now do final pass where you more rigorously check each line whether each line in B.txt is in C.txt (or vice versa - depending on which file was smaller)

If lines in B.txt really are regular expressions this approach clearly won't work. You may be forced to use something like HyperScan which is designed to deal with huge regexes. If you have the disk space you could decompress A.gz and just let HyperScan get to work on that.

A 79MByte grep "string" is going to painful to work with. Are the lines of B.txt really regular expressions or are they fixed identical strings? How many lines in the uncompressed A.gz are expected to be matched by lines in B.txt?

If you're dealing with fixed strings in B.txt and the uncompressed A.gz contains a comparatively small (let's say on the order of 100MB) of matching lines you may be better served writing a program to pre-process A.gz:

You could hash each line of B.txt and remember the hashes
You then check if any line in the uncompressed A.gz hashes to the same thing as any of your previous hashes. If so you print out the line (e.g. into C.txt) ready for further processing
You now do final pass where you more rigorously check each line whether each line in B.txt is in C.txt (or vice versa - depending on which file was smaller)

Some code for do the initial approximate filtering could be this:

# Do a quick APPROXIMATE filter of lines in FILENEEDLES that are also in # FILEHAYSTACK import sys def main(): if len(sys.argv) < 2: print("usage: %s FILENEEDLES FILEHAYSTACK" % sys.argv[0]) exit(1) first_filename = sys.argv[1] second_filename = sys.argv[2] line_hashes = set() with open(first_filename, "r") as f: for line in f: line_hashes.add(hash(line)) with open(second_filename, "r") as f: for line in f: if hash(line) in line_hashes: sys.stdout.write(line) if __name__ == "__main__": main()

For example:

$ echo -e '1\n2\n3' > B.txt $ echo -e '2\n3\n4\5' | gzip > A.gz $ ./approxfilter.py B.txt <(gzip -dc A.gz) > candidates.txt $ cat candidates.txt 2 3

You would now need to check candidates.txt to see if lines output EXACTLY matched those of B.txt (but this is hopefully a smaller and far easier problem and you could even modify the program above to do it all if the number of candidate lines is "small" and well within what can be held in memory)

If lines in B.txt really are regular expressions this approach clearly won't work. You may be forced to use something like HyperScan which is designed to deal with huge regexes. If you have the disk space you could decompress A.gz and just let HyperScan get to work on that.

Source Link

answered Jun 3, 2020 at 2:27

Anon

3.9k
25
31

Loading

Stack Exchange Network

Return to Answer

Pattern matching suggestion

Full line matching suggestion

Pattern matching suggestion

Full line matching suggestion