1

I would like to use GNU Parallel to process a huge .gz or .bz2 file.

I know I can do:

bzcat huge.bz2 | parallel --pipe ... 

But it would be nice if there was a way similar to --pipe-part that can read multiple parts of the file in parallel. One option is to decompress the file:

bzcat huge.bz2 > huge parallel --pipe-part -a huge ... 

but huge.bz2 is huge, and I would much prefer decompressing it multiple times than storing it uncompressed.

10
  • Isn't it --pipe-part (three minuses)? (the man page says it is, but I think of the man page as a proxy for your authority on GNU parallel) Commented Mar 28 at 12:03
  • 1
    For gzip files, relevant: Parallel decompression of gzip-compressed files and random access to DNA sequences and other interesting links courtesy of (also very relevant here) gztool. Commented Mar 28 at 17:11
  • 2
    gztool also mentions perl module Gzip::RandomAccess in case you want to add that into GNU parallel. Some more modern compression formats make random access easier. Commented Mar 28 at 17:22
  • 1
    Can you clarify what kind of solution you're looking for? Presumably, this being you, you aren't asking for a GNU parallel option, so I assume you want some sort of workaround that allows you to pipe parts of the file? How about splitting the file and then processing the pieces in parallel? Commented Mar 28 at 18:00
  • 1
    @MarcusMüller (I forgot the middle '-' so many times that I simply decided that all middle '-' can be left out for all options). Commented Mar 28 at 18:30

1 Answer 1

0

I think simple tools could help you

pigz for .gz pixz for .xz (lzma) lbzip2 for .bz2 

something working usually fine could be :

lbzip2 -n $(nproc) -d huge.bz2 

or for gzip

pigz -p $(nproc) -d huge.gz 

...

And you can use with tar too

tar -I<compression__program> -xf huge.tar.bz2 

For ex:

tar -Ipigz -xf several_terabytes_archive.tar.gz 
2
  • This is not what I am looking for. Maybe it is easier to see what I am looking for if you imagine that I do not have disk space for the uncompressed file? I want to be able to access the last 1 MB of huge.gz without having to decompress everything first. Commented Apr 2 at 11:10
  • ok, understand better(sorry for answer)! I think it would not be possible without creating huge.gz specially for partial decompression for ex: using zlib and insert sync marks you can use after as decompression point. From zlib manual, function gzread(gzFile file, voidp buf, unsigned len); take 'len' args and manual said: "Read and decompress up to len uncompressed bytes from file into buf" so I think it would do the job you wanted to do. If it help in your case, let me know and I will update my prevous answer Commented Apr 2 at 16:58

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.