GitHub - hantaniold/parallel-gzip: parallelized gzip implementation using pthreads

hantaniold / parallel-gzip Public

Notifications You must be signed in to change notification settings
Fork 1
Star 8

parallelized gzip implementation using pthreads

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
gzip-src		gzip-src
pgzip-src		pgzip-src
README		README

Repository files navigation

For the curious, gzip-src contains our initial modifications using OpenMP. The real meat is in pgzip-src, containing a parallel implementation of gzip using p-threads. The below describes the contents of the pgzip-src folder.

To test out the suite, navigate to parallel-gzip/pgzip-src, and run ./test_suite.sh . This will get you all the binaries and test it on a bunch of files. The output times and othe rinfo will go to tso.txt.

The test suite only does filesizes of 20, 100 and 250 MB, though if you wish you can also do them for 500 MB, 750, 1000, 2000 (or any size you desire) by commenting out/adding the desired lines in test_suite.sh .

Last modified: 12/2/11.

----------------------------------------------------------------------------

This is a set of various implementations, both sequential and parallel of the gzip compression program.
There are four folders: gzip, pigz, quickzip, and pgzip. The folder gzip is the standard implementation
of gzip found at www.gzip.org. There is a script called init_script.sh which can be run and will download
the file, build it, and copy the executable to the directory. The folder pigz is the standard implementation
of parallel gzip found at www.zlib.net/pigz/. There is a script called init_script.sh which can be run
and will download the file, build it, and copy the executable to the directory. These are the standard
two implementations most people use.

The next folder quickzip, implements pseudo gzip parallelism, by simply splitting a file into chunks (unix
split command) and then compresses each in a background process, we could join them back by removing the last
two bytes in each chunk and then cat the compressed files and add a crc byte and a length byte. Furthermore
we would need to remove the end block codes of each seperate chunk except the last one to comply with gzip file
format. However we do not join in this manner because we wanted a clean implemenation and did not want to modify
end chunk codes, which would need to be done in the gzip src (see LINE 653 in deflate.c source from gzip). That
is why we call it pseudo gzip, since it is technically not RFC Gzip, but does the same job.

The next folder pgzip, implements true gzip parallelism. It follows a similar approach as pigz, but absolutely no
source code is borrowed or taken as inspiration from that project. It is developed purely from the gzip source files,
and proceeds to create a threadpool and then send chunks of data to free threads, and then takes finished work and
flushes to a file appropriately. For the time being it only supports compression.

----------------------------------------------------------------------------

Build Details:

gzip: simply go to directory gzip and run init_script.sh
pigz: simply go to directory pigz and run init_script.sh
pgzip: simply go to directory pgzip and run make
quickzip: depends on having gzip and python, but otherwise you're set

----------------------------------------------------------------------------

Contributors:

https://github.com/a-wild-tigger
https://github.com/SeanHogan
https://github.com/norseboar

----------------------------------------------------------------------------