Saturday, February 17, 2007

Compressing the Human Genome

Compressing the human genome is a worthy task because compressing better means a deeper understanding of the patterns and structure of the genome. Here are some results with popular open-source text compressors.

genome.tar 3124449280 bytes

gzip 1.3.5
real 12m35.780s
user 11m35.300s
sys 0m8.410s
genome.tar.gz 861173407 bytes

pbzip2 0.9.6, with 4 processors
real 5m3.966s
user 18m16.750s
sys 0m35.780s
Output Size: 793150501 bytes

With multiple processors, pbzip2 is faster than single-processor gzip.

7za a -m0=LZMA:d28 -mx=9 genome genome.tar
7-Zip (A) 4.44 beta Copyright (c) 1999-2007 Igor Pavlov 2007-01-20
p7zip Version 4.44 (locale=C,Utf16=off,HugeFiles=on,2 CPUs)
real 245m55.993s
user 347m56.284s
sys 1m28.416s
(Note: used 2.7 GB memory, also timing done on different machine than gzip and pbzip2, so useful only as an order of magnitude comparison)
683829237 bytes

with the option LZMA:d27 the output is 4MB larger, ie 687816321 bytes, It uses 1.3 GB memory. real 229m55.132s, user 316m36.679s, sys 0m26.043s

With 7-zip it fits on to one CD!

1 comment :

Anonymous said...

Interesting use for pbzip2. I had never tried compressing the human genome before. I guess there is quite a bit of redundant "data" in there. ;-)