Sunday, January 20, 2013

The deflate algorithm

The deflate algorithm, the compression approach used in the original Pkzip 2.0 version, is a fascinating read defined in the RFC1951 standard.  In order to understand the spec, it's helpful to work through the "hello" examples discussed in this article.

Section 3.2.2. in the RFC defines the basic approach to creating Huffman trees, but this article provides a solid explanation about how it works and shows how it's implemented.   The key insight is that Huffman codes are generated by counting the lengths of the variable-length codes (For more background/context, see Stanford lectures on algebraic error control codes and/or the textbook used in the class).

Gzip is essentially using the deflate algorithm but adds a header/trailer to the end.  RFC1952 provides a breakdown of how the header is generated, but you can also test for yourself:

$echo "hello" > hello
$gzip hello
$python
>>> data = open("hello.gz", "r").read()
>>> data[0:2]
'\x1f\x8b'
>>> data[3:4]
'\x08'
>>> [hex(ord(a)) for a in data[4:8]]
['0x9e', '0x2e', '0xfa', '0x50']
>>> 0x50fa2e9e
1358573214
>>> datetime.datetime.fromtimestamp(1358573214)
datetime.datetime(2013, 1, 18, 21, 26, 54)

>>> data[10:16]
'hello\x00'

More related articles too:

http://www.zlib.net/manual.html
http://www.adayinthelifeof.nl/2010/06/02/deflating-the-universe/
http://www.w3.org/Graphics/PNG/RFC-1951#codes
http://www.commandlinefanatic.com/cgi-bin/showarticle.cgi?article=art007

No comments:

Post a Comment