Section 3.2.2. in the RFC defines the basic approach to creating Huffman trees, but this article provides a solid explanation about how it works and shows how it's implemented. The key insight is that Huffman codes are generated by counting the lengths of the variable-length codes (For more background/context, see Stanford lectures on algebraic error control codes and/or the textbook used in the class).
Gzip is essentially using the deflate algorithm but adds a header/trailer to the end. RFC1952 provides a breakdown of how the header is generated, but you can also test for yourself:
$echo "hello" > hello
$gzip hello
$python
>>> data = open("hello.gz", "r").read()
>>> data[0:2]
'\x1f\x8b'
>>> data[3:4]
'\x08'
>>> [hex(ord(a)) for a in data[4:8]]
['0x9e', '0x2e', '0xfa', '0x50']
>>> 0x50fa2e9e
1358573214
>>> datetime.datetime.fromtimestamp(1358573214)
datetime.datetime(2013, 1, 18, 21, 26, 54)
>>> data[10:16]
'hello\x00'
More related articles too:
http://www.adayinthelifeof.nl/2010/06/02/deflating-the-universe/
http://www.w3.org/Graphics/PNG/RFC-1951#codes
http://www.commandlinefanatic.com/cgi-bin/showarticle.cgi?article=art007
No comments:
Post a Comment