Re: Compression and dictionaries

From: Jon Smirl <jonsmirl@gmail.com>
Date: 2006-08-15 23:29:28
On 15 Aug 2006 04:33:03 -0400, linux@horizon.com <linux@horizon.com> wrote:
> Just to inform this discussion, a brief description of the zlib
> "deflate" algorithm.  A full description is in RFC1951.
>
> This is LZ77 compression, meaning that what is actually encoded is
> a series of instruction to either
> - Insert a specified byte, or
> - Copy a run of n bytes, starting m bytes ago.

Shawn and I did some testing overnight. A simple 4KB preload table
yielded a 4% size reduction in my 850MB pack. I'm still hoping for 10%
using an optimal 32K table.

Paper on computing the optimal preload table:
http://www.eecs.harvard.edu/~michaelm/postscripts/dcc2001a.pdf

If anyone is interested in doing some research, it would be
interesting to build a pack engine based on full-text search
technology. For example cLucene,
http://clucene.sourceforge.net/index.php/Main_Page The full-text
search app contains the code for extracting the dictionary and then
encoding all of the blobs using it.

My prediction is that a large global dictionary based scheme will
compress significantly better than zlib is capable of.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Tue Aug 15 23:30:15 2006

This archive was generated by hypermail 2.1.8 : 2006-08-15 23:30:55 EST