Re: Mercurial 0.4b vs git patchbomb benchmark

From: Adam J. Richter <adam@yggdrasil.com>
Date: 2005-05-01 00:44:17
On 2005-04-30, Andrea Arcangeli wrote:
>On a bit more technical side, one thing I'm wondering about is the
>compression. If I change mercurial like this:
>
>--- revlog.py.~1~       2005-04-29 01:33:14.000000000 +0200
>+++ revlog.py   2005-04-30 03:54:12.000000000 +0200
>@@ -11,9 +11,11 @@
> import zlib, struct, mdiff, sha, binascii, os, tempfile
> 
> def compress(text):
>+    return text
>     return zlib.compress(text)
> 
> def decompress(bin):
>+    return text
>     return zlib.decompress(bin)
> 
> def hash(text):
>
>
>the .hg directory sizes changes from 167M to 302M _BUT_ the _compressed_
>size of the .hg directory (i.e. like in a full network transfer with
>rsync -z or a tar.gz backup) changes from 55M to 38M:
>
>andrea@opteron:~/devel/kernel> du -sm hg-orig hg-aa hg-orig.tar.bz2 hg-aa.tar.bz2 
>167     hg-orig
>302     hg-aa
>55      hg-orig.tar.bz2
>38      hg-aa.tar.bz2
>^^^^^^^^^^^^^^^^^^^^^ 38M backup and network transfer is what I want
>
>So I don't really see an huge benefit in compression, other than to
>slowdown the checkins measurably [i.e. what Linus doesn't want] (the
>time of compression is a lot higher than the time of python runtime during
>checkin, so it's hard to believe your 100% boost with psyco in the hg file,
>sometime psyco doesn't make any difference infact, I'd rather prefer people to
>work on the real thing of generating native bytecode at compile time, rather
>than at runtime, like some haskell compiler can do).
>
>mercurial is already good at decreasing the entropy by using an efficient
>storage format, it doesn't need to cheat by putting compression on each blob
>that can only leads to bad ratios when doing backups and while transferring
>more than one blob through the network.

	I'd like to mention a couple of possible optimizations
for both the with and without compression approaches.

	If you remove the gzip compression, then I imagine you could
do much of the IO of checking out files via sendfile, without
ever copying data to program space or even changing the program's
memory map.  There apparently exists a python sendfile module.

	If this mercurial were written in C, much of the rest of
the IO could be optimized with mmap (to reduce copies) and writev
in the absense of a compression pass.  I don't know enough about
python to know if these optimizations are available.

	On the other hand, if you recognize that there is a
duplication of the work of matching common substrings in
attepmting to store files as differences and in most compression
algorithms, including zlib or bzip2, then you might want to
consider storing the files in a format like zdelta or vcdiff, where
differential storage and compression are combined by describing
a file in terms of copy operations both from other files and
_earlier byte ranges of itself_.

	zdelta is a modification of zlib for this purpose, but
I see no permission grants associated with the author's copyright,
and I thought that zlib only looked at the previous 32kB of data.

	Also, if you go this route, you might want to skip the
last phases of these compressors where they convert individual
characters into a more compact representation, which I think
would defeat inter-file pattern matching if you try to make
a compressed tar of the repository, and would preclude the
sendfile/mmap optimization (although they might not be worth
it at this level of granularity).  Then again, since you're
naming your files by sha1 hashes, it follows that related files
will not be farther apart as the repository grows, so the
compression opportunities for larger repositories might be
less anyhow.

                    __     ______________
Adam J. Richter        \ /
adam@yggdrasil.com      | g g d r a s i l
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Sun May 01 01:56:26 2005

This archive was generated by hypermail 2.1.8 : 2005-05-01 01:56:26 EST