Re: [PATCH] improved delta support for git

From: Dan Holmsand <holmsand@gmail.com>
Date: 2005-05-18 07:43:47
Nicolas (and others),

I've been trying out your delta stuff as well. It was a bit
disappointing at first, but some tweaking payed off in the end...

First, I tried the entire bkcvs history for 2.6, but storing only the
"fs" directory tree in git (hoping that would be representative
enough, since the entire tree gets *big*). I got 4678 commits.

In its original form, it looks like this (first size is "network
size", the last one disk size on ext3. Average size per object in
bytes):

trees:  16M  (15684 files)  avg: 1119, disk:  61M
blobs: 121M  (17200 files)  avg: 7414, disk: 157M
Total: 139M  (37562 files)  avg: 3883, disk: 237M

Using your code, with unlimited delta depth:

trees:  16M  (15684 files)  avg: 1119, disk:  61M
blobs:   9M   (2333 files)  avg: 4491, disk:  15M
deltas: 30M  (14867 files)  avg: 2147, disk:  71M
Total:  83M  (37562 files)  avg: 2334, disk: 188M

Same thing, with a maximum delta depth of 2:

trees:  16M  (15684 files)  avg: 1119, disk:  61M
blobs:  45M   (6940 files)  avg: 6906, disk:  60M
deltas: 20M  (10260 files)  avg: 2086, disk:  48M
Total:  83M  (37562 files)  avg: 2334, disk: 188M

So, total size from a network perspective went from 139M to 83M, which
seemed a little disappointing to me.

I think there are too reasons, as shown by these statistics:

1) Too many deltas get too big and/or compress badly.

2) Trees take up a big chunk of total space.


Therefore, I tried some other approaches. This one seemed to work
best:

1) I limit the maximum size of any delta to 10% of the size of the new
version. That guarantees a big saving, as long as any delta is
produced.

2) If the "previous" version of a blob is a delta, I produce the new
delta form the old deltas base version. This works surprisingly well.
I'm guessing the reason for this is that most changes are really
small, and they tend to be in the same area as a previous change (as
in "Commit new feature. Commit bugfix for new feature. Commit fix for
bugfix of new feature. Delete new feature as it doesn't work...").

3) I use the same method for all tree objects.

This method of "opportunistic delta compression" has some other
advantages: No risk of long delta chains (as the maximum delta depth
is one). It should be disk cache friendly, as many deltas are produced
against the same base version. And this method could easily be used
incrementally, or "on the fly", as forward deltas are used.

Using these tweaks helped a lot, size wise:

trees:   3M   (9746 files)  avg:  380, disk:  38M
blobs:  13M   (3301 files)  avg: 4208, disk:  20M
deltas: 11M  (19837 files)  avg:  586, disk:  78M
Total:  28M  (37562 files)  avg:  799, disk: 155M

As this method turned 139M worth of git repository into 28M, I decided
to try the same method on the entire bkcvs history (28203 commits).

Plain vanilla git looks like this:

trees:  246M  (156812 files)  avg: 1647, disk:  699M
blobs: 1171M  (185458 files)  avg: 6623, disk: 1573M
Total: 1422M  (370473 files)  avg: 4025, disk: 2382M

The delta compressed approach outlined above yields:

trees:   47M   (73519 files)  avg:  672, disk:  289M
blobs:  156M   (49857 files)  avg: 3285, disk:  281M
deltas: 107M  (218894 files)  avg:  515, disk:  863M
Total:  315M  (370473 files)  avg:  892, disk: 1544M

So, 1.4G became 315M. Not too bad, IMHO. Disk size is still big,
of course, but disks are apparently cheap these days.

It could probably be even better, if git didn't produce quite as many
tree objects. Some sort of chunking together of tree objects would
help delta compression a lot (and improve disk size quite a bit in the
process).

Attached is a patch (against current cogito). It is basically the same
as yours, Nicolas, except for some hackery to make the above possible.
I'm sure I've made lots of stupid mistakes in it (and the 10% limit is
hardcoded right now; I'm lazy).

/dan


-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Received on Wed May 18 08:58:09 2005

This archive was generated by hypermail 2.1.8 : 2005-05-18 08:58:10 EST