Nicolas (and others), I've been trying out your delta stuff as well. It was a bit disappointing at first, but some tweaking payed off in the end... First, I tried the entire bkcvs history for 2.6, but storing only the "fs" directory tree in git (hoping that would be representative enough, since the entire tree gets *big*). I got 4678 commits. In its original form, it looks like this (first size is "network size", the last one disk size on ext3. Average size per object in bytes): trees: 16M (15684 files) avg: 1119, disk: 61M blobs: 121M (17200 files) avg: 7414, disk: 157M Total: 139M (37562 files) avg: 3883, disk: 237M Using your code, with unlimited delta depth: trees: 16M (15684 files) avg: 1119, disk: 61M blobs: 9M (2333 files) avg: 4491, disk: 15M deltas: 30M (14867 files) avg: 2147, disk: 71M Total: 83M (37562 files) avg: 2334, disk: 188M Same thing, with a maximum delta depth of 2: trees: 16M (15684 files) avg: 1119, disk: 61M blobs: 45M (6940 files) avg: 6906, disk: 60M deltas: 20M (10260 files) avg: 2086, disk: 48M Total: 83M (37562 files) avg: 2334, disk: 188M So, total size from a network perspective went from 139M to 83M, which seemed a little disappointing to me. I think there are too reasons, as shown by these statistics: 1) Too many deltas get too big and/or compress badly. 2) Trees take up a big chunk of total space. Therefore, I tried some other approaches. This one seemed to work best: 1) I limit the maximum size of any delta to 10% of the size of the new version. That guarantees a big saving, as long as any delta is produced. 2) If the "previous" version of a blob is a delta, I produce the new delta form the old deltas base version. This works surprisingly well. I'm guessing the reason for this is that most changes are really small, and they tend to be in the same area as a previous change (as in "Commit new feature. Commit bugfix for new feature. Commit fix for bugfix of new feature. Delete new feature as it doesn't work..."). 3) I use the same method for all tree objects. This method of "opportunistic delta compression" has some other advantages: No risk of long delta chains (as the maximum delta depth is one). It should be disk cache friendly, as many deltas are produced against the same base version. And this method could easily be used incrementally, or "on the fly", as forward deltas are used. Using these tweaks helped a lot, size wise: trees: 3M (9746 files) avg: 380, disk: 38M blobs: 13M (3301 files) avg: 4208, disk: 20M deltas: 11M (19837 files) avg: 586, disk: 78M Total: 28M (37562 files) avg: 799, disk: 155M As this method turned 139M worth of git repository into 28M, I decided to try the same method on the entire bkcvs history (28203 commits). Plain vanilla git looks like this: trees: 246M (156812 files) avg: 1647, disk: 699M blobs: 1171M (185458 files) avg: 6623, disk: 1573M Total: 1422M (370473 files) avg: 4025, disk: 2382M The delta compressed approach outlined above yields: trees: 47M (73519 files) avg: 672, disk: 289M blobs: 156M (49857 files) avg: 3285, disk: 281M deltas: 107M (218894 files) avg: 515, disk: 863M Total: 315M (370473 files) avg: 892, disk: 1544M So, 1.4G became 315M. Not too bad, IMHO. Disk size is still big, of course, but disks are apparently cheap these days. It could probably be even better, if git didn't produce quite as many tree objects. Some sort of chunking together of tree objects would help delta compression a lot (and improve disk size quite a bit in the process). Attached is a patch (against current cogito). It is basically the same as yours, Nicolas, except for some hackery to make the above possible. I'm sure I've made lots of stupid mistakes in it (and the 10% limit is hardcoded right now; I'm lazy). /dan - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This archive was generated by hypermail 2.1.8 : 2005-05-18 08:58:10 EST