Re: RFC: adding xdelta compression to git

From: Linus Torvalds <torvalds@osdl.org>
Date: 2005-05-03 14:52:42
On Tue, 3 May 2005, Alon Ziv wrote:
> 
> 1. Add a git-deltify command, which will take two trees and replace the second 
> tree's blobs with delta-blobs referring to the first tree.

If you do something like this, you want such a delta-blob to be named by 
the sha1 of the result, so that things that refer to it can transparently 
see either the original blob _or_ the "deltified" one, and will never 
care.

It seems that is your plan:

> from the outside it looks like any other blob, but internally it
> contains another blob reference + an xdelta.

Yes. git doesn't much care, as long as the objects unpack to the right 
format. That's all hidden away.

> The only function which would need to understand the new format would be
> unpack_sha1_file.

Yes. EXCEPT for one thing. fsck. I'd _really_ like fsck to be able to know
something about any xdelta objects, if only because if/when things go
wrong, it's really nasty to suddenly see a million "blob" objects not work
any more, with no indication of _why_ they don't work. The core reason may
be that one original object (that just got used as a base for tons of
other objects through deltas) is corrupt or missing. And then you want to
show that _one_ object.

> Cons:
> * Changes the repository format.

It wouldn't necessarily. You should be able to do this with _zero_ changes 
to existing objects what-so-ever.

What you do is introduce an "xdelta" object, which has a reference to a 
blob object and the delta. The git object model already names all objects 
by a simple ascii name, so adding a new object type in _no_ way changes 
any existing objects.

So you can just make "unpack_sha1_file()" notice that it unpacked a xdelta 
object, and then do the proper delta application, and nobody will ever be 
the wiser.

> * Some performance impact (probably quite small).

If you limit the depth of deltas, probably not too bad.

> * Same blob may have different representation in two repositories (one 
> compressed, on deltified). [I am not sure this is really a bad thing...]

THIS, I think, is the real issue. fsck-cache and pull etc, that needs to
know about references to other objects, would have to be able to see the
xdelta object, so that they can build up the reference graph. So you'd
need to basically make a "raw_unpack_sha1_file()" interface (the current
regular unpack_sha1_file()) for that.

Also, the fact is, since git saves things as separate files, you'd not win
as much as you would with some other backing store. So the second step is
to start packing the objects etc. I think there is actually a very steep
complexity edge here - not because any of the individual steps necessarily
add a whole lot, but because they all lead to the "next step".

I personally clearly feel that simplicity (and the resulting robustness)
is worth a _lot_ of disk-space.

So I think that what you suggest is likely to actually be pretty easy, but 
I'm not entirely convinced it's worth the slide into complexity.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Tue May 03 14:51:25 2005

This archive was generated by hypermail 2.1.8 : 2005-05-03 14:51:25 EST