Re: Index/hash order

From: Linus Torvalds <torvalds@osdl.org>
Date: 2005-04-13 08:25:10
On Tue, 12 Apr 2005, H. Peter Anvin wrote:
>
> I was a bit surprised to see that the hash used is the hash of a 
> compressed blob, instead of hashing the raw contents.  This makes the 
> hash sensitive to compression algorithm, or even, for that matter, the 
> compression level used.

Yes. It's a downside, but the upsides make it worth it.

Also, it's being somewhat controlled by the fact that zlib is a pretty
stable library _and_ I set the compression level explicitly.

If it turns out that zlib changes can change the hash or that some
versions of it can "seed" the compression thing differently - and thus
generate different hashes - we may have to import one particular version
of zlib into the system, but I sincerely hope (and believe) that's not
going to be needed.

> If the hash instead were made from the uncompressed contents, then 
> supporting a new compression algorithm would just be a matter of adding 
> it to zlib, and compression strength could be made user-settable and 
> even changed on the fly.

No.

And I'll tell you why I'm opposed to that.

I really think that the _distribution_ is a very fundamental part of git, 
and it was actually one of the fundamental guiding goals behind the whole 
design. In fact, pretty much everything in git really stems from two 
goals:
 - totally distributed (and replicated) data
 - performance stays constant regardless of history size

And part of that "it should distribute cleanly" is the fact that a git 
archive should never EVER have any "merge conflicts" on any other level, 
and we should never need any locking.

For example, say that you run git on a truly distributed and not
necessarily fully connected filesystem. One of the _primary_ goals in the
whole design of git is that distributed nature. And that means that a
database object really has to look the same to everybody. You can't have
one user that has a lower compression level because he has a slow CPU,
because when that user re-connects with another user, they now have two
different objects.

Can you merge it by just saying that the objects are the same? Yes. Git
woudln't care. So if you have git-specific merge knowledge, you can
resolve things fine. But I really want to have the possibility of running
a git database over a non-coherent but general-purpose filesystem. Git
should do the right thing. 

Think of a silly mirroring system, for example. It should never need to
worry its brain about whether it can connect to multiple mirror masters.  
Which implies that all objects have to be identical (at least if you want
to just run standard mirroring software like "rsync").

So that really mesn that a database object has to _look_ exactly the same
to everybody. It's not ok to have two files that are different, but have
the same name and the same "git contents". That would destroy one of the
fundamental goals of git, as far as I'm concerned. It would mean that
something would have to "merge" the object clash.

Think "shared object directory over crappy network filesystem". I claim 
that if two people import the same patch at roughly the same time, they 
SHOULD NOT CARE that their filesystem may not be entirely coherent. They 
will create the _exact_ same blob, and regardless of whether they end up 
mixing caches in strange ways (including getting some page cache pages 
from their own writes, and some from the other clients writes), it should 
still work.

(Btw, before you say anything, I know that git currently doesn't create
new objects in a totally safe way in a network environment. It should
generate a new object first to a temp-file, and then rename that to the
final file. I was lazy. But the _design_ allows for git to do the right
thing).

So I do agree that it's strange to do the SHA1 on the compressed object.  
But that was very much by design, and I think I have a really good reason
for it.

Btw - it does allow for variations in compression. The way git works, if
two different people use different compression, they'll get different
object names even though the blob those object names describe may be
identical. That's a waste of space, but it's not something that git will 
be fundamentally unhappy about - things will still work, and a merge 
algorithm would have to be pretty damn stupid to not be able to merge two 
identical files, so they will "join" well at a git level eventually.

So this approach _does_ allow for changing the compression algorithm (as
long as we make sure that everybody uses a new git version that can read
both, so that you can mix old and new objects), or even an algorithm that
isn't entirely stable (I'm pretty sure zlib is), but it does waste a lot
of space in that case.

IOW, the only _really_ important thing is that a single hash must uniquely
identify one and only one object. If two otherwise identical objects get
two different hashes, that's a "bad thing", but it's something git can 
handle. It comes with a (pretty high) cost of diskspace, though.

In fact, git has already handled exactly that: if you look at the changes
I did when I changed the "tree" object from being the full directory
structure, to be just a single directory, that causes effectively exactly
this thing: two identical trees can generate a number of different object 
combinations depending on how you encoded them. Now, fsck will complain 
about all but the "canonical" format, but the fact is, it all _works_ even 
if you mix non-canonical objects.

I dunno. I feel pretty strongly that I made a good choice, but on the 
other hand, I've felt that way about things I ended up being totally wrong 
about, so don't let that stop you from arguing. Maybe my reasons aren't as 
good as I thought they were, so feel free to shoot holes in my parade ;)

Side note: there's a couple of nice advantages to the way git does thing:

One is just the funny trivial verification you can do without any 
git-specific knowledge at all:

	torvalds@ppc970:~> sha1sum git/.git/objects/01/c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 
	01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7  git/.git/objects/01/c92f2620a8e13e7cb7fd98ee644c6b65eeccb7

see the pattern ;)

The other is that it means that it's _really_ hard to fake a git object. 
Not only does the sha1 checksum have to match, it needs to uncompress 
cleanly too, and the resulting header needs to be sensible and match the 
result (it encodes the size of the uncompressed object in the header). So 
you can't just replace one "blob" object with another that has the same 
sha1, even if you could create such a thing. You'd have to work at it. 

(This is mainly an issue for "blob"s, since the other objects have
internal structure that git can check, so you can't just replace them with
another object anyway. Think of the "compression" phase to be an internal
structure that "fsck-cache" can - and does - verify even for an object
that you otherwise can't "test" for correctness).

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Tue Apr 12 15:30:52 2005

This archive was generated by hypermail 2.1.8 : 2005-04-15 12:56:42 EST