On Tue, 12 Apr 2005, H. Peter Anvin wrote: > > I was a bit surprised to see that the hash used is the hash of a > compressed blob, instead of hashing the raw contents. This makes the > hash sensitive to compression algorithm, or even, for that matter, the > compression level used. Yes. It's a downside, but the upsides make it worth it. Also, it's being somewhat controlled by the fact that zlib is a pretty stable library _and_ I set the compression level explicitly. If it turns out that zlib changes can change the hash or that some versions of it can "seed" the compression thing differently - and thus generate different hashes - we may have to import one particular version of zlib into the system, but I sincerely hope (and believe) that's not going to be needed. > If the hash instead were made from the uncompressed contents, then > supporting a new compression algorithm would just be a matter of adding > it to zlib, and compression strength could be made user-settable and > even changed on the fly. No. And I'll tell you why I'm opposed to that. I really think that the _distribution_ is a very fundamental part of git, and it was actually one of the fundamental guiding goals behind the whole design. In fact, pretty much everything in git really stems from two goals: - totally distributed (and replicated) data - performance stays constant regardless of history size And part of that "it should distribute cleanly" is the fact that a git archive should never EVER have any "merge conflicts" on any other level, and we should never need any locking. For example, say that you run git on a truly distributed and not necessarily fully connected filesystem. One of the _primary_ goals in the whole design of git is that distributed nature. And that means that a database object really has to look the same to everybody. You can't have one user that has a lower compression level because he has a slow CPU, because when that user re-connects with another user, they now have two different objects. Can you merge it by just saying that the objects are the same? Yes. Git woudln't care. So if you have git-specific merge knowledge, you can resolve things fine. But I really want to have the possibility of running a git database over a non-coherent but general-purpose filesystem. Git should do the right thing. Think of a silly mirroring system, for example. It should never need to worry its brain about whether it can connect to multiple mirror masters. Which implies that all objects have to be identical (at least if you want to just run standard mirroring software like "rsync"). So that really mesn that a database object has to _look_ exactly the same to everybody. It's not ok to have two files that are different, but have the same name and the same "git contents". That would destroy one of the fundamental goals of git, as far as I'm concerned. It would mean that something would have to "merge" the object clash. Think "shared object directory over crappy network filesystem". I claim that if two people import the same patch at roughly the same time, they SHOULD NOT CARE that their filesystem may not be entirely coherent. They will create the _exact_ same blob, and regardless of whether they end up mixing caches in strange ways (including getting some page cache pages from their own writes, and some from the other clients writes), it should still work. (Btw, before you say anything, I know that git currently doesn't create new objects in a totally safe way in a network environment. It should generate a new object first to a temp-file, and then rename that to the final file. I was lazy. But the _design_ allows for git to do the right thing). So I do agree that it's strange to do the SHA1 on the compressed object. But that was very much by design, and I think I have a really good reason for it. Btw - it does allow for variations in compression. The way git works, if two different people use different compression, they'll get different object names even though the blob those object names describe may be identical. That's a waste of space, but it's not something that git will be fundamentally unhappy about - things will still work, and a merge algorithm would have to be pretty damn stupid to not be able to merge two identical files, so they will "join" well at a git level eventually. So this approach _does_ allow for changing the compression algorithm (as long as we make sure that everybody uses a new git version that can read both, so that you can mix old and new objects), or even an algorithm that isn't entirely stable (I'm pretty sure zlib is), but it does waste a lot of space in that case. IOW, the only _really_ important thing is that a single hash must uniquely identify one and only one object. If two otherwise identical objects get two different hashes, that's a "bad thing", but it's something git can handle. It comes with a (pretty high) cost of diskspace, though. In fact, git has already handled exactly that: if you look at the changes I did when I changed the "tree" object from being the full directory structure, to be just a single directory, that causes effectively exactly this thing: two identical trees can generate a number of different object combinations depending on how you encoded them. Now, fsck will complain about all but the "canonical" format, but the fact is, it all _works_ even if you mix non-canonical objects. I dunno. I feel pretty strongly that I made a good choice, but on the other hand, I've felt that way about things I ended up being totally wrong about, so don't let that stop you from arguing. Maybe my reasons aren't as good as I thought they were, so feel free to shoot holes in my parade ;) Side note: there's a couple of nice advantages to the way git does thing: One is just the funny trivial verification you can do without any git-specific knowledge at all: torvalds@ppc970:~> sha1sum git/.git/objects/01/c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 git/.git/objects/01/c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 see the pattern ;) The other is that it means that it's _really_ hard to fake a git object. Not only does the sha1 checksum have to match, it needs to uncompress cleanly too, and the resulting header needs to be sensible and match the result (it encodes the size of the uncompressed object in the header). So you can't just replace one "blob" object with another that has the same sha1, even if you could create such a thing. You'd have to work at it. (This is mainly an issue for "blob"s, since the other objects have internal structure that git can check, so you can't just replace them with another object anyway. Think of the "compression" phase to be an internal structure that "fsck-cache" can - and does - verify even for an object that you otherwise can't "test" for correctness). Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.htmlReceived on Tue Apr 12 15:30:52 2005
This archive was generated by hypermail 2.1.8 : 2005-04-15 12:56:42 EST