Re: [PATCH] multi item packed files

From: Linus Torvalds <torvalds@osdl.org>
Date: 2005-04-23 05:43:15
On Fri, 22 Apr 2005, Chris Mason wrote:
> 
> The problem I see for git is that once you have enough data, it should degrade 
> over and over again somewhat quickly.

I really doubt that.

There's a more or less constant amount of new data added all the time: the 
number of changes does _not_ grow with history. The number of changes 
grows with the amount of changes going on in the tree, and while that 
isn't exactly constant, it definitely is not something that grows very 
fast. 

Btw, this is how git is able to be so fast in the first place. Git is fast 
because it knows that the "size of the change" is a lot smaller than the 
"size of the repository", so it fundamentally at all points tries to make 
sure that it only ever bothers with stuff that has changed.

Stuff that hasn't changed, it ignores very _very_ efficiently. 

That's really the whole point of the index file: it's a way to quickly
ignore the stuff that hasn't changed - both for simple operations like
"show-diff", but also for complex operations like "merge these three
trees".

And it works exactly because the number of changes does _not_ grow at all 
linearly with the history of the project. In fact, in most projects, the 
rate of change does _down_ when the project grows, because the projects 
matures and generally gets more complicated and thus harder to change.

(The kernel _really_ is pretty special. I am willing to bet that there are
not a lot of big projects that have been able to continue to take changes
at the kind of pace that the kernel does. But we've had to work at it a
lot, including obviously using SCM tools that are very much geared towards
scaling. Why do you think the kernel puts more pressure on SCM's than
other projects? It's exactly because we're trying to scale our change
acceptance to bigger numbers).

So when you say "once you have enough data, it will degrade quickly" 
ignores the fact that the rate of change isn't (the "second derivative of 
the size of the project in time") really isn't that high. 

> I grabbed Ingo's tarball of 28,000 patches since 2.4.0 and applied them all 
> into git on ext3 (htree).  It only took ~2.5 hrs to apply.

Ok, I'd actually wish it took even less, but that's still a pretty
impressive average of three patches a second.

> Anyway, I ended up with a 2.6GB .git directory.  Then I:
> 
> rm .git/index
> umount ; mount again
> time read-tree `tree-id` (24.45s)
> time checkout-cache --prefix=../checkout/ -a -f (4m30s)
> 
> --prefix is neat ;)

That sounds pretty acceptable. Four minutes is a long time, but I assume
that the whole point of the exercise was to try to test worst-case
behaviour.  We can certainly make sure that real usage gets lower numbers
than that (in particular, my "real usage" ends up being 100% in the disk
cache ;)

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Sat Apr 23 05:41:41 2005

This archive was generated by hypermail 2.1.8 : 2005-04-23 05:41:41 EST