Re: [PATCH] multi item packed files

From: Chris Mason <mason@suse.com>
Date: 2005-04-23 04:58:34
On Friday 22 April 2005 12:22, Linus Torvalds wrote:
> On Thu, 21 Apr 2005, Chris Mason wrote:
> > We can sort by the files before reading them in, but even if we order
> > things perfectly, we're spreading the io out too much across the drive.
>
> No we don't.
>
> It's easy to just copy the repository in a way where this just isn't true:
> you sort the objects by how far they are from the current HEAD, and you
> just copy the repository in that order ("furthest" objects first - commits
> last).
>
> That's what I meant by defragmentation - you can actually do this on your
> own, even if your filesystem doesn't support it.

This certainly can help.  Based on some ideas from andrea I made a poor man's 
defrag script last year that was similar.  It worked by copying files into a 
flat dir in the order you expected to read them in, deleting the original, 
then hard linking them into their original name.

Copying in order straight into a new git tree doesn't help much when the 
filesystem is using the subdirectory as a hint to block allocation.  So 
you'll probably have to copy them all into a flat directory and then hard 
link back into the git tree (the flat dir can then be deleted of course).

The problem I see for git is that once you have enough data, it should degrade 
over and over again somewhat quickly.  My own guess is that you'll need to 
run the script at least monthly.  If we're designing the thing now and say 
'wow, that's going to be really slow without help', it doesn't hurt to look 
at alternatives.

I grabbed Ingo's tarball of 28,000 patches since 2.4.0 and applied them all 
into git on ext3 (htree).  It only took ~2.5 hrs to apply.  I did use my  
write-tree patch where you had to give write-tree a list of directories to 
search, but I don't think this helped much since the operation was mostly 
disk write bound.

Anyway, I ended up with a 2.6GB .git directory.  Then I:

rm .git/index
umount ; mount again
time read-tree `tree-id` (24.45s)
time checkout-cache --prefix=../checkout/ -a -f (4m30s)

--prefix is neat ;)

The tree that ended up in checkout was 239456k, giving us an effective io rate 
for checkout-cache of 885k/s.  (this drive gets 24MB/s sequential reads).

I'll have numbers for the packed files later on today.  No, I don't really 
expect the numbers will convince you to implement some kind of packing ;)  
But it's still a good data point to have, and generating them here is just 
poking the box every 2 hours or so.

-chris
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Sat Apr 23 04:58:51 2005

This archive was generated by hypermail 2.1.8 : 2005-04-23 04:58:51 EST