Re: fast-import and unique objects.

From: Shawn Pearce <spearce@spearce.org>
Date: 2006-08-07 04:03:24
Jon Smirl <jonsmirl@gmail.com> wrote:
> On 8/6/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> >This model has a lot of object duplication. I generated 949,305
> >revisions, but only 754,165 are unique. I'll modify my code to build a
> >hash of the objects it has seen and then not send the duplicates to
> >fast-import. Those 195,140 duplicated objects may be what is tripping
> >index-pack up.
> 
> New run is finished with duplicate removal.
> 
> Time to run is unchanged, still 2hrs. Run time is IO bound not CPU.
> Pack file is 845MB instead of 934MB.
> git-index-pack works now, it takes 4 CPU minutes to run.
> Index file is 18MB.

I'm attaching a new version of fast-import.c which generates the
index, and does duplicate removal.  However I think that it might
be slightly faster for you to do the duplicate removal in Python
as it saves the user-kernel-user copy of the file data.  Even so,
this new version should save you those 4 CPU minutes as the index
will be generated from the in-memory SHA1s rather than needing to
recompute them.

I've changed the calling convention:

  - It now takes the pack's base name as its first parameter. It
    appends ".pack" and ".idx" to form the actual file names its
    writing to.

  - It expects an estimated object count as its second parameter.
    In your case this would be something around 760000.  This tells
    it how large of an object table to allocate, with each entry
    being 24 bytes + 1 pointer (28 or 32 bytes).  Overshooting
	this number will cause it to degrade by allocating one
	overflow entry at a time from malloc.

So the new version should take about 20 MB of memory and should
produce a valid pack and index in the same time as it does only
the pack now.  Plus it won't generate duplicates.
 
> So it looks like the first stage code is working. Next I need to
> modify cvs2svn to keep track of the sha-1 through it's sorting process
> instead of file:revision.

When you get down to tree writing and commit writing we might want
to do something similiar with the trees and commits.  I can modify
fast-import to also store those off into a pack.

-- 
Shawn.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Received on Mon Aug 07 04:04:12 2006

This archive was generated by hypermail 2.1.8 : 2006-08-07 04:04:44 EST