Jon Smirl <jonsmirl@gmail.com> wrote: > On 8/6/06, Jon Smirl <jonsmirl@gmail.com> wrote: > >This model has a lot of object duplication. I generated 949,305 > >revisions, but only 754,165 are unique. I'll modify my code to build a > >hash of the objects it has seen and then not send the duplicates to > >fast-import. Those 195,140 duplicated objects may be what is tripping > >index-pack up. > > New run is finished with duplicate removal. > > Time to run is unchanged, still 2hrs. Run time is IO bound not CPU. > Pack file is 845MB instead of 934MB. > git-index-pack works now, it takes 4 CPU minutes to run. > Index file is 18MB. I'm attaching a new version of fast-import.c which generates the index, and does duplicate removal. However I think that it might be slightly faster for you to do the duplicate removal in Python as it saves the user-kernel-user copy of the file data. Even so, this new version should save you those 4 CPU minutes as the index will be generated from the in-memory SHA1s rather than needing to recompute them. I've changed the calling convention: - It now takes the pack's base name as its first parameter. It appends ".pack" and ".idx" to form the actual file names its writing to. - It expects an estimated object count as its second parameter. In your case this would be something around 760000. This tells it how large of an object table to allocate, with each entry being 24 bytes + 1 pointer (28 or 32 bytes). Overshooting this number will cause it to degrade by allocating one overflow entry at a time from malloc. So the new version should take about 20 MB of memory and should produce a valid pack and index in the same time as it does only the pack now. Plus it won't generate duplicates. > So it looks like the first stage code is working. Next I need to > modify cvs2svn to keep track of the sha-1 through it's sorting process > instead of file:revision. When you get down to tree writing and commit writing we might want to do something similiar with the trees and commits. I can modify fast-import to also store those off into a pack. -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This archive was generated by hypermail 2.1.8 : 2006-08-07 04:04:44 EST