Re: fast-import and unique objects.

From: Jon Smirl <jonsmirl@gmail.com>
Date: 2006-08-08 00:37:30
On 8/7/06, Shawn Pearce <spearce@spearce.org> wrote:
> > I'm staring at the cvs2svn code now trying to figure out how to modify
> > it without rewriting everything. I may just leave it all alone and
> > build a table with cvs_file:rev to sha-1 mappings. It would be much
> > more efficient to carry sha-1 throughout the stages but that may
> > require significant rework.
>
> Does it matter?  How long does the cvs2svn processing take,
> excluding the GIT blob processing that's now known to take 2 hours?
> What's your target for an acceptable conversion time on the system
> you are working on?

As is, it takes the code about a week to import MozCVS into
Subversion. But I've already addressed the core of why that was taking
so long. The original code forks off a copy of cvs for each revision
to exact the text. Doing that 1M times takes about two days. The
version with fast-import takes two hours.

At the end of the process cvs2svn forks off svn 250K times to import
the change sets. That takes about four days to finish. Doing a
fast-import backend should fix that.

> Any thoughts yet on how you might want to feed trees and commits
> to a fast pack writer?  I was thinking about doing a stream into
> fast-import such as:

The data I have generates an output that indicates add/change/delete
for each file name. Add/change should have an associated sha-1 for the
new revision. cvs/svn have no concept of trees.

How about sending out a stream of add/change/delete operations
interspersed with commits? That would let fast-import track the tree
and only generate tree nodes when they change.

The protocol may need some thought. I need to be able to handle
branches and labels too.


>         <4 byte length of commit><commit><treeent>*<null>
>
> where <commit> is the raw commit minus the first "tree nnn\n" line, and
> <treeent> is:
>
>         <type><sp><sha1><sp><path><null>
>
> where <type> is one of 'B' (normal blob), 'L' (symlink), 'X'
> (executable blob), <sha1> is the 40 byte hex, <path> is the file from
> the root of the repository ("src/module/foo.c"), and <sp> and <null>
> are the obvious values.  You would feed all tree entries and the pack
> writer would split the stream up into the individual tree objects.
>
> fast-import would generate the tree(s) delta'ing them against the
> prior tree of the same path, prefix "tree nnn\n" to the commit
> blob you supplied, generate the commit, and print out its ID.
> By working from the first commit up to the most recent each tree
> deltas would be using the older tree as the base which may not be
> ideal if a large number of items get added to a tree but should be
> effective enough to generate a reasonably sized initial pack.
>
> It would however mean you need to monitor the output pipe from
> fast-import to get back the commit id so you can use it to prep
> the next commit's parent(s) as you can't produce that in Python.
>
> --
> Shawn.
>


-- 
Jon Smirl
jonsmirl@gmail.com
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Tue Aug 08 00:38:14 2006

This archive was generated by hypermail 2.1.8 : 2006-08-08 00:38:48 EST