Re: [Census] So who uses git?

From: Junio C Hamano <junkio@cox.net>
Date: 2006-01-30 07:15:08
Morten Welinder <mwelinder@gmail.com> writes:

> If I understand this right, that means that for a log file (in this
> case a ChangeLog file) that is appended to linearly as a
> function of revision number, we have...
>
> cvs: O(n) archive size
> git: O(n*n) archive size
>
> At least that is what we get if revision N is always deltad over
> revision N-1.  A good deal could be saved if instead of dumping
> a full copy every 10 revisions, that revision would instead be
> deltad off an earlier revision, but I think it'll still be O(n*n).

I have not counted O()rders, but it is not as simple as that,
because we do not really compare "versions".  If version N
reverts a change version N-1 introduced since version N-2, we
would not even store a copy for version N and version N-2
separately.  We just store a single copy, which may be delta
information against version N-1 (or the other way around and N-1
might be delta against N).

For the sake of math, let's say this project keeps only one
file, append only ChangeLog, with a straight line of development
without branches ("single strand of pearls"), and has revisions
1..N.

In RCS, you would have a full copy of the revision N, and
revision J is recorded as delta from revision J+1 for 1 <= J < N.
This delta is similar to "ed" script, and going backwards in the
history for the ChangeLog example means only line deletion is
involved, so what was removed is not recorded.  It records how
many lines are removed from where.  This is _very_ efficient and
compact.

In git, we would have a full copy of version N (because we favor
keeping larger blob associated with newer commits as a full
copy), and essentially the same thing as RCS happens.  The only
difference is that our "delta" is binary delta, but in this
case, it just records "copy N bytes from here to here" which
results in about the same amount of information to represent
each delta.  As you say, if (10 < N), we would have a full copy
every once in a while.  You could use depth other than the
default to make this chaining longer and if you did so, your
repository would be *very* compactly compressed.

However, retrieving cost of version 1 is quite different.  RCS
format is O(n) -- you start from the tip, extract and interpret
(N-1) deltas and apply them in turn to get to what you want.

The cost of extracting an arbitrary version is bounded in git
packfile, because you need to do such an "extract, interpret and
apply" at most $depth cycles.  This is primarily because we do
not store "versions" but individual objects, and do not apply
"newer revisions are far more likely to be accessed often"
heuristics, which RCS format is designed for.




-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Mon Jan 30 07:15:45 2006

This archive was generated by hypermail 2.1.8 : 2006-01-30 07:15:57 EST