Morten Welinder <mwelinder@gmail.com> writes: > If I understand this right, that means that for a log file (in this > case a ChangeLog file) that is appended to linearly as a > function of revision number, we have... > > cvs: O(n) archive size > git: O(n*n) archive size > > At least that is what we get if revision N is always deltad over > revision N-1. A good deal could be saved if instead of dumping > a full copy every 10 revisions, that revision would instead be > deltad off an earlier revision, but I think it'll still be O(n*n). I have not counted O()rders, but it is not as simple as that, because we do not really compare "versions". If version N reverts a change version N-1 introduced since version N-2, we would not even store a copy for version N and version N-2 separately. We just store a single copy, which may be delta information against version N-1 (or the other way around and N-1 might be delta against N). For the sake of math, let's say this project keeps only one file, append only ChangeLog, with a straight line of development without branches ("single strand of pearls"), and has revisions 1..N. In RCS, you would have a full copy of the revision N, and revision J is recorded as delta from revision J+1 for 1 <= J < N. This delta is similar to "ed" script, and going backwards in the history for the ChangeLog example means only line deletion is involved, so what was removed is not recorded. It records how many lines are removed from where. This is _very_ efficient and compact. In git, we would have a full copy of version N (because we favor keeping larger blob associated with newer commits as a full copy), and essentially the same thing as RCS happens. The only difference is that our "delta" is binary delta, but in this case, it just records "copy N bytes from here to here" which results in about the same amount of information to represent each delta. As you say, if (10 < N), we would have a full copy every once in a while. You could use depth other than the default to make this chaining longer and if you did so, your repository would be *very* compactly compressed. However, retrieving cost of version 1 is quite different. RCS format is O(n) -- you start from the tip, extract and interpret (N-1) deltas and apply them in turn to get to what you want. The cost of extracting an arbitrary version is bounded in git packfile, because you need to do such an "extract, interpret and apply" at most $depth cycles. This is primarily because we do not store "versions" but individual objects, and do not apply "newer revisions are far more likely to be accessed often" heuristics, which RCS format is designed for. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.htmlReceived on Mon Jan 30 07:15:45 2006
This archive was generated by hypermail 2.1.8 : 2006-01-30 07:15:57 EST