Re: Fix up diffcore-rename scoring

From: Rutger Nijlunsing <rutger@nospam.com>
Date: 2006-03-14 11:49:39
From: Rutger Nijlunsing <rutger@nospam.com>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Junio C Hamano <junkio@cox.net>, git@vger.kernel.org
Bcc: 
Subject: Re: Fix up diffcore-rename scoring
Reply-To: git@wingding.demon.nl
In-Reply-To: <Pine.LNX.4.64.0603130727350.3618@g5.osdl.org>
Organization: M38c

On Mon, Mar 13, 2006 at 07:38:53AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 13 Mar 2006, Junio C Hamano wrote:
> > 
> > By the way, the reason the diffcore-delta code in "next" does
> > not do every-eight-bytes hash on the source material is to
> > somewhat alleviate the problem that comes from not detecting
> > copying of consecutive byte ranges.
> 
> Yes. However, there are better ways to do that in practice.
> 
> The most effective way that is generally used is to not use a fixed 
> chunk-size, but use a terminating character, together with a 
> minimum/maximum chunksize.
> 
> There's a pretty natural terminating character that works well for 
> sources: '\n'.
> 
> So the natural way to do similarity detection when most of the code is 
> line-based is to do the hashing on chunks that follow the rule "minimum of 
> <n> bytes, maximum of <2*n> bytes, try to begin/end at a \n".
> 
> So if you don't see any '\n' at all (or the only such one is less than <n> 
> bytes into your current window), do the hash over a <2n>-byte chunk (this 
> takes care of binaries and/or long lines).
> 
> This - for source code - allows you to ignore trivial byte offset things, 
> because you have a character that is used for synchronization. So you 
> don't need to do hashing at every byte in both files - you end up doing 
> the hashing only at line boundaries in practice. And it still _works_ for 
> binary files, although you effectively need bigger identical chunk-sizes 
> to find similarities (for text-files, it finds similarities of size <n>, 
> for binaries the similarities need to effectively be of size 3*n, because 
> you chunk it up at ~2*n, and only generate the hash at certain offsets in 
> the source binary).

This looks like something I did last year as an experiment in the
pre-git times. The idea was to generate a patch-with-renames from two
(large) source trees.

Algorithm:
  - determine md5sum for each file (same idea as git's SHA1 sum)
    if changed since last run
  - only look at md5sums which do not match
  - pool files into types, which might depend on extension and/or MIME type.
    This is an optimisation.
  - Only compare filepair _within_ one pool.
  - The filepair order in one pool is determined by filename-similarity.
    So pair [include/asm-ppc/ioctl.h, include/asm-powerpc/ioctl.h]
    is inspected before pair
       [include/asm-ppc/ioctl.h, arch/arm/plat-omap/clock.h] .
  - For each file, create a hash from String line -> Integer occuranced .
    Similarities are calculated by comparing two hashes.
  - Keep as a rename-match all files which:
    - have at most 50% new lines;
    - have at most 25% lines deleted from them.

I ran the code against v2.6.12 and v2.6.14 to be able to compare it
with the current contenders. Hopefully some ideas are harvestable...

Algorithm differences:
  - '\n' is used as boundary, independant on line length.
    This is bad for binary files, and maybe even bad for text files.
    So don't harvest :)
  - don't look at the intersection percentage, but look at two values:
    - percentage of lines added (default: max. 50%)
    - percentage of lines removed (default: max. 25%)
    This assumes files get bigger during development (at most 50%), and
    not too much code is deleted (at most 25%).
    Disadvantages:
      - Two magic numbers instead of one.
      - It's non-symmetrical. Diff A->B will find different renames from
        diff B->A. This scares me, actually.
  - to speed up the detection:
    - don't start comparing files at random. Start comparing files which
      have the same 'names' in it. So when v2.6.12 has a files called
      arch/arm/mach-omap/clock.c, start comparing with files which have
      most words the same. Currently, '-', '.', '_' and '/' are used
      as word separators.
      Advantage: don't match on the first match just above the
        match-threshold.
    (next heuristics are all optional:)
    - only compare files with the same extension. This splits up all files
      into groups, which makes it much faster.
      In general, there's no reason to compare a .h with a .c file.
    - only compare files with the same MIME type. Same as above, but also
      works for files without extensions (so don't compare README with
      Makefile)

Ok, the result:

$ shpatch.rb -d linux-2.6.12,linux-2.6.14 | wc -l
104   <-- That's bad. We're missing some renames here.

$ shpatch.rb -d linux-2.6.12,linux-2.6.14 | sort -k 1.10

+ 0% -23% arch/arm/configs/omnimeter_defconfig -> arch/arm/configs/collie_defconfig
+ 5% - 9% arch/arm/mach-omap/board-generic.c -> arch/arm/mach-omap1/board-generic.c
+ 0% - 8% arch/arm/mach-omap/board-h2.c -> arch/arm/mach-omap1/board-h2.c
+ 0% - 5% arch/arm/mach-omap/board-h3.c -> arch/arm/mach-omap1/board-h3.c
+ 0% - 3% arch/arm/mach-omap/board-innovator.c -> arch/arm/mach-omap1/board-innovator.c
+ 0% - 9% arch/arm/mach-omap/board-netstar.c -> arch/arm/mach-omap1/board-netstar.c
+ 9% -10% arch/arm/mach-omap/board-osk.c -> arch/arm/mach-omap1/board-osk.c
+ 0% - 6% arch/arm/mach-omap/board-perseus2.c -> arch/arm/mach-omap1/board-perseus2.c
+ 3% - 8% arch/arm/mach-omap/board-voiceblue.c -> arch/arm/mach-omap1/board-voiceblue.c
+ 7% - 4% arch/arm/mach-omap/clock.c -> arch/arm/plat-omap/clock.c
+ 0% - 0% arch/arm/mach-omap/clock.h -> arch/arm/plat-omap/clock.h
+ 0% - 5% arch/arm/mach-omap/common.h -> include/asm-arm/arch-omap/common.h
+ 2% - 1% arch/arm/mach-omap/dma.c -> arch/arm/plat-omap/dma.c
+ 0% - 1% arch/arm/mach-omap/fpga.c -> arch/arm/mach-omap1/fpga.c
+11% -11% arch/arm/mach-omap/gpio.c -> arch/arm/plat-omap/gpio.c
+ 2% - 2% arch/arm/mach-omap/irq.c -> arch/arm/mach-omap1/irq.c
+ 0% - 4% arch/arm/mach-omap/leds.c -> arch/arm/mach-omap1/leds.c
+ 0% - 0% arch/arm/mach-omap/leds-h2p2-debug.c -> arch/arm/mach-omap1/leds-h2p2-debug.c
+ 0% - 0% arch/arm/mach-omap/leds-innovator.c -> arch/arm/mach-omap1/leds-innovator.c
+ 0% - 4% arch/arm/mach-omap/leds-osk.c -> arch/arm/mach-omap1/leds-osk.c
+ 0% -25% arch/arm/mach-omap/Makefile.boot -> arch/arm/mach-omap1/Makefile.boot
+ 1% - 2% arch/arm/mach-omap/mcbsp.c -> arch/arm/plat-omap/mcbsp.c
+ 0% - 6% arch/arm/mach-omap/mux.c -> arch/arm/plat-omap/mux.c
+ 0% - 0% arch/arm/mach-omap/ocpi.c -> arch/arm/plat-omap/ocpi.c
+ 1% -18% arch/arm/mach-omap/pm.c -> arch/arm/plat-omap/pm.c
+ 0% -11% arch/arm/mach-omap/sleep.S -> arch/arm/plat-omap/sleep.S
+ 6% - 4% arch/arm/mach-omap/time.c -> arch/arm/mach-omap1/time.c
+ 0% - 1% arch/arm/mach-omap/usb.c -> arch/arm/plat-omap/usb.c
+ 2% - 1% arch/ia64/sn/include/pci/pcibr_provider.h -> include/asm-ia64/sn/pcibr_provider.h
+ 0% - 2% arch/ia64/sn/include/pci/pic.h -> include/asm-ia64/sn/pic.h
+ 0% - 0% arch/ia64/sn/include/pci/tiocp.h -> include/asm-ia64/sn/tiocp.h
+ 3% -23% arch/m68knommu/platform/68VZ328/de2/config.c -> arch/m68knommu/platform/68VZ328/config.c
+ 1% -18% arch/mips/configs/osprey_defconfig -> arch/mips/configs/qemu_defconfig
+ 0% -12% arch/mips/vr41xx/zao-capcella/setup.c -> arch/mips/vr41xx/common/type.c
+ 0% - 0% arch/ppc64/oprofile/op_impl.h -> include/asm-ppc64/oprofile_impl.h
+ 3% -23% arch/ppc/configs/ash_defconfig -> arch/ppc64/configs/bpa_defconfig
+ 2% -21% arch/ppc/configs/beech_defconfig -> arch/ppc/configs/ev64360_defconfig
+ 5% -20% arch/ppc/configs/cedar_defconfig -> arch/ppc/configs/mpc8548_cds_defconfig
+ 9% -17% arch/ppc/configs/k2_defconfig -> arch/ppc/configs/bamboo_defconfig
+ 3% -25% arch/ppc/configs/mcpn765_defconfig -> arch/xtensa/configs/common_defconfig
+ 2% -23% arch/ppc/configs/oak_defconfig -> arch/frv/defconfig
+ 3% -16% arch/ppc/configs/SM850_defconfig -> arch/ppc/configs/mpc86x_ads_defconfig
+ 3% -13% arch/ppc/configs/SPD823TS_defconfig -> arch/ppc/configs/mpc885ads_defconfig
+19% -15% arch/um/kernel/tempfile.c -> arch/um/os-Linux/mem.c
+ 0% - 5% arch/x86_64/kernel/semaphore.c -> lib/semaphore-sleepers.c
+ 0% - 6% drivers/i2c/chips/adm1021.c -> drivers/hwmon/adm1021.c
+ 0% - 4% drivers/i2c/chips/adm1025.c -> drivers/hwmon/adm1025.c
+ 0% -17% drivers/i2c/chips/adm1026.c -> drivers/hwmon/adm1026.c
+ 0% - 3% drivers/i2c/chips/adm1031.c -> drivers/hwmon/adm1031.c
+ 0% - 4% drivers/i2c/chips/asb100.c -> drivers/hwmon/asb100.c
+ 1% - 4% drivers/i2c/chips/ds1621.c -> drivers/hwmon/ds1621.c
+ 0% - 1% drivers/i2c/chips/fscher.c -> drivers/hwmon/fscher.c
+ 0% - 2% drivers/i2c/chips/fscpos.c -> drivers/hwmon/fscpos.c
+ 0% - 2% drivers/i2c/chips/gl518sm.c -> drivers/hwmon/gl518sm.c
+ 0% - 2% drivers/i2c/chips/gl520sm.c -> drivers/hwmon/gl520sm.c
+ 3% -19% drivers/i2c/chips/it87.c -> drivers/hwmon/it87.c
+ 4% -22% drivers/i2c/chips/lm63.c -> drivers/hwmon/lm63.c
+ 0% - 6% drivers/i2c/chips/lm75.c -> drivers/hwmon/lm75.c
+ 0% - 2% drivers/i2c/chips/lm75.h -> drivers/hwmon/lm75.h
+ 0% - 3% drivers/i2c/chips/lm77.c -> drivers/hwmon/lm77.c
+ 2% - 5% drivers/i2c/chips/lm78.c -> drivers/hwmon/lm78.c
+ 0% - 3% drivers/i2c/chips/lm80.c -> drivers/hwmon/lm80.c
+ 2% -21% drivers/i2c/chips/lm83.c -> drivers/hwmon/lm83.c
+ 0% - 3% drivers/i2c/chips/lm85.c -> drivers/hwmon/lm85.c
+ 0% - 4% drivers/i2c/chips/lm87.c -> drivers/hwmon/lm87.c
+ 4% -20% drivers/i2c/chips/lm90.c -> drivers/hwmon/lm90.c
+ 0% - 3% drivers/i2c/chips/lm92.c -> drivers/hwmon/lm92.c
+ 0% - 3% drivers/i2c/chips/max1619.c -> drivers/hwmon/max1619.c
+ 0% - 7% drivers/i2c/chips/sis5595.c -> drivers/hwmon/sis5595.c
+ 0% -11% drivers/i2c/chips/smsc47b397.c -> drivers/hwmon/smsc47b397.c
+ 0% - 9% drivers/i2c/chips/smsc47m1.c -> drivers/hwmon/smsc47m1.c
+ 0% -23% drivers/i2c/chips/via686a.c -> drivers/hwmon/via686a.c
+ 0% - 4% drivers/i2c/chips/w83627hf.c -> drivers/hwmon/w83627hf.c
+ 1% - 5% drivers/i2c/chips/w83781d.c -> drivers/hwmon/w83781d.c
+ 1% - 3% drivers/i2c/chips/w83l785ts.c -> drivers/hwmon/w83l785ts.c
+14% -17% drivers/i2c/i2c-sensor-vid.c -> drivers/hwmon/hwmon-vid.c
+ 0% - 0% drivers/infiniband/include/ib_cache.h -> include/rdma/ib_cache.h
+ 0% - 3% drivers/infiniband/include/ib_fmr_pool.h -> include/rdma/ib_fmr_pool.h
+ 9% - 7% drivers/infiniband/include/ib_mad.h -> include/rdma/ib_mad.h
+ 0% - 0% drivers/infiniband/include/ib_pack.h -> include/rdma/ib_pack.h
+ 1% - 6% drivers/infiniband/include/ib_sa.h -> include/rdma/ib_sa.h
+ 0% -11% drivers/infiniband/include/ib_smi.h -> include/rdma/ib_smi.h
+ 3% - 6% drivers/infiniband/include/ib_user_mad.h -> include/rdma/ib_user_mad.h
+ 4% - 2% drivers/infiniband/include/ib_verbs.h -> include/rdma/ib_verbs.h
+ 0% -16% include/asm-ppc64/ioctl.h -> include/asm-powerpc/ioctl.h
+ 0% - 9% include/asm-ppc64/ioctls.h -> include/asm-powerpc/ioctls.h
+ 5% - 9% include/asm-ppc64/mc146818rtc.h -> include/asm-powerpc/mc146818rtc.h
+ 0% - 5% include/asm-ppc64/mman.h -> include/asm-powerpc/mman.h
+ 2% -25% include/asm-ppc64/sembuf.h -> include/asm-powerpc/sembuf.h
+ 3% -13% include/asm-ppc64/shmbuf.h -> include/asm-powerpc/shmbuf.h
+ 0% -15% include/asm-ppc64/sockios.h -> include/asm-powerpc/sockios.h
+ 1% - 5% include/asm-ppc64/topology.h -> include/asm-powerpc/topology.h
+ 0% -15% include/asm-ppc64/user.h -> include/asm-powerpc/user.h
+ 0% -21% include/asm-ppc/agp.h -> include/asm-powerpc/agp.h
+12% -16% include/asm-ppc/msgbuf.h -> include/asm-xtensa/msgbuf.h
+ 5% -25% include/asm-ppc/namei.h -> include/asm-powerpc/namei.h
+ 4% -18% include/asm-ppc/param.h -> include/asm-powerpc/param.h
+ 0% -13% include/asm-ppc/poll.h -> include/asm-powerpc/poll.h
+ 0% -24% include/asm-ppc/shmbuf.h -> include/asm-xtensa/shmbuf.h
+ 1% -17% include/asm-ppc/socket.h -> include/asm-powerpc/socket.h
+ 0% - 9% include/asm-ppc/string.h -> include/asm-powerpc/string.h
+ 1% -10% include/asm-ppc/termbits.h -> include/asm-powerpc/termbits.h
+ 0% - 3% include/asm-ppc/termios.h -> include/asm-powerpc/termios.h
+ 5% -22% include/asm-ppc/unaligned.h -> include/asm-powerpc/unaligned.h

Regards,
Rutger.

-- 
Rutger Nijlunsing ---------------------------------- eludias ed dse.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Received on Tue Mar 14 11:50:19 2006

This archive was generated by hypermail 2.1.8 : 2006-03-14 11:50:30 EST