From: Rutger Nijlunsing <rutger@nospam.com> To: Linus Torvalds <torvalds@osdl.org> Cc: Junio C Hamano <junkio@cox.net>, git@vger.kernel.org Bcc: Subject: Re: Fix up diffcore-rename scoring Reply-To: git@wingding.demon.nl In-Reply-To: <Pine.LNX.4.64.0603130727350.3618@g5.osdl.org> Organization: M38c On Mon, Mar 13, 2006 at 07:38:53AM -0800, Linus Torvalds wrote: > > > On Mon, 13 Mar 2006, Junio C Hamano wrote: > > > > By the way, the reason the diffcore-delta code in "next" does > > not do every-eight-bytes hash on the source material is to > > somewhat alleviate the problem that comes from not detecting > > copying of consecutive byte ranges. > > Yes. However, there are better ways to do that in practice. > > The most effective way that is generally used is to not use a fixed > chunk-size, but use a terminating character, together with a > minimum/maximum chunksize. > > There's a pretty natural terminating character that works well for > sources: '\n'. > > So the natural way to do similarity detection when most of the code is > line-based is to do the hashing on chunks that follow the rule "minimum of > <n> bytes, maximum of <2*n> bytes, try to begin/end at a \n". > > So if you don't see any '\n' at all (or the only such one is less than <n> > bytes into your current window), do the hash over a <2n>-byte chunk (this > takes care of binaries and/or long lines). > > This - for source code - allows you to ignore trivial byte offset things, > because you have a character that is used for synchronization. So you > don't need to do hashing at every byte in both files - you end up doing > the hashing only at line boundaries in practice. And it still _works_ for > binary files, although you effectively need bigger identical chunk-sizes > to find similarities (for text-files, it finds similarities of size <n>, > for binaries the similarities need to effectively be of size 3*n, because > you chunk it up at ~2*n, and only generate the hash at certain offsets in > the source binary). This looks like something I did last year as an experiment in the pre-git times. The idea was to generate a patch-with-renames from two (large) source trees. Algorithm: - determine md5sum for each file (same idea as git's SHA1 sum) if changed since last run - only look at md5sums which do not match - pool files into types, which might depend on extension and/or MIME type. This is an optimisation. - Only compare filepair _within_ one pool. - The filepair order in one pool is determined by filename-similarity. So pair [include/asm-ppc/ioctl.h, include/asm-powerpc/ioctl.h] is inspected before pair [include/asm-ppc/ioctl.h, arch/arm/plat-omap/clock.h] . - For each file, create a hash from String line -> Integer occuranced . Similarities are calculated by comparing two hashes. - Keep as a rename-match all files which: - have at most 50% new lines; - have at most 25% lines deleted from them. I ran the code against v2.6.12 and v2.6.14 to be able to compare it with the current contenders. Hopefully some ideas are harvestable... Algorithm differences: - '\n' is used as boundary, independant on line length. This is bad for binary files, and maybe even bad for text files. So don't harvest :) - don't look at the intersection percentage, but look at two values: - percentage of lines added (default: max. 50%) - percentage of lines removed (default: max. 25%) This assumes files get bigger during development (at most 50%), and not too much code is deleted (at most 25%). Disadvantages: - Two magic numbers instead of one. - It's non-symmetrical. Diff A->B will find different renames from diff B->A. This scares me, actually. - to speed up the detection: - don't start comparing files at random. Start comparing files which have the same 'names' in it. So when v2.6.12 has a files called arch/arm/mach-omap/clock.c, start comparing with files which have most words the same. Currently, '-', '.', '_' and '/' are used as word separators. Advantage: don't match on the first match just above the match-threshold. (next heuristics are all optional:) - only compare files with the same extension. This splits up all files into groups, which makes it much faster. In general, there's no reason to compare a .h with a .c file. - only compare files with the same MIME type. Same as above, but also works for files without extensions (so don't compare README with Makefile) Ok, the result: $ shpatch.rb -d linux-2.6.12,linux-2.6.14 | wc -l 104 <-- That's bad. We're missing some renames here. $ shpatch.rb -d linux-2.6.12,linux-2.6.14 | sort -k 1.10 + 0% -23% arch/arm/configs/omnimeter_defconfig -> arch/arm/configs/collie_defconfig + 5% - 9% arch/arm/mach-omap/board-generic.c -> arch/arm/mach-omap1/board-generic.c + 0% - 8% arch/arm/mach-omap/board-h2.c -> arch/arm/mach-omap1/board-h2.c + 0% - 5% arch/arm/mach-omap/board-h3.c -> arch/arm/mach-omap1/board-h3.c + 0% - 3% arch/arm/mach-omap/board-innovator.c -> arch/arm/mach-omap1/board-innovator.c + 0% - 9% arch/arm/mach-omap/board-netstar.c -> arch/arm/mach-omap1/board-netstar.c + 9% -10% arch/arm/mach-omap/board-osk.c -> arch/arm/mach-omap1/board-osk.c + 0% - 6% arch/arm/mach-omap/board-perseus2.c -> arch/arm/mach-omap1/board-perseus2.c + 3% - 8% arch/arm/mach-omap/board-voiceblue.c -> arch/arm/mach-omap1/board-voiceblue.c + 7% - 4% arch/arm/mach-omap/clock.c -> arch/arm/plat-omap/clock.c + 0% - 0% arch/arm/mach-omap/clock.h -> arch/arm/plat-omap/clock.h + 0% - 5% arch/arm/mach-omap/common.h -> include/asm-arm/arch-omap/common.h + 2% - 1% arch/arm/mach-omap/dma.c -> arch/arm/plat-omap/dma.c + 0% - 1% arch/arm/mach-omap/fpga.c -> arch/arm/mach-omap1/fpga.c +11% -11% arch/arm/mach-omap/gpio.c -> arch/arm/plat-omap/gpio.c + 2% - 2% arch/arm/mach-omap/irq.c -> arch/arm/mach-omap1/irq.c + 0% - 4% arch/arm/mach-omap/leds.c -> arch/arm/mach-omap1/leds.c + 0% - 0% arch/arm/mach-omap/leds-h2p2-debug.c -> arch/arm/mach-omap1/leds-h2p2-debug.c + 0% - 0% arch/arm/mach-omap/leds-innovator.c -> arch/arm/mach-omap1/leds-innovator.c + 0% - 4% arch/arm/mach-omap/leds-osk.c -> arch/arm/mach-omap1/leds-osk.c + 0% -25% arch/arm/mach-omap/Makefile.boot -> arch/arm/mach-omap1/Makefile.boot + 1% - 2% arch/arm/mach-omap/mcbsp.c -> arch/arm/plat-omap/mcbsp.c + 0% - 6% arch/arm/mach-omap/mux.c -> arch/arm/plat-omap/mux.c + 0% - 0% arch/arm/mach-omap/ocpi.c -> arch/arm/plat-omap/ocpi.c + 1% -18% arch/arm/mach-omap/pm.c -> arch/arm/plat-omap/pm.c + 0% -11% arch/arm/mach-omap/sleep.S -> arch/arm/plat-omap/sleep.S + 6% - 4% arch/arm/mach-omap/time.c -> arch/arm/mach-omap1/time.c + 0% - 1% arch/arm/mach-omap/usb.c -> arch/arm/plat-omap/usb.c + 2% - 1% arch/ia64/sn/include/pci/pcibr_provider.h -> include/asm-ia64/sn/pcibr_provider.h + 0% - 2% arch/ia64/sn/include/pci/pic.h -> include/asm-ia64/sn/pic.h + 0% - 0% arch/ia64/sn/include/pci/tiocp.h -> include/asm-ia64/sn/tiocp.h + 3% -23% arch/m68knommu/platform/68VZ328/de2/config.c -> arch/m68knommu/platform/68VZ328/config.c + 1% -18% arch/mips/configs/osprey_defconfig -> arch/mips/configs/qemu_defconfig + 0% -12% arch/mips/vr41xx/zao-capcella/setup.c -> arch/mips/vr41xx/common/type.c + 0% - 0% arch/ppc64/oprofile/op_impl.h -> include/asm-ppc64/oprofile_impl.h + 3% -23% arch/ppc/configs/ash_defconfig -> arch/ppc64/configs/bpa_defconfig + 2% -21% arch/ppc/configs/beech_defconfig -> arch/ppc/configs/ev64360_defconfig + 5% -20% arch/ppc/configs/cedar_defconfig -> arch/ppc/configs/mpc8548_cds_defconfig + 9% -17% arch/ppc/configs/k2_defconfig -> arch/ppc/configs/bamboo_defconfig + 3% -25% arch/ppc/configs/mcpn765_defconfig -> arch/xtensa/configs/common_defconfig + 2% -23% arch/ppc/configs/oak_defconfig -> arch/frv/defconfig + 3% -16% arch/ppc/configs/SM850_defconfig -> arch/ppc/configs/mpc86x_ads_defconfig + 3% -13% arch/ppc/configs/SPD823TS_defconfig -> arch/ppc/configs/mpc885ads_defconfig +19% -15% arch/um/kernel/tempfile.c -> arch/um/os-Linux/mem.c + 0% - 5% arch/x86_64/kernel/semaphore.c -> lib/semaphore-sleepers.c + 0% - 6% drivers/i2c/chips/adm1021.c -> drivers/hwmon/adm1021.c + 0% - 4% drivers/i2c/chips/adm1025.c -> drivers/hwmon/adm1025.c + 0% -17% drivers/i2c/chips/adm1026.c -> drivers/hwmon/adm1026.c + 0% - 3% drivers/i2c/chips/adm1031.c -> drivers/hwmon/adm1031.c + 0% - 4% drivers/i2c/chips/asb100.c -> drivers/hwmon/asb100.c + 1% - 4% drivers/i2c/chips/ds1621.c -> drivers/hwmon/ds1621.c + 0% - 1% drivers/i2c/chips/fscher.c -> drivers/hwmon/fscher.c + 0% - 2% drivers/i2c/chips/fscpos.c -> drivers/hwmon/fscpos.c + 0% - 2% drivers/i2c/chips/gl518sm.c -> drivers/hwmon/gl518sm.c + 0% - 2% drivers/i2c/chips/gl520sm.c -> drivers/hwmon/gl520sm.c + 3% -19% drivers/i2c/chips/it87.c -> drivers/hwmon/it87.c + 4% -22% drivers/i2c/chips/lm63.c -> drivers/hwmon/lm63.c + 0% - 6% drivers/i2c/chips/lm75.c -> drivers/hwmon/lm75.c + 0% - 2% drivers/i2c/chips/lm75.h -> drivers/hwmon/lm75.h + 0% - 3% drivers/i2c/chips/lm77.c -> drivers/hwmon/lm77.c + 2% - 5% drivers/i2c/chips/lm78.c -> drivers/hwmon/lm78.c + 0% - 3% drivers/i2c/chips/lm80.c -> drivers/hwmon/lm80.c + 2% -21% drivers/i2c/chips/lm83.c -> drivers/hwmon/lm83.c + 0% - 3% drivers/i2c/chips/lm85.c -> drivers/hwmon/lm85.c + 0% - 4% drivers/i2c/chips/lm87.c -> drivers/hwmon/lm87.c + 4% -20% drivers/i2c/chips/lm90.c -> drivers/hwmon/lm90.c + 0% - 3% drivers/i2c/chips/lm92.c -> drivers/hwmon/lm92.c + 0% - 3% drivers/i2c/chips/max1619.c -> drivers/hwmon/max1619.c + 0% - 7% drivers/i2c/chips/sis5595.c -> drivers/hwmon/sis5595.c + 0% -11% drivers/i2c/chips/smsc47b397.c -> drivers/hwmon/smsc47b397.c + 0% - 9% drivers/i2c/chips/smsc47m1.c -> drivers/hwmon/smsc47m1.c + 0% -23% drivers/i2c/chips/via686a.c -> drivers/hwmon/via686a.c + 0% - 4% drivers/i2c/chips/w83627hf.c -> drivers/hwmon/w83627hf.c + 1% - 5% drivers/i2c/chips/w83781d.c -> drivers/hwmon/w83781d.c + 1% - 3% drivers/i2c/chips/w83l785ts.c -> drivers/hwmon/w83l785ts.c +14% -17% drivers/i2c/i2c-sensor-vid.c -> drivers/hwmon/hwmon-vid.c + 0% - 0% drivers/infiniband/include/ib_cache.h -> include/rdma/ib_cache.h + 0% - 3% drivers/infiniband/include/ib_fmr_pool.h -> include/rdma/ib_fmr_pool.h + 9% - 7% drivers/infiniband/include/ib_mad.h -> include/rdma/ib_mad.h + 0% - 0% drivers/infiniband/include/ib_pack.h -> include/rdma/ib_pack.h + 1% - 6% drivers/infiniband/include/ib_sa.h -> include/rdma/ib_sa.h + 0% -11% drivers/infiniband/include/ib_smi.h -> include/rdma/ib_smi.h + 3% - 6% drivers/infiniband/include/ib_user_mad.h -> include/rdma/ib_user_mad.h + 4% - 2% drivers/infiniband/include/ib_verbs.h -> include/rdma/ib_verbs.h + 0% -16% include/asm-ppc64/ioctl.h -> include/asm-powerpc/ioctl.h + 0% - 9% include/asm-ppc64/ioctls.h -> include/asm-powerpc/ioctls.h + 5% - 9% include/asm-ppc64/mc146818rtc.h -> include/asm-powerpc/mc146818rtc.h + 0% - 5% include/asm-ppc64/mman.h -> include/asm-powerpc/mman.h + 2% -25% include/asm-ppc64/sembuf.h -> include/asm-powerpc/sembuf.h + 3% -13% include/asm-ppc64/shmbuf.h -> include/asm-powerpc/shmbuf.h + 0% -15% include/asm-ppc64/sockios.h -> include/asm-powerpc/sockios.h + 1% - 5% include/asm-ppc64/topology.h -> include/asm-powerpc/topology.h + 0% -15% include/asm-ppc64/user.h -> include/asm-powerpc/user.h + 0% -21% include/asm-ppc/agp.h -> include/asm-powerpc/agp.h +12% -16% include/asm-ppc/msgbuf.h -> include/asm-xtensa/msgbuf.h + 5% -25% include/asm-ppc/namei.h -> include/asm-powerpc/namei.h + 4% -18% include/asm-ppc/param.h -> include/asm-powerpc/param.h + 0% -13% include/asm-ppc/poll.h -> include/asm-powerpc/poll.h + 0% -24% include/asm-ppc/shmbuf.h -> include/asm-xtensa/shmbuf.h + 1% -17% include/asm-ppc/socket.h -> include/asm-powerpc/socket.h + 0% - 9% include/asm-ppc/string.h -> include/asm-powerpc/string.h + 1% -10% include/asm-ppc/termbits.h -> include/asm-powerpc/termbits.h + 0% - 3% include/asm-ppc/termios.h -> include/asm-powerpc/termios.h + 5% -22% include/asm-ppc/unaligned.h -> include/asm-powerpc/unaligned.h Regards, Rutger. -- Rutger Nijlunsing ---------------------------------- eludias ed dse.nl never attribute to a conspiracy which can be explained by incompetence ---------------------------------------------------------------------- - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This archive was generated by hypermail 2.1.8 : 2006-03-14 11:50:30 EST