linux@horizon.com writes: > I was working on the same thing, but hindered by lack of access to PPC > hardware. I notice that you also took advantage of the unaligned load > support and native byte order to do the hash straight from the source. Yes. :) In previous experiments (in the context of trying different ways to do memcpy) I found that doing unaligned word loads is faster than doing aligned loads plus extra rotate and mask instructions to get the bytes you want together. > But I came up with a few additional refinements: > > - You are using three temporaries (%r0, %r6, and RT(x)) for your > round functions. You only need one temporary (%r0) for all the functions. > (Plus %r15 for k) The reason I used more than one temporary is that I was trying to put dependent instructions as far apart as reasonably possible, to minimize the chances of pipeline stalls. Given that the 970 does register renaming and out-of-order execution, I don't know how essential that is, but it can't hurt. > All are three logical instrunctions on PPC. The second form > lets you add it into the accumulator e in two pieces: A sequence of adds into a single register is going to incur the 2-cycle latency between generation and use of a value; i.e. the adds will only issue on every second cycle. I think we are better off making the dataflow more like a tree than a linear chain where possible. > And the last function, majority(x,y,z), can be written as: > f3(x,y,z) = (x & y) | (y & z) | (z & x) > = (x & y) | z & (x | y) > = (x & y) | z & (x ^ y) > = (x & y) + z & (x ^ y) That's cute, I hadn't thought of that. > - You don't need to decrement %r1 before saving registers. > The PPC calling convention defines a "red zone" below the > current stack pointer that is guaranteed never to be touched > by signal handlers or the like. This is specifically for > leaf procedure optimization, and is at least 224 bytes. Not in the ppc32 ELF ABI - you are not supposed to touch memory below the stack pointer. The kernel is more forgiving than that, and in fact you can currently use the red zone without anything bad happening, but you really shouldn't. > - Is that many stw/lwz instructions faster than stmw/lmw? > The latter is at least more cahce-friendly. I believe the stw/lwz and the stmw/lmw will actually execute at the same speed on the 970, but I have seen lwz/stw go faster than lmw/stmw on other machines. In any case we aren't executing the prolog and epilog as often as the instructions in the main loop, hopefully. > - You can avoid saving and restoring %r15 by recycling %r5 for that > purpose; it's not used after the mtctr %r5. True. > - The above changes actually save enough registers to cache the whole hash[5] > in registers as well, eliminating *all* unnecessary load/store traffic. That's cool. > With all of the above changes, your sha1ppc.S file turns into: I added a stwu and an addi to make a stack frame, and changed %r15 to %r5 as you mentioned in another message. I tried it in a little test program I have that calls SHA1_Update 256,000 times with a buffer of 4096 zero bytes, i.e. it processes 1000MB. Your version seems to be about 2% faster; it took 4.53 seconds compared to 4.62 for mine. But it also gives the wrong answer; I haven't investigated why. Thanks, Paul. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.htmlReceived on Sun Apr 24 14:41:13 2005
This archive was generated by hypermail 2.1.8 : 2005-04-24 14:41:13 EST