Keith Owens schrieb: >On Mon, 24 Jan 2005 14:44:22 +0100, >Christian Hildner <christian.hildner@hob.de> wrote: > > >>Keith Owens schrieb: >> >> >>>When jiffies is within 22 bit range of __gp, the linker writes the >>>sequence as >>> >>> addl r20=offset_of(jiffies,__gp),r1;; >>> mov r16=r20;; >>> ld8.acq r23=[r16] // value of jiffies >>> >>> >>> >>Is there a restriction to not rewrite to >> >> addl r16=offset_of(jiffies,__gp),r1;; >> ld8.acq r23=[r16] // value of jiffies >> nop.i 0 >> >>because that would save at least one cycle and would make bundling easier (dependend of additional instructions, of course). >> >> > >The code snippet was a simplification of what gcc actually does. If >you look at some object code, you will find that the 3 instructions are >already spread over multiple bundles. Moving the final ld8 upwards >cannot save any cycles, you still have to execute the same number of >bundles. > But it is one instruction group less. And that relates to at least (here exactly) one cycle. >A real example from kernel/sched.o > > 4830: 09 50 20 42 00 21 [MMI] adds r10=8,r33 > 4832: LTOFF22X jiffies > 4836: 20 81 84 00 42 c0 adds r18=16,r33 > 483c: 01 08 00 90 addl r14=0,r1;; > 4840: 08 00 08 1e d8 19 [MMI] stf.spill [r15]=f2 > 4841: LDXMOV jiffies > 4842: LTOFF22X __per_cpu_offset > 4846: b0 00 38 30 20 40 ld8 r11=[r14] > 484c: 03 08 00 90 addl r26=0,r1 > 4850: 08 a0 00 02 00 24 [MMI] addl r20=0,r1 > 4850: LTOFF22X .data.percpu+0x440 > 4856: 90 00 01 20 40 e0 shladd r9=r32,1,r0 > 485c: 02 00 59 00 sxt4 r23=r32 > 4860: 08 40 00 14 18 10 [MMI] ld8 r8=[r10] > 4866: 10 01 48 30 20 e0 ld8 r17=[r18] > 486c: 04 00 c4 00 mov r39=b0 > 4870: 05 00 00 00 01 40 [MLX] nop.m 0x0 > 4876: 10 00 00 00 00 60 movl r27=0x10624dd3;; > 487c: 33 55 6c 62 > 4880: 10 00 00 00 01 00 [MIB] nop.m 0x0 > 4886: f0 40 e0 f0 29 00 shl r15=r8,7 > 488c: 00 00 00 20 nop.b 0x0 > 4890: 09 c0 00 34 18 10 [MMI] ld8 r24=[r26] > 4890: LDXMOV __per_cpu_offset > 4896: 30 00 2c 70 21 40 ld8.acq r3=[r11] > >The LDXMOV relocation is designed to make it simple to convert the >instruction from ld8 r11=[r14] to mov r11=r14, it is easy to do in >place. > Ok, simplicity is an argument. > Moving an entire slot around is a lot messier, for no >performance gain. > You have still one memory unit wasted for the mov logically being a nop. So dependant on the cpu implementation there is a possible loss of one cycle specially for memory intensive code fragments/instructions groups. In the example the LDXMOV instruction group has seven memory units utilized. And if the cpu has only six of them implemented? But I see the complexity when changing that. It would result in the need for another optimizer step. A linker optimizer? Christian - To unsubscribe from this list: send the line "unsubscribe linux-ia64" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.htmlReceived on Tue Jan 25 02:32:04 2005
This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:35 EST