This page documents additional information learnt while scheduling and profiling Itanium assembly code - particularly system code - some of which is documented poorly or not at all.
All information is for Itanium 2 unless otherwise stated. The microarchitecture reference guide is prerequisite reading.
If you think something here is incorrect, or you can add something, please edit this page or let me (MatthewChapman) know.
Special register accesses
Here we consider special registers as all non-general, non-FP registers. These can be divided into groups by function and latency (these latencies are documented in the microarchitecture manual):
Group |
Unit |
Registers |
Read Latency |
branch |
I0 |
ip, BRs, ar.pfs, ar.lc, ar.ec |
2 cycles* |
predicates |
I0 |
pr |
2 cycles* |
interruption |
M2 |
cr.iva, cr.iip, cr.iipa, cr.isr, cr.iim |
2 cycles* |
cache |
M2 (L2 OzQ) |
ar.ccv |
11 cycles |
MMU |
M2 (DCS) |
cr.pta, cr.gpta, cr.ifa, cr.itir, cr.iha, RRs, PKRs |
5 cycles |
NAT |
M2 (DCS) |
ar.rnat, ar.unat |
5 cycles |
PSR |
M2 (DCS) |
psr, cr.ipsr, cr.dcr |
12 cycles |
RSE |
M2 (DCS) |
ar.rsc, ar.bspstore, ar.bsp, cr.ifs |
12 cycles |
KRs |
M2 (DCS) |
ar.k0-k7 |
12 cycles |
slow |
M2 (DCS) |
LSAPIC (including ar.itc), PMU, CPUID, MSRs |
36 cycles# |
* These reads are allowed the full EXE cycle and are then bypassed from DET to REG (see diagram in Fetzer paper)
# Only one can issue from DCS buffer every 6 cycles
Special register accesses, like memory accesses, are non-blocking and do not occupy the execution unit for more than one cycle. While a read is outstanding, a read or write to the target general register will cause a scoreboard stall, accounted for in BE_EXE_BUBBLE.GRALL.
The groups labelled DCS are accessed via the DCS subsystem. The acronym DCS is referred to in the microarchitecture manual (as in BE_L1D_FPU_BUBBLE.L1D_DCS) but I haven't been able to find a definition; my guess is that it is something like Data Communication Subsystem. From a programming point of view the important part of this is what I will refer to as the DCS buffer - a 7 entry FIFO which queues DCS requests waiting to be serviced. Filling up this FIFO will result in a stall. In order to determine whether a stall will occur, one has to consider that the lifetime of the request in the DCS buffer is 2 cycles less than the read latencies given above for reads, and 5 cycles (?) less for writes. For example, a KR read will effectively occupy an entry in the DCS buffer for 10 cycles. On the 7th request there is a one cycle stall (BE_L1D_FPU_BUBBLE.L1D_DCS); this may be to prevent a second unqueuable request entering a critical part of the pipeline. On the 8th request the pipeline stalls for two entries to drain (BE_L1D_FPU_BUBBLE.L1D_DCS, BE_L1D_FPU_BUBBLE.DCURECIR every 2nd cycle). There is a "tricky" case when the 7th and 8th requests are exactly two cycles apart, in which the pipelining produces results that I do not quite understand, but knowledge of this is not necessary for avoiding DCS stalls.
The DCS subsystem can only write back one result per cycle. Additionally, these results compete for the load units (M0/M1) with results from caches. If DCS data delivery coincides with L1D data delivery (on both units), DCU recirculation bubbles will occur (BE_L1D_FPU_BUBBLE.L1D_DCURECIR).
RSE latencies
The following table shows measured latencies from systematic instruction testing. Actual latencies may vary slightly.
RSE_AR: mov to or from ar.rsc/ar.bspstore/ar.bsp/ar.rnat
FP_OP: any F unit operation (including nop.f)
USE: use of target general register
ANY: any instruction (stall inevitable)
From |
To |
Latency |
Stall accounted to |
mov ar.rsc=reg |
RSE_AR |
12 cycles |
BE_RSE_BUBBLE.AR_DEP |
mov ar.rsc=imm |
RSE_AR |
2 cycles |
BE_RSE_BUBBLE.AR_DEP |
mov ar.bspstore= |
RSE_AR |
5 cycles |
BE_RSE_BUBBLE.AR_DEP |
mov =ar.bspstore |
mov ar.rnat= |
8 cycles |
BE_EXE_BUBBLE.ARCR |
mov =ar.bsp |
mov ar.rnat= |
8 cycles |
BE_EXE_BUBBLE.ARCR |
mov =ar.rnat/ar.unat |
mov ar.rnat/ar.unat= |
6 cycles |
BE_EXE_BUBBLE.ARCR |
mov ar.rnat/ar.unat= |
mov =ar.rnat/ar.unat |
6 cycles |
BE_EXE_BUBBLE.ARCR |
mov =ar.unat |
FP_OP |
6 cycles |
BE_EXE_BUBBLE.ARCR |
mov ar.bspstore= |
flushrs |
13 cycles min# |
BE_RSE_BUBBLE.OVERFLOW |
mov ar.rnat= |
flushrs |
2 cycles min^ |
BE_RSE_BUBBLE.OVERFLOW |
ANY |
flushrs |
2 cycles min |
BE_RSE_BUBBLE.OVERFLOW |
mov ar.rsc= |
loadrs |
13 cycles min% |
BE_RSE_BUBBLE.LOADRS |
mov ar.bspstore= |
loadrs |
13 cycles min |
BE_RSE_BUBBLE.LOADRS |
mov =ar.bspstore |
loadrs |
3 cycles min |
BE_RSE_BUBBLE.LOADRS |
loadrs |
loadrs |
9 cycles min |
BE_RSE_BUBBLE.LOADRS |
ANY |
loadrs |
2 cycles min |
BE_RSE_BUBBLE.LOADRS |
# microarchitecture manual quotes 14 cycles - probably inclusive of flushrs instruction
^ microarchitecture manual quotes 3 cycles - probably inclusive of flushrs instruction
% microarchitecture manual quotes 14 cycles - probably inclusive of loadrs instruction
All other combinations of RSE_AR/flushrs/loadrs/alloc were measured as having single cycle latencies.
System instruction latencies
Again, these latencies were obtained through systematic measurement, and actual latencies may vary slightly.
epc |
ANY |
1 cycle |
- |
bsw |
ANY |
6 cycles% |
BE_RSE_BUBBLE.BANK_SWITCH |
rfi |
ANY |
13 cycles^ |
BE_FLUSH_BUBBLE.BRU (1), BE_FLUSH_BUBBLE.XPN (8), BACK_END_BUBBLE.FE (3) |
srlz.d |
ANY |
1 cycle |
- |
srlz.i |
ANY |
12 cycles |
BE_FLUSH_BUBBLE.XPN (8), BACK_END_BUBBLE.FE (3) |
sum/rum/mov psr.um= |
ANY |
5 cycles* |
BE_EXE_BUBBLE.ARCR |
sum/rum/mov psr.um= |
srlz |
10 cycles |
BE_EXE_BUBBLE.ARCR |
ssm/rum/mov psr.l= |
srlz |
5 cycles# |
BE_EXE_BUBBLE.ARCR |
mov =psr.um/psr |
srlz |
2 cycles |
BE_EXE_BUBBLE.ARCR |
mov pkr/rr= |
srlz/sync/fwb/mf/invala_M0 |
14 cycles |
BE_EXE_BUBBLE.ARCR |
itc |
srlz |
11 cycles |
BE_EXE_BUBBLE.ARCR |
probe/tpa/tak/thash/ttag$ |
USE |
5 cycles |
BE_EXE_BUBBLE.GRALL |
* measured value consistent with the microarchitecture manual
# microarchitecture manual quotes 6 cycles - probably inclusive of srlz.d
% assuming bank switch necessary, otherwise no-op
^ no extra cycles for bank switch
$ note that these instructions are equivalent to MMU register reads
Most M unit instructions (except ALU, nop.m, invala.e) should not be scheduled exactly 5 cycles after mov pkr/rr or a DCU recirculate bubble will occur.
M unit dispersal rules
The explanation in the microarchitecture manual is confusing, and one of the examples is incorrect. The general principle seems to be that load subtype instructions are allocated to units first, then the remaining slots are allocated sequentially to remaining units, taking into account constraints when applicable. There is an unusual case when an M0-only instruction is issued in the second slot of either bundle (issue splits and then dispersal seems to stall for an additional cycle).
Special split issue cases
Aside from the cases mentioned in the microarchitecture reference, the Itanium 2 processor always splits issue before mf and after srlz, sync and mov =ar.unat. The processor also ensures that issue splits between mov =ar.bsp and a branch unit instruction - for cases within the same bundle, the split is after the M slot, else it is between the two bundles. Similarly issue splits between any M unit instruction and fwb. All of these cases are accounted for in SYLL_NOT_DISPERSED.IMPL.
L1D alias avoidance
Since L1 cache entries are tagged with the L1 TLB entry rather than the physical address (in the load case), this could present problems with virtual aliases. This is dealt with by ensuring there are no aliases in the L1DTLB - at insert time, any existing entry with the same physical address is evicted (and hence the corresponding cache lines are evicted too).
L2 cache replacement
The L2 cache replacement algorithm is described as NRU (Not Recently Used). This is implemented by maintaining one usage bit for each way within a line. This bit is set when that way is accessed. If all of the usage bits within a line become set, they are cleared. Replacement targets the first way that has its usage bit clear.
Long VHPT hash functions
index |
= |
(HPN ^ RID) & mask |
tag |
= |
HPN ^ (RID << 39) |
HPN is at most 49 bits (61-12), so in the tag the bottom 10 bits of RID overlap with HPN and the top 14 bits are intact. The VHPT must be at least 32KB, so the mask preserves at least 10 bits of the index, allowing the bottom 10 bits of the RID to be recovered. Thus the reverse functions are:
RID |
= |
(tag{62:49} << 10) | (index ^ tag){0..9} |
VPN |
= |
tag ^ (RID << 39) |
