This page documents additional information learnt while scheduling and profiling Itanium assembly code - particularly system code - some of which is documented poorly or not at all.

All information is for Itanium 2 unless otherwise stated. The microarchitecture reference guide is prerequisite reading.

If you think something here is incorrect, or you can add something, please edit this page or let me (MatthewChapman) know.

Special register accesses

Here we consider special registers as all non-general, non-FP registers. These can be divided into groups by function and latency (these latencies are documented in the microarchitecture manual):

Group

Unit

Registers

Read Latency

branch

I0

ip, BRs, ar.pfs, ar.lc, ar.ec

2 cycles*

predicates

I0

pr

2 cycles*

interruption

M2

cr.iva, cr.iip, cr.iipa, cr.isr, cr.iim

2 cycles*

cache

M2 (L2 OzQ)

ar.ccv

11 cycles

MMU

M2 (DCS)

cr.pta, cr.gpta, cr.ifa, cr.itir, cr.iha, RRs, PKRs

5 cycles

NAT

M2 (DCS)

ar.rnat, ar.unat

5 cycles

PSR

M2 (DCS)

psr, cr.ipsr, cr.dcr

12 cycles

RSE

M2 (DCS)

ar.rsc, ar.bspstore, ar.bsp, cr.ifs

12 cycles

KRs

M2 (DCS)

ar.k0-k7

12 cycles

slow

M2 (DCS)

LSAPIC (including ar.itc), PMU, CPUID, MSRs

36 cycles#

* These reads are allowed the full EXE cycle and are then bypassed from DET to REG (see diagram in Fetzer paper)
# Only one can issue from DCS buffer every 6 cycles

Special register accesses, like memory accesses, are non-blocking and do not occupy the execution unit for more than one cycle. While a read is outstanding, a read or write to the target general register will cause a scoreboard stall, accounted for in BE_EXE_BUBBLE.GRALL.

The groups labelled DCS are accessed via the DCS subsystem. The acronym DCS is referred to in the microarchitecture manual (as in BE_L1D_FPU_BUBBLE.L1D_DCS) but I haven't been able to find a definition; my guess is that it is something like Data Communication Subsystem. From a programming point of view the important part of this is what I will refer to as the DCS buffer - a 7 entry FIFO which queues DCS requests waiting to be serviced. Filling up this FIFO will result in a stall. In order to determine whether a stall will occur, one has to consider that the lifetime of the request in the DCS buffer is 2 cycles less than the read latencies given above for reads, and 5 cycles (?) less for writes. For example, a KR read will effectively occupy an entry in the DCS buffer for 10 cycles. On the 7th request there is a one cycle stall (BE_L1D_FPU_BUBBLE.L1D_DCS); this may be to prevent a second unqueuable request entering a critical part of the pipeline. On the 8th request the pipeline stalls for two entries to drain (BE_L1D_FPU_BUBBLE.L1D_DCS, BE_L1D_FPU_BUBBLE.DCURECIR every 2nd cycle). There is a "tricky" case when the 7th and 8th requests are exactly two cycles apart, in which the pipelining produces results that I do not quite understand, but knowledge of this is not necessary for avoiding DCS stalls.

The DCS subsystem can only write back one result per cycle. Additionally, these results compete for the load units (M0/M1) with results from caches. If DCS data delivery coincides with L1D data delivery (on both units), DCU recirculation bubbles will occur (BE_L1D_FPU_BUBBLE.L1D_DCURECIR).

RSE latencies

The following table shows measured latencies from systematic instruction testing. Actual latencies may vary slightly.

RSE_AR: mov to or from ar.rsc/ar.bspstore/ar.bsp/ar.rnat
FP_OP: any F unit operation (including nop.f)
USE: use of target general register
ANY: any instruction (stall inevitable)

From

To

Latency

Stall accounted to

mov ar.rsc=reg

RSE_AR

12 cycles

BE_RSE_BUBBLE.AR_DEP

mov ar.rsc=imm

RSE_AR

2 cycles

BE_RSE_BUBBLE.AR_DEP

mov ar.bspstore=

RSE_AR

5 cycles

BE_RSE_BUBBLE.AR_DEP

mov =ar.bspstore

mov ar.rnat=

8 cycles

BE_EXE_BUBBLE.ARCR

mov =ar.bsp

mov ar.rnat=

8 cycles

BE_EXE_BUBBLE.ARCR

mov =ar.rnat/ar.unat

mov ar.rnat/ar.unat=

6 cycles

BE_EXE_BUBBLE.ARCR

mov ar.rnat/ar.unat=

mov =ar.rnat/ar.unat

6 cycles

BE_EXE_BUBBLE.ARCR

mov =ar.unat

FP_OP

6 cycles

BE_EXE_BUBBLE.ARCR

mov ar.bspstore=

flushrs

13 cycles min#

BE_RSE_BUBBLE.OVERFLOW

mov ar.rnat=

flushrs

2 cycles min^

BE_RSE_BUBBLE.OVERFLOW

ANY

flushrs

2 cycles min

BE_RSE_BUBBLE.OVERFLOW

mov ar.rsc=

loadrs

13 cycles min%

BE_RSE_BUBBLE.LOADRS

mov ar.bspstore=

loadrs

13 cycles min

BE_RSE_BUBBLE.LOADRS

mov =ar.bspstore

loadrs

3 cycles min

BE_RSE_BUBBLE.LOADRS

loadrs

loadrs

9 cycles min

BE_RSE_BUBBLE.LOADRS

ANY

loadrs

2 cycles min

BE_RSE_BUBBLE.LOADRS

# microarchitecture manual quotes 14 cycles - probably inclusive of flushrs instruction
^ microarchitecture manual quotes 3 cycles - probably inclusive of flushrs instruction
% microarchitecture manual quotes 14 cycles - probably inclusive of loadrs instruction

All other combinations of RSE_AR/flushrs/loadrs/alloc were measured as having single cycle latencies.

System instruction latencies

Again, these latencies were obtained through systematic measurement, and actual latencies may vary slightly.

epc

ANY

1 cycle

-

bsw

ANY

6 cycles%

BE_RSE_BUBBLE.BANK_SWITCH

rfi

ANY

13 cycles^

BE_FLUSH_BUBBLE.BRU (1), BE_FLUSH_BUBBLE.XPN (8), BACK_END_BUBBLE.FE (3)

srlz.d

ANY

1 cycle

-

srlz.i

ANY

12 cycles

BE_FLUSH_BUBBLE.XPN (8), BACK_END_BUBBLE.FE (3)

sum/rum/mov psr.um=

ANY

5 cycles*

BE_EXE_BUBBLE.ARCR

sum/rum/mov psr.um=

srlz

10 cycles

BE_EXE_BUBBLE.ARCR

ssm/rum/mov psr.l=

srlz

5 cycles#

BE_EXE_BUBBLE.ARCR

mov =psr.um/psr

srlz

2 cycles

BE_EXE_BUBBLE.ARCR

mov pkr/rr=

srlz/sync/fwb/mf/invala_M0

14 cycles

BE_EXE_BUBBLE.ARCR

itc

srlz

11 cycles

BE_EXE_BUBBLE.ARCR

probe/tpa/tak/thash/ttag$

USE

5 cycles

BE_EXE_BUBBLE.GRALL

* measured value consistent with the microarchitecture manual
# microarchitecture manual quotes 6 cycles - probably inclusive of srlz.d
% assuming bank switch necessary, otherwise no-op
^ no extra cycles for bank switch
$ note that these instructions are equivalent to MMU register reads

Most M unit instructions (except ALU, nop.m, invala.e) should not be scheduled exactly 5 cycles after mov pkr/rr or a DCU recirculate bubble will occur.

M unit dispersal rules

The explanation in the microarchitecture manual is confusing, and one of the examples is incorrect. The general principle seems to be that load subtype instructions are allocated to units first, then the remaining slots are allocated sequentially to remaining units, taking into account constraints when applicable. There is an unusual case when an M0-only instruction is issued in the second slot of either bundle (issue splits and then dispersal seems to stall for an additional cycle).

Special split issue cases

Aside from the cases mentioned in the microarchitecture reference, the Itanium 2 processor always splits issue before mf and after srlz, sync and mov =ar.unat. The processor also ensures that issue splits between mov =ar.bsp and a branch unit instruction - for cases within the same bundle, the split is after the M slot, else it is between the two bundles. Similarly issue splits between any M unit instruction and fwb. All of these cases are accounted for in SYLL_NOT_DISPERSED.IMPL.

L1D alias avoidance

Since L1 cache entries are tagged with the L1 TLB entry rather than the physical address (in the load case), this could present problems with virtual aliases. This is dealt with by ensuring there are no aliases in the L1DTLB - at insert time, any existing entry with the same physical address is evicted (and hence the corresponding cache lines are evicted too).

L2 cache replacement

The L2 cache replacement algorithm is described as NRU (Not Recently Used). This is implemented by maintaining one usage bit for each way within a line. This bit is set when that way is accessed. If all of the usage bits within a line become set, they are cleared. Replacement targets the first way that has its usage bit clear.

Long VHPT hash functions

index

=

(HPN ^ RID) & mask

tag

=

HPN ^ (RID << 39)

HPN is at most 49 bits (61-12), so in the tag the bottom 10 bits of RID overlap with HPN and the top 14 bits are intact. The VHPT must be at least 32KB, so the mask preserves at least 10 bits of the index, allowing the bottom 10 bits of the RID to be recovered. Thus the reverse functions are:

RID

=

(tag{62:49} << 10) | (index ^ tag){0..9}

VPN

=

tag ^ (RID << 39)

IA64wiki: ItaniumInternals (last edited 2009-12-10 03:14:07 by localhost)

Gelato@UNSW is sponsored by
the University of New South Wales National ICT Australia The Gelato Federation Hewlett-Packard Company Australian Research Council
Please contact us with any questions or comments.