Most documents explain the Itanium memory consistency model in terms of the visibility of memory operations to programs running on different CPUs. While this is no doubt a useful view, it can be easier to understand in terms of a combination of two separate layers, where coherency operates between processors and ordering is an intra-processor consideration.
In Itanium 2 processors, the L2 cache is where these two layers converge. At and below the L2 level, cache coherency is provided by a standard MESI scheme, in which each cache line can be Invalid, Shared for read, Exclusive for write, or Modified. This is very similar to how a single-writer multiple-reader DSM system would be implemented in software. Thus, if requests were issued to the L2 layer individually and in program order, sequential consistency would result. Sequential consistency guarantees that all processors observe writes in the same order.
Of course, issuing memory operations synchronously in this way is not efficient. The first optimisation made by modern processors is to avoid blocking on writes (which would normally require fetching that cache line exclusively before the write can be applied). Write buffers make this possible. However, to maintain program semantics, local reads that follow local writes must return the data written by those writes, even if those writes have not yet been applied to the global shared memory. This is known as local bypassing, and results in a slightly weaker form of consistency known as processor consistency. In processor consistency, a local processor may observe its writes before other processors observe those writes, but all other processors observe those writes in the same order.
Furthermore, modern processors allow multiple memory operations (and hence cache fetches) to be outstanding. While processor consistency can be maintained by controlling the order in which requests access cache data and retire, the Itanium consistency model relaxes this requirement even further, and allows requests to access cache data out-of-order as soon as the cache line is available. In the Itanium 2 processor, this is controlled by the L2 OzQ (out-of-order queue).
In other words, while the underlying global shared memory can provide sequential consistency, local memory operations of any processor are re-ordered before they access the shared memory, and thus the illusion of a weaker consistency model results.
To allow the programmer to limit the extent to which local memory operations are re-ordered before they access the L2 cache and thus the global shared memory, the Itanium architecture provides acquire and release annotations on load and store operations (respectively).
Consider an L2 OzQ that might look like this:
OP1 | OP2 | ST.REL | OP3 | OP4 | ...
Here, all earlier operations (OP1 and OP2) must access the cache before the ST.REL is allowed to access the cache. Note that there are no restrictions on OP3 and OP4; they may be issued earlier than the ST.REL if their cache lines are available.
OP1 | OP2 | LD.ACQ | OP3 | OP4 | ...
Here, the LD.ACQ must access the cache before any later operations (OP3, OP4, etc.) are allowed to access the cache. Note that it is not guaranteed that OP1 and OP2 complete before the LD.ACQ; thus if one of those operations is a store, the LD.ACQ can receive bypassed data. A memory fence would be necessary to prevent this.
The memory fence instruction (MF) is equivalent to both an acquire and release: all previous operations must complete before it, and no subsequent operations can issue before it.