~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~ [ freetext search ] ~ [ file search ] ~

Linux Cross Reference
Linux-2.6.17/Documentation/memory-barriers.txt

Version: ~ [ 2.6.16 ] ~ [ 2.6.17 ] ~
Architecture: ~ [ ia64 ] ~ [ i386 ] ~ [ arm ] ~ [ ppc ] ~ [ sparc64 ] ~

  1                          ============================
  2                          LINUX KERNEL MEMORY BARRIERS
  3                          ============================
  4 
  5 By: David Howells <dhowells@redhat.com>
  6 
  7 Contents:
  8 
  9  (*) Abstract memory access model.
 10 
 11      - Device operations.
 12      - Guarantees.
 13 
 14  (*) What are memory barriers?
 15 
 16      - Varieties of memory barrier.
 17      - What may not be assumed about memory barriers?
 18      - Data dependency barriers.
 19      - Control dependencies.
 20      - SMP barrier pairing.
 21      - Examples of memory barrier sequences.
 22      - Read memory barriers vs load speculation.
 23 
 24  (*) Explicit kernel barriers.
 25 
 26      - Compiler barrier.
 27      - The CPU memory barriers.
 28      - MMIO write barrier.
 29 
 30  (*) Implicit kernel memory barriers.
 31 
 32      - Locking functions.
 33      - Interrupt disabling functions.
 34      - Miscellaneous functions.
 35 
 36  (*) Inter-CPU locking barrier effects.
 37 
 38      - Locks vs memory accesses.
 39      - Locks vs I/O accesses.
 40 
 41  (*) Where are memory barriers needed?
 42 
 43      - Interprocessor interaction.
 44      - Atomic operations.
 45      - Accessing devices.
 46      - Interrupts.
 47 
 48  (*) Kernel I/O barrier effects.
 49 
 50  (*) Assumed minimum execution ordering model.
 51 
 52  (*) The effects of the cpu cache.
 53 
 54      - Cache coherency.
 55      - Cache coherency vs DMA.
 56      - Cache coherency vs MMIO.
 57 
 58  (*) The things CPUs get up to.
 59 
 60      - And then there's the Alpha.
 61 
 62  (*) References.
 63 
 64 
 65 ============================
 66 ABSTRACT MEMORY ACCESS MODEL
 67 ============================
 68 
 69 Consider the following abstract model of the system:
 70 
 71                             :                :
 72                             :                :
 73                             :                :
 74                 +-------+   :   +--------+   :   +-------+
 75                 |       |   :   |        |   :   |       |
 76                 |       |   :   |        |   :   |       |
 77                 | CPU 1 |<----->| Memory |<----->| CPU 2 |
 78                 |       |   :   |        |   :   |       |
 79                 |       |   :   |        |   :   |       |
 80                 +-------+   :   +--------+   :   +-------+
 81                     ^       :       ^        :       ^
 82                     |       :       |        :       |
 83                     |       :       |        :       |
 84                     |       :       v        :       |
 85                     |       :   +--------+   :       |
 86                     |       :   |        |   :       |
 87                     |       :   |        |   :       |
 88                     +---------->| Device |<----------+
 89                             :   |        |   :
 90                             :   |        |   :
 91                             :   +--------+   :
 92                             :                :
 93 
 94 Each CPU executes a program that generates memory access operations.  In the
 95 abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
 96 perform the memory operations in any order it likes, provided program causality
 97 appears to be maintained.  Similarly, the compiler may also arrange the
 98 instructions it emits in any order it likes, provided it doesn't affect the
 99 apparent operation of the program.
100 
101 So in the above diagram, the effects of the memory operations performed by a
102 CPU are perceived by the rest of the system as the operations cross the
103 interface between the CPU and rest of the system (the dotted lines).
104 
105 
106 For example, consider the following sequence of events:
107 
108         CPU 1           CPU 2
109         =============== ===============
110         { A == 1; B == 2 }
111         A = 3;          x = A;
112         B = 4;          y = B;
113 
114 The set of accesses as seen by the memory system in the middle can be arranged
115 in 24 different combinations:
116 
117         STORE A=3,      STORE B=4,      x=LOAD A->3,    y=LOAD B->4
118         STORE A=3,      STORE B=4,      y=LOAD B->4,    x=LOAD A->3
119         STORE A=3,      x=LOAD A->3,    STORE B=4,      y=LOAD B->4
120         STORE A=3,      x=LOAD A->3,    y=LOAD B->2,    STORE B=4
121         STORE A=3,      y=LOAD B->2,    STORE B=4,      x=LOAD A->3
122         STORE A=3,      y=LOAD B->2,    x=LOAD A->3,    STORE B=4
123         STORE B=4,      STORE A=3,      x=LOAD A->3,    y=LOAD B->4
124         STORE B=4, ...
125         ...
126 
127 and can thus result in four different combinations of values:
128 
129         x == 1, y == 2
130         x == 1, y == 4
131         x == 3, y == 2
132         x == 3, y == 4
133 
134 
135 Furthermore, the stores committed by a CPU to the memory system may not be
136 perceived by the loads made by another CPU in the same order as the stores were
137 committed.
138 
139 
140 As a further example, consider this sequence of events:
141 
142         CPU 1           CPU 2
143         =============== ===============
144         { A == 1, B == 2, C = 3, P == &A, Q == &C }
145         B = 4;          Q = P;
146         P = &B          D = *Q;
147 
148 There is an obvious data dependency here, as the value loaded into D depends on
149 the address retrieved from P by CPU 2.  At the end of the sequence, any of the
150 following results are possible:
151 
152         (Q == &A) and (D == 1)
153         (Q == &B) and (D == 2)
154         (Q == &B) and (D == 4)
155 
156 Note that CPU 2 will never try and load C into D because the CPU will load P
157 into Q before issuing the load of *Q.
158 
159 
160 DEVICE OPERATIONS
161 -----------------
162 
163 Some devices present their control interfaces as collections of memory
164 locations, but the order in which the control registers are accessed is very
165 important.  For instance, imagine an ethernet card with a set of internal
166 registers that are accessed through an address port register (A) and a data
167 port register (D).  To read internal register 5, the following code might then
168 be used:
169 
170         *A = 5;
171         x = *D;
172 
173 but this might show up as either of the following two sequences:
174 
175         STORE *A = 5, x = LOAD *D
176         x = LOAD *D, STORE *A = 5
177 
178 the second of which will almost certainly result in a malfunction, since it set
179 the address _after_ attempting to read the register.
180 
181 
182 GUARANTEES
183 ----------
184 
185 There are some minimal guarantees that may be expected of a CPU:
186 
187  (*) On any given CPU, dependent memory accesses will be issued in order, with
188      respect to itself.  This means that for:
189 
190         Q = P; D = *Q;
191 
192      the CPU will issue the following memory operations:
193 
194         Q = LOAD P, D = LOAD *Q
195 
196      and always in that order.
197 
198  (*) Overlapping loads and stores within a particular CPU will appear to be
199      ordered within that CPU.  This means that for:
200 
201         a = *X; *X = b;
202 
203      the CPU will only issue the following sequence of memory operations:
204 
205         a = LOAD *X, STORE *X = b
206 
207      And for:
208 
209         *X = c; d = *X;
210 
211      the CPU will only issue:
212 
213         STORE *X = c, d = LOAD *X
214 
215      (Loads and stores overlap if they are targetted at overlapping pieces of
216      memory).
217 
218 And there are a number of things that _must_ or _must_not_ be assumed:
219 
220  (*) It _must_not_ be assumed that independent loads and stores will be issued
221      in the order given.  This means that for:
222 
223         X = *A; Y = *B; *D = Z;
224 
225      we may get any of the following sequences:
226 
227         X = LOAD *A,  Y = LOAD *B,  STORE *D = Z
228         X = LOAD *A,  STORE *D = Z, Y = LOAD *B
229         Y = LOAD *B,  X = LOAD *A,  STORE *D = Z
230         Y = LOAD *B,  STORE *D = Z, X = LOAD *A
231         STORE *D = Z, X = LOAD *A,  Y = LOAD *B
232         STORE *D = Z, Y = LOAD *B,  X = LOAD *A
233 
234  (*) It _must_ be assumed that overlapping memory accesses may be merged or
235      discarded.  This means that for:
236 
237         X = *A; Y = *(A + 4);
238 
239      we may get any one of the following sequences:
240 
241         X = LOAD *A; Y = LOAD *(A + 4);
242         Y = LOAD *(A + 4); X = LOAD *A;
243         {X, Y} = LOAD {*A, *(A + 4) };
244 
245      And for:
246 
247         *A = X; Y = *A;
248 
249      we may get either of:
250 
251         STORE *A = X; Y = LOAD *A;
252         STORE *A = Y = X;
253 
254 
255 =========================
256 WHAT ARE MEMORY BARRIERS?
257 =========================
258 
259 As can be seen above, independent memory operations are effectively performed
260 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
261 What is required is some way of intervening to instruct the compiler and the
262 CPU to restrict the order.
263 
264 Memory barriers are such interventions.  They impose a perceived partial
265 ordering between the memory operations specified on either side of the barrier.
266 They request that the sequence of memory events generated appears to other
267 parts of the system as if the barrier is effective on that CPU.
268 
269 
270 VARIETIES OF MEMORY BARRIER
271 ---------------------------
272 
273 Memory barriers come in four basic varieties:
274 
275  (1) Write (or store) memory barriers.
276 
277      A write memory barrier gives a guarantee that all the STORE operations
278      specified before the barrier will appear to happen before all the STORE
279      operations specified after the barrier with respect to the other
280      components of the system.
281 
282      A write barrier is a partial ordering on stores only; it is not required
283      to have any effect on loads.
284 
285      A CPU can be viewed as as commiting a sequence of store operations to the
286      memory system as time progresses.  All stores before a write barrier will
287      occur in the sequence _before_ all the stores after the write barrier.
288 
289      [!] Note that write barriers should normally be paired with read or data
290      dependency barriers; see the "SMP barrier pairing" subsection.
291 
292 
293  (2) Data dependency barriers.
294 
295      A data dependency barrier is a weaker form of read barrier.  In the case
296      where two loads are performed such that the second depends on the result
297      of the first (eg: the first load retrieves the address to which the second
298      load will be directed), a data dependency barrier would be required to
299      make sure that the target of the second load is updated before the address
300      obtained by the first load is accessed.
301 
302      A data dependency barrier is a partial ordering on interdependent loads
303      only; it is not required to have any effect on stores, independent loads
304      or overlapping loads.
305 
306      As mentioned in (1), the other CPUs in the system can be viewed as
307      committing sequences of stores to the memory system that the CPU being
308      considered can then perceive.  A data dependency barrier issued by the CPU
309      under consideration guarantees that for any load preceding it, if that
310      load touches one of a sequence of stores from another CPU, then by the
311      time the barrier completes, the effects of all the stores prior to that
312      touched by the load will be perceptible to any loads issued after the data
313      dependency barrier.
314 
315      See the "Examples of memory barrier sequences" subsection for diagrams
316      showing the ordering constraints.
317 
318      [!] Note that the first load really has to have a _data_ dependency and
319      not a control dependency.  If the address for the second load is dependent
320      on the first load, but the dependency is through a conditional rather than
321      actually loading the address itself, then it's a _control_ dependency and
322      a full read barrier or better is required.  See the "Control dependencies"
323      subsection for more information.
324 
325      [!] Note that data dependency barriers should normally be paired with
326      write barriers; see the "SMP barrier pairing" subsection.
327 
328 
329  (3) Read (or load) memory barriers.
330 
331      A read barrier is a data dependency barrier plus a guarantee that all the
332      LOAD operations specified before the barrier will appear to happen before
333      all the LOAD operations specified after the barrier with respect to the
334      other components of the system.
335 
336      A read barrier is a partial ordering on loads only; it is not required to
337      have any effect on stores.
338 
339      Read memory barriers imply data dependency barriers, and so can substitute
340      for them.
341 
342      [!] Note that read barriers should normally be paired with write barriers;
343      see the "SMP barrier pairing" subsection.
344 
345 
346  (4) General memory barriers.
347 
348      A general memory barrier gives a guarantee that all the LOAD and STORE
349      operations specified before the barrier will appear to happen before all
350      the LOAD and STORE operations specified after the barrier with respect to
351      the other components of the system.
352 
353      A general memory barrier is a partial ordering over both loads and stores.
354 
355      General memory barriers imply both read and write memory barriers, and so
356      can substitute for either.
357 
358 
359 And a couple of implicit varieties:
360 
361  (5) LOCK operations.
362 
363      This acts as a one-way permeable barrier.  It guarantees that all memory
364      operations after the LOCK operation will appear to happen after the LOCK
365      operation with respect to the other components of the system.
366 
367      Memory operations that occur before a LOCK operation may appear to happen
368      after it completes.
369 
370      A LOCK operation should almost always be paired with an UNLOCK operation.
371 
372 
373  (6) UNLOCK operations.
374 
375      This also acts as a one-way permeable barrier.  It guarantees that all
376      memory operations before the UNLOCK operation will appear to happen before
377      the UNLOCK operation with respect to the other components of the system.
378 
379      Memory operations that occur after an UNLOCK operation may appear to
380      happen before it completes.
381 
382      LOCK and UNLOCK operations are guaranteed to appear with respect to each
383      other strictly in the order specified.
384 
385      The use of LOCK and UNLOCK operations generally precludes the need for
386      other sorts of memory barrier (but note the exceptions mentioned in the
387      subsection "MMIO write barrier").
388 
389 
390 Memory barriers are only required where there's a possibility of interaction
391 between two CPUs or between a CPU and a device.  If it can be guaranteed that
392 there won't be any such interaction in any particular piece of code, then
393 memory barriers are unnecessary in that piece of code.
394 
395 
396 Note that these are the _minimum_ guarantees.  Different architectures may give
397 more substantial guarantees, but they may _not_ be relied upon outside of arch
398 specific code.
399 
400 
401 WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
402 ----------------------------------------------
403 
404 There are certain things that the Linux kernel memory barriers do not guarantee:
405 
406  (*) There is no guarantee that any of the memory accesses specified before a
407      memory barrier will be _complete_ by the completion of a memory barrier
408      instruction; the barrier can be considered to draw a line in that CPU's
409      access queue that accesses of the appropriate type may not cross.
410 
411  (*) There is no guarantee that issuing a memory barrier on one CPU will have
412      any direct effect on another CPU or any other hardware in the system.  The
413      indirect effect will be the order in which the second CPU sees the effects
414      of the first CPU's accesses occur, but see the next point:
415 
416  (*) There is no guarantee that the a CPU will see the correct order of effects
417      from a second CPU's accesses, even _if_ the second CPU uses a memory
418      barrier, unless the first CPU _also_ uses a matching memory barrier (see
419      the subsection on "SMP Barrier Pairing").
420 
421  (*) There is no guarantee that some intervening piece of off-the-CPU
422      hardware[*] will not reorder the memory accesses.  CPU cache coherency
423      mechanisms should propagate the indirect effects of a memory barrier
424      between CPUs, but might not do so in order.
425 
426         [*] For information on bus mastering DMA and coherency please read:
427 
428             Documentation/pci.txt
429             Documentation/DMA-mapping.txt
430             Documentation/DMA-API.txt
431 
432 
433 DATA DEPENDENCY BARRIERS
434 ------------------------
435 
436 The usage requirements of data dependency barriers are a little subtle, and
437 it's not always obvious that they're needed.  To illustrate, consider the
438 following sequence of events:
439 
440         CPU 1           CPU 2
441         =============== ===============
442         { A == 1, B == 2, C = 3, P == &A, Q == &C }
443         B = 4;
444         <write barrier>
445         P = &B
446                         Q = P;
447                         D = *Q;
448 
449 There's a clear data dependency here, and it would seem that by the end of the
450 sequence, Q must be either &A or &B, and that:
451 
452         (Q == &A) implies (D == 1)
453         (Q == &B) implies (D == 4)
454 
455 But! CPU 2's perception of P may be updated _before_ its perception of B, thus
456 leading to the following situation:
457 
458         (Q == &B) and (D == 2) ????
459 
460 Whilst this may seem like a failure of coherency or causality maintenance, it
461 isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
462 Alpha).
463 
464 To deal with this, a data dependency barrier must be inserted between the
465 address load and the data load:
466 
467         CPU 1           CPU 2
468         =============== ===============
469         { A == 1, B == 2, C = 3, P == &A, Q == &C }
470         B = 4;
471         <write barrier>
472         P = &B
473                         Q = P;
474                         <data dependency barrier>
475                         D = *Q;
476 
477 This enforces the occurrence of one of the two implications, and prevents the
478 third possibility from arising.
479 
480 [!] Note that this extremely counterintuitive situation arises most easily on
481 machines with split caches, so that, for example, one cache bank processes
482 even-numbered cache lines and the other bank processes odd-numbered cache
483 lines.  The pointer P might be stored in an odd-numbered cache line, and the
484 variable B might be stored in an even-numbered cache line.  Then, if the
485 even-numbered bank of the reading CPU's cache is extremely busy while the
486 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
487 but the old value of the variable B (1).
488 
489 
490 Another example of where data dependency barriers might by required is where a
491 number is read from memory and then used to calculate the index for an array
492 access:
493 
494         CPU 1           CPU 2
495         =============== ===============
496         { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
497         M[1] = 4;
498         <write barrier>
499         P = 1
500                         Q = P;
501                         <data dependency barrier>
502                         D = M[Q];
503 
504 
505 The data dependency barrier is very important to the RCU system, for example.
506 See rcu_dereference() in include/linux/rcupdate.h.  This permits the current
507 target of an RCU'd pointer to be replaced with a new modified target, without
508 the replacement target appearing to be incompletely initialised.
509 
510 See also the subsection on "Cache Coherency" for a more thorough example.
511 
512 
513 CONTROL DEPENDENCIES
514 --------------------
515 
516 A control dependency requires a full read memory barrier, not simply a data
517 dependency barrier to make it work correctly.  Consider the following bit of
518 code:
519 
520         q = &a;
521         if (p)
522                 q = &b;
523         <data dependency barrier>
524         x = *q;
525 
526 This will not have the desired effect because there is no actual data
527 dependency, but rather a control dependency that the CPU may short-circuit by
528 attempting to predict the outcome in advance.  In such a case what's actually
529 required is:
530 
531         q = &a;
532         if (p)
533                 q = &b;
534         <read barrier>
535         x = *q;
536 
537 
538 SMP BARRIER PAIRING
539 -------------------
540 
541 When dealing with CPU-CPU interactions, certain types of memory barrier should
542 always be paired.  A lack of appropriate pairing is almost certainly an error.
543 
544 A write barrier should always be paired with a data dependency barrier or read
545 barrier, though a general barrier would also be viable.  Similarly a read
546 barrier or a data dependency barrier should always be paired with at least an
547 write barrier, though, again, a general barrier is viable:
548 
549         CPU 1           CPU 2
550         =============== ===============
551         a = 1;
552         <write barrier>
553         b = 2;          x = b;
554                         <read barrier>
555                         y = a;
556 
557 Or:
558 
559         CPU 1           CPU 2
560         =============== ===============================
561         a = 1;
562         <write barrier>
563         b = &a;         x = b;
564                         <data dependency barrier>
565                         y = *x;
566 
567 Basically, the read barrier always has to be there, even though it can be of
568 the "weaker" type.
569 
570 [!] Note that the stores before the write barrier would normally be expected to
571 match the loads after the read barrier or data dependency barrier, and vice
572 versa:
573 
574         CPU 1                           CPU 2
575         ===============                 ===============
576         a = 1;           }----   --->{  v = c
577         b = 2;           }    \ /    {  w = d
578         <write barrier>        \        <read barrier>
579         c = 3;           }    / \    {  x = a;
580         d = 4;           }----   --->{  y = b;
581 
582 
583 EXAMPLES OF MEMORY BARRIER SEQUENCES
584 ------------------------------------
585 
586 Firstly, write barriers act as a partial orderings on store operations.
587 Consider the following sequence of events:
588 
589         CPU 1
590         =======================
591         STORE A = 1
592         STORE B = 2
593         STORE C = 3
594         <write barrier>
595         STORE D = 4
596         STORE E = 5
597 
598 This sequence of events is committed to the memory coherence system in an order
599 that the rest of the system might perceive as the unordered set of { STORE A,
600 STORE B, STORE C } all occuring before the unordered set of { STORE D, STORE E
601 }:
602 
603         +-------+       :      :
604         |       |       +------+
605         |       |------>| C=3  |     }     /\
606         |       |  :    +------+     }-----  \  -----> Events perceptible
607         |       |  :    | A=1  |     }        \/       to rest of system
608         |       |  :    +------+     }
609         | CPU 1 |  :    | B=2  |     }
610         |       |       +------+     }
611         |       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
612         |       |       +------+     }        requires all stores prior to the
613         |       |  :    | E=5  |     }        barrier to be committed before
614         |       |  :    +------+     }        further stores may be take place.
615         |       |------>| D=4  |     }
616         |       |       +------+
617         +-------+       :      :
618                            |
619                            | Sequence in which stores are committed to the
620                            | memory system by CPU 1
621                            V
622 
623 
624 Secondly, data dependency barriers act as a partial orderings on data-dependent
625 loads.  Consider the following sequence of events:
626 
627         CPU 1                   CPU 2
628         ======================= =======================
629                 { B = 7; X = 9; Y = 8; C = &Y }
630         STORE A = 1
631         STORE B = 2
632         <write barrier>
633         STORE C = &B            LOAD X
634         STORE D = 4             LOAD C (gets &B)
635                                 LOAD *C (reads B)
636 
637 Without intervention, CPU 2 may perceive the events on CPU 1 in some
638 effectively random order, despite the write barrier issued by CPU 1:
639 
640         +-------+       :      :                :       :
641         |       |       +------+                +-------+  | Sequence of update
642         |       |------>| B=2  |-----       --->| Y->8  |  | of perception on
643         |       |  :    +------+     \          +-------+  | CPU 2
644         | CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
645         |       |       +------+       |        +-------+
646         |       |   wwwwwwwwwwwwwwww   |        :       :
647         |       |       +------+       |        :       :
648         |       |  :    | C=&B |---    |        :       :       +-------+
649         |       |  :    +------+   \   |        +-------+       |       |
650         |       |------>| D=4  |    ----------->| C->&B |------>|       |
651         |       |       +------+       |        +-------+       |       |
652         +-------+       :      :       |        :       :       |       |
653                                        |        :       :       |       |
654                                        |        :       :       | CPU 2 |
655                                        |        +-------+       |       |
656             Apparently incorrect --->  |        | B->7  |------>|       |
657             perception of B (!)        |        +-------+       |       |
658                                        |        :       :       |       |
659                                        |        +-------+       |       |
660             The load of X holds --->    \       | X->9  |------>|       |
661             up the maintenance           \      +-------+       |       |
662             of coherence of B             ----->| B->2  |       +-------+
663                                                 +-------+
664                                                 :       :
665 
666 
667 In the above example, CPU 2 perceives that B is 7, despite the load of *C
668 (which would be B) coming after the the LOAD of C.
669 
670 If, however, a data dependency barrier were to be placed between the load of C
671 and the load of *C (ie: B) on CPU 2:
672 
673         CPU 1                   CPU 2
674         ======================= =======================
675                 { B = 7; X = 9; Y = 8; C = &Y }
676         STORE A = 1
677         STORE B = 2
678         <write barrier>
679         STORE C = &B            LOAD X
680         STORE D = 4             LOAD C (gets &B)
681                                 <data dependency barrier>
682                                 LOAD *C (reads B)
683 
684 then the following will occur:
685 
686         +-------+       :      :                :       :
687         |       |       +------+                +-------+
688         |       |------>| B=2  |-----       --->| Y->8  |
689         |       |  :    +------+     \          +-------+
690         | CPU 1 |  :    | A=1  |      \     --->| C->&Y |
691         |       |       +------+       |        +-------+
692         |       |   wwwwwwwwwwwwwwww   |        :       :
693         |       |       +------+       |        :       :
694         |       |  :    | C=&B |---    |        :       :       +-------+
695         |       |  :    +------+   \   |        +-------+       |       |
696         |       |------>| D=4  |    ----------->| C->&B |------>|       |
697         |       |       +------+       |        +-------+       |       |
698         +-------+       :      :       |        :       :       |       |
699                                        |        :       :       |       |
700                                        |        :       :       | CPU 2 |
701                                        |        +-------+       |       |
702                                        |        | X->9  |------>|       |
703                                        |        +-------+       |       |
704           Makes sure all effects --->   \   ddddddddddddddddd   |       |
705           prior to the store of C        \      +-------+       |       |
706           are perceptible to              ----->| B->2  |------>|       |
707           subsequent loads                      +-------+       |       |
708                                                 :       :       +-------+
709 
710 
711 And thirdly, a read barrier acts as a partial order on loads.  Consider the
712 following sequence of events:
713 
714         CPU 1                   CPU 2
715         ======================= =======================
716                 { A = 0, B = 9 }
717         STORE A=1
718         <write barrier>
719         STORE B=2
720                                 LOAD B
721                                 LOAD A
722 
723 Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
724 some effectively random order, despite the write barrier issued by CPU 1:
725 
726         +-------+       :      :                :       :
727         |       |       +------+                +-------+
728         |       |------>| A=1  |------      --->| A->0  |
729         |       |       +------+      \         +-------+
730         | CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
731         |       |       +------+        |       +-------+
732         |       |------>| B=2  |---     |       :       :
733         |       |       +------+   \    |       :       :       +-------+
734         +-------+       :      :    \   |       +-------+       |       |
735                                      ---------->| B->2  |------>|       |
736                                         |       +-------+       | CPU 2 |
737                                         |       | A->0  |------>|       |
738                                         |       +-------+       |       |
739                                         |       :       :       +-------+
740                                          \      :       :
741                                           \     +-------+
742                                            ---->| A->1  |
743                                                 +-------+
744                                                 :       :
745 
746 
747 If, however, a read barrier were to be placed between the load of E and the
748 load of A on CPU 2:
749 
750         CPU 1                   CPU 2
751         ======================= =======================
752                 { A = 0, B = 9 }
753         STORE A=1
754         <write barrier>
755         STORE B=2
756                                 LOAD B
757                                 <read barrier>
758                                 LOAD A
759 
760 then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
761 2:
762 
763         +-------+       :      :                :       :
764         |       |       +------+                +-------+
765         |       |------>| A=1  |------      --->| A->0  |
766         |       |       +------+      \         +-------+
767         | CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
768         |       |       +------+        |       +-------+
769         |       |------>| B=2  |---     |       :       :
770         |       |       +------+   \    |       :       :       +-------+
771         +-------+       :      :    \   |       +-------+       |       |
772                                      ---------->| B->2  |------>|       |
773                                         |       +-------+       | CPU 2 |
774                                         |       :       :       |       |
775                                         |       :       :       |       |
776           At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
777           barrier causes all effects      \     +-------+       |       |
778           prior to the storage of B        ---->| A->1  |------>|       |
779           to be perceptible to CPU 2            +-------+       |       |
780                                                 :       :       +-------+
781 
782 
783 To illustrate this more completely, consider what could happen if the code
784 contained a load of A either side of the read barrier:
785 
786         CPU 1                   CPU 2
787         ======================= =======================
788                 { A = 0, B = 9 }
789         STORE A=1
790         <write barrier>
791         STORE B=2
792                                 LOAD B
793                                 LOAD A [first load of A]
794                                 <read barrier>
795                                 LOAD A [second load of A]
796 
797 Even though the two loads of A both occur after the load of B, they may both
798 come up with different values:
799 
800         +-------+       :      :                :       :
801         |       |       +------+                +-------+
802         |       |------>| A=1  |------      --->| A->0  |
803         |       |       +------+      \         +-------+
804         | CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
805         |       |       +------+        |       +-------+
806         |       |------>| B=2  |---     |       :       :
807         |       |       +------+   \    |       :       :       +-------+
808         +-------+       :      :    \   |       +-------+       |       |
809                                      ---------->| B->2  |------>|       |
810                                         |       +-------+       | CPU 2 |
811                                         |       :       :       |       |
812                                         |       :       :       |       |
813                                         |       +-------+       |       |
814                                         |       | A->0  |------>| 1st   |
815                                         |       +-------+       |       |
816           At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
817           barrier causes all effects      \     +-------+       |       |
818           prior to the storage of B        ---->| A->1  |------>| 2nd   |
819           to be perceptible to CPU 2            +-------+       |       |
820                                                 :       :       +-------+
821 
822 
823 But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
824 before the read barrier completes anyway:
825 
826         +-------+       :      :                :       :
827         |       |       +------+                +-------+
828         |       |------>| A=1  |------      --->| A->0  |
829         |       |       +------+      \         +-------+
830         | CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
831         |       |       +------+        |       +-------+
832         |       |------>| B=2  |---     |       :       :
833         |       |       +------+   \    |       :       :       +-------+
834         +-------+       :      :    \   |       +-------+       |       |
835                                      ---------->| B->2  |------>|       |
836                                         |       +-------+       | CPU 2 |
837                                         |       :       :       |       |
838                                          \      :       :       |       |
839                                           \     +-------+       |       |
840                                            ---->| A->1  |------>| 1st   |
841                                                 +-------+       |       |
842                                             rrrrrrrrrrrrrrrrr   |       |
843                                                 +-------+       |       |
844                                                 | A->1  |------>| 2nd   |
845                                                 +-------+       |       |
846                                                 :       :       +-------+
847 
848 
849 The guarantee is that the second load will always come up with A == 1 if the
850 load of B came up with B == 2.  No such guarantee exists for the first load of
851 A; that may come up with either A == 0 or A == 1.
852 
853 
854 READ MEMORY BARRIERS VS LOAD SPECULATION
855 ----------------------------------------
856 
857 Many CPUs speculate with loads: that is they see that they will need to load an
858 item from memory, and they find a time where they're not using the bus for any
859 other loads, and so do the load in advance - even though they haven't actually
860 got to that point in the instruction execution flow yet.  This permits the
861 actual load instruction to potentially complete immediately because the CPU
862 already has the value to hand.
863 
864 It may turn out that the CPU didn't actually need the value - perhaps because a
865 branch circumvented the load - in which case it can discard the value or just
866 cache it for later use.
867 
868 Consider:
869 
870         CPU 1                   CPU 2
871         ======================= =======================
872                                 LOAD B
873                                 DIVIDE          } Divide instructions generally
874                                 DIVIDE          } take a long time to perform
875                                 LOAD A
876 
877 Which might appear as this:
878 
879                                                 :       :       +-------+
880                                                 +-------+       |       |
881                                             --->| B->2  |------>|       |
882                                                 +-------+       | CPU 2 |
883                                                 :       :DIVIDE |       |
884                                                 +-------+       |       |
885         The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
886         division speculates on the              +-------+   ~   |       |
887         LOAD of A                               :       :   ~   |       |
888                                                 :       :DIVIDE |       |
889                                                 :       :   ~   |       |
890         Once the divisions are complete -->     :       :   ~-->|       |
891         the CPU can then perform the            :       :       |       |
892         LOAD with immediate effect              :       :       +-------+
893 
894 
895 Placing a read barrier or a data dependency barrier just before the second
896 load:
897 
898         CPU 1                   CPU 2
899         ======================= =======================
900                                 LOAD B
901                                 DIVIDE
902                                 DIVIDE
903                                 <read barrier>
904                                 LOAD A
905 
906 will force any value speculatively obtained to be reconsidered to an extent
907 dependent on the type of barrier used.  If there was no change made to the
908 speculated memory location, then the speculated value will just be used:
909 
910                                                 :       :       +-------+
911                                                 +-------+       |       |
912                                             --->| B->2  |------>|       |
913                                                 +-------+       | CPU 2 |
914                                                 :       :DIVIDE |       |
915                                                 +-------+       |       |
916         The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
917         division speculates on the              +-------+   ~   |       |
918         LOAD of A                               :       :   ~   |       |
919                                                 :       :DIVIDE |       |
920                                                 :       :   ~   |       |
921                                                 :       :   ~   |       |
922                                             rrrrrrrrrrrrrrrr~   |       |
923                                                 :       :   ~   |       |
924                                                 :       :   ~-->|       |
925                                                 :       :       |       |
926                                                 :       :       +-------+
927 
928 
929 but if there was an update or an invalidation from another CPU pending, then
930 the speculation will be cancelled and the value reloaded:
931 
932                                                 :       :       +-------+
933                                                 +-------+       |       |
934                                             --->| B->2  |------>|       |
935                                                 +-------+       | CPU 2 |
936                                                 :       :DIVIDE |       |
937                                                 +-------+       |       |
938         The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
939         division speculates on the              +-------+   ~   |       |
940         LOAD of A                               :       :   ~   |       |
941                                                 :       :DIVIDE |       |
942                                                 :       :   ~   |       |
943                                                 :       :   ~   |       |
944                                             rrrrrrrrrrrrrrrrr   |       |
945                                                 +-------+       |       |
946         The speculation is discarded --->   --->| A->1  |------>|       |
947         and an updated value is                 +-------+       |       |
948         retrieved                               :       :       +-------+
949 
950 
951 ========================
952 EXPLICIT KERNEL BARRIERS
953 ========================
954 
955 The Linux kernel has a variety of different barriers that act at different
956 levels:
957 
958   (*) Compiler barrier.
959 
960   (*) CPU memory barriers.
961 
962   (*) MMIO write barrier.
963 
964 
965 COMPILER BARRIER
966 ----------------
967 
968 The Linux kernel has an explicit compiler barrier function that prevents the
969 compiler from moving the memory accesses either side of it to the other side:
970 
971         barrier();
972 
973 This a general barrier - lesser varieties of compiler barrier do not exist.
974 
975 The compiler barrier has no direct effect on the CPU, which may then reorder
976 things however it wishes.
977 
978 
979 CPU MEMORY BARRIERS
980 -------------------
981 
982 The Linux kernel has eight basic CPU memory barriers:
983 
984         TYPE            MANDATORY               SMP CONDITIONAL
985         =============== ======================= ===========================
986         GENERAL         mb()                    smp_mb()
987         WRITE           wmb()                   smp_wmb()
988         READ            rmb()                   smp_rmb()
989         DATA DEPENDENCY read_barrier_depends()  smp_read_barrier_depends()
990 
991 
992 All CPU memory barriers unconditionally imply compiler barriers.
993 
994 SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
995 systems because it is assumed that a CPU will be appear to be self-consistent,
996 and will order overlapping accesses correctly with respect to itself.
997 
998 [!] Note that SMP memory barriers _must_ be used to control the ordering of
999 references to shared memory on SMP systems, though the use of locking instead
1000 is sufficient.
1001 
1002 Mandatory barriers should not be used to control SMP effects, since mandatory
1003 barriers unnecessarily impose overhead on UP systems. They may, however, be
1004 used to control MMIO effects on accesses through relaxed memory I/O windows.
1005 These are required even on non-SMP systems as they affect the order in which
1006 memory operations appear to a device by prohibiting both the compiler and the
1007 CPU from reordering them.
1008 
1009 
1010 There are some more advanced barrier functions:
1011 
1012  (*) set_mb(var, value)
1013  (*) set_wmb(var, value)
1014 
1015      These assign the value to the variable and then insert at least a write
1016      barrier after it, depending on the function.  They aren't guaranteed to
1017      insert anything more than a compiler barrier in a UP compilation.
1018 
1019 
1020  (*) smp_mb__before_atomic_dec();
1021  (*) smp_mb__after_atomic_dec();
1022  (*) smp_mb__before_atomic_inc();
1023  (*) smp_mb__after_atomic_inc();
1024 
1025      These are for use with atomic add, subtract, increment and decrement
1026      functions that don't return a value, especially when used for reference
1027      counting.  These functions do not imply memory barriers.
1028 
1029      As an example, consider a piece of code that marks an object as being dead
1030      and then decrements the object's reference count:
1031 
1032         obj->dead = 1;
1033         smp_mb__before_atomic_dec();
1034         atomic_dec(&obj->ref_count);
1035 
1036      This makes sure that the death mark on the object is perceived to be set
1037      *before* the reference counter is decremented.
1038 
1039      See Documentation/atomic_ops.txt for more information.  See the "Atomic
1040      operations" subsection for information on where to use these.
1041 
1042 
1043  (*) smp_mb__before_clear_bit(void);
1044  (*) smp_mb__after_clear_bit(void);
1045 
1046      These are for use similar to the atomic inc/dec barriers.  These are
1047      typically used for bitwise unlocking operations, so care must be taken as
1048      there are no implicit memory barriers here either.
1049 
1050      Consider implementing an unlock operation of some nature by clearing a
1051      locking bit.  The clear_bit() would then need to be barriered like this:
1052 
1053         smp_mb__before_clear_bit();
1054         clear_bit( ... );
1055 
1056      This prevents memory operations before the clear leaking to after it.  See
1057      the subsection on "Locking Functions" with reference to UNLOCK operation
1058      implications.
1059 
1060      See Documentation/atomic_ops.txt for more information.  See the "Atomic
1061      operations" subsection for information on where to use these.
1062 
1063 
1064 MMIO WRITE BARRIER
1065 ------------------
1066 
1067 The Linux kernel also has a special barrier for use with memory-mapped I/O
1068 writes:
1069 
1070         mmiowb();
1071 
1072 This is a variation on the mandatory write barrier that causes writes to weakly
1073 ordered I/O regions to be partially ordered.  Its effects may go beyond the
1074 CPU->Hardware interface and actually affect the hardware at some level.
1075 
1076 See the subsection "Locks vs I/O accesses" for more information.
1077 
1078 
1079 ===============================
1080 IMPLICIT KERNEL MEMORY BARRIERS
1081 ===============================
1082 
1083 Some of the other functions in the linux kernel imply memory barriers, amongst
1084 which are locking and scheduling functions.
1085 
1086 This specification is a _minimum_ guarantee; any particular architecture may
1087 provide more substantial guarantees, but these may not be relied upon outside
1088 of arch specific code.
1089 
1090 
1091 LOCKING FUNCTIONS
1092 -----------------
1093 
1094 The Linux kernel has a number of locking constructs:
1095 
1096  (*) spin locks
1097  (*) R/W spin locks
1098  (*) mutexes
1099  (*) semaphores
1100  (*) R/W semaphores
1101  (*) RCU
1102 
1103 In all cases there are variants on "LOCK" operations and "UNLOCK" operations
1104 for each construct.  These operations all imply certain barriers:
1105 
1106  (1) LOCK operation implication:
1107 
1108      Memory operations issued after the LOCK will be completed after the LOCK
1109      operation has completed.
1110 
1111      Memory operations issued before the LOCK may be completed after the LOCK
1112      operation has completed.
1113 
1114  (2) UNLOCK operation implication:
1115 
1116      Memory operations issued before the UNLOCK will be completed before the
1117      UNLOCK operation has completed.
1118 
1119      Memory operations issued after the UNLOCK may be completed before the
1120      UNLOCK operation has completed.
1121 
1122  (3) LOCK vs LOCK implication: