Micro-State Accounting

Current Work is on the /ToDo page.

A paper was published in AUUG2003: details on our publications page.

See also the /DownLoad page.

In a standard Linux system, the only information about where your time went is held in user, system and real times. These times are sampled statistically --- each time a clock tick fires, a counter is updated in the currently running task.

For the purposes of tuning the system, it'd be nice not only to get more precise indications of how much time was spent in the kernel and in user mode, but also to see where the rest of the time goes.

Time is stolen from tasks to handle interrupts. Such time shows up (if at all) as slightly greater system time.

Process States

Threads move from state to state according to the diagram here

http://www.gelato.unsw.edu.au/images/thread-state.png

The states reported by ps are a subset of the states we want to account for; in particular, the ps states do not distinguish between three runnable states: on-cpu, on-runqueue, and interrupted.

The interrupted state is an interesting one: usually when an interrupt arrives in Linux it steals time from the currently running task. When the interrupt routine finishes, it returns the interrupted task to exactly the state it was in before the interrupt happens. However, the interrupt routine may wake up other tasks, and if the newly woken task is at higher priority than the interrupted task, the return-from-interrupt code arranges to call schedule().

While the interrupt is running, timer ticks accrue to the interrupted task.

Moreover, Interrupts may be nested --- in particular, timeslices can expire when in interrupt code. So there's need for special code to transfer the `interrupted' state from the previous to the next task at schedule time, if necessary.

Accounting

The idea is to have a single clock that says when you last entered a state; a clock for each microstate to be tracked, and a pointer to the current microstate clock.

On state change, the current time minus state-change-time is added to the current clock; current clock is set to the new state's clock, and state-change-time is set to the current time.

In addition, a timer is set up per interrupt, so time can be measured per IRQ.

If interrupts are shared, this doesn't give as much information as one would like, but for sane architectures there should be enough interrupt lines that sharing is unnecessary.

Other Implementations

Solaris

Solaris allows you to peek at the states of LWP (light weight processes), or what linux would term threads. Microstate accounting is separate to the lrusage type structures, and must be enabled via a /proc control. The microstates measured by Solaris are in /usr/include/sys/msacct.h

LMS_USER

User mode execution time.

LMS_SYSTEM

System (kernel) mode execution time.

LMS_TRAP

Other system trap time (other meaning aside from system call traps, which are accounted for in pr_stime).

LMS_TFAULT

Time sleeping while waiting for a page fault on a text page to be resolved.

LMS_DFAULT

Time sleeping while waiting for a page fault on a data page to be resolved.

LMS_KFAULT

Time sleeping while waiting for a kernel page fault to be resolved.

LMS_USER_LOCK

Time sleeping while waiting for a user lock (e.g. user-defined mutex lock).

LMS_SLEEP

Any sleep time not accounted for in the above fields.

LMS_WAIT_CPU

Time spent waiting for an available CPU.

LMS_STOPPED

Time spent stopped (e.g. job control or debug).

Calls are made at several points in the kernel to switch states.

Solaris offers a toolkits which seem to be the primary way of poking into the microstate accounting data?

Our implementation

There is a patch at http://marc.theaimsgroup.com/?l=linux-kernel&m=105662514529349&w=2 that:

  1. Adds timers for basic states
  2. Updates those timers when called from schedule(), and at other times when the state changes.
  3. Adds /proc/pid/msa to read timers from;

  4. Adds a new system call sys_msa that can obtain either timers for self or for children.

Care was taken not to grab the time too often; when two threads change state simultaneously (e.g., when schedule() is called) there is a single call to get the time.

Unfortunately, one cannot call do_gettimeofday() from an interrupt handler, so there are some problems on NUMA and large SMP machines, and on machines with aggressive power management (e.g., laptops):

States

Queues

The tricky bit is that when the ACTIVE queue is empty, the roles of active and expired queues are flipped. So the MSA machine needs to track when a queue is flipped, and make sure that the appropriate timings happen.

To avoid reading the clock more than necessary, when the queues are flipped, a pre-thread flag is set in the thread that is next to be scduled onto the processor. At the end of schedule() where the timers are updated, the current time is recorded as the last queueflip time; when threads are about to be put onto the processor apparently from the EXPIRED queue, their EXPIRED time is updated to the time in the last queueflip, and the time since last queueflip accounted as ACTIVE time.

SMP

However, it blows out the size of a struct task_struct immensely. Still have to work out something.

Disksleep

Some observations

(by PeterChubb)

This being so, we can get rid of most of the calls to msa_set_timer.

Using Microstate Accoounting

There is a new system call,

typedef __u64 clk_t; 
 
int msa(int ntimers, int which, clk_t *timers) 

The argument which is one of:

#define MSA_SELF 0              /* sum of times for all threads */ 
#define MSA_CHILDREN 1          /* sum of times for waited-for children */ 
#define MSA_THREAD   2          /* times for current thread only */ 

The timers array must be allocated by the caller; it contains ntimers timers. These are filled in by the kernel to give times in nanoseconds.

The times currently tracked are:

timer

Function

ONCPU_USR

Time spent running in user mode

ONCPU_SYS

In-kernel time spent in system calls

ONCPU_TRAP

In-kernel time handling exact exceptions, e.g., IA64 unaligned access traps, or minor page faults

INTERRUPTIBLE_SLEEP

Time spent in slow sleeps, e.g., caused by the sleep(2) system call

UNINTERRUPTIBLE_SLEEP

Time spent in fast sleeps, e.g., waiting for Disc I/O

ONACTIVEQUEUE

Time waiting for CPU, when the thread(s) have a timeslice

ONEXPIREDQUEUE

Time waiting for a new timeslice

ZOMBIE

The time between exiting and the thread's parent calling pthread_join() or wait()

STOPPED

Time in T state

INTERRUPTED

The amount of time stolen from your thread's timeslice to handle interrupts

PAGING_SLEEP

The amount of time sleeping waiting for pages from disc

FUTEX_SLEEP

The amount of time spent sleeping in futexes (e.g., for pthread_mutex operations)

POLL_SLEEP

The time spent sleeping in poll, epoll or select

Implementation Results

On a 2.5GHz P4, I see these timings for getting all the states:

1115.65 nsec per MSA_SELF call
1006.71 nsec per MSA_THREAD call
860.611 nsec per MSA_CHILDREN call
638.323 nsec per gettimeofday() call
708.212 nsec per getrusage() call
524.985 nsec per getpid() call

For getting just the first three states, I see:

785.475 nsec per MSA_SELF call
727.679 nsec per MSA_THREAD call
662.208 nsec per MSA_CHILDREN call
636.934 nsec per gettimeofday() call
708.476 nsec per getrusage() call
524.835 nsec per getpid() call

LMBench results for IA64:

                 L M B E N C H  2 . 0   S U M M A R Y
                 ------------------------------------

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
                                 null     null                       open    signal   signal    fork    execve  /bin/sh
kernel                           call      I/O     stat    fstat    close   install   handle  process  process  process
-----------------------------  -------  -------  -------  -------  -------  -------  -------  -------  -------  -------
2.5.72                           0.366  0.47936    1.481    0.631    2.727    0.614    3.110    250.9    876.9   3541.5
  s.d. (5 runs)                  0.003  0.00215    0.013    0.002    0.014    0.000    0.021      0.0      7.9     15.6
2.5.72msa                        0.364  0.48099    1.543    0.638    2.749    0.615    3.135    249.9    864.9   3560.5
  s.d. (5 runs)                  0.004  0.00467    0.005    0.001    0.008    0.000    0.033      0.0      8.1     12.5
2.5.72msa-uni                    0.347  0.45322    1.335    0.602    2.361    0.578    2.936    238.6    828.7   3426.9
  s.d. (5 runs)                  0.003  0.00177    0.002    0.003    0.005    0.004    0.008      0.0      8.1     24.8
2.5.72uni                        0.348  0.45638    1.329    0.607    2.445    0.584    2.888    236.2    823.5   3429.3
  s.d. (5 runs)                  0.000  0.00298    0.001    0.003    0.009    0.004    0.012      0.0     13.0      4.3

File select - times in microseconds - smaller is better
-------------------------------------------------------
                                select   select   select   select   select   select   select   select
kernel                           10 fd   100 fd   250 fd   500 fd   10 tcp  100 tcp  250 tcp  500 tcp
-----------------------------  -------  -------  -------  -------  -------  -------  -------  -------
2.5.72                           2.064   10.597   24.806   48.351    2.926  18.9793  45.7354  90.1770
  s.d.                           0.006    0.011    0.010    0.025    0.007  0.01876  0.02360  0.03915
2.5.72msa                        2.078   10.571   24.776   48.325    2.953  18.9672  45.7242  90.2188
  s.d.                           0.007    0.003    0.027    0.027    0.010  0.00768  0.02830  0.03659
2.5.72msa-uni                    1.665    7.878   18.045   34.908    2.570  16.2263  38.9363  76.6805
  s.d.                           0.031    0.004    0.008    0.021    0.005  0.00199  0.01046  0.00979
2.5.72uni                        1.701    7.935   18.232   34.869    2.594  16.2680  38.9774  77.0298
  s.d.                           0.005    0.148    0.363    0.004    0.004  0.00536  0.01355  0.74997

Context switching with 0K - times in microseconds - smaller is better
---------------------------------------------------------------------
                                2proc/0k   4proc/0k   8proc/0k  16proc/0k  32proc/0k  64proc/0k  96proc/0k
kernel                         ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch
-----------------------------  ---------  ---------  ---------  ---------  ---------  ---------  ---------
2.5.72                             1.582      1.650      1.830      1.976      2.326      5.714      7.192
  s.d.                             0.154      0.037      0.043      0.024      0.196      0.201      0.100
2.5.72msa                          1.664      1.688      1.850      1.992      2.480      5.812      7.224
  s.d.                             0.089      0.026      0.053      0.040      0.416      0.091      0.254
2.5.72msa-uni                      1.538      1.524      1.608      1.704      2.038      5.896      7.080
  s.d.                             0.038      0.021      0.030      0.033      0.244      0.258      0.104
2.5.72uni                          1.368      1.362      1.424      1.560      1.820      5.010      6.366
  s.d.                             0.018      0.050      0.071      0.078      0.209      0.161      0.121

Context switching with 4K - times in microseconds - smaller is better
---------------------------------------------------------------------
                                2proc/4k   4proc/4k   8proc/4k  16proc/4k  32proc/4k  64proc/4k  96proc/4k
kernel                         ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch
-----------------------------  ---------  ---------  ---------  ---------  ---------  ---------  ---------
2.5.72                             2.174      2.222      2.490      2.770      4.992     10.474     12.228
  s.d.                             0.100      0.075      0.026      0.032      0.488      0.384      0.232
2.5.72msa                          2.340      2.374      2.586      2.912      5.006     10.438     12.260
  s.d.                             0.047      0.023      0.025      0.241      0.271      0.338      0.256
2.5.72msa-uni                      2.154      2.220      2.460      2.624      5.398     10.304     12.252
  s.d.                             0.093      0.040      0.060      0.067      0.523      0.352      0.120
2.5.72uni                          1.978      1.990      2.228      2.520      4.514      9.530     11.476
  s.d.                             0.072      0.059      0.072      0.227      0.269      0.184      0.246

Context switching with 8K - times in microseconds - smaller is better
---------------------------------------------------------------------
                                2proc/8k   4proc/8k   8proc/8k  16proc/8k  32proc/8k  64proc/8k  96proc/8k
kernel                         ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch
-----------------------------  ---------  ---------  ---------  ---------  ---------  ---------  ---------
2.5.72                             2.686      2.864      3.218      3.636      7.440     14.114     16.694
  s.d.                             0.076      0.117      0.148      0.340      1.474      1.690      0.315
2.5.72msa                          2.836      2.966      3.172      3.758      7.182     13.608     16.594
  s.d.                             0.061      0.066      0.164      0.261      1.934      1.294      0.496
2.5.72msa-uni                      2.602      2.648      2.942      3.304      6.366     12.698     16.334
  s.d.                             0.022      0.040      0.136      0.015      0.617      0.259      0.116
2.5.72uni                          2.388      2.492      2.802      3.178      5.486     12.222     15.526
  s.d.                             0.082      0.079      0.104      0.177      0.271      0.559      0.303

Context switching with 16K - times in microseconds - smaller is better
----------------------------------------------------------------------
                               2proc/16k  4proc/16k  8proc/16k  16prc/16k  32prc/16k  64prc/16k  96prc/16k
kernel                         ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch
-----------------------------  ---------  ---------  ---------  ---------  ---------  ---------  ---------
2.5.72                             3.594      3.764      4.104      5.000      9.280     18.318     24.024
  s.d.                             0.133      0.108      0.058      0.197      0.452      1.299      0.711
2.5.72msa                          3.750      3.816      4.124      5.062      9.244     18.716     24.530
  s.d.                             0.047      0.029      0.043      0.508      0.367      2.294      0.991
2.5.72msa-uni                      3.528      3.578      3.902      4.702      9.592     17.664     24.508
  s.d.                             0.090      0.026      0.069      0.171      0.673      1.520      0.292
2.5.72uni                          3.328      3.392      3.702      4.524      8.324     16.982     23.696
  s.d.                             0.075      0.033      0.011      0.207      0.167      0.603      0.822

Context switching with 32K - times in microseconds - smaller is better
----------------------------------------------------------------------
                               2proc/32k  4proc/32k  8proc/32k  16prc/32k  32prc/32k  64prc/32k  96prc/32k
kernel                         ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch
-----------------------------  ---------  ---------  ---------  ---------  ---------  ---------  ---------
2.5.72                             5.382      5.778      7.016      9.378     25.350     43.272     47.668
  s.d.                             0.087      0.064      0.023      0.552      3.138      1.435      0.395
2.5.72msa                          5.498      5.756      7.226     10.680     25.412     43.954     47.604
  s.d.                             0.033      0.042      0.260      3.308      6.214      1.347      0.488
2.5.72msa-uni                      5.250      5.594      6.796      8.360     22.616     45.014     47.936
  s.d.                             0.072      0.081      0.011      1.376      1.668      0.947      0.048
2.5.72uni                          5.082      5.370      6.714      8.792     20.200     44.040     47.044
  s.d.                             0.072      0.025      0.074      1.342      3.342      1.388      0.232

Context switching with 64K - times in microseconds - smaller is better
----------------------------------------------------------------------
                               2proc/64k  4proc/64k  8proc/64k  16prc/64k  32prc/64k  64prc/64k  96prc/64k
kernel                         ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch  ctx swtch
-----------------------------  ---------  ---------  ---------  ---------  ---------  ---------  ---------
2.5.72                             8.976     11.876     13.638     23.206     79.748     85.344     83.982
  s.d.                             0.211      0.146      0.281      3.665      2.682      0.390      0.644
2.5.72msa                          8.974     12.124     13.728     28.410     78.374     85.362     83.666
  s.d.                             0.056      0.095      0.218      7.974      2.087      0.438      0.570
2.5.72msa-uni                      8.938     11.794     13.318     23.628     78.354     85.444     85.632
  s.d.                             0.293      0.123      0.087      2.785      4.147      0.036      0.065
2.5.72uni                          8.598     11.434     13.112     22.652     76.888     84.560     84.874
  s.d.                             0.046      0.075      0.068      3.482      3.091      0.183      0.072

File create/delete and VM system latencies in microseconds - smaller is better
----------------------------------------------------------------------------
                                 0K       0K       1K       1K       4K       4K      10K      10K     Mmap     Prot    Page
kernel                         Create   Delete   Create   Delete   Create   Delete   Create   Delete   Latency  Fault   Fault
------------------------------ -------  -------  -------  -------  -------  -------  -------  -------  -------  ------  ------
2.5.72                           22.98    17.09    43.17    29.33    43.37    29.32    62.27    31.42    377.8   0.269    2.40
  s.d.                            0.40     0.13     0.82     0.05     0.86     0.03     1.04     0.04      2.4   0.071    0.55
2.5.72msa                        23.05    17.08    42.39    29.37    43.10    29.37    61.87    31.43    378.0   0.217    3.00
  s.d.                            0.07     0.11     1.33     0.07     1.60     0.07     1.71     0.04      1.2   0.066    0.00
2.5.72msa-uni                    22.21    15.65    39.51    26.77    39.57    26.89    57.30    28.83    314.0   0.395    2.00
  s.d.                            0.04     0.09     0.91     0.38     0.91     0.04     2.05     0.07      0.7   0.027    0.00
2.5.72uni                        22.11    15.43    40.06    26.64    40.14    26.68    57.74    28.62    303.0   0.475    2.00
  s.d.                            0.06     0.08     1.24     0.03     1.22     0.08     2.10     0.03      0.7   0.067    0.00

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
kernel                           Pipe   AF/Unix     UDP   RPC/UDP     TCP   RPC/TCP  TCPconn
-----------------------------  -------  -------  -------  -------  -------  -------  -------
2.5.72                           6.554    9.925  17.6191  29.3789  22.9711  37.1813   55.731
  s.d.                           0.249    0.171  0.06422  0.21368  0.24221  0.26545    9.929
2.5.72msa                        6.738   10.062  17.6407    33.98  28.3487  42.5031   56.453
  s.d.                           0.270    0.225  0.16372  9.40735  11.6790  10.8331    9.815
2.5.72msa-uni                    5.588    8.862  15.8027  27.1261  21.1931  34.4985   65.535
  s.d.                           0.216    0.230  0.17825  0.21333  0.13276  0.17592    0.243
2.5.72uni                        5.617    8.549  15.7438  26.6860  20.5699  34.0435   64.453
  s.d.                           0.238    0.029  0.05222  0.18369  0.18819  0.20809    0.364

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
                                                             File     Mmap    Bcopy    Bcopy   Memory   Memory
kernel                           Pipe   AF/Unix    TCP     reread   reread   (libc)   (hand)     read    write
-----------------------------  -------  -------  -------  -------  -------  -------  -------  -------  -------
2.5.72                         2124.75  1638.22   803.74  1092.20   590.50   665.70   388.60   590.40   563.52
  s.d.                            8.74     7.37   390.78     0.94     0.12     2.47     0.20     0.18     0.69
2.5.72msa                      2047.26  1636.50   667.02  1094.64   590.05   665.73   388.34   590.11   561.00
  s.d.                           45.28    12.16   171.21     0.49     0.10     1.94     0.30     0.17     1.59
2.5.72msa-uni                  2209.01  1733.66   762.21  1098.71   590.58   661.02   388.57   590.73   561.90
  s.d.                           40.74    22.60    13.05     1.42     0.27     1.83     0.29     0.11     3.85
2.5.72uni                      2296.26  1816.51   778.47  1097.82   590.70   665.05   388.70   590.89   560.51
  s.d.                            5.85    87.26    16.86     2.08     0.24     1.94     0.48     0.19     4.61

*Local* More Communication bandwidths in MB/s - bigger is better
----------------------------------------------------------------
                                  File     Mmap  Aligned  Partial  Partial  Partial  Partial  
OS                                open     open    Bcopy    Bcopy     Mmap     Mmap     Mmap    Bzero
                                 close    close   (libc)   (hand)     read    write   rd/wrt     copy     HTTP
-----------------------------  -------  -------  -------  -------  -------  -------  -------  -------  -------
2.5.72                         1099.39   550.74   667.12   683.64   773.95  1618.59   474.37  2383.52   17.554
  s.d.                            1.72     0.41     1.95     3.86     0.24     4.70     0.11     7.21    0.232
2.5.72msa                      1102.10   551.19   666.40   683.22   773.59  1599.19   474.07  2368.85   17.502
  s.d.                            0.66     0.12     0.77     3.03     0.17    14.88     0.26     7.57    0.106
2.5.72msa-uni                  1105.21   555.96   661.72   678.13   774.29  1597.93   474.42  2375.95   18.912
  s.d.                            1.54     0.60     2.90     4.96     0.12    37.92     0.30    11.10    0.089
2.5.72uni                      1104.45   557.32   666.72   683.38   774.64  1568.61   474.16  2370.79   19.120
  s.d.                            1.11     0.53     3.00     3.34     0.25    50.60     0.63    16.35    0.126

Memory latencies in nanoseconds - smaller is better
---------------------------------------------------
kernel                          Mhz     L1 $     L2 $    Main mem
-----------------------------  -----  -------  -------  ---------
2.5.72                           900    2.227    6.684     122.69
  s.d.                             0    0.232    0.232       0.23
2.5.72msa                        900    2.227    6.686     123.25
  s.d.                             0    0.106    0.106       0.11
2.5.72msa-uni                    900    2.226    6.682     122.59
  s.d.                             0    0.089    0.089       0.09
2.5.72uni                        900    2.225    6.680     122.54
  s.d.                             0    0.126    0.126       0.13

IA64wiki: MicroStateAccounting (last edited 2009-12-10 03:13:56 by localhost)

Gelato@UNSW is sponsored by
the University of New South Wales National ICT Australia The Gelato Federation Hewlett-Packard Company Australian Research Council
Please contact us with any questions or comments.