Micro-State Accounting
Contents
Current Work is on the /ToDo page.
A paper was published in AUUG2003: details on our publications page.
See also the /DownLoad page.
In a standard Linux system, the only information about where your time went is held in user, system and real times. These times are sampled statistically --- each time a clock tick fires, a counter is updated in the currently running task.
For the purposes of tuning the system, it'd be nice not only to get more precise indications of how much time was spent in the kernel and in user mode, but also to see where the rest of the time goes.
Time is stolen from tasks to handle interrupts. Such time shows up (if at all) as slightly greater system time.
Process States
Threads move from state to state according to the diagram here
The states reported by ps are a subset of the states we want to account for; in particular, the ps states do not distinguish between three runnable states: on-cpu, on-runqueue, and interrupted.
The interrupted state is an interesting one: usually when an interrupt arrives in Linux it steals time from the currently running task. When the interrupt routine finishes, it returns the interrupted task to exactly the state it was in before the interrupt happens. However, the interrupt routine may wake up other tasks, and if the newly woken task is at higher priority than the interrupted task, the return-from-interrupt code arranges to call schedule().
While the interrupt is running, timer ticks accrue to the interrupted task.
Moreover, Interrupts may be nested --- in particular, timeslices can expire when in interrupt code. So there's need for special code to transfer the `interrupted' state from the previous to the next task at schedule time, if necessary.
Accounting
The idea is to have a single clock that says when you last entered a state; a clock for each microstate to be tracked, and a pointer to the current microstate clock.
On state change, the current time minus state-change-time is added to the current clock; current clock is set to the new state's clock, and state-change-time is set to the current time.
In addition, a timer is set up per interrupt, so time can be measured per IRQ.
If interrupts are shared, this doesn't give as much information as one would like, but for sane architectures there should be enough interrupt lines that sharing is unnecessary.
Other Implementations
Solaris
Solaris allows you to peek at the states of LWP (light weight processes), or what linux would term threads. Microstate accounting is separate to the lrusage type structures, and must be enabled via a /proc control. The microstates measured by Solaris are in /usr/include/sys/msacct.h
LMS_USER |
User mode execution time. |
LMS_SYSTEM |
System (kernel) mode execution time. |
LMS_TRAP |
Other system trap time (other meaning aside from system call traps, which are accounted for in pr_stime). |
LMS_TFAULT |
Time sleeping while waiting for a page fault on a text page to be resolved. |
LMS_DFAULT |
Time sleeping while waiting for a page fault on a data page to be resolved. |
LMS_KFAULT |
Time sleeping while waiting for a kernel page fault to be resolved. |
LMS_USER_LOCK |
Time sleeping while waiting for a user lock (e.g. user-defined mutex lock). |
LMS_SLEEP |
Any sleep time not accounted for in the above fields. |
LMS_WAIT_CPU |
Time spent waiting for an available CPU. |
LMS_STOPPED |
Time spent stopped (e.g. job control or debug). |
Calls are made at several points in the kernel to switch states.
Solaris offers a toolkits which seem to be the primary way of poking into the microstate accounting data?
Our implementation
There is a patch at http://marc.theaimsgroup.com/?l=linux-kernel&m=105662514529349&w=2 that:
- Adds timers for basic states
- Updates those timers when called from schedule(), and at other times when the state changes.
Adds /proc/pid/msa to read timers from;
Adds a new system call sys_msa that can obtain either timers for self or for children.
Care was taken not to grab the time too often; when two threads change state simultaneously (e.g., when schedule() is called) there is a single call to get the time.
Unfortunately, one cannot call do_gettimeofday() from an interrupt handler, so there are some problems on NUMA and large SMP machines, and on machines with aggressive power management (e.g., laptops):
- The code assumes that all on-processor timer registers have the same frequency, and are synchronised. This may not be true for SMP or NUMA machines.
- The code assumes that on-process timers have a constant frequency. This isn't true in the presence of power saving measures that adjust processor clock speed.
States
- ONCPU : updated when the thread gets put onto the CPU
- ONRUNQUEUE : updated when the thread is put into the ACTIVE queue
- ONEXPIRED : updated when the process is put onto the EXPIRED queue.
- INTERRUPTED : updated when the thread is interrupted
- UNINTERRUPTIBLE_SLEEP : updated when the thread goes to sleep on a disk i/o
- INTERRUPTIBLE_SLEEP : updated when the thread gets put to sleep
- ZOMBIE : updated when the thread is a zombie (zombie tasks need to be kept on the runqueue so they can be reaped, but they can't actually do anything)
Queues
A process is either running, on the run queue, but the run queue consists of two queues, active and expired. A process gets put on the expired queue if it is still running at the end of its timeslice. This lowers its chance of running again, unless it is an interactive process where it gets put back on the run queue. So keeping track of which queue the process is on might be handy.
The tricky bit is that when the ACTIVE queue is empty, the roles of active and expired queues are flipped. So the MSA machine needs to track when a queue is flipped, and make sure that the appropriate timings happen.
To avoid reading the clock more than necessary, when the queues are flipped, a pre-thread flag is set in the thread that is next to be scduled onto the processor. At the end of schedule() where the timers are updated, the current time is recorded as the last queueflip time; when threads are about to be put onto the processor apparently from the EXPIRED queue, their EXPIRED time is updated to the time in the last queueflip, and the time since last queueflip accounted as ACTIVE time.
SMP
- ITC is not the same across processors. What probably needs to happen is the accounting is kept in a per-processor array; this idea fits in ok because the scheduler keeps a close eye on processor affinity.
However, it blows out the size of a struct task_struct immensely. Still have to work out something.
Disksleep
- the scheduler doesn't have an explicit record of what processes is waiting on a disk.
- io_schedule() increments the i/o wait counter for the runqueue but then calls schedule -- we need to keep track so we don't just blindly change its status.
- So what we do is ignore the actual wait cause, and just record UNINTERRUPTIBLE sleep. The user can then correlate that with other statistics (such as number of block I/Os from vmstat) to work out if it is indeed sleeping on disk.
Some observations
(by PeterChubb)
- D state corresponds to TASK_UNINTERRUPTIBLE (DISKSLEEP)
- S state corresponds to TASK_INTERRUPTIBLE (USERSLEEP)
- Other important states are ONCPU, ONACTIVE, ONEXPIRED, ZOMBIE, STOPPED, INTERRUPTED.
- We'll eventually want to distinguish substates of TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE, based on the equivalent of wchan, but we don't do that now.
This being so, we can get rid of most of the calls to msa_set_timer.
we need a hook in __activate_task() where a task is put onto the active queue, to flip from DISKSLEEP or USERSLEEP to ONACTIVE.
we need two hooks in schedule(), one for a plain context switch (one thread goes ONACTIVE->ONCPU, another thread goes ONCPU->ONACTIVE or ONCPU->ONEXPIRED),the other when the ACTIVE and EXPIRED queues are swopped.
Using Microstate Accoounting
There is a new system call,
typedef __u64 clk_t; int msa(int ntimers, int which, clk_t *timers)
The argument which is one of:
#define MSA_SELF 0 /* sum of times for all threads */ #define MSA_CHILDREN 1 /* sum of times for waited-for children */ #define MSA_THREAD 2 /* times for current thread only */
The timers array must be allocated by the caller; it contains ntimers timers. These are filled in by the kernel to give times in nanoseconds.
The times currently tracked are:
timer |
Function |
ONCPU_USR |
Time spent running in user mode |
ONCPU_SYS |
In-kernel time spent in system calls |
ONCPU_TRAP |
In-kernel time handling exact exceptions, e.g., IA64 unaligned access traps, or minor page faults |
INTERRUPTIBLE_SLEEP |
Time spent in slow sleeps, e.g., caused by the sleep(2) system call |
UNINTERRUPTIBLE_SLEEP |
Time spent in fast sleeps, e.g., waiting for Disc I/O |
ONACTIVEQUEUE |
Time waiting for CPU, when the thread(s) have a timeslice |
ONEXPIREDQUEUE |
Time waiting for a new timeslice |
ZOMBIE |
The time between exiting and the thread's parent calling pthread_join() or wait() |
STOPPED |
Time in T state |
INTERRUPTED |
The amount of time stolen from your thread's timeslice to handle interrupts |
PAGING_SLEEP |
The amount of time sleeping waiting for pages from disc |
FUTEX_SLEEP |
The amount of time spent sleeping in futexes (e.g., for pthread_mutex operations) |
POLL_SLEEP |
The time spent sleeping in poll, epoll or select |
Implementation Results
On a 2.5GHz P4, I see these timings for getting all the states:
1115.65 nsec per MSA_SELF call 1006.71 nsec per MSA_THREAD call 860.611 nsec per MSA_CHILDREN call 638.323 nsec per gettimeofday() call 708.212 nsec per getrusage() call 524.985 nsec per getpid() call
For getting just the first three states, I see:
785.475 nsec per MSA_SELF call 727.679 nsec per MSA_THREAD call 662.208 nsec per MSA_CHILDREN call 636.934 nsec per gettimeofday() call 708.476 nsec per getrusage() call 524.835 nsec per getpid() call
LMBench results for IA64:
L M B E N C H 2 . 0 S U M M A R Y
------------------------------------
Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
null null open signal signal fork execve /bin/sh
kernel call I/O stat fstat close install handle process process process
----------------------------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
2.5.72 0.366 0.47936 1.481 0.631 2.727 0.614 3.110 250.9 876.9 3541.5
s.d. (5 runs) 0.003 0.00215 0.013 0.002 0.014 0.000 0.021 0.0 7.9 15.6
2.5.72msa 0.364 0.48099 1.543 0.638 2.749 0.615 3.135 249.9 864.9 3560.5
s.d. (5 runs) 0.004 0.00467 0.005 0.001 0.008 0.000 0.033 0.0 8.1 12.5
2.5.72msa-uni 0.347 0.45322 1.335 0.602 2.361 0.578 2.936 238.6 828.7 3426.9
s.d. (5 runs) 0.003 0.00177 0.002 0.003 0.005 0.004 0.008 0.0 8.1 24.8
2.5.72uni 0.348 0.45638 1.329 0.607 2.445 0.584 2.888 236.2 823.5 3429.3
s.d. (5 runs) 0.000 0.00298 0.001 0.003 0.009 0.004 0.012 0.0 13.0 4.3
File select - times in microseconds - smaller is better
-------------------------------------------------------
select select select select select select select select
kernel 10 fd 100 fd 250 fd 500 fd 10 tcp 100 tcp 250 tcp 500 tcp
----------------------------- ------- ------- ------- ------- ------- ------- ------- -------
2.5.72 2.064 10.597 24.806 48.351 2.926 18.9793 45.7354 90.1770
s.d. 0.006 0.011 0.010 0.025 0.007 0.01876 0.02360 0.03915
2.5.72msa 2.078 10.571 24.776 48.325 2.953 18.9672 45.7242 90.2188
s.d. 0.007 0.003 0.027 0.027 0.010 0.00768 0.02830 0.03659
2.5.72msa-uni 1.665 7.878 18.045 34.908 2.570 16.2263 38.9363 76.6805
s.d. 0.031 0.004 0.008 0.021 0.005 0.00199 0.01046 0.00979
2.5.72uni 1.701 7.935 18.232 34.869 2.594 16.2680 38.9774 77.0298
s.d. 0.005 0.148 0.363 0.004 0.004 0.00536 0.01355 0.74997
Context switching with 0K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/0k 4proc/0k 8proc/0k 16proc/0k 32proc/0k 64proc/0k 96proc/0k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.5.72 1.582 1.650 1.830 1.976 2.326 5.714 7.192
s.d. 0.154 0.037 0.043 0.024 0.196 0.201 0.100
2.5.72msa 1.664 1.688 1.850 1.992 2.480 5.812 7.224
s.d. 0.089 0.026 0.053 0.040 0.416 0.091 0.254
2.5.72msa-uni 1.538 1.524 1.608 1.704 2.038 5.896 7.080
s.d. 0.038 0.021 0.030 0.033 0.244 0.258 0.104
2.5.72uni 1.368 1.362 1.424 1.560 1.820 5.010 6.366
s.d. 0.018 0.050 0.071 0.078 0.209 0.161 0.121
Context switching with 4K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/4k 4proc/4k 8proc/4k 16proc/4k 32proc/4k 64proc/4k 96proc/4k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.5.72 2.174 2.222 2.490 2.770 4.992 10.474 12.228
s.d. 0.100 0.075 0.026 0.032 0.488 0.384 0.232
2.5.72msa 2.340 2.374 2.586 2.912 5.006 10.438 12.260
s.d. 0.047 0.023 0.025 0.241 0.271 0.338 0.256
2.5.72msa-uni 2.154 2.220 2.460 2.624 5.398 10.304 12.252
s.d. 0.093 0.040 0.060 0.067 0.523 0.352 0.120
2.5.72uni 1.978 1.990 2.228 2.520 4.514 9.530 11.476
s.d. 0.072 0.059 0.072 0.227 0.269 0.184 0.246
Context switching with 8K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/8k 4proc/8k 8proc/8k 16proc/8k 32proc/8k 64proc/8k 96proc/8k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.5.72 2.686 2.864 3.218 3.636 7.440 14.114 16.694
s.d. 0.076 0.117 0.148 0.340 1.474 1.690 0.315
2.5.72msa 2.836 2.966 3.172 3.758 7.182 13.608 16.594
s.d. 0.061 0.066 0.164 0.261 1.934 1.294 0.496
2.5.72msa-uni 2.602 2.648 2.942 3.304 6.366 12.698 16.334
s.d. 0.022 0.040 0.136 0.015 0.617 0.259 0.116
2.5.72uni 2.388 2.492 2.802 3.178 5.486 12.222 15.526
s.d. 0.082 0.079 0.104 0.177 0.271 0.559 0.303
Context switching with 16K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/16k 4proc/16k 8proc/16k 16prc/16k 32prc/16k 64prc/16k 96prc/16k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.5.72 3.594 3.764 4.104 5.000 9.280 18.318 24.024
s.d. 0.133 0.108 0.058 0.197 0.452 1.299 0.711
2.5.72msa 3.750 3.816 4.124 5.062 9.244 18.716 24.530
s.d. 0.047 0.029 0.043 0.508 0.367 2.294 0.991
2.5.72msa-uni 3.528 3.578 3.902 4.702 9.592 17.664 24.508
s.d. 0.090 0.026 0.069 0.171 0.673 1.520 0.292
2.5.72uni 3.328 3.392 3.702 4.524 8.324 16.982 23.696
s.d. 0.075 0.033 0.011 0.207 0.167 0.603 0.822
Context switching with 32K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/32k 4proc/32k 8proc/32k 16prc/32k 32prc/32k 64prc/32k 96prc/32k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.5.72 5.382 5.778 7.016 9.378 25.350 43.272 47.668
s.d. 0.087 0.064 0.023 0.552 3.138 1.435 0.395
2.5.72msa 5.498 5.756 7.226 10.680 25.412 43.954 47.604
s.d. 0.033 0.042 0.260 3.308 6.214 1.347 0.488
2.5.72msa-uni 5.250 5.594 6.796 8.360 22.616 45.014 47.936
s.d. 0.072 0.081 0.011 1.376 1.668 0.947 0.048
2.5.72uni 5.082 5.370 6.714 8.792 20.200 44.040 47.044
s.d. 0.072 0.025 0.074 1.342 3.342 1.388 0.232
Context switching with 64K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/64k 4proc/64k 8proc/64k 16prc/64k 32prc/64k 64prc/64k 96prc/64k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
----------------------------- --------- --------- --------- --------- --------- --------- ---------
2.5.72 8.976 11.876 13.638 23.206 79.748 85.344 83.982
s.d. 0.211 0.146 0.281 3.665 2.682 0.390 0.644
2.5.72msa 8.974 12.124 13.728 28.410 78.374 85.362 83.666
s.d. 0.056 0.095 0.218 7.974 2.087 0.438 0.570
2.5.72msa-uni 8.938 11.794 13.318 23.628 78.354 85.444 85.632
s.d. 0.293 0.123 0.087 2.785 4.147 0.036 0.065
2.5.72uni 8.598 11.434 13.112 22.652 76.888 84.560 84.874
s.d. 0.046 0.075 0.068 3.482 3.091 0.183 0.072
File create/delete and VM system latencies in microseconds - smaller is better
----------------------------------------------------------------------------
0K 0K 1K 1K 4K 4K 10K 10K Mmap Prot Page
kernel Create Delete Create Delete Create Delete Create Delete Latency Fault Fault
------------------------------ ------- ------- ------- ------- ------- ------- ------- ------- ------- ------ ------
2.5.72 22.98 17.09 43.17 29.33 43.37 29.32 62.27 31.42 377.8 0.269 2.40
s.d. 0.40 0.13 0.82 0.05 0.86 0.03 1.04 0.04 2.4 0.071 0.55
2.5.72msa 23.05 17.08 42.39 29.37 43.10 29.37 61.87 31.43 378.0 0.217 3.00
s.d. 0.07 0.11 1.33 0.07 1.60 0.07 1.71 0.04 1.2 0.066 0.00
2.5.72msa-uni 22.21 15.65 39.51 26.77 39.57 26.89 57.30 28.83 314.0 0.395 2.00
s.d. 0.04 0.09 0.91 0.38 0.91 0.04 2.05 0.07 0.7 0.027 0.00
2.5.72uni 22.11 15.43 40.06 26.64 40.14 26.68 57.74 28.62 303.0 0.475 2.00
s.d. 0.06 0.08 1.24 0.03 1.22 0.08 2.10 0.03 0.7 0.067 0.00
*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
kernel Pipe AF/Unix UDP RPC/UDP TCP RPC/TCP TCPconn
----------------------------- ------- ------- ------- ------- ------- ------- -------
2.5.72 6.554 9.925 17.6191 29.3789 22.9711 37.1813 55.731
s.d. 0.249 0.171 0.06422 0.21368 0.24221 0.26545 9.929
2.5.72msa 6.738 10.062 17.6407 33.98 28.3487 42.5031 56.453
s.d. 0.270 0.225 0.16372 9.40735 11.6790 10.8331 9.815
2.5.72msa-uni 5.588 8.862 15.8027 27.1261 21.1931 34.4985 65.535
s.d. 0.216 0.230 0.17825 0.21333 0.13276 0.17592 0.243
2.5.72uni 5.617 8.549 15.7438 26.6860 20.5699 34.0435 64.453
s.d. 0.238 0.029 0.05222 0.18369 0.18819 0.20809 0.364
*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
File Mmap Bcopy Bcopy Memory Memory
kernel Pipe AF/Unix TCP reread reread (libc) (hand) read write
----------------------------- ------- ------- ------- ------- ------- ------- ------- ------- -------
2.5.72 2124.75 1638.22 803.74 1092.20 590.50 665.70 388.60 590.40 563.52
s.d. 8.74 7.37 390.78 0.94 0.12 2.47 0.20 0.18 0.69
2.5.72msa 2047.26 1636.50 667.02 1094.64 590.05 665.73 388.34 590.11 561.00
s.d. 45.28 12.16 171.21 0.49 0.10 1.94 0.30 0.17 1.59
2.5.72msa-uni 2209.01 1733.66 762.21 1098.71 590.58 661.02 388.57 590.73 561.90
s.d. 40.74 22.60 13.05 1.42 0.27 1.83 0.29 0.11 3.85
2.5.72uni 2296.26 1816.51 778.47 1097.82 590.70 665.05 388.70 590.89 560.51
s.d. 5.85 87.26 16.86 2.08 0.24 1.94 0.48 0.19 4.61
*Local* More Communication bandwidths in MB/s - bigger is better
----------------------------------------------------------------
File Mmap Aligned Partial Partial Partial Partial
OS open open Bcopy Bcopy Mmap Mmap Mmap Bzero
close close (libc) (hand) read write rd/wrt copy HTTP
----------------------------- ------- ------- ------- ------- ------- ------- ------- ------- -------
2.5.72 1099.39 550.74 667.12 683.64 773.95 1618.59 474.37 2383.52 17.554
s.d. 1.72 0.41 1.95 3.86 0.24 4.70 0.11 7.21 0.232
2.5.72msa 1102.10 551.19 666.40 683.22 773.59 1599.19 474.07 2368.85 17.502
s.d. 0.66 0.12 0.77 3.03 0.17 14.88 0.26 7.57 0.106
2.5.72msa-uni 1105.21 555.96 661.72 678.13 774.29 1597.93 474.42 2375.95 18.912
s.d. 1.54 0.60 2.90 4.96 0.12 37.92 0.30 11.10 0.089
2.5.72uni 1104.45 557.32 666.72 683.38 774.64 1568.61 474.16 2370.79 19.120
s.d. 1.11 0.53 3.00 3.34 0.25 50.60 0.63 16.35 0.126
Memory latencies in nanoseconds - smaller is better
---------------------------------------------------
kernel Mhz L1 $ L2 $ Main mem
----------------------------- ----- ------- ------- ---------
2.5.72 900 2.227 6.684 122.69
s.d. 0 0.232 0.232 0.23
2.5.72msa 900 2.227 6.686 123.25
s.d. 0 0.106 0.106 0.11
2.5.72msa-uni 900 2.226 6.682 122.59
s.d. 0 0.089 0.089 0.09
2.5.72uni 900 2.225 6.680 122.54
s.d. 0 0.126 0.126 0.13

