Benchmarks of the NPTL library on IA64
Contents
Update
As at September 2004 these results are mostly of historical interest; the libraries and kernel have undergone significant changes that will alter performance.
Currently the people at Bull Open Source are doing some scalability and performance testing on a wide range of hardware.
Their results are available in their forums
Notes
- Tests are run by setting up an alarm timer for 5 seconds and measuring how many operations we can perform in that time.
- Tests were run on a single processor 900Mhz Itanium 2 (rx2600) and a Pentium 4 2.53Ghz
- Itanium tests were with 2.5.64 with libc cvs and NTPL 0.34
Latest patch is libc cvs @ 2003-04-22 & NPTL 0.36 + fastsyscall patch libc cvs part nptl part This patch is not maintained -- fast syscall handling has changed
A debian package of a glibc built with these packages is available. just add
deb http://www.gelato.unsw.edu.au/~ianw/libc-nptl/ ./
to your /etc/apt/sources.list and do apt-get update then apt-get install libc6.1 libc6.1-dev. You will need to be using unstable and a kernel >= 2.5.67 or so for this to work.- Pentium tests were run under Redhat 9.0 (closest to NPTL 0.29 I believe).
Source for these tests available from http://www.gelato.unsw.edu.au/patches/pthreadbench/pthreadbench.tar.gz
- These tests are only meant to be indicative.
Ideas for more tests, criticism, etc most welcome <ianw AT NOSPAM gelato DOT unsw DOT edu DOT au>
Updated 14/04/03 : added uncontested test, cleaned up code, added a few 'back of the envelope' calculations
Updated 22/04/03 : update patch to NPTL 0.36
Updated 23/04/03 : add page size comparisons, changed formatting.
Updated 05/05/03 : updated debian packages to 0.37 -- 0.36 packages were missing include files (fixed now)
Updated 14/08/03 : updated debian packages to 0.56 -- careful with these they may crash your kernel. Fast syscall handling has changed so my patches removed.
Updated 09/09/03 : I presented a paper on these results to the AUUG 2003 Winter conference (slides). This has slightly updated results over those listed below; we are still interested to hear if anyone else is looking at the performance of the new libraries.
Updated 29/09/03 : I have created a online-viewable cvs repository of NPTL so far at http://www.gelato.unsw.edu.au/cgi-bin/viewcvs.cgi/ianw/nptl/
Updated 16/12/03 : Much of this is redunant -- NPTL is now in glibc CVS, Debian has full NPTL support. The current .deb's in the archive given above contain a glibc with David Mosbergers fast system call patch.
Tests
Lifecycle
Test the thread life cycle by creating a thread and joining it as much as we can in the specified time.
Results |
||
Library |
Average/second |
Compared to NPTL (IA-64) |
Linux Threads (Pentium) |
127819 |
44% more threads created |
NPTL (Pentium) |
229619 |
158% more threads created |
Linux Threads (IA-64) |
12330 |
86% less threads created |
NPTL (IA-64) |
88899 |
- |
Context
Have a fight over a variable locked by some mutexes. This just makes two threads switch as much as they possibly can.
Results |
||
Library |
Average/second |
Compared to NPTL (IA-64) |
Linux Threads (Pentium) |
145010 |
52% less context switches |
NPTL (Pentium) |
342365 |
14% more context switches |
Linux Threads (IA-64) |
182826 |
39% less context switches |
NPTL (IA-64) |
301656 |
- |
Library |
Average Time / Switch - (Locking Overhead) |
Compared to Machine Cycles |
Linux Threads (Pentium) |
6896ns - 132ns = 6764ns |
6764ns * (2.5 cycles per ns) = 16910 machine cycles |
NPTL (Pentium) |
2920ns - 70.5ns = 2849ns |
2849ns * (2.5 cycles per ns) = 7112 machine cycles |
Linux Threads (IA-64) |
5470ns - 158ns = 5312ns |
5312ns * (0.9 cycles per ns) = 4781 machine cycles |
NPTL (IA-64) |
3315ns - 49ns = 3266ns |
49ns * (0.9 cycles per ns) = 2939 machine cycles |
Locking overhead taken as an estimate from uncontested benchmark.
Wakeup
Have 10 worker threads with one master thread. The threads conditionally wait on a "queue" that the master thread fills. When the queue is full, do a signal wake, where the woken worker thread processes the queue and returns. Showing how quickly condition variables respond.
Results |
||
Library |
Average/second |
Compared to NPTL (IA-64) |
Linux Threads (Pentium) |
97771 |
94% more wake ups |
NPTL (Pentium) |
122761 |
143% more wake ups |
Linux Threads (IA-64) |
43483 |
14% less wake ups |
NPTL (IA-64) |
50331 |
- |
Uncontested
See how many times a thread can get/release an uncontested lock.
Results |
||
Library |
Average/second |
Compared to NPTL (IA-64) |
Linux Threads (Pentium) |
7571516 |
26% less uncontested locks taken |
NPTL (Pentium) |
7060900 |
31% less uncontested locks taken |
Linux Threads (IA-64) |
6328766 |
38% less uncontested locks taken |
NPTL (IA-64) |
10212532 |
- |
Library |
Average Time / Operation |
Compared to Machine Cycles |
Linux Threads (Pentium) |
132ns / 2 = 66ns |
66ns * (2.5 cycles per ns) = 165 machine cycles |
Linux Threads (IA-64) |
158ns / 2 = 79ns |
79ns * (0.9 cycles per ns) = 63 machine cycles |
NPTL (Pentium) |
141ns / 2 = 70.5ns |
70.5ns * (2.5 cycles per ns) = 176 machine cycles |
NPTL (IA-64) |
98ns / 2 = 49ns |
49ns * (0.9 cycles per ns) = 44 machine cycles |
- Possible explainations for increased time on Pentium NPTL? PeterC suggested that atomic operations cause pipeline flushes on Pentium; Itanium is not penalised like this.
Effect of page size
IA64 linux allows page sizes of 4K,8K,16K or 64K. 2.5.67 kernels were configured with only differing page sizes and the tests were run with libc cvs @ 2003-04-22 + NPTL 0.36 on the aforementioned Itanium 2 machine.
Context Switching |
||||
Page Size (KB) |
4 |
8 |
16 |
64 |
Linux Threads |
179183 |
179863 |
177651 |
177508 |
NPTL |
396513 |
401157 |
411063 |
376556 |
%GAIN |
54.81% |
55.16% |
56.78% |
52.86% |
Life Cycle |
||||
Page Size (KB) |
4 |
8 |
16 |
64 |
Linux Threads |
18502 |
16515 |
13727 |
6605 |
NPTL |
101776 |
106560 |
106073 |
99582 |
%GAIN |
81.82% |
84.50% |
87.06% |
93.37% |
Wake Up |
||||
Page Size (KB) |
4 |
8 |
16 |
64 |
Linux Threads |
69118 |
68690 |
67065 |
68069 |
NPTL |
118380 |
110928 |
111702 |
105223 |
%GAIN |
41.61% |
38.08% |
39.96% |
35.31% |
Uncontested |
||||
Page Size (KB) |
4 |
8 |
16 |
64 |
Linux Threads |
6323734 |
6325072 |
6324960 |
6325262 |
NPTL |
10201658 |
10206947 |
10206628 |
10206287 |
%GAIN |
38.01% |
38.03% |
38.03% |
38.03% |
Test Result Data
Lifecycle
Linux Threads (Pentium)
128340 threads created in 4.99537 sec = 25691.8 per second 129766 threads created in 4.99872 sec = 25959.9 per second 121461 threads created in 4.99888 sec = 24297.7 per second 130323 threads created in 4.99892 sec = 26070.2 per second 129206 threads created in 4.99882 sec = 25847.3 per second
Linux Threads (IA-64)
61478 threads created in 4.99969 sec = 12296.4 per second 61646 threads created in 4.99982 sec = 12329.6 per second 62016 threads created in 4.99966 sec = 12404 per second 61528 threads created in 4.99976 sec = 12306.2 per second 61585 threads created in 4.99959 sec = 12318 per second
NPTL (Pentium)
1146950 threads created in 4.99332 sec = 229697 per second 1155268 threads created in 4.9982 sec = 231137 per second 1147274 threads created in 4.99876 sec = 229511 per second 1140484 threads created in 4.99872 sec = 228155 per second 1147698 threads created in 4.9987 sec = 229599 per second
NPTL (IA-64)
453504 threads created in 4.99994 sec = 90702 per second 442196 threads created in 4.99971 sec = 88444.4 per second 442380 threads created in 4.99951 sec = 88484.7 per second 442508 threads created in 4.99948 sec = 88510.7 per second 441764 threads created in 4.99974 sec = 88357.4 per second
Context
Linux Threads (Pentium)
681148 context switches in 5.08959 sec = 133832 per second 973348 context switches in 4.9987 sec = 194720 per second 627962 context switches in 5.08881 sec = 123401 per second 738299 context switches in 5.04877 sec = 146233 per second 654499 context switches in 5.159 sec = 126866 per second
Linux Threads (IA-64)
1009192 context switches in 5.07046 sec = 199034 per second 1092900 context switches in 5.15004 sec = 212212 per second 881415 context switches in 4.99969 sec = 176294 per second 813726 context switches in 5.18222 sec = 157023 per second 914534 context switches in 5.39332 sec = 169568 per second
NPTL (Pentium)
1648943 context switches in 5.11928 sec = 322104 per second 1482676 context switches in 4.99866 sec = 296615 per second 2425918 context switches in 5.08865 sec = 476731 per second 1558364 context switches in 5.1187 sec = 304445 per second 1646595 context switches in 5.27868 sec = 311933 per second
NPTL (IA-64)
1904623 context switches in 5.06151 sec = 376296 per second 1523602 context switches in 4.99972 sec = 304738 per second 1527689 context switches in 5.00027 sec = 305521 per second 1559780 context switches in 4.99951 sec = 311986 per second 1630905 context switches in 5.26538 sec = 309741 per second
Wakeup
Linux Threads (Pentium)
428249 wakes ups in 4.99544 sec = 85727.9 per second 500555 wakes ups in 4.99804 sec = 100150 per second 505355 wakes ups in 4.99886 sec = 101094 per second 506447 wakes ups in 4.99872 sec = 101315 per second 502745 wakes ups in 4.99881 sec = 100573 per second
Linux Threads (IA-64)
250795 wakes ups in 4.99994 sec = 50159.6 per second 247568 wakes ups in 4.99947 sec = 49518.8 per second 246061 wakes ups in 5.00033 sec = 49208.9 per second 246656 wakes ups in 4.99965 sec = 49334.6 per second 245981 wakes ups in 4.99955 sec = 49200.6 per second
NPTL (Pentium)
594859 wakes ups in 4.99864 sec = 119004 per second 672069 wakes ups in 4.99914 sec = 134437 per second 601190 wakes ups in 4.99806 sec = 120285 per second 601265 wakes ups in 4.99857 sec = 120288 per second 598793 wakes ups in 4.99861 sec = 119792 per second
NPTL (IA-64)
260518 wakes ups in 4.99969 sec = 52106.8 per second 255846 wakes ups in 4.99955 sec = 51173.8 per second 240073 wakes ups in 4.99973 sec = 48017.2 per second 249059 wakes ups in 4.99959 sec = 49815.9 per second 252732 wakes ups in 5.00013 sec = 50545.1 per second
Uncontested
Linux Threads (Pentium)
37948774 uncontested locks taken in 4.9984 sec = 7592178.223 per second 37786013 uncontested locks taken in 4.99857 sec = 7559370.627 per second 37935855 uncontested locks taken in 4.99858 sec = 7589326.369 per second 37752282 uncontested locks taken in 4.99852 sec = 7552690.486 per second 37809587 uncontested locks taken in 4.99861 sec = 7564020.198 per second
Linux Threads (IA-64)
31645158 uncontested locks taken in 4.99992 sec = 6329131.6 per second 31640567 uncontested locks taken in 4.99929 sec = 6329017.184 per second 31640113 uncontested locks taken in 4.99933 sec = 6328866.871 per second 31637056 uncontested locks taken in 4.99935 sec = 6328236.402 per second 31639168 uncontested locks taken in 4.99941 sec = 6328584.17 per second
NPTL (Pentium)
35295501 uncontested locks taken in 4.99789 sec = 7062087.463 per second 35313675 uncontested locks taken in 4.99866 sec = 7064624.08 per second 35175797 uncontested locks taken in 4.99865 sec = 7037057.998 per second 35316222 uncontested locks taken in 4.99857 sec = 7065259.412 per second 35367709 uncontested locks taken in 4.99863 sec = 7075476.235 per second
NPTL (IA-64)
51058878 uncontested locks taken in 4.99938 sec = 10213037.93 per second 51055216 uncontested locks taken in 4.99937 sec = 10212338.12 per second 51050637 uncontested locks taken in 4.99938 sec = 10211397.7 per second 51064154 uncontested locks taken in 4.99992 sec = 10212992.17 per second 51057906 uncontested locks taken in 4.99935 sec = 10212898.66 per second
