IA64 Linux TCP/IP Stack
The future home of a page detailing some performance measurements we have done/are about to do on the Linux TCP/IP stack on IA64.
What are we testing?
The network stack has multiple parts each adding to delay in processing traffic. At least we need to separate out NIC driver problems from IP problems from TCP problems.
One way to separate out NIC performance is to use the lo device and compare with real drivers.
The idea is to see:
- Is there any performance issue with the network under IA64 Linux?
- Where is the kernel spending time when under network load?
- What is the kernel's behaviour under various kinds of overload?
- Network saturation (throughput problems)
- Lots of half-closed connections (Time spent to close a connection when there are lots of outstanding closes; a performance issue for web servers)
- DOS attacks like the SYN attack
Connection round trip performance
- Send a few bytes (say 100), wait for a small reply, destroy connection.
- As above but do not destroy connection each time. Compare and contrast for connection establishment time.
- Ignoring connection time, measure differences between UDP and TCP implementations of above schemes.
- TCP Close performance : normal close requires a 3 way handshake; how much can an abortive release increase performance (stevens p.g. 247)? Where is that appropriate?
- create multiple connections and see how increasing connections degrades performance.
- application layer : (userver good for this)
- what parameters for accepting connections (aggresiveness, order in which are accepted).
- Send a large number of bytes and time the transfer, average. Can it fill the underlying network link? What saturates first? Use oprofile or qprof to see where the kernel is spending its time.
The four major parameters for TCP/IP are :
- file size (how much data we are sending)
- send buffer size (how much data we provide for the kernel to pass on)
- MTU (frame size on the link -- 1500 bytes for ethernet; 20 bytes for TCP overhead, 20 bytes for IP overhead)
- window size (how much outstanding data)
does varying these result in large performance differences?
tcp_max_syn_backlog might get in the way
Think about throughput -- two links are not necessarily the same.
Serialization Delay = sz delay = frame size (bits) / link bandwith (bits/second) . Time taken to put a message "on the line", need to add into round trip delay.
Maximum Throughput = max throughput = (window size) * 8 bits/byte)/(2 * round trip delay).
"Testing the Performance of Multiple TCP/IP Stacks", Proceedings of CMG97, December 7-12, 199, volume 1, pages 626-638. John L. Wood, Chrisopher D. Selvaggi and John Q. Walker.
TCP/IP Illustrated, Volume 1. W. Richard Stevens 1994 Addison-Wesley
Exploring the Performance of Select-based Internet Servers. Tim Brecht, Michal Ostrowski. HPL-2001-314, 2001
httperf with userver makes for an ideal real-world style network test. httperf generates lots of http traffic to a high performance testing web server 'userver'.
As a first test, we have an IA64 server running 2.5.6-test5 with userver 0.3.3 listening for client requests. For the single client, we ran a test sending between 100 and 2000 messages/sec in increments of 100 messages/sec for a total of one minute (i.e. 2000 * 60 = 120,000 total messages). For the two clients, this was divided in half to give the same total messages. These are not particularly high loads, previous work shows even a modest x86 server should scale linearly to around 4000 requests.
Refer to the two graphs below; the 2.4 kernel scales as we expect (linearly), however the 2.5 kernel appears to hit a point where no more progress is made. This is repeatable with our setup; we are currently looking into what is going on. The effects do not seem to appear on a least PowerPC; running a 2.6.0-test7 kernel on a 700Mhz G3 produced a linearly scaling graph as expected. Interestingly, in this test we were using the IA64 box as a traffic generator, thus it has no problems outputting the required packets.
Update : I have got to the bottom of this particular problem -- I was sure I turned off Network Packet Filtering, but had apparently not. But it should be easy to see problems with this code, it will printk warnings! Yes -- if you haven't done something like 'dmesg -n 4' to stop unaligned access warnings ... I also upgraded to testing with a 2.6.0-test8 kernel since this stops another annoying (bogus) warning.
However, this does raise the question as to why the connection tracking table fills up quite quickly on IA64, where as on even an underpowered PowerPC it doesn't seem to have any problems. Additionally, after turning the packet filtering stuff off, I don't get any more unaligned access faults (or haven't yet).
I am now testing again, and noticing some strange effects. Testing between two IA64 boxes, the load generating machine seems to be pumping out packets at an incredible rate. Like 10,000 per second seemingly without problems. I'm not sure if this is correct, but certainly there don't seem to be any errors. Eventually the server gets to a point where it starts logging 'drop open request from 10.0.0.3' thousands of times (syslog supresses them). I'm yet to discover what is happening to cause this; userver doesn't report any problems because the packets must be getting dropped before userer has a chance to log the connection (?). Once the kernel gets into this state of dropping requests, it doesn't recover until I (seemingly) stop and start userver. All this to be confirmed, however.