Barriers to High Performance Computing on Linux
Linux has come a long long way since it was first released. It now runs on an extremely large variety of hardware, from embedded devices to 2000+ processor clusters. However, just running is not really good enough. To run well in a HPC envrionment requires certain infrastructure changes, and new or improved open-source software.
Note that just because there's something here doesn't mean that the Gelato@UNSW team is going to do anything about it.
Also note, that in the benchmarks I've (PeterChubb) run, kernel time itself is not significant. That might change as the number of processsors is increased (I have access only up to 4, which isn't very many). It's more likely that scalability problems will be found in other areas (e.g., increased memory latency because of large NUMA memories; slower than desirable operation because of TLB or cache contention, etc.) that kernel algorithms have impact on, but are not directly attributable to time spent in the kernel.
- Support for large memory. The current hierarchical page table structure is inefficient for mapping very large physical memory with holes (as is common on ccNUMA machines). TLB coverage is also a problem.
- Better Support for NUMA. The kernel's scheduler does now have some NUMA support, but there's no provision (yet) for replicated almost-constant data, or even for replicated constant data or text segments.
- Pluggable scheduler modules --- scheduler classes
- Gang scheduling
- Proportional fair-share scheduling (need job/user/accounting-entity accounting for that) based either on PID feedback controller or lottery
- Isochronous scheduler
- Soft-real time scheduler
- Batch scheduler
More NUMA awareness; CPU sets or other ways to group memory-sharing threads to a single (or small set of) RADs (RAD == resource affinity domain; group of processors+memory that are near each other)
Performance Measurement and Control
- Per-process CPU usage currently inaccurate.
Many of the fields in getrusage() are empty
- There are no disk I/O metrics per process.
- CPU Queue Length doesn't appear to be available per CPU.
- Page Faults are misaccounted
/proc doesn't scale well to thousands of threads. We need a better way to get at per-process data (that doesn't involve so many system calls to get at the info --- currently every thread needs at least three system calls, opem, read, close, not to mention all the getdents() calls to get at the data in the first place).
ClusterFileSystems e.g., Lustre, or CXFS are desirable.
- SAN? Not sure what the issues are here.
- Hierarchical storage management --- use of discs as caches to slower more permanent storage (e.g., tape silos)
Want to be able to characterise a job as needing n processors, m gigabytes of memory, and access to drives /dev/sdX and /dev/sdY and schedule memory and processors accordingly.
LSF Platform? GNU Queue? NQS? SGI's Miser --- Queue is immature and lacks features, others not open source. Open source NQS is available, but is old software, may need updating for current environments and requirements. PBS is supported by BII, a gelato member; it appears to be a descendant of NQS.
- Gang scheduler? Isochronous schedulers? Job schedulers?
- Pluggable schedulers would allow insertion of gang, fairshare, or lottery schedulers (etc) without affecting the small machines.
NUMA aware spinlocks http://lse.sourceforge.net/numa/locking
- Replacement of spinlocks with seq_locks where appropriate
- Use of atomic ops instead of locks where possible There are general problems with fine-grained locking. To be successful, developers need to be very careful to document:
- What each lock covers
- For each variable, what lock (if any) covers it
- For each function, what locks are expected to be held on entry and exit
Otherwise, it's very easy to multiply locks to no good purpose (it's often easier to create a new lock rather than understand what's there), which can lead to deadlocks.
Lock placement also needs to be carefully addressed. There seem to be two schools of thought:
- Put the lock into the same cacheline as the variables it protects. Then when you have the lock, you have the variables too. This is good for lightly contested locks.
- Put the lock on its own into a separate cacheline. Then when the lock is contended, the current holder of the lock can proceed at full speed until it does the write to release the lock, wherupon the cacheline containing the variables will be transferred to the other processor.
As far as I know, little or no analysis has yet been done on the placement of locks in the Linux kernel.
- The new Posix threading library (based on Futexes and a faster CLONE(2) system call) is a big improvement. It's unclear whether it'll scale well with lots of processors.
- Can MPI be improved?
- There are still issues of thread placement in ccNUMA that need addressing.
See Rik van Riel's page http://surriel.com/research_wanted/
Also Larry McVoys talks on SMP Clusters (sounds a bit like Cellular Disco?)
And the ADEOS work http://www.opersys.com/adeos/practical-smp-clusters/