Linux I/O schedulers
Contents
A disk scheduler is the part of the Linux kernel that reorders, delays, and merges requests for disk I/O to achieve better throughput and lower latency than would happen if all the requests were sent straight to disk. The scheduler aims to minimise disk head movement as far as possible. You can think of the scheduler as a queue of requests waiting for service at the disk.
Linux 2.6 has four different disk schedulers. You can select the available schedulers at kernel build time; the schedulers can either be built-in or modules.
Changing Schedulers
To use a particular scheduler for a particular block device, write its name into /sys/block/devicename/queue/scheduler
For example:
# cat /sys/block/hda/queue/scheduler noop [anticipatory] deadline
says that there are three schedulers available (the noop, anticipatory and deadline schedulers) and that the anticipatory scheduler is the one currently in use.
# echo deadline > /sys/block/hda/queue/scheduler # cat /sys/block/hda/queue/scheduler noop anticipatory [deadline]
Each scheduler has parameters that can be set. These are visible to userspace in /sys/block/devicename/queue/iosched/
General issues
A scheduler can only work with what it's given. If there is nothing queued in the operating system for the disk, the scheduler does nothing at all.
Modern disk subsystems do queuing and scheduling themselves. For example, the CCISS P600 can queue up to 512 requests; and even commonly available SATA drives can generally queue 16 or 32 requests. With a deep on-device queue, the operating system's queuing and scheduling has less scope for operation: most requests are queued at the disk instead.
NOTE TO EDITOR: put diagram here
If there is a queue on disk (TCQ or NCQ, or write-back caching on the disk) then the goals of the operating system's scheduler may be perturbed. In particular, attempts to bound request latency, or provide fairness, or to give reads priority over writes, can all be subverted by a deep queue that uses a different queuing discipline.
NOOP
The (misnamed) noop scheduler in principle does not reorder requests. However, if a new request is immediately after the previous request, the new one will be merged into the existing one in the queue.
Under most workloads this does nothing, as the block layer and file systems merge requests as much as they can anyway.
noop has no tunables.
Deadline
The deadline scheduler approximates SCAN, augmented by deadlines for avoiding starvation. As each request comes in, a deadline is attached to it. The scheduler sets up batches of requests in ascending sector order. After each batch, it checks to see if any deadlines have expired, and starts a new batch at the request with the expired deadline if so. Otherwise, it tries to create a new batch in SCAN order from the last request. When switching from reads to writes, the new batch starts with the earliest deadline.
deadline treats read and write requests separately, and (by default) tries to service reads twice as often as writes. It can do both front-merging and tail-merging; front merging can be turned off by a parameter.
tunable |
default value |
Purpose |
fifo_batch |
16 |
Number of requests in a batch --- not really fifo |
read_expire |
500 |
milliseconds after a read request arrives before its deadline |
write_expire |
5000 |
milliseconds deadline for write requests |
writes_starved |
2 |
ratio of reads to writes |
front_merges |
1 |
(on or off) whether to front-merge requests |
Anticipatory
The Anticipatory scheduler (AS) has all the functionality of deadline, but also adds anticipation. If a program is processing a file sequentially, then it's likely to do another disk operation very close to the last one it did. Therefore, if the scheduler delays a short while after a request is completed, another request will probably be on its way that is close to the previous request.
AS has a number of other differences in implementation from Deadline:
- Requests within a batch are in SCAN order, and need not be contiguous. Also, backward seeks up to 1 MiB are allowed, but are costed at twice the cost of an equivalent forward seek.
AS does not treat reads and writes separately, but does treat synchronous and asynchronous requests separately. synchronous requests are all reads, and writes with O_DIRECT or O_SYNC.
Parameters are:
tunable |
default value (ms) |
Purpose |
|
antic_expire |
6 |
Maximum anticipation time |
|
est_time |
(output var) |
|
|
read_expire |
125 |
deadline for synchronous requests |
|
read_batch_expire |
500 |
How long to issue synch. requests before switching to asynch |
|
write_expire |
250 |
deadline for asynchronous requests |
|
write_batch_expire |
125 |
How long to issue asynch. requests before switching to sync. |
|
If you have a fast disk, it's generally worth reducing antic_expire. It should be around or a bit less than the average seek cost --- there's no point in holding the disk idle for longer than necessary, and if it's going to be idle longer than a seek, you may as well do the seek anyway.
Completely Fair Queuing
The CFQ scheduler aims to distribute disk time fairly amongst processes competing for access to the disk. In addition, it uses anticipation and deadlines to improve performance, and attempt to bound latency.
Each process issuing I/O gets a time slice during which it has exclusive access for synchronous requests. The time slice is bounded by two parameters: slice_sync adjusted by the process's I/O priority gives the length in milliseconds of each slice; and quantum gives the number of requests that can be issued.
All processes share a set of 17 queues for asynchronous I/O, one for each effective I/O priority. Each queue gets a time slice based on slice_async adjusted by priority), and the queues are treated round-robin.
Within each slice, requests are merged (both front and back), and issued according to SCAN, with backward seeks up to back_seek_max permitted, but biassed by back_seek_penalty compared with forward seeks. CFQ will wait up to slice_idle ms within a slice for a process to issue more I/O. This allows anticipation, and also improves fairness when a process issues bursty I/O; but in general it reduces performance.
Parameter summary:
Tunable |
Default value |
Purpose |
back_seek_max |
16384 |
(KiB) Max. backward seek |
back_seek_penalty |
2 |
reverse seek bias |
fifo_expire_sync |
125 |
(ms) deadline for synchronous requests |
fifo_expire_async |
250 |
(ms) deadline for asynch requests |
quantum |
4 |
Maximum requests to service in each batch in each round |
slice_async |
40 |
(ms) base time slice for asynch. requests |
slice_sync |
100 |
(ms) base time slice for sync. requests |
slice_async_rq |
2 |
base number of async requests per round |
slice_idle |
8 |
(ms) maximum idle time between requests |
In general we've found major performance improvement by setting slice_idle to zero.
