Linux I/O schedulers

A disk scheduler is the part of the Linux kernel that reorders, delays, and merges requests for disk I/O to achieve better throughput and lower latency than would happen if all the requests were sent straight to disk. The scheduler aims to minimise disk head movement as far as possible. You can think of the scheduler as a queue of requests waiting for service at the disk.

Linux 2.6 has four different disk schedulers. You can select the available schedulers at kernel build time; the schedulers can either be built-in or modules.

Changing Schedulers

To use a particular scheduler for a particular block device, write its name into /sys/block/devicename/queue/scheduler

For example:

# cat /sys/block/hda/queue/scheduler
noop [anticipatory] deadline

says that there are three schedulers available (the noop, anticipatory and deadline schedulers) and that the anticipatory scheduler is the one currently in use.

# echo deadline > /sys/block/hda/queue/scheduler
# cat /sys/block/hda/queue/scheduler
noop anticipatory [deadline]

Each scheduler has parameters that can be set. These are visible to userspace in /sys/block/devicename/queue/iosched/

General issues

A scheduler can only work with what it's given. If there is nothing queued in the operating system for the disk, the scheduler does nothing at all.

Modern disk subsystems do queuing and scheduling themselves. For example, the CCISS P600 can queue up to 512 requests; and even commonly available SATA drives can generally queue 16 or 32 requests. With a deep on-device queue, the operating system's queuing and scheduling has less scope for operation: most requests are queued at the disk instead.

If there is a queue on disk (TCQ or NCQ, or write-back caching on the disk) then the goals of the operating system's scheduler may be perturbed. In particular, attempts to bound request latency, or provide fairness, or to give reads priority over writes, can all be subverted by a deep queue that uses a different queuing discipline.

NOOP

The (misnamed) noop scheduler in principle does not reorder requests. However, if a new request is immediately after the previous request, the new one will be merged into the existing one in the queue.

Under most workloads this does nothing, as the block layer and file systems merge requests as much as they can anyway.

noop has no tunables.

Deadline

The deadline scheduler approximates SCAN, augmented by deadlines for avoiding starvation. As each request comes in, a deadline is attached to it. The scheduler sets up batches of requests in ascending sector order. After each batch, it checks to see if any deadlines have expired, and starts a new batch at the request with the expired deadline if so. Otherwise, it tries to create a new batch in SCAN order from the last request. When switching from reads to writes, the new batch starts with the earliest deadline.

deadline treats read and write requests separately, and (by default) tries to service reads twice as often as writes. It can do both front-merging and tail-merging; front merging can be turned off by a parameter.

tunable

default value

Purpose

fifo_batch

16

Number of requests in a batch --- not really fifo

read_expire

500

milliseconds after a read request arrives before its deadline

write_expire

5000

milliseconds deadline for write requests

writes_starved

2

ratio of reads to writes

front_merges

1

(on or off) whether to front-merge requests

Anticipatory

The Anticipatory scheduler (AS) has all the functionality of deadline, but also adds anticipation. If a program is processing a file sequentially, then it's likely to do another disk operation very close to the last one it did. Therefore, if the scheduler delays a short while after a request is completed, another request will probably be on its way that is close to the previous request.

AS has a number of other differences in implementation from Deadline:

  1. Requests within a batch are in SCAN order, and need not be contiguous. Also, backward seeks up to 1 MiB are allowed, but are costed at twice the cost of an equivalent forward seek.
  2. AS does not treat reads and writes separately, but does treat synchronous and asynchronous requests separately. synchronous requests are all reads, and writes with O_DIRECT or O_SYNC.

Parameters are:

tunable

default value (ms)

Purpose

antic_expire

6

Maximum anticipation time

est_time

(output var)

read_expire

125

deadline for synchronous requests

read_batch_expire

500

How long to issue synch. requests before switching to asynch

write_expire

250

deadline for asynchronous requests

write_batch_expire

125

How long to issue asynch. requests before switching to sync.

If you have a fast disk, it's generally worth reducing antic_expire. It should be around or a bit less than the average seek cost --- there's no point in holding the disk idle for longer than necessary, and if it's going to be idle longer than a seek, you may as well do the seek anyway.

Completely Fair Queuing

The CFQ scheduler aims to distribute disk time fairly amongst processes competing for access to the disk. In addition, it uses anticipation and deadlines to improve performance, and attempt to bound latency.

Each process issuing I/O gets a time slice during which it has exclusive access for synchronous requests. The time slice is bounded by two parameters: slice_sync adjusted by the process's I/O priority gives the length in milliseconds of each slice; and quantum gives the number of requests that can be issued.

All processes share a set of 17 queues for asynchronous I/O, one for each effective I/O priority. Each queue gets a time slice based on slice_async adjusted by priority), and the queues are treated round-robin.

Within each slice, requests are merged (both front and back), and issued according to SCAN, with backward seeks up to back_seek_max permitted, but biassed by back_seek_penalty compared with forward seeks. CFQ will wait up to slice_idle ms within a slice for a process to issue more I/O. This allows anticipation, and also improves fairness when a process issues bursty I/O; but in general it reduces performance.

Parameter summary:

Tunable

Default value

Purpose

back_seek_max

16384

(KiB) Max. backward seek

back_seek_penalty

2

reverse seek bias

fifo_expire_sync

125

(ms) deadline for synchronous requests

fifo_expire_async

250

(ms) deadline for asynch requests

quantum

4

Maximum requests to service in each batch in each round

slice_async

40

(ms) base time slice for asynch. requests

slice_sync

100

(ms) base time slice for sync. requests

slice_async_rq

2

base number of async requests per round

slice_idle

8

(ms) maximum idle time between requests

In general we've found major performance improvement by setting slice_idle to zero.

IA64wiki: LinuxIOSchedulers (last edited 2009-12-10 03:14:06 by localhost)

Gelato@UNSW is sponsored by
the University of New South Wales National ICT Australia The Gelato Federation Hewlett-Packard Company Australian Research Council
Please contact us with any questions or comments.