Large Block Devices

It seems a little silly that on a processor with an 18 petabyte address space, that Linux was limited to less than 2TB of disc per device, especially as there are readily available RAID arrays in the 4 to 8 TB range. The main limitation (still there in 2.4 kernels) was that the size of a disc was held in a 32-bit signed int, in 1k units.

Work was carried out here on the 2.5 kernels to remove the limitation. This happened about the same time that Al Viro reorganised the gendisk interface in the kernel, simplifying it and making many of the changes that had been started unnecessary.

The main change was to add a new type, sector_t which was an unsigned long by default, but that 32-bit platforms could override if CONFIG_LBD was set.

Motivation

Looks like we'll have tens of terabytes in a single package by 2012. You can buy a RAID controller that looks like a single disc with 10TB now. Or build one for under $5k/TB. Could Linux work with such a device?

The Problems

Although many of the filesystems in common use can cope with large files and large disc volumes, the Linux kernel limits the size of a filesystem, and the maximum size of a file, in various places.

In effect under early 2.5 (2.5.x<28) and 2.4 linux, the maximum file system size is about 1TB, and the maximum file size is just under 2TB (actually, 0x1fffffff000 bytes) on a 32-bit machine with 4k pages. It turns out that this limitation is extended (by use of ints where there should be longs) to 64-bit platforms as well.

Partitions

The size of partition is limited to 2^31 blocks for most partitioning schemes, and to 2^32 for those schemes (e.g., ultrix) that use unsigned 32bit numbers. A block is 512 bytes --- embedded in the interfaces to read_dev_sector() etc. --- not even as a symbol! This makes use of large blocks to get around a 32-bit block number a bit problematical.

See LBDFileSystems for more details.

SCSI

For large attached devices, SCSI-3 allows READ16 and WRITE16 commands. These are in 2.5 since December 2002.

Current Status

Linux 2.5 Kernel Status

The large block device patch is in the 2.5 kernels, and will therefore be in Linux 2.6

Linux 2.4 Kernel Status

There is a backport of the LBD patch for Linux 2.4.20 available from our downloads area. It hasn't been tested particularly well yet. Things I'm sure won't work are:

  1. LVM uses unsigned long as the type for communicating with user space. As such, allowing it to work with large block devices will require changing the user interface, and all the user utilities. I decided not to bother, as LVM has been obsoleted for 2.5.

  2. RAID uses 64-bit division. You can try adding `gcc -print-libgcc-file-name` to the link line for the md devices, but there may be other problems. In addition, the user-level code to handle large partitions just isn't there. See LBDRaid for more information.

Tools status

  1. XFS works well.
  2. ext[23] need to be created with  mke2fs -b 4096  otherwise they get the size wrong. These filesystems start by creating a bitmap, with one bit per block, so you'ld better have enough physical memory plus swap to allow this and have some left over. You need several hundred megabytes of memory per terabyte of disc. You can of course use swapspace to extend this, but the result is that mkfs and fsck are very slow.

  3. Fixes for the latest JFS tools are in the standard tree; JFS appears to work well on 32-bit platforms, but not on IA64 if your pagesize is not 4k.
  4. Reiserfs version 3.2 seems to work.

Test Program

A simple test program that seeks to a set of random locations in a file, then writes randomly to those locations, then reads them back and compares is exercise.c

You can use this program either on a block device directly, or by creating a whole heap of (large) files on a mounted filesystem, then running instances of it on each.

The program writes a line of characters scaled to the width of your screen, that maps onto the test file/device. As more tests are run the characters get denser and denser, until a line of # characters implies at least 12 writes to each section.

Mailing List

There's a mailing list hosted at http://www.gelato.unsw.edu.au/mailman/listinfo/lbd for discussion of experiences and issues with the large block device patch.

Related Work (Links)

Large File work: http://www.suse.de/~aj/linux_lfs.html

IA64wiki: LargeBlockDevices (last edited 2009-12-10 03:13:56 by localhost)

Gelato@UNSW is sponsored by
the University of New South Wales National ICT Australia The Gelato Federation Hewlett-Packard Company Australian Research Council
Please contact us with any questions or comments.