Large File System support in Linux 2.5.x

This page is no longer maintained!
Its contents are now on the Wiki

Large block support is now in Linux 2.5! If you want it in 2.4 linux, see the Gelato@UNSW download page

Motivation

Looks like we'll have tens of terabytes in a single package by 2012. You can buy a RAID controller that looks like a single disc with 10TB now. Or build one for under $5k/TB. Could Linux work with such a device?

The problems

Although many of the filesystems in common use can cope with large files and large disc volumes, the Linux kernel limits the size of a filesystem, and the maximum size of a file, in various places. In effect under current (2.5.x<28) linux, the maximum file system size is about 1TB, and the maximum file size is just under 2TB (actually, 0x1fffffff000 bytes) on a 32-bit machine with 4k pages. It turns out that this limitation is extended (by use of ints where there should be longs) to 64-bit platforms as well.

Gendisk layer

Size of partition is limited to 2^31 blocks for most partitioning schemes, to 2^32 for those schemes (e.g., ultrix) that use unsigned 32bit numbers. A block is 512 bytes --- embedded in the interfaces to read_dev_sector() etc --- not even as a symbol! This makes use of large blocks to get around a 32-bit block number a bit problematical.
Need to use a different type for block numbers/offsets

We're going to need new partitioning schemes for large multi-terabyte discs. None of the existing schemes will work well for multi-terabyte discs with small physical block sizes, but the EFI partitioning scheme may be OK if it's adopted widely.

The blkpg stuff uses sizes in bytes, and uses a long-long to hold them, thus limiting the size of disc to 2^63 (9 EB)--- which should be adequate :-)

Internally almost everything seems to be measured in 512-byte or 1k units. The maximum logical blocksize is the same as the page size.
Note this may have implications for Lucy's work on multiple-page sizes.

LVM

Linux LVM copes with up to 1EB (with 16G physical extent size). However, the generic kernel limits will apply.

SCSI

Scsi-3 allows 64-bit logical block addresses. These are not yet used by Linux.

struct scsi_disk: capacity needs to be unsigned long or uint64_t (currently unsigned int) is sizes in blocks. Raw scsi uses 32-bit unsigned Nr blocks, 32bit logical sector size for SCSI2.

May wish to start using read16/write16 commands for big discs, which allows up to 9 EB.

ATA disc subsystem

TODO.

Filesystems

vfs seems to use unsigned long where appropriate -- no 32-bit limitations (or if they're there they're bugs) on 64-bit machines.

Most of the interfaces are in terms of loff_t which is 64-bit on all platforms.

Because fsck would take so long, it's unlikely that a non-journalled filesystem would be used on a large partition/logical volume.

NFS

NFS version 2 uses a 32-bit field for file sizes and offsets; NFS version 3 can use 64-bit sizes and offsets ---- use NFSv3 for large file system work.

ext[23]

If the block size is 8k can go up to 32 T partition, 2T FILE. The standard maximum block size is 4k; it can't be bigger than PAGE_CACHE_SIZE (currently 4k on most 32-bit platforms; 16k on itanium by default (but can go to 64k for McKinley architecture)

ReiserFS

Reiserfs version 3.x is limited to 17.6TB partition as it uses 4k blocks and 32-bit sector numbers. The maximum file offset supported is 2TB-1block (0x1fffffff000). However a bug (where?) allows seeks beyond this (to 0x7FFFFFFFFFFFFFFF), just no writes beyond this. Turns out that there's limitation in the amount of memory that can be allocated in-kernel: and large reiserfs filesystems will overflow this.

jfs

JFS supports up to 4PB with 4k blocks, to 512TB with 512byte blocks.

xfs

XFS uses the full 64-bit space, and is limited to 9EB filesystems, but only 0x1FFFFFFFFFFF (32TB) files. At present XFS looks like the most appropriate file system for large file work (but check out JFS, reiserfs version 4, and ext3 with large block sizes).

The Patch

A patch is available here that does the following:
Peter Chubb
Last modified: Fri Nov 22 13:58:47 EST 2002