Linux AIO (sometimes known as KAIO or libaio
) is something of a black art where experienced practitioners know the pitfalls but for some reason it’s taboo to tell someone about gotchas they don’t already know. From scratching around on the web and experience I’ve come up with a few examples where Linux’s asynchronous I/O submission via io_submit()
may become (silently) synchronous, thereby turning it into a blocking (i.e. no longer fast) call:
- You’re submitting buffered (aka non-direct) I/O. You’re at the mercy of Linux’s caching and your submit can go synchronous when:
- What you’re reading isn’t already in the “read cache”.
- The “write cache” is full and the new write request can’t be accepted until some existing writeback has been completed.
- You’re asking for direct I/O to a file in a filesystem but for whatever reason the filesystem decides to ignore the
O_DIRECT
“hint” (e.g. how you submitted the I/O didn’t meetO_DIRECT
alignment constraints, filesystem or particular filesystem’s configuration doesn’t supportO_DIRECT
) and it chooses to silently perform buffered I/O instead, resulting in the case above. - You’re doing direct I/O to a file in a filesystem but the filesystem has to do a synchronous operation (such as reading metadata/updating metadata via writeback) in order to fulfill your I/O. A common example of this is issuing an “allocating write” (e.g. because you’re appending/extending the end of a file or filling in an unallocated hole) and this sounds like what the questioner is doing (“appended to the file”). Some filesystems such as XFS try harder to provide good AIO behaviour but even there a user has to be careful to avoid sending certain operations to the filesystem in parallel otherwise
io_submit()
again will turn into a blocking call while the other operation completes. The Seastar framework contains a small lookup table of filesystem specific cases. - You’re submitting too much outstanding I/O. Your disk/disk controller will have a maximum number of I/O requests that can be processed at the same time and there are maximum request queue sizes for each specific device (see the
/sys/block/[disk]/queue/nr_requests
documentation and the un(der) documented/sys/block/[disk]/device/queue_depth
) within the kernel. Making I/O requests back-up and exceed the size of the kernel queues leads to blocking.- If you submit I/Os that are “too large” (e.g. bigger than
/sys/block/[disk]/queue/max_sectors_kb
but the true limit may be something smaller like 512 KiB) they will be split up within the block layer and go on to chew up more than one request. - The system global maximum number of concurrent AIO requests (see the
/proc/sys/fs/aio-max-nr
documentation) can also have an impact but the result will be seen inio_setup()
rather thanio_submit()
.
- If you submit I/Os that are “too large” (e.g. bigger than
- A layer in the Linux block device stack between the request and the submission to the disk has to block. For example, things like Linux software RAID (md) can make I/O requests passing through it stall while updating RAID 1 metadata on individual disks.
- Your submission causes the kernel to wait because:
- It needs to take a particular lock (e.g.
i_rwsem
) that is in use. - It needs to allocate some extra memory or page something in.
- It needs to take a particular lock (e.g.
- You’re submitting I/O to a file descriptor that’s not a “regular” file or a block device (e.g. your descriptor is a pipe or a socket).
The list above is not exhaustive.
With >= 4.14 kernels the RWF_NONBLOCK
flag can be used to make some of the blocking scenarios above noisy. For example, when using buffering and trying to read data not yet in the page cache, the RWF_NONBLOCK
flag will cause submission to fail with EAGAIN
when blocking would otherwise occur. Obviously you still a) need a 4.14 (or later) kernel that supports this flag and b) have to be aware of the cases it doesn’t cover. I notice there are patches that have been accepted or are being proposed to return EAGAIN
in more scenarios that would otherwise block but at the time of writing (2019) RWF_NONBLOCK
is not supported for buffered filesystem writes.
Alternatives
If your kernel is >=5.1, you could try using io_uring
which does far better at not blocking on submission (it’s an entirely different interface and was new in 2020).
References
- The AIOUserGuide has a “Performance considerations” section that warns about some
io_submit()
blocking/slowness situations. - A good list of Linux AIO pitfalls is given in the “Performance issues” section of the README for the ggaoed AoE target.
- The “sleeps and waits during io_submit” XFS mailing list thread hints at some AIO queue constraints.
- The “io_submit() blocks for writes for substantial amount of time” XFS mailing list thread has a warning from Dave Chiner that when an XFS filesystem becomes more than 85-90% full, the chances of unpredictable filesystem delays increases the closer you get to
ENOSPC
due to lack of large amounts of contiguous free space. - The “[PATCH 1/1 linux-next] ext4: add compatibility flag check to the patch” LKML thread has a reply from Ext4 lead dev Ted Ts’o talking about how filesystems can fallback to buffered I/O for
O_DIRECT
rather than failing theopen()
call.- In the “ubifs: Allow O_DIRECT” LKML thread Btrfs lead developer Chris Mason states Btrfs resorts to buffered I/O when
O_DIRECT
is requested on compressed files. - ZFS on Linux 0.8.0 changed ZoL’s behaviour from erroring on
O_DIRECT
to “accepting” it by falling back to buffered I/O (see point 3 in the commit message). There’s further discussion from the lead up to the commit in the ZFS on Linux “Direct IO” GitHub issue. In the “NVMe Read Performance Issues with ZFS (submit_bio to io_schedule)” issue someone suggests they are getting closer to submitting a change that enables a proper zerocopyO_DIRECT
. If such a change were accepted, it would end up in some future version of ZoL greater than 0.8.2. - The Ext4 wiki has a warning that certain Linux implementations (Which?) fall back to buffered I/O when doing
O_DIRECT
allocating writes.
- In the “ubifs: Allow O_DIRECT” LKML thread Btrfs lead developer Chris Mason states Btrfs resorts to buffered I/O when
- The 2004 Linux Scalability Effort page titled “Kernel Asynchronous I/O (AIO) Support” has a list of things that worked and things that did not work with Linux AIO (a bit old but a quick reference).
- In the “io_submit() taking a long time to return ?” linux-aio mailing list thread Zach Brown explains that for a file in a filesystem to find out which blocks to issue async I/O against it has to do a synchronous read of metadata (but this data is usually already cached).
- The cover letter “[PATCH 0/10 v13] merge request: No wait AIO” on the LKML lists reasons for
io_submit()
delays. There’s also an LWN article talking about an earlier version of the no-wait AIO patch set and some of the cases it doesn’t cover (but note that buffered reads were covered by it in the end). - “io_submit blocking” search on the linux-aio mailing list
Related:
- Linux AIO: Poor Scaling
- io_submit() blocks until a previous operation will be completed
- buffered asynchronous file I/O on linux (but stick to the bits explicitly talking about Linux kernel AIO)
Hopefully this post helps someone (and if does help you could you upvote it? Thanks!).