Kernel-Level I/O Events in Linux: From Syscall to DMA

When a process reads from a socket or writes to a disk, the visible simplicity of a read() or write() syscall conceals an intricate machinery operating beneath the surface. The Linux kernel's I/O subsystem is one of the most sophisticated components of any operating system, coordinating hardware interrupts, memory-mapped buffers, scheduler interactions, and multiple layers of abstraction -- all while trying to minimize latency and maximize throughput simultaneously.

This article explores kernel-level I/O events from the ground up: how they originate, how the kernel tracks and dispatches them, the evolution from blocking I/O to fully asynchronous interfaces, and the internals that make high-performance I/O possible.

The Anatomy of an I/O Request

Every I/O operation, regardless of how it is initiated from userspace, eventually becomes a request that travels down a well-defined path inside the kernel.

VFS: The Universal Entry Point

The Virtual Filesystem Switch (VFS) is the first abstraction layer every I/O request encounters. When a process calls read(), control transfers via the syscall table to ksys_read(), which resolves the file descriptor to a struct file, then invokes the file's f_op->read_iter() method. The VFS makes no assumptions about what lies below -- the backing store could be an ext4 partition, a network socket, a pipe, or a character device. Each inode type registers its own file_operations, and the VFS dispatches accordingly.

At this layer, the kernel also performs permission checks, updates access timestamps, handles O_APPEND semantics atomically, and manages the file position. The actual I/O has not yet started.

Note

The VFS is what allows you to read() from a network socket, a file on disk, or a pipe using the exact same syscall. Each backing type registers its own file_operations struct, and the VFS dispatches to the correct implementation transparently.

The Page Cache

For regular files, the kernel does not issue a direct read to the block device. Instead it consults the page cache -- an XArray of 4 KB (or huge) pages indexed by file offset, anchored to the file's address_space structure. (Prior to Linux 4.20, this was implemented as a radix tree; the XArray replaced it with a cleaner, internally-locked API.) If the page is already in cache and up to date, the data is copied directly to userspace and the syscall returns. This is a purely memory operation; no I/O event in the hardware sense occurs at all.

A cache miss triggers a page fault-like path called readahead. The kernel allocates pages, marks them as locked and under writeback, and submits a bio (block I/O) structure to the block layer. The process is then put to sleep on a wait queue associated with the page until the I/O completes.

Dirty pages flow the other direction: a write() modifies pages in the cache and marks them dirty, returning to userspace almost immediately. The kernel's per-device writeback flusher threads (successor to the older pdflush pool, redesigned in Linux 2.6.32 by Jens Axboe) later flush dirty pages in the background according to vm.dirty_ratio and vm.dirty_background_ratio tunables. The physical I/O is decoupled from the userspace call.

The Block Layer

When a read miss or explicit writeback requires actual device I/O, the kernel constructs a bio structure defined in <linux/bio.h>. A bio represents a single I/O operation -- a list of memory pages, a block device, a sector offset, and a direction (read or write). Importantly, a bio knows nothing about files or filesystems; it speaks only in sectors.

Request Queues and I/O Schedulers

Historically, bios were merged into struct request objects and placed onto a per-device request_queue. I/O schedulers -- CFQ, Deadline, NOOP -- reordered these requests to minimize disk head seek time. This model made sense for rotational drives where seek latency is a dominant cost.

With NVMe SSDs and other low-latency devices, the single-queue model became a bottleneck because all CPU cores contended on a single spinlock protecting the queue. Linux 3.13 introduced blk-mq (Multi-Queue Block I/O). Under blk-mq, each CPU or set of CPUs has its own submission queue, eliminating the global lock. Requests are dispatched to per-hardware-queue dispatch queues and submitted to the driver with minimal synchronization overhead. Modern NVMe drives expose 64 or more hardware queues, and blk-mq maps software queues to hardware queues to fully exploit this parallelism.

Pro Tip

The scheduler under blk-mq (mq-deadline, BFQ, or none) runs per-queue and can still reorder requests, but the lock granularity is far finer and scalability is dramatically improved. For NVMe devices, none is often the best choice since the device itself handles request ordering internally.

Completion Events and Interrupts

When the device finishes an I/O, it raises a hardware interrupt. The interrupt handler does minimal work: it acknowledges the interrupt and schedules a softirq or calls blk_mq_complete_request(). The completion handler walks the bio's bi_end_io callback chain. For page cache reads, this callback marks the page uptodate, clears the locked bit, and wakes all processes sleeping on the page's wait queue. Those processes are moved to the run queue and will eventually be scheduled to copy the data to their userspace buffers and return from read().

This interrupt-to-wakeup path is the fundamental unit of an I/O event at the kernel level.

I/O Notification Interfaces

The blocking model is efficient for simple sequential access but inadequate for servers managing thousands of concurrent connections. The kernel provides progressively more powerful interfaces for monitoring I/O readiness across many file descriptors simultaneously.

select and poll

select() and poll() are the oldest multiplexing interfaces. Both accept a set of file descriptors and block until at least one becomes ready for the requested operation. The kernel iterates through every file descriptor in the set, calling each file's poll operation (defined in file_operations) to check readiness and register a wait entry with the file's internal wait queue. If no descriptor is ready, the process sleeps. When any descriptor becomes ready, it wakes the process, which must then re-scan the entire set to find which one.

The fundamental problem is O(n) complexity per wakeup. For a server with 10,000 connections, select scans 10,000 descriptors on every call, and the entire fd_set must be copied between kernel and userspace each time. This is the "C10K problem" that drove the development of epoll.

epoll: Edge and Level Triggering

epoll maintains a kernel-side data structure (a red-black tree) of registered file descriptors and an internal wait queue. Registration via epoll_ctl(EPOLL_CTL_ADD) is a one-time O(log n) operation. The kernel attaches a callback to each registered file's internal wait queue. When readiness changes -- a socket receives data, a pipe becomes writable -- the kernel's wait queue infrastructure invokes the callback, which adds the event to a separate ready list. epoll_wait() simply harvests this ready list, achieving O(1) event retrieval regardless of the total number of monitored descriptors.

The edge-triggered mode (EPOLLET) deserves particular attention. In level-triggered mode (the default), the kernel reports readiness as long as the condition persists. If a socket has 1 KB of data and the application reads only 512 bytes, the next epoll_wait call will immediately report the socket as readable again. In edge-triggered mode, the kernel reports an event only when the readiness state transitions -- from not-readable to readable. The application must drain the socket completely (reading until EAGAIN) on each notification, but this eliminates spurious wakeups and is substantially more efficient in high-throughput scenarios.

epoll lifecycle

// Create epoll instance
int epfd = epoll_create1(0);

// Register a socket with edge-triggered mode
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = sockfd;
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, &ev);

// Event loop
struct epoll_event events[MAX_EVENTS];
int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);

for (int i = 0; i < nfds; i++) {
    // Edge-triggered: must drain completely
    while (read(events[i].data.fd, buf, sizeof(buf)) > 0)
        process(buf);
}

EPOLLONESHOT takes this further: after an event fires, the kernel automatically removes the descriptor from the interest set until it is explicitly re-armed with EPOLL_CTL_MOD. This is useful in multi-threaded servers where you do not want multiple threads woken for the same descriptor simultaneously.

EPOLLERR and EPOLLHUP are always monitored regardless of the flags you specify. Internally, epoll stores each registered descriptor as an epitem structure in the red-black tree, keyed by (fd, file*) pair, allowing the same file to be registered in multiple epoll instances.

signalfd, timerfd, eventfd

Linux unifies disparate event sources into file descriptors via the *fd family. timerfd wraps a POSIX timer into a file descriptor that becomes readable when the timer fires, readable by epoll exactly like a network socket. signalfd turns signal delivery into readable events. eventfd provides a lightweight 64-bit counter that processes can increment and epoll can monitor, serving as a cheap inter-thread or inter-process notification channel. These interfaces allow a single epoll loop to monitor timers, signals, sockets, files, and inter-thread events with a unified code path.

io_uring: True Asynchronous I/O

The interfaces described above are readiness-based: they tell you when a file descriptor is ready, after which you perform a blocking syscall that should return immediately. io_uring, introduced in Linux 5.1 by Jens Axboe, takes a fundamentally different approach: completion-based asynchronous I/O where you submit operations and later receive completions, with no guarantee or requirement that the calling thread blocks at all.

Ring Buffer Architecture

io_uring creates two lock-free ring buffers shared between kernel and userspace via mmap:

The Submission Queue (SQ) is written by userspace and read by the kernel. Each entry (struct io_uring_sqe) describes an operation -- its opcode, file descriptor, buffer address, offset, and flags. The ring's head and tail are atomic variables; userspace advances the tail after writing entries, and the kernel advances the head after consuming them.

The Completion Queue (CQ) is written by the kernel and read by userspace. Each entry (struct io_uring_cqe) contains a result code and a user-supplied 64-bit identifier correlating the completion to the original submission. The kernel advances the tail; userspace advances the head.

Note

Crucially, for many operations the kernel can complete the I/O and post a CQE without requiring a syscall for the completion path. The application can poll the CQ tail in a tight loop entirely in userspace.

io_uring ring buffer flow

/* Submission side (userspace) */
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
sqe->user_data = MY_REQUEST_ID;
io_uring_submit(&ring);

/* Completion side (userspace) */
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int result = cqe->res;         // bytes read or error
__u64 id = cqe->user_data;     // correlates to submission
io_uring_cqe_seen(&ring, cqe); // advance CQ head

Submission Paths and Kernel Threads

io_uring_enter() is the syscall that notifies the kernel of new submissions and optionally waits for completions. However, two features can eliminate this syscall entirely.

IORING_SETUP_SQPOLL spawns a dedicated kernel thread that continuously polls the SQ for new entries. The application writes to the SQ ring and the kernel thread picks up entries without any syscall crossing. This is extremely valuable for NVMe I/O where the round-trip through io_uring_enter() would be a significant fraction of total latency.

IORING_SETUP_IOPOLL enables polling on the completion side for block I/O, bypassing interrupt-driven completion entirely. Instead of the device raising an interrupt and the kernel dispatching it through the softirq path, the kernel thread polls the NVMe completion queue directly. For very low-latency NVMe devices, this can reduce I/O latency by eliminating interrupt handling overhead, at the cost of consuming CPU cycles.

Fixed Resources and Zero-Copy

io_uring allows pre-registration of buffers (IORING_OP_REGISTER_BUFFERS) and file descriptors (IORING_OP_REGISTER_FILES). Registered buffers are pinned in memory and their virtual-to-physical page mappings are cached in the kernel, eliminating the per-I/O cost of page-pinning and DMA setup. Registered file descriptors avoid the file table lookup on each operation. Together these features amortize per-operation overhead across many I/Os, a significant optimization for small, high-frequency operations.

io_uring also supports fixed buffers with IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED, where the kernel performs DMA directly into or from the registered buffer without intermediate copies.

Linked Operations and Multi-Shot

Submission queue entries can be linked with IOSQE_IO_LINK: the kernel will not start a linked entry until the previous one completes successfully. This allows expressing dependency chains -- open a file, read from it, close it -- as a single batch submission with no round-trips between kernel and userspace.

Multi-shot operations (introduced for accept and recv in later kernel versions) allow a single SQE to produce multiple CQEs. A multi-shot accept will repeatedly accept connections and generate completions until explicitly cancelled, removing the need to resubmit after each accepted connection.

Direct I/O and DMA

When the page cache is explicitly bypassed with O_DIRECT, the kernel maps userspace buffers directly for DMA. The I/O must meet alignment constraints: the buffer address, file offset, and transfer length must all be aligned to the block device's logical block size (typically 512 bytes, often 4096 bytes for 4K-native drives).

Under O_DIRECT, read_iter() bypasses the page cache and constructs a bio directly from the userspace buffer pages (which must be locked into memory for the DMA duration). Completion still flows through the bi_end_io callback, but instead of marking a page cache page uptodate, it wakes the submitting process directly. Because the page cache is not involved, there is no readahead, no writeback coalescing, and no implicit caching -- the application takes on these responsibilities itself.

Warning

Databases like PostgreSQL and MySQL use O_DIRECT to implement their own buffer pool management without double-buffering through the page cache. If you use O_DIRECT, you are responsible for your own caching, alignment, and readahead strategies. Getting the alignment wrong will result in EINVAL errors.

Wait Queues and the Scheduler Interface

The lowest-level primitive underlying all I/O event notification is the wait queue, implemented in <linux/wait.h>. A wait queue is a linked list of wait_queue_entry_t structures, each containing a callback function and a pointer to a task. When a resource becomes unavailable, a task calls add_wait_queue() and then schedule(), which removes it from the run queue. When the resource becomes available, another context -- an interrupt handler, a softirq, another process -- calls wake_up() on the wait queue, which iterates the list and calls each entry's callback.

The default callback calls try_to_wake_up(), which adds the sleeping task back to a run queue. epoll replaces this default callback with its own function that adds the relevant epitem to the ready list and wakes the process sleeping in epoll_wait(). This is how the kernel's generic wakeup mechanism is extended to implement epoll's event model.

wake_up_all() wakes every waiter, suitable for events where all waiters can proceed. wake_up() wakes only one waiter with WQ_FLAG_EXCLUSIVE set, implementing a form of thundering-herd prevention for scenarios where only one task should handle the event.

io_uring vs. epoll: Architectural Comparison

These two interfaces represent different philosophies. epoll is readiness-based and integrates naturally with blocking syscalls. It is mature, well-understood, and adequate for many server workloads. Its overhead per event is low but nonzero: at minimum a syscall to epoll_wait, a syscall to read/write, and the associated context switches if the kernel thread has other work to do.

io_uring is completion-based and can approach zero syscall overhead with SQPOLL. For workloads dominated by many small, frequent I/O operations -- a key-value store, a message broker, a database write-ahead log -- io_uring with registered buffers and SQ polling can achieve substantially higher throughput and lower latency than epoll. The tradeoff is complexity: the programming model is harder to reason about, especially with linked operations, cancellation, and multi-shot entries.

Pro Tip

For network I/O, the gap between epoll and io_uring is narrower because socket operations rarely require SQ polling. The syscall cost of epoll_wait plus recv is small relative to network RTT. io_uring's advantages are most pronounced for storage I/O where device latency can be as low as tens of microseconds on modern NVMe hardware and kernel overhead becomes a meaningful fraction of total cost.

Observability: Tracing I/O Events

The kernel exposes I/O events through several tracing mechanisms.

Tracepoints embedded in the block layer (block:block_rq_issue, block:block_rq_complete) fire on every I/O request submission and completion. Tools like blktrace and bpftrace attach to these points and can reconstruct the full lifecycle of every block I/O on the system.

tracing I/O events

# Trace all block I/O completions with latency
$ bpftrace -e 'tracepoint:block:block_rq_complete { @us = hist(args->nr_sector); }'

# Watch real-time I/O with blktrace
$ blktrace -d /dev/nvme0n1 -o - | blkparse -i -

# Per-process I/O latency histograms with biolatency
$ biolatency-bpfcc -D

# io_uring submission/completion tracing
$ bpftrace -e 'tracepoint:io_uring:io_uring_complete { printf("%d %d\n", args->res, args->user_data); }'

perf can profile I/O latency distributions using perf record with hardware PMU events or software tracepoints. iostat and /proc/diskstats expose aggregated statistics: reads/writes completed, sectors transferred, time spent in I/O, and the queue depth.

eBPF programs attached to kprobes on blk_start_request and blk_account_io_done can compute per-process, per-device latency histograms with microsecond resolution entirely in kernel space, with negligible overhead compared to traditional sampling.

For io_uring specifically, io_uring tracepoints (io_uring:io_uring_submit_sqe, io_uring:io_uring_complete) allow reconstruction of the submission and completion timeline for every operation.

Conclusion

Linux's I/O event infrastructure has evolved from a simple interrupt-driven model to a sophisticated, multi-layered system that can approach the theoretical limits of modern hardware. The VFS provides uniform abstraction; the page cache amortizes device access over time; blk-mq parallelizes submission across CPU cores; epoll scales event monitoring to millions of descriptors; and io_uring collapses the kernel/userspace boundary for I/O-intensive workloads through shared ring buffers and kernel-side polling threads.

Understanding this machinery is not merely academic. The choice between epoll and io_uring, between page cache and O_DIRECT, between level and edge triggering -- these are decisions with measurable performance consequences in production systems.

The kernel's I/O path is one of the most carefully optimized pieces of software ever written, and the depth of that optimization is only visible when you trace an operation from the syscall entry all the way down to the DMA engine and back.

^ back to top