How to Write Your First Linux Kernel Module in C

Q: What is a Linux kernel module and how does it differ from a regular program?

A Linux kernel module is a compiled C object file (.ko) that gets loaded directly into the running kernel, giving your code full ring-0 privileges with no memory isolation or sandbox. Unlike a regular user-space program, a module shares address space with the entire kernel and can call any exported kernel function. A bug does not produce a segfault; it can corrupt kernel data structures or trigger a kernel panic.

Q: What is vermagic and why does the kernel check it?

Vermagic is a string embedded in every .ko file that encodes the kernel release version, SMP support, preemption model, and whether module versioning is active. The kernel compares this string against its own compiled-in vermagic at load time. If any field mismatches, the module is refused. This prevents undefined behavior that would result from loading a module compiled against a different kernel ABI.

Q: How do I expose a configurable parameter to a kernel module at load time?

Use the module_param() macro, which takes the variable name, its type, and a permission mask for the corresponding sysfs entry under /sys/module/. Load the module with insmod mymodule.ko param_name=value. A permission of 0644 makes the parameter readable by everyone and writable by root even while the module is running. A permission of 0 hides it from sysfs entirely.

sudowheel

There is a boundary inside every Linux system that separates the ordinary from the extraordinary. On one side sits your browser, your text editor, your terminal — all of it running in user space, isolated from the hardware, protected from each other by the kernel itself. On the other side sits the kernel: more than 40 million lines of code that crossed that threshold with the 6.14 rc1 in January 2025, running in ring 0, touching every byte of RAM, every clock cycle, every hardware interrupt. Cross that boundary carelessly and the entire system goes down. Cross it with skill and understanding, and you can extend the most sophisticated operating system ever created without rebooting even once.

That is precisely what a Linux kernel module does. It is code you write in C, compile into a relocatable object file with a .ko extension, and inject into a live, running kernel. When loaded, your module executes with full kernel privileges. No sandbox. No memory isolation. No second chances. A rogue pointer does not throw a segmentation fault — it silently corrupts data structures, scrambles page tables, or causes a kernel panic that drops you into a wall of red text. This is not a world for the careless, but it is absolutely a world worth entering, because the knowledge you gain from it changes how you think about every piece of software you will ever write.

This guide walks you through everything from the conceptual architecture to a working, loadable module with a /proc interface. Nothing is glossed over. Every macro, every header, every kernel API call gets an honest explanation of what it does under the hood.

Why This Matters: The Case for Kernel Module Development

Before writing a single line of code, it is worth understanding why the kernel module system was architected the way it was — because the design decision embedded in the Loadable Kernel Module (LKM) framework reflects a deep and fascinating engineering trade-off.

Linux is, at its heart, a monolithic kernel. That means the entire operating system core — the scheduler, the memory manager, the filesystem layer, the device driver framework, the network stack — all runs in the same address space at the same privilege level. Compare this to a microkernel design like GNU Hurd or early versions of Minix, where operating system services run as isolated user-space processes that communicate via message passing. Monolithic kernels are faster because there is no context-switching overhead for kernel services. Microkernels are theoretically safer because a bug in a driver cannot crash the whole system.

Linux took a pragmatic middle ground. The kernel is monolithic for performance, but it supports dynamic loading of modules — separate compiled units that can be inserted into and removed from the running kernel without a reboot. This gives you the performance of a monolithic design with some of the flexibility of a modular one. The catch is that a module becomes part of the kernel the moment it loads. It shares address space with everything. It can call any exported kernel function. It can read and write kernel memory. There is no enforced isolation whatsoever.

Mental Model: Process Isolation vs Kernel Address Space

In user space, the kernel protects each process from every other process. Your process has its own virtual address space, and a bad pointer causes a segfault that kills only you. The MMU enforces the boundary. When you write kernel module code, you have crossed that boundary and the MMU no longer protects you from yourself. Every kernel module shares the same address space. A rogue pointer in your module does not segfault and recover — it silently writes to kernel data structures, corrupts the page tables of another process, or overwrites a function pointer in a driver running on a different CPU. The analogy is not a program crashing; it is a program silently modifying another program's memory while both run. That is why every pointer dereference in kernel code deserves more scrutiny than it would in user-space code. This connects directly to why copy_to_user/copy_from_user exist and why they are non-negotiable — see The Copy Barrier.

This architecture is why nearly everything you interact with in Linux is implemented as a kernel module: USB drivers, filesystem drivers (ext4, btrfs, NTFS), network protocols, even some security modules. When you plug in a USB device and it just works, what you are witnessing is the kernel dynamically loading the appropriate driver module, initializing it, and routing I/O through it — all while the system is running, all without you doing a thing.

What You Will Need

Before getting started, confirm your environment has the following:

A Linux system. Not WSL (Windows Subsystem for Linux) — while WSL 2 runs a real Linux kernel, module loading is typically disabled or restricted. Use a native Linux install or a virtual machine running Ubuntu 22.04 LTS or 24.04 LTS, Debian 12, or Fedora 40+. If you are concerned about crashing your daily driver, a VM is the right choice. As the Linux Kernel Module Programming Guide notes: if you mess anything up then you can easily reboot or restore the virtual machine. That is good advice.

Kernel headers for your running kernel. The build system needs the header files that match your exact kernel version to compile a module correctly.

Debian/Ubuntu

$ sudo apt-get install linux-headers-$(uname -r) build-essential

Fedora/RHEL

$ sudo dnf install kernel-devel kernel-headers gcc make

Know your kernel version. Write it down.

$ uname -r

Every .ko file you build is tied to this specific kernel version through a mechanism called vermagic, which we will examine in detail shortly.

Note: Lockdown and Secure Boot

Two additional gatekeepers exist that are easy to overlook. First, the kernel has a /proc/sys/kernel/modules_disabled sysctl — once written to 1, module loading is permanently disabled for the lifetime of that boot (the value cannot be reset to 0 without a reboot). Some hardened distributions enable this after early boot. If insmod returns EPERM and you are sure you have root, check this sysctl. Second, systems booted with Secure Boot enabled and the kernel's lockdown mode active (CONFIG_SECURITY_LOCKDOWN_LSM, enabled by default on Ubuntu, Fedora, and RHEL since roughly 2020) block loading of any module not signed with a key in the kernel's trusted keyring. You will see Operation not permitted and a kernel log line about lockdown. On your development VM, disable Secure Boot in the VM firmware settings — this is the simplest solution and the standard practice for kernel module development.

The Anatomy of a Kernel Module: What the Compiler Actually Produces

When you compile a kernel module, the output is a .ko file. A .ko file is an ELF (Executable and Linkable Format) relocatable object — specifically, Type: REL in ELF nomenclature, not EXEC (executable) or DYN (shared library). You can verify this yourself:

$ file hello.ko

Because it is a relocatable object and not a fully linked executable, its symbol references are unresolved at compile time. There is no entry point address. There are no program headers. What it has instead is a section header table describing every section in the file: .text (code), .data (initialized data), .bss (uninitialized data), .modinfo (module metadata), __versions (symbol version checksums), and .gnu.linkonce.this_module (the module's struct module anchor).

You can read every section name directly with readelf:

inspect .ko sections

$ readelf -S hello.ko | grep -E '\[|Name'
# Look for: .text .data .bss .modinfo __versions .gnu.linkonce.this_module
# Also: .rela.text (relocation table), .symtab (symbol table), .strtab (string table)

The .rela.text section is what makes relocation possible. It contains one entry for every external symbol reference in your module's code — each entry records the offset within .text that needs to be patched, the symbol it references, and the relocation type (e.g., R_X86_64_PC32 for a 32-bit PC-relative reference). When load_module() calls apply_relocations(), it walks this table and writes the runtime address of each symbol into the corresponding slot in the loaded code. Without this step your module is just a blob of code with holes where all the function calls should be.

One section that is rarely discussed in beginner guides is .gnu.linkonce.this_module. It contains a partially initialized struct module — the kernel's internal representation of a loaded module, defined in include/linux/module.h. The build system writes the module's name, its init and exit function pointers, and several bookkeeping fields into this struct at compile time. The kernel patches in the remaining fields (state, reference count, memory layout, list linkage) after loading. This struct is what /proc/modules and lsmod iterate over when they show you the loaded module list.

Note: struct module_memory — A Quiet Kernel 6.4 Change

In kernel 6.4, Song Liu replaced the flat memory layout fields in struct module with an array of struct module_memory entries — one per memory region type (MOD_TEXT, MOD_DATA, MOD_RODATA, MOD_RO_AFTER_INIT, MOD_INIT_TEXT, MOD_INIT_DATA, and others). If you are writing tooling that inspects loaded module memory layouts by walking struct module directly (as some BPF programs and out-of-tree debuggers do), the old field names module_core, module_init, core_size, and init_size no longer exist. You must index into the mem[] array instead. This change does not affect ordinary module code — it is only relevant if you are parsing kernel internals programmatically.

Kernel Pop Quiz

During module loading, the kernel calls apply_relocations(). What does this function actually do, and why can it not be done at compile time?

Incorrect

Signature verification is a separate step that happens earlier in load_module(), controlled by CONFIG_MODULE_SIG. apply_relocations() is specifically about resolving symbol addresses. The two mechanisms serve different purposes: signature verification is a trust check, relocation is a mechanical address-patching step that must occur for any module to be executable regardless of whether signatures are in use.

Correct

The .ko file is a relocatable ELF object — its symbol references are intentionally left as holes. At compile time the linker has no way to know where printk, kmalloc, or any other kernel function will be in memory when the module actually loads. On kernels with KASLR (Kernel Address Space Layout Randomization, CONFIG_RANDOMIZE_BASE=y), the kernel image itself is loaded at a random physical offset on every boot, making compile-time addresses impossible even in principle. apply_relocations() walks the .rela.text table entry by entry, looks up each symbol in the running kernel's export table (backed by /proc/kallsyms), and writes the live address into the appropriate byte offset in the module's loaded code.

see relocation entries before and after loading

# Before loading: holes appear as address 0 or small offsets
$ readelf -r hello.ko | grep printk
# After loading: resolved addresses visible in /proc/kallsyms
$ grep printk /proc/kallsyms | head -3
# KASLR check: kernel base shifts on each boot
$ grep '_text' /proc/kallsyms

Incorrect

The copy from user space into kernel memory happens earlier — in copy_module_from_user() inside load_module(). By the time apply_relocations() is called, the module's ELF sections are already in kernel memory. Relocation is specifically the step of writing resolved symbol addresses into the code that is already there. The distinction matters because apply_relocations() is what allows the module's code to actually call kernel functions — without it the instructions still have placeholder zeroes where the call targets should be.

When you run insmod, it passes the binary to the kernel via the finit_module() system call. This system call was added in kernel 3.8 — the original init_module() syscall required loading the entire module image into a user-space buffer first, then passing a pointer to it. finit_module() takes a file descriptor instead, which allows the kernel to determine module authenticity from the file's location in the filesystem (useful for systems that restrict module loading to a verified read-only root partition). Since kernel 5.17, finit_module() also accepts the MODULE_INIT_COMPRESSED_FILE flag, enabling the kernel to decompress a .ko.zst or .ko.xz module internally rather than requiring userspace decompression first. The kernel then performs the following operations:

Reads the ELF sections and validates structure integrity.
Checks the .modinfo section for the vermagic string and compares it against the running kernel's own vermagic.
If CONFIG_MODVERSIONS is enabled, CRC-checks every symbol the module uses against the checksum embedded in the module's __versions section.
Allocates kernel memory for the module's code and data sections using vmalloc or architecture-specific module memory allocators.
Performs relocation: resolves all the module's undefined symbol references against the kernel's exported symbol table (/proc/kallsyms).
Calls do_init_module(), which invokes the module's init function, transitions the module state to MODULE_STATE_LIVE, and notifies any registered module notifier callbacks.

Note

The kernel allocates module memory from a dedicated region — on x86-64 this is the module memory arena at high virtual addresses, separate from the main kernel text. The exact allocator is module_alloc(), which on older kernels called vmalloc_exec() to get memory that is both writable (for relocation patching) and executable (for running the code); on current kernels this is handled through architecture-specific mechanisms under execmem_alloc(). Either way, after relocation is complete, the kernel calls set_memory_ro() on the .text section to make it read-only — a hardening step that prevents a compromised module from modifying its own code at runtime.

The vermagic string embedded in every module looks like this:

vermagic example

vermagic: 6.5.0-44-generic SMP preempt mod_unload modversions

It encodes the kernel release string, whether SMP support is compiled in, the preemption model, whether forced unloading is permitted, and whether symbol versioning is active. If even one field mismatches, the kernel refuses to load the module. This is not pedantry — it is a safety mechanism. An interface that changed between kernel versions could cause undefined behavior, memory corruption, or a crash if the mismatch were silently ignored.

You can read the vermagic string from a compiled module without loading it:

$ modinfo -F vermagic hello.ko

And compare it directly against the running kernel's string:

$ cat /proc/version | awk '{print $3}'

What Would Happen If

What would happen if you forced the kernel to skip the vermagic check — for example, using insmod --force — and loaded a module compiled against a different kernel version?

The kernel allows this with insmod --force (which sets the MODULE_INIT_IGNORE_VERMAGIC and MODULE_INIT_IGNORE_MODVERSIONS flags in the finit_module() call), and the result is genuinely unpredictable — which is precisely why the check exists. The safest outcome is an immediate kernel oops during module initialization because a function signature changed between versions and your compiled call sites pass arguments in the wrong order or with the wrong types. The more dangerous outcome is silent data corruption: a struct that grew a new field between kernel versions will have your module writing to offsets that now overlap with something else, with no error and no warning.

The kernel marks itself tainted when a force-loaded module is inserted (the F flag in /proc/sys/kernel/tainted). Kernel developers routinely disregard bug reports from tainted kernels because the taint means the system is in an undefined state. From a security standpoint, --force loading a module from a different kernel version is also a vector for bypassing security features like SMEP, SMAP, or kASLR, since the mismatched module may be built without knowing their addresses. Never use --force in production or on any system where reliability matters.

Writing Your First Module: Hello, Kernel

Create a directory for your work:

terminal

$ mkdir ~/kernel_modules && cd ~/kernel_modules
$ mkdir hello && cd hello

Create hello.c:

hello.c

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Your Name");
MODULE_DESCRIPTION("A minimal kernel module");
MODULE_VERSION("1.0");

static int __init hello_init(void)
{
    pr_info("hello: module loaded\n");
    return 0;
}

static void __exit hello_exit(void)
{
    pr_info("hello: module unloaded\n");
}

module_init(hello_init);
module_exit(hello_exit);

Now create the Makefile:

Makefile

obj-m += hello.o
PWD := $(CURDIR)

all:
	$(MAKE) -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
	$(MAKE) -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

Build it, then load, observe, and unload:

terminal

$ make
$ sudo insmod hello.ko
$ dmesg | tail -5
$ sudo rmmod hello
$ dmesg | tail -5

Note

If you do not see messages immediately, the console_loglevel may be set lower than KERN_INFO (level 6). Check and adjust with sudo dmesg -n 7 to display all messages including KERN_DEBUG.

Breaking Down Every Line: Nothing Is Accidental

The Headers

linux/init.h provides the __init and __exit macros. These are not cosmetic. When you mark a function with __init, the compiler places it in a special ELF section called .init.text. After the module successfully initializes, the kernel frees this section. The memory it occupied is returned to the system. This is why you cannot call an __init-marked function after initialization completes — the code is literally gone from memory. Similarly, __exit places a function in .exit.text, which on modules built into the kernel is discarded entirely because built-in code never unloads.

Two related annotations are worth knowing. __initdata does the same thing for data — variables that are only needed during initialization (lookup tables, temporary buffers used to set up hardware) can be tagged __initdata to have their memory reclaimed after init. And __ro_after_init, added in kernel 4.6, marks data that must be writable during __init but should become read-only once initialization completes — the kernel calls mark_readonly() to flip the page protection after all __init sections have run. This is used for security-sensitive configuration that should be immutable once set: hardware capability flags, cryptographic keys embedded in drivers, and similar one-time-writable state. Using __ro_after_init for your own module's post-init constants is a small but meaningful hardening step.

linux/module.h is the master header for module infrastructure. It pulls in the MODULE_LICENSE, MODULE_AUTHOR, MODULE_DESCRIPTION, and MODULE_VERSION macros. These macros write key-value pairs into the .modinfo ELF section. You can read them without loading the module:

$ modinfo hello.ko

MODULE_LICENSE: Not Just Formality

MODULE_LICENSE("GPL") is one of the most consequential lines in your module. The Linux kernel exports some of its functions as "GPL-only," marked with EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL(). If your module declares a non-GPL license, attempting to use a GPL-only symbol will fail at load time with an "Unknown symbol" error. Many of the most useful kernel interfaces — in areas like ftrace, kprobes, and certain crypto APIs — are GPL-only.

This also affects the "tainted" state of the kernel. Loading a proprietary module taints the kernel with the P flag, recorded in kernel crash dumps and oops reports. Taint flags are visible in:

$ cat /proc/sys/kernel/tainted

Acceptable license strings include "GPL", "GPL v2", "GPL and additional rights", "Dual BSD/GPL", "Dual MIT/GPL", and "Dual MPL/GPL".

pr_info() vs printk(): Understanding the Difference

pr_info("message\n") is a macro that expands to printk(KERN_INFO "message\n"). The pr_* family of macros (pr_err, pr_warn, pr_info, pr_debug) are the modern, preferred way to emit kernel log messages. The official kernel documentation describes printk() as writing to the kernel log buffer, a ring buffer exported to userspace through /dev/kmsg and read with dmesg.

One pattern you will see throughout real kernel drivers but almost never in tutorial code is the pr_fmt macro. Defining it before including linux/printk.h (or linux/module.h, which pulls it in) causes all subsequent pr_* calls to automatically prefix their output with the string you specify — typically the module name. Without it, your log messages are easy to lose in a noisy dmesg stream:

pr_fmt prefix pattern

/* Define BEFORE any #include that pulls in linux/printk.h */
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

#include <linux/module.h>
#include <linux/kernel.h>

/* Now pr_info("loaded\n") emits: "mymodule: loaded" */
/* KBUILD_MODNAME is set to "mymodule" by the build system */

This is the same mechanism used by dev_info() and dev_err() in driver code — those functions embed the device name automatically. For a standalone module without an associated device, pr_fmt with KBUILD_MODNAME achieves the same result.

The log levels available are (from most to least severe):

Macro	Level	Meaning
`KERN_EMERG`	0	System is unusable
`KERN_ALERT`	1	Action must be taken immediately
`KERN_CRIT`	2	Critical conditions
`KERN_ERR`	3	Error conditions
`KERN_WARNING`	4	Warning conditions
`KERN_NOTICE`	5	Normal but significant condition
`KERN_INFO`	6	Informational
`KERN_DEBUG`	7	Debug-level messages

MeaningSystem is unusable

MeaningAction must be taken immediately

MeaningCritical conditions

MeaningError conditions

MeaningWarning conditions

MeaningNormal but significant condition

MeaningInformational

MeaningDebug-level messages

Warning

printk() can be called from interrupt context, atomic context, and process context. It acquires a spinlock internally, which means calling it while holding certain other locks can trigger lockdep warnings or deadlocks.

Kernel Pop Quiz

A module loads successfully but immediately produces an "Unknown symbol in module" error when you try to use register_kprobe(). Your module declares MODULE_LICENSE("Proprietary"). What is the most likely cause?

Incorrect

If kprobes were absent from the kernel, you would get the same error — but "Proprietary" is a direct cause here. The kprobes symbols are available on all standard distribution kernels with CONFIG_KPROBES=y. The issue is not availability but access. A non-GPL module license blocks access to any symbol exported with EXPORT_SYMBOL_GPL(), and register_kprobe() is one of them. Changing the license declaration to "GPL" resolves it immediately without any kernel rebuild.

check kprobe symbol export type

# Confirm register_kprobe is GPL-only
$ grep 'register_kprobe' /proc/kallsyms
# Then check Module.symvers to see export type:
$ grep 'register_kprobe' /usr/src/linux-headers-$(uname -r)/Module.symvers
# Output will show EXPORT_SYMBOL_GPL — confirming GPL-only status

Correct

register_kprobe() is exported with EXPORT_SYMBOL_GPL(). At module load time the kernel's symbol lookup checks a GPL-only flag on the export table entry. If MODULE_LICENSE() does not declare a GPL-compatible string, the lookup fails — even though the symbol exists in /proc/kallsyms. This is a compile-time-invisible, load-time-enforced access control mechanism. The fix is to declare MODULE_LICENSE("GPL"). This also untaints the kernel, removing the P flag from /proc/sys/kernel/tainted.

correct license for GPL-only symbols

MODULE_LICENSE("GPL");          /* grants access to EXPORT_SYMBOL_GPL symbols */
MODULE_LICENSE("Proprietary");  /* blocks them — "Unknown symbol" at insmod */

/* Valid GPL-compatible strings:
   "GPL", "GPL v2", "GPL and additional rights",
   "Dual BSD/GPL", "Dual MIT/GPL", "Dual MPL/GPL" */

Incorrect

A CRC mismatch from CONFIG_MODVERSIONS produces a different error message — typically "disagrees about version of symbol" — not "Unknown symbol". The "Unknown symbol" message specifically means the loader searched the kernel's export table and found nothing under that name accessible to your module. With a "Proprietary" license, EXPORT_SYMBOL_GPL() symbols are filtered out of the lookup, making them appear to not exist even though they are present in the running kernel.

distinguishing the two errors

# CRC / version mismatch (CONFIG_MODVERSIONS):
# mymodule: disagrees about version of symbol register_kprobe

# GPL license violation (EXPORT_SYMBOL_GPL):
# mymodule: Unknown symbol register_kprobe (err 0)

# The version mismatch message is unmistakable.
# The GPL denial looks identical to a plain missing symbol —
# both print "Unknown symbol". The only way to confirm it's
# a license issue is to check Module.symvers for _GPL suffix.

A More Useful Module: Exposing a /proc Interface

The /proc filesystem is a virtual filesystem — no data is ever written to disk. Every file in /proc is synthesized on the fly by kernel code. Your module can create its own entries, making them readable and optionally writable from user space.

Mental Model: Virtual Files vs Real Files

When you cat /proc/cpuinfo, nothing is read from a disk or a block device. The VFS layer calls your module's proc_read() function, which generates the output string in kernel memory and hands it to the calling process. The inode exists only in RAM, synthesized when the /proc entry is created. Reading the same /proc file twice can return different data if the underlying kernel state changed between reads — something that never happens with a static file. This is the correct mental model for any /proc entry you write: it is not a file, it is a syscall that happens to look like a file read. The implications for your proc_read() implementation are significant: you must handle the offset parameter correctly, or a tool that calls read() in multiple chunks will get corrupted data. This connects to the copy_to_user requirement covered in The Copy Barrier.

Create procmod.c:

procmod.c

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/proc_fs.h>
#include <linux/uaccess.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Your Name");
MODULE_DESCRIPTION("Proc filesystem demo module");
MODULE_VERSION("1.0");

#define PROC_ENTRY_NAME "mymodule_info"
#define BUF_SIZE 256

static struct proc_dir_entry *proc_entry;
static char message[BUF_SIZE] = "Hello from kernel space!\n";
static size_t message_len;

static ssize_t proc_read(struct file *file,
                          char __user *buffer,
                          size_t count,
                          loff_t *offset)
{
    /* Use message_len, not strlen() -- after a write, message_len
     * reflects the actual current content length. Calling strlen()
     * here would return the old length if the new message is shorter. */
    if (*offset >= (loff_t)message_len)
        return 0;

    if (count > message_len - *offset)
        count = message_len - *offset;

    if (copy_to_user(buffer, message + *offset, count))
        return -EFAULT;

    *offset += count;
    return count;
}

static ssize_t proc_write(struct file *file,
                           const char __user *buffer,
                           size_t count,
                           loff_t *offset)
{
    if (count >= BUF_SIZE)
        count = BUF_SIZE - 1;

    if (copy_from_user(message, buffer, count))
        return -EFAULT;

    message[count] = '\0';
    message_len = count;   /* track exact length for proc_read */

    pr_info("procmod: received %zu bytes from user space\n", count);
    return count;
}

static const struct proc_ops proc_file_ops = {
    .proc_read   = proc_read,
    .proc_write  = proc_write,
    .proc_lseek  = default_llseek,
};

static int __init procmod_init(void)
{
    message_len = strlen(message);   /* initialize from the static default */

    proc_entry = proc_create(PROC_ENTRY_NAME, 0666, NULL, &proc_file_ops);
    if (!proc_entry) {
        pr_err("procmod: failed to create /proc/%s\n", PROC_ENTRY_NAME);
        return -ENOMEM;
    }

    pr_info("procmod: loaded. Read/write /proc/%s\n", PROC_ENTRY_NAME);
    return 0;
}

static void __exit procmod_exit(void)
{
    proc_remove(proc_entry);
    pr_info("procmod: unloaded\n");
}

module_init(procmod_init);
module_exit(procmod_exit);

Update your Makefile to use procmod.o instead of hello.o, rebuild, and load. Then test it:

terminal

$ cat /proc/mymodule_info
Hello from kernel space!

$ echo "Modified by user" | sudo tee /proc/mymodule_info
$ cat /proc/mymodule_info
Modified by user

You have just created a bidirectional communication channel between user space and kernel space.

Caution: World-Writable /proc Entries

The proc_create() call here uses permission 0666 — world-readable and world-writable — so you can test with echo and cat without sudo. Never ship a production module with a world-writable /proc entry unless that is a deliberate and well-reasoned design choice. Any unprivileged process on the system can write to it. For real driver interfaces, 0444 (read-only for all) or 0644 (root-writable, world-readable) are the typical choices. If write access must be root-only, use 0200 or 0600.

Note: seq_file and why it is not used here

You will often see #include <linux/seq_file.h> in proc interface examples alongside proc_fs.h. The seq_file API is a higher-level layer built on top of raw proc_ops: it manages the offset bookkeeping, handles multi-call reads automatically, and is the right choice for any /proc entry whose output may exceed a single page or that needs to iterate over a data structure. For a simple fixed-length message buffer like this module, raw proc_ops with explicit copy_to_user is cleaner and more transparent — it makes the offset arithmetic visible so you can see exactly what is happening. For production proc interfaces that emit variable-length output (e.g., a list of registered devices, a statistics table), prefer the seq_file approach. See seq_open(), seq_printf(), and single_open() in linux/seq_file.h.

Kernel Pop Quiz

Your proc_write handler receives a const char __user *buffer pointer. A colleague suggests skipping copy_from_user() and directly doing *buffer to read the first byte for a quick sanity check. What is the primary risk?

Incorrect

The __user annotation is invisible to the C compiler — it compiles away to nothing and produces no runtime warning. It is only meaningful to the sparse static analysis tool (make C=1). The real danger is not a warning but a silent and potentially catastrophic runtime failure. The kernel's sparse tool exists specifically because the compiler cannot catch this class of bug on its own.

running sparse to catch __user violations

# sparse catches direct dereferences of __user pointers
$ make C=1 M=$(PWD) modules
# warning: dereference of noderef expression
# The compiler itself produces no such warning.

Correct

In user space, an invalid pointer dereference causes a recoverable segfault. In kernel space, the same fault has no recovery path — it generates an oops that may crash the machine. copy_from_user() uses a kernel exception table mechanism: if a page fault fires mid-copy, the fault handler consults the fixup table and returns the number of bytes not copied rather than panicking. Direct dereference bypasses this entirely. There is also a second risk: a malicious process could pass a kernel-space address, causing your handler to read privileged kernel memory.

correct pattern vs dangerous pattern

/* DANGEROUS — direct dereference, no recovery path */
char first = *buffer;   /* oops if page swapped out */

/* CORRECT — copy_from_user handles faults gracefully */
char first;
if (copy_from_user(&first, buffer, 1))
    return -EFAULT;       /* clean error, no panic */

Incorrect

User-space virtual addresses are not inherently invalid from kernel context — on x86-64, user and kernel space share the same virtual address space, just in different ranges (user below ~128TB, kernel in the upper half). The kernel can physically access user-space addresses if the page is present in memory. The actual problem is reliability and security: the page may be swapped out (triggering an unrecoverable fault), or the pointer could be crafted to point into kernel address space, enabling a privilege escalation attack.

what access_ok() actually checks

/* access_ok() verifies the range is in user space */
/* It does NOT guarantee the page is mapped in memory */
/* copy_from_user() handles the unmapped-page case via */
/* the kernel exception fixup table                    */

/* TASK_SIZE_MAX on x86-64: 0x7ffffffff000 */
/* Kernel text starts above: 0xffffffff80000000 */

The Copy Barrier: Why copy_to_user and copy_from_user Are Non-Negotiable

In kernel space, you are running with full memory access. User space pointers that come in through system calls are virtual addresses in the calling process's address space, not kernel space. On x86-64 with 4-level paging, user space occupies the lower canonical range (addresses below 0x0000800000000000), while kernel space occupies the upper canonical range. With 5-level paging that boundary extends further, but the principle is the same: user and kernel space are separated by the canonical address hole.

Directly dereferencing a user-space pointer has two catastrophic failure modes. First, the pointer might be invalid, null, or point to unmapped memory — in kernel space this generates an oops or panic rather than a recoverable segfault. Second, a malicious process could pass a pointer to a kernel-space address, potentially causing your kernel code to read or overwrite protected kernel data. This is the foundation of an entire class of kernel privilege escalation attacks.

copy_to_user(dst_user, src_kernel, count) and copy_from_user(dst_kernel, src_user, count) solve both problems. They verify that the user-space pointer actually falls within the user-space address range via access_ok(). They handle page faults gracefully — if the user-space page is swapped out, the fault is handled and the copy proceeds after the page is brought back in. They return the number of bytes that could not be copied (0 on success).

What access_ok() actually does is worth understanding. On x86-64 it performs a simple arithmetic check: it verifies that addr + size does not wrap around and that the entire range lies below TASK_SIZE_MAX (the upper bound of user address space, typically 0x7ffffffff000 on 64-bit systems). This check costs almost nothing — it is a single comparison — but it catches the null pointer case, the kernel-address case, and the integer-overflow-in-size case in one shot. It does not, however, guarantee the page is mapped. That guarantee comes from the fixup table mechanism: both copy functions are assembled with exception-handling annotations so that if a page fault fires mid-copy, the kernel's fault handler consults the exception table, recovers, and returns the number of bytes not yet copied rather than panicking.

There is a subtler issue here that experienced kernel developers have to reason about carefully: TOCTOU (Time-Of-Check-Time-Of-Use) races. The access_ok() check and the actual copy happen at different moments. A multi-threaded user process can, in principle, unmap or remap the page between those two moments. This is why some kernel subsystems — particularly those implementing system call argument validation — copy the entire user buffer into kernel memory before inspecting any field, rather than reading fields directly from user space one at a time. The copy_from_user() call produces a consistent kernel-side snapshot; reading user memory piecemeal does not.

Kernel Pop Quiz

A kernel module's write handler reads a user-supplied struct directly from the user-space pointer without calling copy_from_user() — but it does call access_ok() first to verify the range. Under what realistic scenario does this still fail, even though access_ok() returned true?

Incorrect

access_ok() performs only a range check — it verifies the address is within the user-space address range. It does not check whether the page is mapped, whether it will remain mapped, or whether it will retain its contents. The kernel documentation for access_ok() explicitly states it does not check memory accessibility. It is one layer of a multi-layer safety system, not a complete guarantee.

Correct

This is a classic TOCTOU (Time-Of-Check-Time-Of-Use) race. The check (access_ok()) and the use (the direct dereference) are not atomic. In a multi-threaded process, thread B can call munmap() or mmap() to replace the page between those two operations. If the page is unmapped, the kernel dereference triggers a fault that cannot be gracefully recovered — instead of copy_from_user()'s exception fixup mechanism, you get an unhandled page fault in kernel context, which typically escalates to a kernel oops. If the page is replaced with different data, the kernel silently reads attacker-controlled content. copy_from_user() does not eliminate the race entirely, but it performs the copy as a single atomic-from-the-reader's-perspective operation with proper exception handling throughout.

correct pattern: copy first, inspect the kernel-side copy

struct my_cmd cmd;
/* Copy entire struct into kernel memory first */
if (copy_from_user(&cmd, user_ptr, sizeof(cmd)))
    return -EFAULT;
/* All subsequent reads operate on the kernel-side copy.
   The user process can mutate user_ptr all it wants now —
   cmd is a stable snapshot in kernel memory. */
if (cmd.length > MAX_LEN)
    return -EINVAL;

Incorrect

The TOCTOU race exists on uniprocessor systems too, because the user process can be preempted between the check and the dereference, and a different thread in that process can run and unmap the page before the kernel resumes. On SMP the window is narrower but the race is fundamentally the same mechanism. The key point is that access_ok() is a point-in-time snapshot of address validity, not a lock that holds that validity stable. On SMP, concurrent modification is simply more likely — the race does not require a context switch.

For copying a single integer or pointer (not a buffer), the kernel provides get_user(x, ptr) and put_user(x, ptr). These are optimized macros that expand to type-appropriate single-word transfers with inline access_ok() checks. They generate more efficient code than routing a 4-byte copy through copy_from_user():

get_user / put_user example

int val;
int __user *uptr = (int __user *)arg;  /* from ioctl handler */

/* Read one int from user space */
if (get_user(val, uptr))
    return -EFAULT;

/* Write one int back to user space */
if (put_user(val * 2, uptr))
    return -EFAULT;

Caution

The __user annotation in function signatures (const char __user *buffer) is invisible to the compiler — it compiles away to nothing. But the sparse static analysis tool, run with make C=1, uses it to track user-space pointers through the entire call graph and flag any direct dereferences. If you write kernel code professionally, run sparse on every non-trivial module. It catches classes of bugs that code review misses.

Module Parameters: Making Your Module Configurable at Load Time

The module_param() macro exposes a module-level variable as a configurable parameter that can be set at load time without recompiling.

module parameter example

#include <linux/moduleparam.h>

static int repeat_count = 3;
module_param(repeat_count, int, 0644);
MODULE_PARM_DESC(repeat_count, "Number of times to print the message (default: 3)");

After compiling, load with:

$ sudo insmod mymodule.ko repeat_count=10

The third argument to module_param() is a permission mask for the corresponding /sys/module/<name>/parameters/<param> sysfs entry. A value of 0 hides the parameter from sysfs entirely. A value of 0644 makes it readable by everyone and writable by root, even while the module is loaded. Supported types include bool, charp, int, uint, long, ulong, short, ushort, and array variants of most of these.

When you need to validate or act on a parameter change while the module is running — not just store a new value — the standard module_param() is insufficient. The solution is module_param_cb(), which lets you register a custom set and get callback pair. The set callback runs synchronously whenever root writes to the sysfs parameter entry, allowing you to validate the new value, reject it with an error code, or trigger a side effect like resizing a buffer or restarting a timer:

module_param_cb example

static int timeout_ms = 100;

static int timeout_set(const char *val, const struct kernel_param *kp)
{
    int n, ret;
    ret = kstrtoint(val, 10, &n);
    if (ret) return ret;
    if (n < 1 || n > 10000)
        return -EINVAL;  /* reject out-of-range values */
    timeout_ms = n;
    restart_timer();   /* act on the change immediately */
    return 0;
}

static const struct kernel_param_ops timeout_ops = {
    .set = timeout_set,
    .get = param_get_int,
};

module_param_cb(timeout_ms, &timeout_ops, &timeout_ms, 0644);
MODULE_PARM_DESC(timeout_ms, "Timer interval in milliseconds (1–10000, default: 100)");

For parameters that hold a list of values, module_param_array() exposes a fixed-size C array. The kernel parses a comma-separated string from the command line or sysfs and fills each element. An auxiliary variable receives the actual number of elements written, which lets you distinguish an empty array from one that was never set:

module_param_array example

static int irq_list[4] = {0};
static int irq_count = 0;

module_param_array(irq_list, int, &irq_count, 0444);
MODULE_PARM_DESC(irq_list, "Up to 4 IRQ numbers to watch (comma-separated)");

/* Load: insmod mymod.ko irq_list=10,11,14
   irq_count will be set to 3 after loading */

What Would Happen If

What would happen if you set a module_param() permission to 0222 (write-only, no read) and a user then ran cat /sys/module/mymodule/parameters/myparam?

The cat command would receive EACCES (Permission denied). The sysfs layer enforces the permission bits you specified — 0222 means the entry is writable but not readable by anyone. This is a valid use case: some module parameters control write-only operations (trigger actions, reset counters) where exposing the current value would either be meaningless or leak internal state.

A subtler consequence: if you set permissions to 0, the parameter disappears from sysfs entirely, but it can still be set at load time via insmod. It just cannot be inspected or changed after the module is running. This is the correct choice for security-sensitive parameters — initialization secrets, internal capacity limits, hardening thresholds — that you want to be configurable at deployment time but not readable or writable from userspace afterward. It is also how driver authors lock down parameters that only make sense at probe time and would cause undefined behavior if changed mid-operation.

Symbol Visibility and Module Stacking

When your module is loaded, its exported symbols become visible to the entire kernel. Other modules can call your functions. This is the mechanism behind module stacking — the ability to build layers of modules where higher-level modules depend on lower-level ones.

EXPORT_SYMBOL(my_function) makes my_function available to any module loaded after yours. EXPORT_SYMBOL_GPL(my_function) restricts it to GPL-licensed modules only. After your module loads, its symbols appear in /proc/kallsyms.

Before your module can be unloaded with rmmod, the kernel checks whether any other loaded module holds a reference to your exported symbols. If another module is actively using your function, rmmod will fail with "Module is in use." You can inspect dependencies with:

$ lsmod

The Use Count column tells you how many other modules or kernel subsystems currently reference this module. A count of 0 means safe to unload.

Warning: try_module_get() and the Unload Race

There is a race condition that beginners do not expect: a caller can enter your module's exported function at the same moment that another thread is calling rmmod. If rmmod wins, your function's code is freed while it is still executing, causing an immediate crash. The kernel prevents this with try_module_get(THIS_MODULE) and module_put(THIS_MODULE). try_module_get() atomically increments the module's reference count and returns false if the module is already being unloaded (state MODULE_STATE_GOING). If it returns true, the module is pinned in memory until you call module_put(). Device drivers registered through the kernel's driver model have this done automatically by the VFS and device core. For modules that export functions called directly from other code paths — without going through a kernel subsystem that manages reference counting — you must call try_module_get()/module_put() yourself around each use. Forgetting this on a production driver is a latent crash waiting for a specific timing window.

Kernel Pop Quiz

Module A exports my_utility_fn() using EXPORT_SYMBOL(). Module B, which declares MODULE_LICENSE("GPL"), calls my_utility_fn(). Module C, which declares MODULE_LICENSE("Proprietary"), also calls my_utility_fn(). Both B and C are currently loaded. You try to unload module A with rmmod modA. What happens?

Incorrect

The license check happens at load time, not at unload time. EXPORT_SYMBOL() (without the _GPL suffix) makes my_utility_fn() available to any module — GPL or not — so module C was able to load and call it. Once loaded, both B and C hold valid references that increment module A's use count. The kernel's reference counting mechanism does not distinguish between GPL and non-GPL callers. License enforcement at load time does not undo live usage references.

Correct

Module A's use count is 2 because both B and C hold active symbol references to it. rmmod checks the use count before attempting to unload — a non-zero count produces "rmmod: ERROR: Module modA is in use by: modB modC". To unload A you must first unload B and C. The key distinction here is that EXPORT_SYMBOL() (not EXPORT_SYMBOL_GPL()) was used — this makes the symbol accessible to proprietary modules at load time, which is why C was able to load at all. If EXPORT_SYMBOL_GPL() had been used, module C would have failed to load in the first place.

checking and understanding use counts

# lsmod output: Module / Size / Used by
$ lsmod | grep modA
# modA   12288  2 modB,modC
#                ^ use count = 2

# Correct unload sequence:
$ sudo rmmod modC
$ sudo rmmod modB
$ sudo rmmod modA  # now use count = 0, succeeds

Incorrect

The kernel never automatically unloads dependent modules — that would violate the explicit unloading contract that module developers rely on. rmmod is a deliberate, manual operation. Automatically cascading an unload through dependent modules could leave other modules or user-space processes in inconsistent state. The equivalent of cascading unload is modprobe -r modA, which will unload A and its unused dependents, but only if those dependents are not themselves referenced by anything else. And even then, it will refuse if any module in the dependency chain has a non-zero use count from outside the chain.

Kernel Memory Allocation: You Are Not in malloc() Anymore

The kernel offers two primary allocation paths, but the difference between them is more nuanced than "small vs. large."

kmalloc() — allocates from a slab cache implemented by the SLUB allocator. SLUB became the default with kernel 2.6.23 and has been the sole allocator since kernel 6.8 (SLOB was removed in 6.4, the legacy SLAB allocator was deprecated in 6.5, and fully removed in 6.8). SLUB organizes memory into per-CPU caches of fixed-size objects. When you call kmalloc(256, GFP_KERNEL), SLUB looks up the nearest size class (in this case 256 bytes, an exact match), pops an object from the current CPU's free list, and returns it in nanoseconds if the cache is warm. The physical pages backing these objects are contiguous — a 256-byte object will never cross a page boundary — which matters for DMA if your device requires a contiguous physical buffer. The practical size limit before you should switch to vmalloc is around 4MB on most architectures, though the theoretical limit is KMALLOC_MAX_SIZE.

kmalloc example

char *buf = kmalloc(256, GFP_KERNEL);
if (!buf)
    return -ENOMEM;
/* ... use buf ... */
kfree(buf);

Warning: Never Compute Array Sizes Manually

A subtle and historically significant class of kernel security bugs comes from computing the allocation size for an array with a bare multiplication: kmalloc(count * sizeof(struct foo), GFP_KERNEL). If count is controlled by user input and is large enough, the multiplication overflows a 32-bit integer silently, producing a small allocation — while the code that follows writes count full-sized elements into it. This is a heap overflow. The kernel provides kmalloc_array(count, sizeof(struct foo), GFP_KERNEL) and kcalloc(count, sizeof(struct foo), GFP_KERNEL) precisely to prevent this: both check for overflow internally and return NULL rather than allocating a truncated buffer. Use them unconditionally for array allocations. The raw kmalloc(n * size, ...) pattern should never appear in new code.

The GFP flags argument determines the entire behavior of the allocator when memory is scarce:

Flag	Context	Behavior when memory is tight
`GFP_KERNEL`	Process context only	May sleep: triggers page reclaim, writeback, swap. Safe for most module code.
`GFP_ATOMIC`	Interrupt / spinlock / atomic	Never sleeps. Uses emergency reserve pool. Higher failure rate. Use sparingly.
`GFP_NOWAIT`	Process context	Never sleeps, but does not use emergency reserve. Fails faster than GFP_ATOMIC.
`GFP_DMA`	Driver code needing legacy DMA	Allocates from the DMA-capable zone (<16MB on x86). Rarely needed with modern IOMMU.
`GFP_KERNEL \| __GFP_ZERO`	Process context	Like GFP_KERNEL but zeroes the allocation before returning. Equivalent to kcalloc().

ContextProcess context only

BehaviorMay sleep: triggers page reclaim, writeback, swap. Safe for most module code.

ContextInterrupt / spinlock / atomic

BehaviorNever sleeps. Uses emergency reserve pool. Higher failure rate. Use sparingly.

ContextProcess context

BehaviorNever sleeps, but does not use emergency reserve. Fails faster than GFP_ATOMIC.

ContextDriver code needing legacy DMA

BehaviorAllocates from the DMA-capable zone (<16MB on x86). Rarely needed with modern IOMMU.

ContextProcess context

BehaviorLike GFP_KERNEL but zeroes the allocation. Equivalent to kcalloc().

vmalloc() — maps arbitrary physical pages into a contiguous virtual address range in the vmalloc area (VMALLOC_START to VMALLOC_END). Each physical page is individually allocated with the page allocator, so there is no physical contiguity requirement. The cost is two-fold: the allocation itself is slower (it must set up page table entries for every page), and accesses are slower because the non-contiguous physical layout produces more TLB pressure. Use vmalloc when you need a large buffer (several megabytes) that does not need to be physically contiguous — firmware loading, large ring buffers, or driver workspace.

Mental Model: vmalloc TLB Pressure vs mmap in User Space

The performance penalty of vmalloc() in the kernel mirrors a well-known user-space pattern: a large mmap() over many non-contiguous physical pages. The virtual address space looks contiguous to the reader, but every unique physical page mapping burns a TLB entry. When those entries are exhausted, every subsequent access causes a TLB miss, a page-table walk, and a fill — the same overhead that makes NUMA-unaware memory allocation slow on multi-socket servers. In kernel code this penalty is concentrated: the kernel TLB is shared across all processes and interrupt handlers, so vmalloc-heavy code creates TLB pressure that affects performance globally, not just for the calling thread. This is why large-buffer device drivers that need physically contiguous memory use dma_alloc_coherent() or __get_free_pages() rather than vmalloc().

What Would Happen If

What would happen if you used vmalloc() to allocate a DMA receive buffer for a network driver, then passed that virtual address to a PCI device's DMA engine?

The DMA engine would write to wrong memory — or, depending on IOMMU configuration, the write would be rejected with a DMA fault. DMA engines operate on physical addresses (or IOVA addresses when an IOMMU is present). vmalloc() gives you a virtually contiguous range built from non-contiguous physical pages. The physical address of byte N in a vmalloc region is not physical_start + N; it is wherever page allocator happened to place each constituent page. If you pass the virtual address's corresponding physical address for page 0, the DMA engine will correctly write the first page, then continue writing to whatever physical memory follows that page — which is not the second page of your buffer.

The correct API for driver DMA buffers is dma_alloc_coherent(), which allocates physically contiguous memory (or sets up IOMMU mappings to create a contiguous IOVA window over non-contiguous pages), returns both the kernel virtual address and the DMA address you pass to the hardware, and handles cache coherency for the architecture. On systems without an IOMMU, it guarantees physical contiguity. On systems with one, it programs the IOMMU so the device sees contiguous addresses even if physical memory is scattered.

A pattern worth knowing for driver code is devm_kmalloc() — the managed allocation variant. Allocations made with devm_* functions are tracked by the device core and automatically freed when the associated device is unbound or the driver is removed. This eliminates an entire class of cleanup path bugs where an error return forgets to free allocated memory. It is not available for pure kernel modules (which have no associated device), but it is standard practice in device driver code:

devm_kmalloc pattern (driver context)

/* In a platform_driver probe() function */
struct my_priv *priv = devm_kzalloc(&pdev->dev,
                                     sizeof(*priv),
                                     GFP_KERNEL);
if (!priv)
    return -ENOMEM;
/* No kfree() needed -- devm frees on driver unbind */

Caution

A memory leak in a kernel module is different in kind from a user-space leak. There is no garbage collector, no process exit to clean up after you. When your module unloads, any memory you allocated and did not free is gone forever — stolen from the system until the next reboot. Always pair every kmalloc with a kfree in the error path and in the __exit function.

Kernel Pop Quiz

You are writing an interrupt handler that needs to allocate a small buffer to queue an incoming hardware event. Which kmalloc flag is correct, and why?

Incorrect

GFP_KERNEL is allowed to sleep — it can trigger page reclaim, writeback, and swap when memory is tight. An interrupt handler runs with interrupts disabled on at least one CPU and cannot be scheduled away. If GFP_KERNEL tries to sleep inside an IRQ handler, the kernel will either BUG immediately (with CONFIG_DEBUG_ATOMIC_SLEEP) or silently corrupt the scheduling state, leading to a deadlock or hang that is very difficult to diagnose.

what happens with GFP_KERNEL in IRQ context

/* With CONFIG_DEBUG_ATOMIC_SLEEP=y, GFP_KERNEL in IRQ triggers: */
/* BUG: sleeping function called from invalid context            */
/* in_atomic(): 1, irqs_disabled(): 1                           */

/* The might_sleep() call inside the allocator catches this */
static irqreturn_t my_irq(int irq, void *dev_id) {
    /* WRONG: */
    void *buf = kmalloc(size, GFP_KERNEL);  /* may sleep — BUG */
    /* CORRECT: */
    void *buf = kmalloc(size, GFP_ATOMIC);  /* never sleeps */
    return IRQ_HANDLED;
}

Correct

GFP_ATOMIC instructs the allocator to never sleep under any circumstances. It draws from a reserved emergency memory pool, which means it has a higher failure rate than GFP_KERNEL under memory pressure, but it is the only safe choice from interrupt context, softirqs, tasklets, or anywhere else where sleeping is prohibited. Always check the return value — GFP_ATOMIC allocations fail more readily. The preferred pattern for interrupt-driven data is to allocate buffers at probe time with GFP_KERNEL and recycle them in the IRQ handler, avoiding runtime allocation entirely.

correct interrupt handler allocation

static irqreturn_t my_irq_handler(int irq, void *dev_id)
{
    /* GFP_ATOMIC: never blocks, uses emergency reserve */
    struct event *ev = kmalloc(sizeof(*ev), GFP_ATOMIC);
    if (!ev)
        return IRQ_HANDLED;  /* drop event under pressure */

    /* populate ev, enqueue it, schedule work */
    schedule_work(&process_work);
    return IRQ_HANDLED;
}

Incorrect

The allocator does not auto-detect context and silently switch behavior. The GFP flag is an explicit contract you make with the allocator about what it is allowed to do. If you pass GFP_KERNEL, the allocator will attempt to sleep when memory is tight, regardless of whether your code is in an interrupt handler. The kernel's might_sleep() mechanism and lockdep will catch this in debug builds, but in a production kernel the result is silent state corruption. You own the responsibility of choosing the right flag for your execution context.

GFP flag selection by context

/* Process context, can sleep:      */ kmalloc(n, GFP_KERNEL);
/* IRQ / atomic / spinlock held:    */ kmalloc(n, GFP_ATOMIC);
/* Process context, must not sleep: */ kmalloc(n, GFP_NOWAIT);
/* Zeroed process-context alloc:    */ kzalloc(n, GFP_KERNEL);

/* in_interrupt() returns true inside hardirq and softirq */
/* in_atomic()    returns true when preemption is disabled */

Debugging Your Module

The primary debugging tool in kernel development is printk(). The kernel documentation describes the log buffer as a ring buffer exported to userspace through /dev/kmsg — and notes that dmesg is the standard way to read it. That covers basic tracing. But when things get serious, you need a different set of tools.

Dynamic Debug

Scattering pr_debug() calls throughout your module and then recompiling to enable them is painful. Dynamic debug solves this. If your kernel is built with CONFIG_DYNAMIC_DEBUG=y (the default on all major distros), you can enable pr_debug() output for specific files, functions, or line ranges at runtime without recompiling:

dynamic debug control

# Enable all pr_debug() in a specific source file
$ echo 'file procmod.c +p' > /sys/kernel/debug/dynamic_debug/control

# Enable just one function's debug output
$ echo 'func proc_write +p' > /sys/kernel/debug/dynamic_debug/control

# Disable after debugging
$ echo 'file procmod.c -p' > /sys/kernel/debug/dynamic_debug/control

ftrace: Function-Level Tracing

For structured, high-throughput tracing, the kernel offers ftrace and trace_printk(). The ftrace function tracer can record every function call in the kernel, with timestamps, at very low overhead:

ftrace setup

$ cd /sys/kernel/debug/tracing
$ echo function_graph > current_tracer
$ echo 1 > tracing_on
$ sudo insmod mymodule.ko
$ echo 0 > tracing_on
$ cat trace

You can narrow the trace to just your module's functions by setting a filter before enabling:

$ echo 'proc_read proc_write procmod_init' > /sys/kernel/debug/tracing/set_ftrace_filter

KASAN: Kernel Address Sanitizer

If your distribution kernel or a custom kernel is built with CONFIG_KASAN=y, the Kernel Address Sanitizer is active. KASAN catches use-after-free, out-of-bounds reads and writes, and use-before-initialization bugs in kernel memory — the same class of bugs that AddressSanitizer catches in user-space code. When KASAN detects a violation it prints a detailed report to dmesg including the bad access address, the stack trace of the access, and the stack trace of the original allocation:

example KASAN report (truncated)

==================================================================
BUG: KASAN: slab-out-of-bounds in proc_write+0x4a/0x80 [procmod]
Write of size 1 at addr ffff888107a3c100 by task insmod/1847

Call Trace:
 proc_write+0x4a/0x80 [procmod]
 proc_reg_write+0x6c/0xa0
 vfs_write+0x121/0x390

Lockdep: The Lock Validator

Deadlocks in kernel code are among the hardest bugs to reproduce and diagnose. The kernel's lockdep subsystem (CONFIG_PROVE_LOCKING=y) tracks lock acquisition order at runtime and detects potential deadlocks before they actually occur. It does this by building a directed graph of all lock acquisitions the kernel has ever seen — if it observes a new acquisition order that would form a cycle in that graph, it immediately prints a lockdep warning with both stack traces, even if no actual deadlock has happened yet:

Note

Lockdep is enabled by default in many distro debug kernels (packages like linux-image-*-dbg on Debian). If you are developing a module that acquires multiple locks, test it under a lockdep-enabled kernel before releasing. A lockdep warning means you have a potential deadlock; it does not mean your current test triggered one.

KFENCE: Production-Safe Memory Error Detection

KASAN is powerful, but its memory overhead (roughly 2x RAM for shadow memory) and performance cost make it impractical in production systems. Kernel 5.12 introduced KFENCE (Kernel Electric-Fence, CONFIG_KFENCE=y) as a lightweight, sampling-based alternative that can run safely in production.

Rather than instrumenting every allocation, KFENCE periodically replaces a small fraction of kmalloc allocations with specially guarded objects — each placed adjacent to a guard page. An out-of-bounds read or write that crosses into the guard page triggers an immediate fault, and KFENCE reports the offending access with a stack trace. Use-after-free is detected because freed KFENCE objects are poisoned and their guard pages remain mapped but inaccessible. The sampling rate defaults to one guarded allocation per 100ms and is tunable via /sys/module/kfence/parameters/sample_interval:

KFENCE tuning

# Check if KFENCE is active on your running kernel
$ cat /sys/module/kfence/parameters/sample_interval
100  # milliseconds between guarded allocations

# Increase sampling rate for intensive testing (lower = more samples)
$ echo 10 | sudo tee /sys/module/kfence/parameters/sample_interval

# View cumulative KFENCE statistics
$ cat /sys/kernel/debug/kfence/stats

Ubuntu 22.04+ and Fedora 34+ ship production kernels with CONFIG_KFENCE=y enabled. If your module has a subtle out-of-bounds or use-after-free bug that KASAN would catch immediately under test conditions but that survives to production, KFENCE will eventually catch it there. The key tradeoff: KFENCE catches real bugs on the allocations it happens to guard during that run; KASAN instruments every allocation but cannot be run in production.

Decoding Oops Stack Traces

For deliberate sanity checks, the kernel provides BUG_ON(condition) and WARN_ON(condition). BUG_ON() triggers a kernel panic if the condition is true — use this when reaching that code path represents a state from which the system cannot safely recover. WARN_ON() prints a stack trace to dmesg but allows the system to continue.

When a module causes a kernel oops, the kernel dumps a stack trace with hexadecimal addresses and typically reports the module name and the offset within it — for example, proc_write+0x4a/0x80 [procmod]. The offset (0x4a here) is what you pass to addr2line against the compiled module object:

$ addr2line -e mymodule.ko 0x4a

Use the .ko file, not the intermediate .o file — the .ko is the final linked object that retains the DWARF debug information needed for line number resolution. If the kernel printed an absolute address rather than a module-relative offset, subtract the module's load base address (visible in /proc/modules or reported in the oops header) to get the offset first. Building with CONFIG_DEBUG_INFO=y (or adding ccflags-y += -g to your Makefile) ensures the .ko retains full debug symbols.

Warning: Compressed Modules on Modern Distros

On Ubuntu 24.04, Fedora 38+, Arch Linux, and other current distributions, installed kernel modules are compressed and have a .ko.zst (Zstd) or .ko.xz (XZ) extension rather than plain .ko. This has been supported since kernel 5.13 for Zstd. You cannot pass a compressed module directly to addr2line or modinfo — both tools require the uncompressed ELF. Decompress first with zstd -d mymodule.ko.zst or xz -d mymodule.ko.xz, then run your analysis tool against the resulting .ko. Note also that when you develop your own out-of-tree module, the build system produces an uncompressed .ko directly — compression only applies after make modules_install on a kernel configured with CONFIG_MODULE_COMPRESS_ZSTD or CONFIG_MODULE_COMPRESS_XZ.

On systems with debug symbols installed, the kernel ships a more capable script:

decode_stacktrace.sh syntax

# Syntax: decode_stacktrace.sh vmlinux [source-tree-path] < oops.txt
$ scripts/decode_stacktrace.sh vmlinux /path/to/kernel/source/ < oops.txt
# The second argument is the kernel source tree root — required for
# file:line resolution. Without it, function names are resolved but
# source file paths are not.

This script resolves not just addresses but also inline functions, which addr2line alone cannot always reconstruct accurately at higher optimization levels.

Tip: objtool — Static Binary Validation

For out-of-tree modules that use unusual control flow — non-standard calling conventions, hand-written assembly stubs, or __attribute__((noreturn)) functions that the compiler did not model correctly — the kernel build system may emit warnings from objtool, the kernel's static binary analysis tool introduced in kernel 4.6 and significantly expanded since. Objtool validates ORC (Oops Rewind Capability) stack unwind annotations, which are what the kernel uses to produce reliable stack traces from oopses and NMIs. A module with incorrect or missing ORC annotations will produce garbled stack traces that are nearly impossible to diagnose. If you see warnings like objtool: myfunction()+0x18: call without frame pointer save/restore, address them before shipping. The build system runs objtool automatically when CONFIG_STACK_VALIDATION=y, which is on by default in most distribution kernels since kernel 5.10.

A Mental Model Shift: Thinking Like the Kernel

The transition from user-space to kernel-space thinking requires internalizing a set of constraints that feel artificial until you understand why they exist. None of them are arbitrary. Each reflects a real physical or architectural truth about how the hardware works.

No Floating-Point — and What That Actually Means

The x87/SSE/AVX register state is part of per-process context and is fully saved and restored on every context switch. Since kernel 4.6, eager FPU mode has been the default on all x86 systems (commit 58122bf1d856), and lazy mode was fully removed in kernel 4.9 — the lazy FPU path had grown so rarely exercised that it was more of a maintenance burden than a real option. The complete FPU state is saved and restored on every context switch regardless of whether the task has used FPU instructions, because modern XSAVE-capable hardware can detect untouched state and skip those saves in hardware — making eager mode essentially free on any remotely recent CPU. If kernel code touches the FPU registers without explicitly saving and restoring state via kernel_fpu_begin()/kernel_fpu_end(), it will corrupt the floating-point context of whatever user process was last scheduled, causing mysterious intermittent errors in completely unrelated processes.

The kernel provides kernel_fpu_begin() and kernel_fpu_end() for the rare cases — hardware crypto acceleration, video codec drivers, certain checksumming routines — where FPU access in kernel context is genuinely necessary. These functions save the user-process FPU state, enable the FPU for kernel use, and restore it afterward. They are expensive relative to normal kernel operations and must never be called from interrupt context. If you find yourself reaching for floating-point in module code, stop and ask whether the algorithm can be reformulated using integer arithmetic. In most cases, it can.

Stack Depth: The 8KB Wall

A typical kernel thread has 8KB or 16KB of stack depending on architecture and kernel configuration, compared to the megabytes available to user processes (which can grow further via demand paging). This is not a tunable parameter you can increase without rebuilding the kernel. Deep recursion in kernel code causes a stack overflow — detected on CONFIG_VMAP_STACK kernels as a guard page fault, but still immediately fatal to the system. On 32-bit x86 with 4KB stacks, even moderate call chain depth can overflow.

The kernel enforces stack usage discipline through the checkstack.pl script, which analyzes compiled objects and reports functions that allocate large stack frames. Run it against your module after building:

check module stack usage

$ objdump -d mymodule.ko | perl /path/to/linux/scripts/checkstack.pl x86_64
# Output: sorted list of functions by stack frame size
# Any function using >200 bytes of stack deserves scrutiny

The practical rule: never declare large arrays or structs on the kernel stack. If you need a 4KB buffer, allocate it with kmalloc() and free it before returning. A stack-allocated char buf[4096] in a kernel function is a serious bug waiting to happen.

The Preemption Count: Reading Context from a Single Integer

The kernel tracks what kind of execution context it is currently in via a per-CPU integer called preempt_count, stored in the thread_info structure. Each bit range encodes different information:

Bits	Field	Meaning when non-zero
0–7	Preempt disable count	Preemption explicitly disabled (e.g., holding a spinlock)
8–15	Softirq count	Inside a softirq or tasklet handler
16–19	Hardirq count	Inside a hardware interrupt handler
20	NMI bit	Inside a Non-Maskable Interrupt handler

MeaningPreemption explicitly disabled — e.g., a spinlock is held. Sleeping here is a bug.

MeaningInside a softirq or tasklet handler. Cannot sleep. Bottom-half locks apply.

MeaningInside a hardware interrupt handler. Cannot sleep. Only spinlocks and GFP_ATOMIC allocations.

MeaningInside a Non-Maskable Interrupt. Extremely restricted — almost no kernel APIs are safe here.

The macros in_interrupt(), in_atomic(), in_serving_softirq(), and preemptible() all query this counter. If you call a function that might sleep (like kmalloc(GFP_KERNEL), mutex_lock(), or schedule()) while in_atomic() returns true, the kernel will issue a might_sleep() warning — or, with CONFIG_DEBUG_ATOMIC_SLEEP enabled, a BUG with a full stack trace. This is one of the most common mistakes new kernel developers make: calling a sleeping function from an interrupt handler or while holding a spinlock.

Concurrency: Everything Runs at Once

Your module's functions may be called simultaneously from multiple CPU cores, from hardware interrupt handlers, from softirqs, from workqueues, and from multiple user processes making concurrent system calls into your /proc entry or character device. There is no single-threaded safety net.

The locking primitives form a hierarchy: spinlocks for short atomic sections that cannot sleep, mutexes for longer sections that can, read-write spinlocks and rwsem for read-heavy data, and RCU (Read-Copy-Update) for the highest-performance read-mostly structures. RCU is the mechanism the kernel uses for its own most performance-critical shared data — the network routing table, the VFS dcache, the task list. In RCU, readers run without any lock at all, writers make a copy, update it, and publish the new version with a single pointer swap. Old versions are freed after a "grace period" during which every CPU has passed through a quiescent state, guaranteeing no reader still holds a reference. Using RCU correctly in a module requires understanding rcu_read_lock(), rcu_dereference(), rcu_assign_pointer(), and synchronize_rcu() — a topic that warrants its own guide.

Tip

When you are unsure which locking primitive to use, start with a mutex. It is the safest default: it sleeps rather than spinning (saving CPU cycles under contention), its ownership is tracked by lockdep, and it will produce a clear warning if you accidentally try to acquire it from interrupt context. Optimize to spinlocks or RCU only after profiling shows the mutex is actually a bottleneck.

Mental Model: Per-CPU Variables — Lockless Counters at Scale

One concurrency pattern that almost no beginner guide mentions is per-CPU variables, declared with DEFINE_PER_CPU(type, name) and accessed via get_cpu_var(name) / put_cpu_var(name). Per-CPU variables eliminate false sharing and lock contention for data that is logically per-CPU by nature — statistics counters, scheduler state, network receive queues, SLUB allocator caches. Each CPU has its own private copy; there is no need for a lock because no other CPU touches your copy during a non-preemptible access window. get_cpu_var() disables preemption for you (preventing migration to a different CPU mid-access) and returns a direct reference to the current CPU's copy; put_cpu_var() re-enables it. The cost: each CPU holds a complete copy of the data, so per-CPU variables are inappropriate for large structures or data that CPUs genuinely need to share. For a module that accumulates per-CPU event counters that are only summed for reporting, this is the right tool and it will outperform any spinlock-protected global counter under load.

Kernel Pop Quiz

You write a kernel function that declares char buf[8192]; on the stack to hold a temporary work buffer. On a typical x86-64 Linux kernel, what is the problem with this?

Incorrect

Kernel thread stacks on x86-64 are typically 8KB or 16KB — not 64KB. The default is 16KB on most modern distro kernels (CONFIG_THREAD_SIZE_ORDER=2 on some), but 8KB configurations exist and were the standard for years. Either way, declaring an 8KB array on the stack leaves almost nothing for the rest of the call chain. Unlike user-space stacks, kernel stacks cannot grow via demand paging. A stack overflow in the kernel is immediately fatal.

check your kernel's stack size

# Read the compiled-in thread stack size
$ grep THREAD_SIZE_ORDER /boot/config-$(uname -r)
# CONFIG_THREAD_SIZE_ORDER=1 → 8KB stack
# CONFIG_THREAD_SIZE_ORDER=2 → 16KB stack

# check per-function stack usage after building
$ objdump -d mymodule.ko | perl scripts/checkstack.pl x86_64

Correct

Kernel stacks on x86-64 are 8KB or 16KB depending on the kernel configuration — a fraction of the megabytes available to user-space processes. Crucially, they cannot grow: there is no demand paging for kernel stacks. An 8KB on-stack array immediately exhausts the available space and any further function call or interrupt pushes past the end. On kernels built with CONFIG_VMAP_STACK=y (the default since 4.9), a guard page catches the overflow and generates a clear fault message, but the system is still lost. The fix is simple: allocate large buffers with kmalloc().

stack-safe buffer handling

/* DANGEROUS — 8KB on a stack that may only have 8KB total */
char buf[8192];  /* stack overflow risk */

/* CORRECT — allocate from the heap */
char *buf = kmalloc(8192, GFP_KERNEL);
if (!buf)
    return -ENOMEM;
/* ... use buf ... */
kfree(buf);

/* Rule of thumb: no single stack variable > ~200 bytes */

Incorrect

The compiler places no restriction on stack-allocated arrays inside kernel functions — they are valid C and compile without error. The kernel does provide the checkstack.pl script that can warn about large stack frames during a build audit, but this is a manual tool, not an enforced compiler error. The danger is entirely a runtime concern: there is no compile-time enforcement of stack depth limits in kernel code. This is why kernel developers run checkstack.pl deliberately and treat any function using more than a couple hundred bytes of stack as something worth scrutinizing.

auditing stack usage with checkstack.pl

# No compiler error for large stack frames in kernel code.
# Use this after building to find offenders:
$ objdump -d mymodule.ko | perl /path/to/linux/scripts/checkstack.pl x86_64
# Output: sorted by stack frame size, largest first
# Functions over ~200 bytes deserve a closer look
# Functions over 1KB are a serious concern

EXPORT_SYMBOL Internals: How Symbol Resolution Actually Works

When your module calls printk(), how does the CPU know where that function's code lives? In user space, dynamic linking resolves this at program load time via the dynamic linker and the PLT/GOT mechanism. In the kernel, the module loader itself performs relocation — there is no separate dynamic linker process.

The kernel maintains an in-memory symbol table built from all EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL() declarations in every compiled-in subsystem and every currently loaded module. This table is exposed to userspace as /proc/kallsyms. You can query it directly to verify that a symbol you intend to use is actually exported:

query kallsyms

# Check if a symbol is exported and whether it is GPL-only
$ grep ' printk\b' /proc/kallsyms
ffffffff81234abc T printk

# GPL-only symbols are marked with a different type letter in some configs
# More reliably: check the Module.symvers file after building a kernel
$ grep 'printk' /usr/src/linux-headers-$(uname -r)/Module.symvers
# Output: CRC   symbol_name   vmlinux   EXPORT_SYMBOL or EXPORT_SYMBOL_GPL

If a symbol your module references is not found in the kernel's export table, the load fails immediately:

terminal output

insmod: ERROR: could not insert module mymodule.ko: Unknown symbol in module

The dmesg log will identify the specific symbol. This is not a compiler error — it is a runtime linker error caught by the kernel's module loader. The distinction matters: the module compiled successfully because the header declaring the function exists on your build system. The kernel refuses it at load time because the running kernel either does not export that symbol at all, or exports it only to GPL modules and yours declares a different license.

The EXPORT_SYMBOL_GPL() enforcement mechanism works at load time, not compile time. The kernel's symbol lookup function checks a flag on the export table entry indicating GPL-only status. If your MODULE_LICENSE() declaration does not include a GPL-compatible string, the lookup fails with "Unknown symbol" even though the symbol is present in /proc/kallsyms. This trips up developers who are used to thinking of symbol visibility as purely a linker concern.

Since Linux 5.4, the kernel also supports symbol namespaces — a mechanism for grouping exported symbols and requiring modules to explicitly import a namespace before using symbols from it. If a driver subsystem exports internal helpers under a namespace like USB_STORAGE, your module must declare:

symbol namespace import

MODULE_IMPORT_NS(USB_STORAGE);

Without this declaration, the module will fail to load even if the symbol is visible in kallsyms. Symbol namespaces are the kernel's answer to the problem of subsystems accidentally exposing internal helpers as public API — they provide a layer of intent verification that neither EXPORT_SYMBOL nor EXPORT_SYMBOL_GPL alone can express.

The CONFIG_MODVERSIONS mechanism adds a further layer of protection. When a kernel is compiled with this option, every exported symbol gets a CRC checksum computed from its full prototype — return type, parameter types, and the types of any struct fields that the function touches. When modules are built, those same checksums are calculated for every kernel function the module calls, and the results are embedded in the module's __versions section. At load time the kernel compares checksums; a mismatch prevents loading. This is the mechanism that lets distributors like Debian maintain kernel ABI stability across point releases without forcing users to rebuild every third-party module after each update.

Kernel Version Compatibility: Writing Modules That Survive Upgrades

A module compiled for kernel 6.2 will not load on kernel 6.5 without recompilation. The kernel provides a macro-based compatibility mechanism for modules that must support multiple versions. The LINUX_VERSION_CODE macro contains the numeric encoding of the kernel version as (major << 16) | (minor << 8) | patch:

version-conditional compilation

#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0)
    /* Use the new proc_ops structure */
    static const struct proc_ops my_fops = { ... };
#else
    /* Use the older file_operations structure */
    static const struct file_operations my_fops = { ... };
#endif

The proc_ops structure is a good example of an API change: it replaced file_operations for /proc entries in kernel 5.6.0. Any module targeting both 5.5 and 5.6+ must use exactly this conditional compilation approach. The DKMS (Dynamic Kernel Module Support) system automates recompilation of modules when the kernel is upgraded, but it cannot handle API changes — that is your job.

What Comes Next: The Landscape Ahead

A first kernel module that prints to dmesg and exposes a /proc entry is a beginning, not a destination. The territory beyond it is vast, and each area is its own discipline.

Character Device Drivers

Character device drivers expose a /dev entry and implement the full file_operations structure: open, read, write, ioctl, poll, mmap, and release. They are the mechanism behind /dev/random, /dev/null, serial ports, and audio interfaces. Writing one requires registering a major/minor device number pair with alloc_chrdev_region(), initializing a struct cdev, and creating the /dev entry via device_create() with a class. The ioctl interface is particularly important to understand: it is the standard channel for out-of-band commands that do not fit the read/write byte-stream model, and it requires careful handling of both 32-bit and 64-bit userspace callers (compat_ioctl).

Workqueues and Deferred Work

Much real driver work cannot be done in the interrupt handler — it takes too long, or it needs to sleep to allocate memory or acquire a mutex. The standard pattern is to do minimal work in the hardirq handler (acknowledge the interrupt, read a status register, enqueue data) and then schedule the rest on a workqueue. The kernel provides the system workqueue via schedule_work(), or you can create a dedicated workqueue with alloc_workqueue() for work that needs its own thread or concurrency guarantees:

deferred work pattern

static DECLARE_WORK(my_work, my_work_handler);

/* Called from interrupt context -- fast, no sleeping */
static irqreturn_t my_irq_handler(int irq, void *dev_id)
{
    /* Acknowledge interrupt, snapshot hardware state */
    schedule_work(&my_work);   /* defer the rest */
    return IRQ_HANDLED;
}

/* Called from process context in a kernel thread -- can sleep */
static void my_work_handler(struct work_struct *work)
{
    /* Allocate memory, acquire mutexes, do I/O */
}

kprobes: Instrumenting the Running Kernel

Kprobes is one of the most powerful and underused facilities available to module authors. It lets you dynamically insert a breakpoint at any kernel instruction address and run your own handler when that address is hit — without modifying any source file or recompiling the kernel. You can attach a pre-handler (runs before the probed instruction) and a post-handler (runs after), giving you the ability to log function arguments, return values, timing, or any register state. Kretprobes extend this to intercept function returns specifically. Because kprobes symbols are GPL-only, your module needs MODULE_LICENSE("GPL") to use them:

minimal kprobe example

#include <linux/kprobes.h>

/* do_sys_openat2(int dfd, const char __user *filename, struct open_how *how)
 * x86-64 calling convention: di=arg1, si=arg2, dx=arg3 */
static int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
    pr_info("do_sys_openat2: dfd=%ld filename_ptr=0x%lx\n",
            (long)regs->di, regs->si);
    return 0;
}

static struct kprobe kp = {
    .symbol_name = "do_sys_openat2",
    .pre_handler = handler_pre,
};

static int __init kprobe_init(void)
{
    return register_kprobe(&kp);
}

static void __exit kprobe_exit(void)
{
    unregister_kprobe(&kp);
}

The pt_regs structure captures the full CPU register state at the probe point. On x86-64, function arguments follow the System V ABI: the first argument is in di, the second in si, the third in dx, and so on. You must know the prototype of the function you are probing to read the right registers — which is one reason the Bootlin Elixir cross-reference is so useful during kprobe development.

Note that regs->si in the example above contains the raw virtual address of the filename string in user space — a pointer, not the string itself. To read the actual filename inside the pre-handler, you would use strncpy_from_user():

reading a user-space string in a kprobe handler

char fname[256];
long ret = strncpy_from_user(fname,
                             (const char __user *)regs->si,
                             sizeof(fname) - 1);
if (ret > 0) {
    fname[ret] = '\0';
    pr_info("open: %s\n", fname);
}
/* strncpy_from_user() is safe from a kprobe pre-handler because
 * we are still in process context. It is NOT safe from a kprobe
 * attached to an interrupt or NMI handler. */

Kprobes is what tools like BCC (BPF Compiler Collection) and SystemTap use under the hood to provide their dynamic tracing capabilities. Understanding kprobes at the module level gives you a clearer picture of what those higher-level tools are actually doing.

The Memory Management Subsystem

Beneath all of the above lies the memory management subsystem: page tables, TLB shootdowns, NUMA topology, huge pages, the buddy allocator, the slab/SLUB allocator, per-CPU page magazines, cgroup memory accounting, memory compaction, and the OOM killer. It is one of the most complex subsystems in the kernel and one of the most complex pieces of software in existence. A ramdisk driver that implements the block device layer from scratch — allocating pages, servicing bio requests, managing the page cache interaction — is the canonical exercise that touches most of this machinery in a controlled way. If you can write a correct ramdisk driver, you understand the storage stack from the filesystem layer to physical memory.

Notifier Chains: Reacting to Kernel Events

A pattern that does not get nearly enough coverage in introductory material is the notifier chain — the kernel's publish-subscribe mechanism for system-level events. Your module can register a callback that fires when specific kernel events occur, without modifying any kernel source. Useful notifier chains include the reboot notifier (do cleanup before the system shuts down), the panic notifier (log state before a crash, or attempt partial recovery), the CPU hotplug notifier (adjust per-CPU data structures when CPUs are brought online or offline), and the network device notifier (react when interfaces come up or go down):

reboot notifier pattern

#include <linux/reboot.h>

static int my_reboot_handler(struct notifier_block *nb,
                              unsigned long action,
                              void *data)
{
    if (action == SYS_RESTART || action == SYS_HALT) {
        pr_info("my_module: flushing state before reboot\n");
        flush_work(&my_flush_work);
    }
    return NOTIFY_DONE;  /* allow other handlers to run */
}

static struct notifier_block my_reboot_nb = {
    .notifier_call = my_reboot_handler,
    .priority = 0,   /* higher priority runs first; -INT_MIN to INT_MAX */
};

/* In your __init: */
register_reboot_notifier(&my_reboot_nb);

/* In your __exit: */
unregister_reboot_notifier(&my_reboot_nb);

The priority field deserves attention: notifiers with higher priority values run before those with lower values. The kernel's own emergency sync handler registers at priority 255; most drivers should register at 0 or below to let the kernel finish its own cleanup first. Forgetting to call unregister_reboot_notifier() in your exit function is a use-after-free waiting for the next system shutdown — one of the subtler cleanup bugs in module development.

"The best documentation for the kernel is the kernel itself."

When a header changes behavior between kernel versions, when a function's prototype shifts, when a new locking requirement appears — none of that is reflected in any tutorial. The source at elixir.bootlin.com is fully cross-referenced, searchable by symbol across every kernel version, and always more accurate than any external documentation. Develop the habit of reading it. Every hour you spend reading kernel source pays dividends across your entire career in systems programming.

How to Write and Load a Linux Kernel Module

Step 1: Install kernel headers and build tools

Run sudo apt-get install linux-headers-$(uname -r) build-essential on Debian/Ubuntu, or sudo dnf install kernel-devel kernel-headers gcc make on Fedora/RHEL. Confirm your kernel version with uname -r. Every .ko file you build is tied to that exact version through the vermagic mechanism.

Step 2: Write the module source and Makefile

Create a .c file that includes linux/init.h, linux/module.h, and linux/kernel.h. Define an __init function that returns 0 on success and an __exit function for cleanup. Register them with module_init() and module_exit(). Declare MODULE_LICENSE("GPL") to allow access to GPL-only exported kernel symbols. Write a Makefile that invokes the kernel build system with make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules.

Step 3: Build and inspect the module

Run make in your module directory to produce a .ko ELF relocatable object. Inspect it with modinfo to view the embedded vermagic string, license, description, and any declared parameters. Use file to confirm it is an ELF 64-bit LSB relocatable.

Step 4: Load, test, and unload the module

Load the module with sudo insmod yourmodule.ko, then check dmesg for your init message. Use lsmod to confirm it is listed and check the use count. Remove it with sudo rmmod yourmodule and verify the exit message in dmesg. If the module creates a /proc entry, test read and write with cat and echo before unloading.

Frequently Asked Questions

What is a Linux kernel module and how does it differ from a regular program?

A Linux kernel module is a compiled C object file (.ko) that gets loaded directly into the running kernel, giving your code full ring-0 privileges with no memory isolation or sandbox. Unlike a regular user-space program, a module shares address space with the entire kernel and can call any exported kernel function. A bug does not produce a segfault; it can corrupt kernel data structures or trigger a kernel panic.

What is vermagic and why does the kernel check it?

Vermagic is a string embedded in every .ko file that encodes the kernel release version, SMP support, preemption model, and whether module versioning is active. The kernel compares this string against its own compiled-in vermagic at load time. If any field mismatches, the module is refused. This prevents undefined behavior that would result from loading a module compiled against a different kernel ABI.

Why must kernel modules use copy_to_user and copy_from_user instead of accessing user-space pointers directly?

Directly dereferencing a user-space pointer in kernel code has two catastrophic failure modes: the pointer may be invalid or unmapped, causing a kernel oops instead of a recoverable segfault; or a malicious process could pass a kernel-space address, causing your code to overwrite protected kernel memory. copy_to_user and copy_from_user validate that the pointer falls within the user-space address range, handle page faults gracefully, and return the number of bytes not copied so you can detect partial transfers.

How do I expose a configurable parameter to a kernel module at load time?

Use the module_param() macro, which takes the variable name, its type, and a permission mask for the corresponding sysfs entry under /sys/module/. Load the module with insmod mymodule.ko param_name=value. A permission of 0644 makes the parameter readable by everyone and writable by root even while the module is running. A permission of 0 hides it from sysfs entirely.

Why does insmod return EPERM even when I am root?

Two separate mechanisms can block module loading even with root privileges. First, check /proc/sys/kernel/modules_disabled — if it reads 1, module loading has been permanently disabled for this boot session and the only fix is a reboot. Second, if your system is running with Secure Boot and kernel lockdown mode active (CONFIG_SECURITY_LOCKDOWN_LSM), the kernel will block unsigned modules. You can check lockdown status with cat /sys/kernel/security/lockdown. For development, disable Secure Boot in your VM firmware settings, or sign your module with a development key using scripts/sign-file from the kernel source tree.

Why do installed modules have a .ko.zst or .ko.xz extension on my distro?

Modern distributions compress installed kernel modules to reduce disk and initramfs size. Ubuntu 24.04 and Fedora 38+ use Zstd compression (.ko.zst); some distros use XZ (.ko.xz). The kernel has supported in-kernel decompression since kernel 5.17 via the MODULE_INIT_COMPRESSED_FILE flag. When you are developing your own out-of-tree module, make produces an uncompressed .ko directly — compression only happens during make modules_install on kernels configured with CONFIG_MODULE_COMPRESS_ZSTD. If you need to run analysis tools like addr2line or modinfo against an installed compressed module, decompress it first with zstd -d mymodule.ko.zst.

Sources and References

The technical claims in this guide are verifiable against primary sources. The following references were used in research and can be checked directly against the authoritative upstream documentation.

Source	Relevance	URL
Linux Kernel Module Programming Guide (LKMPG)	LKM lifecycle, proc_ops vs file_operations change in 5.6, module init/exit patterns, namespace scoping	sysprog21.github.io/lkmpg
kernel.org: printk documentation	printk ring buffer behavior, log levels, /dev/kmsg export	kernel.org/doc/html/latest/core-api/printk-basics.html
kernel.org: SLUB allocator documentation	kmalloc, kfree, GFP flags, SLUB debug options, per-CPU cache behavior	kernel.org/doc/html/latest/admin-guide/mm/slub.html
kernel.org: Dynamic Debug	CONFIG_DYNAMIC_DEBUG, pr_debug(), runtime debug control via debugfs	kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html
kernel.org: KASAN documentation	Kernel Address Sanitizer use-after-free and out-of-bounds detection, shadow memory overhead	kernel.org/doc/html/latest/dev-tools/kasan.html
man7.org Linux man-pages: copy_to_user	copy_to_user / copy_from_user semantics, access_ok(), return value on partial copy	man7.org/linux/man-pages/man9/copy_to_user.9.html
Bootlin Elixir: Linux kernel cross-reference	struct module definition, EXPORT_SYMBOL macro implementation, __versions section, vermagic embedding	elixir.bootlin.com/linux/latest/source/include/linux/module.h
LWN.net: What's next for the SLUB allocator	SLOB removal in kernel 6.4; SLAB deprecated in 6.5; SLAB removal in 6.8; SLUB as sole allocator in current kernels	lwn.net/Articles/974138
commandlinux.com: Linux kernel release statistics	Kernel 6.19 released February 8, 2026 (final 6.x release; Linux 7.0 expected April 2026); kernel source at 40+ million lines as of January 2025	commandlinux.com/statistics/linux-kernel-release-frequency-statistics
LWN.net: Loading modules from file descriptors	finit_module() system call origin and rationale; introduced in kernel 3.8; file-descriptor-based loading for verified filesystems; MODULE_INIT_COMPRESSED_FILE added in 5.17	lwn.net/Articles/519010
Linux Kernel Documentation: KFENCE	Kernel Electric-Fence sampling-based memory error detection; production-safe alternative to KASAN; introduced in kernel 5.12; sample_interval tuning; Ubuntu/Fedora default enablement	kernel.org/doc/html/latest/dev-tools/kfence.html
man7.org Linux man-pages: init_module / finit_module	finit_module() flags including MODULE_INIT_IGNORE_VERMAGIC, MODULE_INIT_IGNORE_MODVERSIONS, MODULE_INIT_COMPRESSED_FILE; EPERM causes including modules_disabled and lockdown	man7.org/linux/man-pages/man2/init_module.2.html
Phoronix: Linux 5.13 Zstd compressed modules	Zstd compression support for .ko.zst modules added in kernel 5.13; kmod 28 userspace support; compressed module suffix conventions	phoronix.com/news/Linux-5.13-Zstd-Modules

^ back to top