The Problem Nobody Tells You About

There is a category of kernel bug so subtle that it can exist in a codebase for years, only manifesting under specific CPU load patterns on specific hardware. The type of bug that passes every unit test, every integration test, and slips through code review because it looks perfectly correct. It is the kind of bug that emerges not from logic errors, but from a fundamental mismatch between what programmers assume about memory and what modern CPUs actually do.

That mismatch has a name: weak memory ordering. And the Linux kernel's answer to it -- after decades of informal documentation, tribal knowledge, and edge-case discoveries -- is the Linux Kernel Memory Consistency Model (LKMM).

This article is your thorough introduction to why the LKMM exists, what it actually says, how its key mechanisms work, and why this knowledge is essential for anyone writing concurrent kernel code. We will go from first principles to formal axioms, with real primitives, real examples, and real bugs that were caught because the model existed.

Background

The LKMM was formally merged into the mainline Linux kernel during the Linux 4.17 merge window in April 2018 (the 4.17 release itself shipped in June 2018), in the tools/memory-model directory. The academic paper accompanying it -- "Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel" by Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and Alan Stern -- was presented at ASPLOS 2018.

What Is a Memory Model, Really?

A memory consistency model is a contract between hardware and software. It answers one deceptively simple question: given a piece of concurrent code running across multiple CPUs, what values can the load instructions legally return?

On a uniprocessor, the answer is trivial: each load must return the value written by the most recent store to that memory location. But on multiprocessor systems with shared memory, things fracture. Different CPU architectures implement different memory models, and those models vary significantly in how permissive they are about reordering operations.

The most intuitive memory model is Sequential Consistency (SC), formalized by Leslie Lamport in 1979. Under SC, all operations across all CPUs appear to execute in some single global order, and each CPU's operations appear in that order in the sequence they were programmed. If CPU0 stores to buf and then stores to flag, no other CPU will ever see flag updated before buf.

The trouble is that no modern high-performance CPU actually implements full Sequential Consistency. Hardware designers discovered that enforcing SC costs significant performance -- CPUs must flush store buffers, stall pipelines, and synchronize caches more aggressively than necessary for most code. So they relaxed the model. x86 processors implement what is known as Total Store Order (TSO), which permits some store-load reorderings. ARM and POWER architectures are even more relaxed, permitting many more reorderings that would be impossible under SC.

The Linux kernel supports more than 30 CPU architectures. Its memory model must be permissive enough to allow any behavior that any supported architecture permits, while still being restrictive enough to allow developers to write correct concurrent code.

"The Linux-kernel memory consistency model (LKMM) is rather complex and obscure. This is particularly evident if you read through the linux-kernel.bell and linux-kernel.cat files that make up the formal version of the model; they are extremely terse and their meanings are far from clear."

-- Alan Stern, LKMM explanation.txt, Linux kernel source tree (tools/memory-model/Documentation/explanation.txt)

This candid admission from one of the model's own authors tells you something important: even the people who designed it acknowledge it is not easy. But it is learnable. And understanding its structure makes the intimidating complexity reveal itself as carefully-reasoned necessity.

Events, Relations, and Program Order

The LKMM does not reason directly about C source statements. Instead, it works with an abstract notion of events -- the individual memory accesses and synchronization operations that occur when code executes. Each read from a shared variable is a read event; each write is a write event; each memory barrier is a fence event.

The most fundamental relation between events is program order (po). We say that event X is po-before event Y (written X ->po Y) if X appears before Y in the sequence of instructions presented to a CPU's execution unit after branches have been resolved and loops unrolled. This is the order in which a programmer reads the code.

But here is the critical insight that the LKMM builds everything on: program order does not equal execution order. CPUs and compilers can reorder operations that are independent of each other. A CPU with an out-of-order execution unit might execute a later independent load before an earlier store completes. A compiler with optimization enabled might hoist a load out of a loop or merge consecutive stores to the same variable. The LKMM's entire purpose is to precisely describe which reorderings are legal and which are not.

Alongside program order, the model defines several other fundamental relations:

Relation Meaning
po Program order -- the sequential order instructions appear in source code for a single CPU
rf (reads-from) Links a write event W to a read event R when R obtains its value from W
co (coherence order) The total order of all stores to a single memory location, representing which write happened "last"
fr (from-reads) Links a read R to a write W when W comes coherence-after the write that R read from
hb (happens-before) A derived relation requiring that certain instructions execute in a specific order
pb (propagates-before) Links events when a store propagates to every CPU and to RAM before a later event executes

The model defines correctness in terms of cycles in these relations. If a set of rules requires event X to come before event Y, and Y before Z, and Z before X, that is a cycle -- and cycles indicate impossibility. The LKMM's axioms are fundamentally cycle-prohibition rules: they state that certain combinations of relations must not form cycles.

Cache Coherence and the Four Coherency Rules

Any reasonable memory model starts with cache coherence: the guarantee that all CPUs will eventually agree on the order of writes to any single memory location. The LKMM enforces this through four coherency rules that together prevent contradictions in the observed order of stores to a given variable.

Coherence order is required to be consistent with program order. This means it must not be possible for CPU1 to observe a sequence of writes to variable x that contradicts the order in which those writes occurred in the source code of each respective CPU. The LKMM's "coherence" axiom expresses this by requiring the union of the coherence relations not to contain any cycles.

Cache coherence protocols -- MESI being the canonical example on x86 systems -- handle this at the hardware level by ensuring that when a CPU takes ownership of a cache line to write it, all other cached copies are invalidated. But cache coherence alone is not sufficient for writing correct concurrent code. It tells you that CPUs agree on the final order of writes to each individual variable; it does not tell you anything about the relationship between writes to different variables.

Common Misconception

Developers often assume that if CPU0 stores to variable A and then stores to variable B, any CPU that observes the store to B must already see the store to A. This is true under TSO (x86), but it is NOT guaranteed under ARM or POWER architectures, which have weaker memory models. Kernel code that relies on this assumption is architecture-specific and therefore incorrect in a multi-architecture context.

This is the fundamental challenge the LKMM addresses: how do you write portable concurrent code that is correct on all of Linux's supported architectures, from x86's strong TSO model to ARM's weaker model to POWER's even more relaxed semantics?

The answer involves deliberately adding synchronization primitives that constrain the permitted reorderings -- and the LKMM tells you exactly which primitives are needed for which guarantees.

Memory Barriers: Enforcing Order Between CPUs

Memory barriers are instructions that prevent certain reorderings. The Linux kernel provides a hierarchy of barrier primitives, ranging from compiler-only barriers to full hardware barriers that flush CPU pipelines and store buffers.

The Barrier Hierarchy

kernel barrier primitives
/* Compiler-only barriers -- no CPU instruction emitted */
barrier()          // prevents compiler from reordering across this point

/* SMP-conditional barriers -- reduce to compiler barriers on UP */
smp_mb()           // full memory barrier: orders ALL prior loads/stores before ALL later ones
smp_rmb()          // read memory barrier: orders prior LOADs before later LOADs
smp_wmb()          // write memory barrier: orders prior STOREs before later STOREs

/* Acquire/Release semantics */
smp_load_acquire(p)  // load with acquire semantics -- no later access can be reordered before it
smp_store_release(p, v) // store with release semantics -- no prior access can be reordered after it

/* Mandatory barriers -- active even on UP, used for MMIO */
mb()               // mandatory full barrier
rmb()              // mandatory read barrier
wmb()              // mandatory write barrier

The SMP barriers are where the real action happens for most kernel code. smp_mb() is the "nuclear option" -- it guarantees that all memory accesses preceding it in program order propagate to all CPUs before any memory accesses following it begin. This is as expensive as it sounds; a full barrier on modern hardware costs tens of CPU cycles because it typically requires flushing the CPU's store buffer.

The kernel documentation describes the fundamental pairing rule clearly: when dealing with CPU-to-CPU interactions, memory barriers must be paired. An smp_wmb() on the writing side must be matched with an smp_rmb() on the reading side, or a full smp_mb() on both sides, to guarantee ordering.

Barrier Pairing Rule

SMP memory barriers _must_ be used to control the ordering of references to shared memory on SMP systems, though the use of locking instead is sufficient. Mandatory barriers should not be used to control SMP effects, since mandatory barriers impose unnecessary overhead on both SMP and UP systems.

Source: Linux kernel Documentation/memory-barriers.txt

The Message Passing (MP) Pattern

The classic litmus test for understanding why barriers are necessary is the Message Passing (MP) pattern. CPU0 writes a payload to a buffer, then sets a flag indicating the buffer is ready. CPU1 reads the flag, and if set, reads from the buffer.

message-passing-pattern.c
/* CPU 0: writer */
WRITE_ONCE(buf, 42);         // step 1: write payload
smp_wmb();                    // step 2: ensure step 1 propagates before step 3
WRITE_ONCE(flag, 1);          // step 3: signal ready

/* CPU 1: reader */
if (READ_ONCE(flag)) {         // step 4: check ready signal
    smp_rmb();                // step 5: ensure step 4 completes before step 6
    r1 = READ_ONCE(buf);      // step 6: read payload
}

Without the barriers, both the CPU and the compiler are free to reorder steps 1 and 3 (on the writer side) and steps 4 and 6 (on the reader side). On ARM or POWER hardware, this is not a theoretical concern -- it will actually happen. CPU1 could read flag == 1 but then read the old value of buf because the store to buf has not yet propagated to CPU1's cache.

The smp_wmb() on the writer side ensures that the store to buf is visible to all CPUs before the store to flag becomes visible. The smp_rmb() on the reader side ensures that the load of flag completes before the load of buf executes. Together, they create the necessary ordering guarantee.

Alternatively, and more idiomatically in modern kernel code, the same guarantee can be achieved with acquire/release semantics:

acquire-release-pattern.c
/* CPU 0: writer -- release store */
WRITE_ONCE(buf, 42);
smp_store_release(&flag, 1);   // release: all prior stores visible before this

/* CPU 1: reader -- acquire load */
if (smp_load_acquire(&flag)) { // acquire: all later loads see stores before the release
    r1 = READ_ONCE(buf);
}

The acquire/release pattern creates a happens-before edge between the two CPUs: the release store on CPU0 "happens before" the acquire load on CPU1, meaning all memory accesses by CPU0 before the release are guaranteed to be visible to CPU1 after the acquire.

READ_ONCE and WRITE_ONCE: Taming the Compiler

Beyond hardware reordering, there is a second adversary: the optimizing compiler. The C language's memory model, introduced in C11, treats any concurrent access to a non-atomic variable as a data race with undefined behavior. This means the compiler is free to assume that if a variable is not modified by the current thread, it will not change between two consecutive reads -- and so it may legitimately transform two reads into one.

This is the "load fusing" problem. Consider a loop that polls a flag waiting for a condition:

compiler-optimization-hazard.c
/* Dangerous: compiler may hoist the load out of the loop */
while (!done)
    cpu_relax();

/* Safe: READ_ONCE prevents compiler from caching the value */
while (!READ_ONCE(done))
    cpu_relax();

/* READ_ONCE and WRITE_ONCE are implemented as volatile casts */
/* They prevent the compiler from merging, splitting, or 
   reordering accesses to the annotated variable             */

Without READ_ONCE(), a sufficiently aggressive optimizer may legally transform that loop into a single check followed by an infinite loop -- reading done once, caching it in a register, and never re-reading it from memory. The compiled code would spin forever even after another CPU sets done = 1.

READ_ONCE() and WRITE_ONCE() solve this by implementing the access as a volatile cast, which forces the compiler to emit an actual memory access instruction at that program point without any caching, merging, or reordering across that instruction. They do not add any hardware memory barrier instructions -- they purely constrain the compiler.

The LKMM distinguishes sharply between marked accesses (those using READ_ONCE(), WRITE_ONCE(), atomic operations, or explicit barriers) and plain accesses (bare C variable reads and writes). Concurrent plain accesses to the same shared variable constitute data races, which are undefined behavior under the C standard. The LKMM requires that all accesses to shared variables be marked -- this is not merely stylistic; it is a correctness requirement.

"Without the READ_ONCE(), the compiler might combine the load from 'a' with other loads from 'a'. Without the WRITE_ONCE(), the compiler might combine the store to 'b' with other stores to 'b'. Either can result in highly counterintuitive effects on ordering."

-- Linux kernel Documentation/memory-barriers.txt (kernel.org)

The Six Formal Axioms of the LKMM

The formal version of the LKMM is defined by six requirements, each expressed as an acyclicity constraint on a specific relation or combination of relations. Understanding these axioms gives you a complete picture of what the model actually guarantees.

1. Sequential Consistency Per Variable (Coherence)

This axiom requires that the system obey the four coherency rules: the coherence order of stores to any single variable must be consistent with program order. No CPU should observe stores to a variable in a different order than they occurred relative to each other. The LKMM's "coherence" axiom expresses this by requiring that the union of the coherence-related relations contains no cycles.

2. Atomicity

This axiom requires that atomic read-modify-write (RMW) operations -- things like cmpxchg(), xchg(), atomic_add_return() -- are truly atomic: no other store can "sneak in" between the read and write components of the operation. Formally, whenever a read event R and write event W compose an atomic RMW, and W' is the write event that R reads from, there must not be any store that comes between W' and W in the coherence order. Note that for conditional RMW operations such as cmpxchg(), this atomicity guarantee (and the associated ordering) applies only when the comparison succeeds and the operation actually performs its write.

3. Happens-Before

This is perhaps the most important axiom for developers to internalize. It requires that certain instructions execute in a specific globally-agreed order. The happens-before (hb) relation links memory access events in the order they must execute, and the axiom states that hb must not contain any cycles. This is the formal expression of the intuition that "if X happens before Y and Y happens before Z, then X must happen before Z."

4. Propagation

This axiom requires that certain stores propagate to CPUs and to RAM in a specific order. The propagates-before (pb) relation is built on top of strong fences (full memory barriers and RCU grace periods) and captures the guarantee that these fences provide about the global visibility order of stores.

5. RCU (Grace Period Guarantee)

This axiom formalizes the fundamental guarantee of Read-Copy-Update synchronization: that any RCU read-side critical section that begins before a grace period starts must complete before that grace period ends. Formally, the RCU-before (rb) relation must not contain a cycle. This axiom is what makes it safe for RCU to reclaim memory after a grace period -- it proves that no reader still active before the grace period can access the memory after it is reclaimed.

6. Plain-Coherence

This final axiom requires that plain memory accesses -- those not using READ_ONCE(), WRITE_ONCE(), or any other marking -- must still obey the operational model's cache coherence rules. Even unguarded C variable reads and writes must not observe stores in an order that violates cache coherence. This axiom primarily constrains the model's behavior with respect to data races rather than providing a guarantee developers can rely on.

Formal Verification

The LKMM is expressed in the "cat" language, used by the herd7 tool (part of the diy toolsuite developed by Jade Alglave, Luc Maranget, and collaborators). This makes it an executable formal model -- you can write litmus tests in a subset of C supplemented with Linux kernel constructs and run them against the model to verify whether specific outcomes are permitted or forbidden. This is how the model's developers discovered bugs in kernel locking primitives before the model was even merged.

RCU: Read-Copy-Update and the Memory Model

No discussion of the Linux kernel memory model is complete without a thorough treatment of Read-Copy-Update (RCU). RCU is the kernel's most sophisticated synchronization mechanism and, as the dedicated axiom in the LKMM indicates, it required formal treatment beyond what simpler barrier-based models could capture.

RCU was introduced into the Linux kernel in version 2.5.43 in October 2002, primarily by Paul McKenney, with Dipankar Sarma as a key contributor. The underlying synchronization concept dates to earlier academic work by McKenney and Jack Slingwine (US Patent 5,442,758, 1995). Its core insight is elegant: readers can proceed without any locking at all, while writers create new versions of data structures and rely on a "grace period" mechanism to safely reclaim old versions.

The key property that makes this work is the Grace Period Guarantee: any RCU read-side critical section that is in progress when a grace period begins must complete before that grace period ends. A grace period is any time period during which each CPU has passed through at least one "quiescent state" -- a state where it is not executing in an RCU read-side critical section.

rcu-update-pattern.c
/* Writer: update a pointer safely under RCU */
struct mydata *old_ptr, *new_ptr;

new_ptr = kmalloc(sizeof(*new_ptr), GFP_KERNEL);
/* ... populate new_ptr ... */

old_ptr = rcu_dereference_protected(global_ptr, lockdep_is_held(&my_lock));
rcu_assign_pointer(global_ptr, new_ptr);  // atomic pointer swap with release semantics

synchronize_rcu();   // wait for a full grace period to elapse
kfree(old_ptr);      // now safe: all readers that could see old_ptr have finished

/* Reader: traverse the pointer safely */
rcu_read_lock();
ptr = rcu_dereference(global_ptr);  // load with address dependency / consume semantics
/* ... use ptr ... */
rcu_read_unlock();

The LKMM formalizes RCU by defining a family of relations: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb (RCU-before). The RCU axiom then requires that the rb relation contains no cycles. This is the formal statement of the Grace Period Guarantee.

One subtle but important point: rcu_read_lock() and rcu_read_unlock() are not free operations from a memory ordering perspective. They are modeled as fence events in the LKMM's formal language (as seen in the linux-kernel.bell file in the kernel source). The interaction between RCU fences and other synchronization primitives is carefully specified to ensure that the Grace Period Guarantee interacts correctly with happens-before and propagation ordering.

"RCU's novel properties include support for concurrent forward progress for readers and writers as well as highly optimized inter-CPU synchronization."

-- "RCU Usage In the Linux Kernel: Eighteen Years Later," ACM SIGOPS Operating Systems Review, 2020

In practice, the read-side overhead of RCU in a non-preemptible (TREE_RCU) server-class Linux kernel build is exactly zero -- the rcu_read_lock() and rcu_read_unlock() calls compile away completely. On kernels built with CONFIG_PREEMPT_RCU (common in desktop and real-time configurations), rcu_read_lock() increments a per-task preemption counter and the overhead is nonzero. It is the non-preemptible server configuration that makes RCU so valuable for high-frequency read paths like the virtual filesystem (VFS), network routing tables, and process credential lookups.

Data Races: When Plain Accesses Go Wrong

The LKMM draws a precise technical distinction between a data race and a race condition. A race condition is a high-level logical bug where the outcome of a concurrent program depends on the timing of operations in a way the developer did not intend. A data race is a specific, lower-level violation: two memory accesses to the same location that conflict (at least one is a write), happen concurrently in different threads, and at least one is a plain (unmarked) access.

Data races are particularly dangerous because the C11 standard declares them to invoke undefined behavior. This means a sufficiently aggressive compiler, upon detecting that two accesses might race, is legally permitted to compile the code in any way it chooses -- including optimizations that make the program behave in completely unexpected ways. The Linux kernel, unlike a normal C program, explicitly rejects the C11 undefined behavior model for data races. The LKMM defines a more constrained behavior for kernel code, but it still requires that shared variables be properly marked to avoid data races.

The practical consequence: any shared variable that could be accessed concurrently from multiple CPUs must use READ_ONCE() for reads and WRITE_ONCE() for writes, or must be protected by appropriate locking, RCU, or atomic operations.

Control Dependencies: A Subtle Trap

One of the most frequently misunderstood aspects of the LKMM concerns control dependencies. A control dependency exists when a conditional load (reading a value and branching based on it) appears to constrain the ordering of a subsequent store. Many developers assume that "if I read X and based on that decide to write Y, then my write to Y must be ordered after my read of X." But this reasoning is dangerously incomplete.

control-dependency-trap.c
/* CPU 0 */
r1 = READ_ONCE(x);
if (r1)
    WRITE_ONCE(y, 1);   // this store is NOT guaranteed to be ordered
                         // after the load of x on ARM/POWER!

/* Why? CPUs can speculate stores before branches are resolved. */
/* The compiler can also eliminate the branch entirely.        */

/* Fix: use a full barrier or smp_store_release */
r1 = READ_ONCE(x);
if (r1) {
    smp_mb();             // now the store to y is properly ordered
    WRITE_ONCE(y, 1);
}

The kernel documentation's memory-barriers.txt devotes considerable space to control dependencies precisely because the hazard is so non-obvious. Control dependencies do not provide multicopy atomicity. They do not apply to code following the if-statement. And critically, compilers do not understand control dependencies -- if the compiler optimizes away the conditional, the ordering disappears with it. READ_ONCE() helps prevent the compiler from eliminating the branch, but developers must be vigilant.

KCSAN: The Kernel Concurrency Sanitizer

Formal models are only as useful as your ability to check real code against them. The Linux kernel addresses this with KCSAN -- the Kernel Concurrency Sanitizer -- a dynamic data race detector developed primarily by Marco Elver (Google) and merged into the Linux kernel in version 5.8 (released August 2020).

KCSAN is aware of the LKMM. It understands marked atomic operations (READ_ONCE, WRITE_ONCE, atomic_*), and it models load or store buffering to detect missing smp_mb(), smp_wmb(), smp_rmb(), and smp_store_release() calls when compiled with CONFIG_KCSAN_WEAK_MEMORY=y.

The mechanism is clever. KCSAN relies on compiler instrumentation of plain memory accesses. For each instrumented access, it either checks for an existing "watchpoint" (and fires a race report if it finds a conflicting one) or sets up a new watchpoint and deliberately stalls the current thread for a randomized delay. If another thread accesses the same location during that window, the watchpoint fires and the race is reported.

Critical Advice

"Do NOT respond to KCSAN reports by mindlessly adding READ_ONCE(), data_race(), and WRITE_ONCE()." Adding annotations without understanding the underlying race condition is a band-aid over a concurrency design flaw. The correct response is to understand what ordering guarantee is actually needed and apply the appropriate synchronization primitive.

Source: LWN.net, "Concurrency bugs should fear the big bad data-race detector"

Since its introduction, KCSAN has found a large number of concurrency bugs in the Linux kernel, many of which had existed for years without detection. A notable example is CVE-2024-26861, where syzbot running KCSAN discovered a data race in WireGuard's packet receive path: concurrent reads and writes to keypair->receiving_counter.counter from a softirq NAPI poll context and a workqueue decryption worker, without proper annotations. The fix used READ_ONCE()/WRITE_ONCE() to annotate the access as intentionally racy -- a deliberate choice by the WireGuard maintainers, who judged that the race was benign given the algorithm's logic but that it needed to be made explicit and well-defined under the LKMM.

Another example is a data race in the Linux kernel's bonding driver, where fields tracking last-received timestamps on bond slaves (notably slave->last_rx) were being read and written locklessly from concurrent interrupt handlers on different CPUs. KCSAN and syzbot flagged the issue; the fix annotated those accesses with READ_ONCE() and WRITE_ONCE(). The maintainers' preference for annotations over heavier locking reflected a deliberate design choice: adding locks to high-frequency packet receive paths introduces unacceptable latency.

This illustrates a nuance that the LKMM enables: not all data races are bugs in the traditional sense. Some concurrent accesses are intentionally racy -- the program logic can tolerate occasional stale reads, or the access is protected at a higher level. The data_race() annotation allows kernel developers to explicitly mark such accesses as intentionally racing, telling KCSAN to ignore them without hiding the fact that a race exists. This is honest engineering: acknowledging the race, documenting why it is acceptable, and preventing future developers from silently breaking the intentional design.

Atomic Operations and Their Ordering Guarantees

Atomic operations -- atomic_read(), atomic_set(), atomic_add(), atomic_cmpxchg(), and dozens of variants -- are the backbone of lock-free synchronization in the kernel. But their ordering guarantees are not uniform, and confusing them is a common source of subtle bugs.

The general rule in the LKMM is: any read-modify-write (RMW) atomic operation that returns a value (old or new) and whose name does not end in _acquire, _release, or _relaxed implies a full SMP-conditional memory barrier (smp_mb()) on each side of the actual operation. This means operations like atomic_xchg(), atomic_add_return(), and atomic_dec_and_test() are "full barrier" atomics -- they provide complete ordering of all memory accesses before and after the operation. One important precision: for conditional RMW operations such as atomic_cmpxchg(), the full ordering guarantee applies only when the operation succeeds; a failing compare-exchange provides no ordering guarantee.

However, non-value-returning atomics like atomic_add() and atomic_inc() provide atomicity but no ordering guarantees relative to surrounding non-atomic accesses -- the LKMM does not constrain how the CPU or compiler may reorder surrounding plain accesses around these operations. Developers who need ordering around these operations must explicitly add smp_mb__before_atomic() or smp_mb__after_atomic(). Note that atomic_set() is not an RMW operation but a plain assignment through the atomic type wrapper; it carries neither atomicity in the RMW sense nor any ordering guarantee.

atomic-ordering.c
/* Full barrier semantics -- safe, ordering guaranteed */
old = atomic_xchg(&counter, new_val);

/* cmpxchg: full ordering only on SUCCESS; failing CAS provides no ordering */
old = atomic_cmpxchg(&counter, expected, new_val);

/* Relaxed -- NO ordering guarantee around surrounding accesses */
atomic_add(1, &counter);
atomic_inc(&counter);

/* atomic_set() is a plain assignment through the atomic wrapper -- NOT an RMW */
/* It provides neither RMW atomicity nor any ordering guarantee                 */
atomic_set(&counter, 0);

/* Adding explicit ordering around a relaxed atomic */
WRITE_ONCE(shared_data, value);
smp_mb__before_atomic();
atomic_add(1, &counter);       // now ordered after the WRITE_ONCE above

/* Value-returning atomics (no _relaxed/_acquire/_release suffix) provide full barrier */
/* No additional smp_mb() needed here:                                                  */
result = atomic_add_return(1, &counter);
r1 = READ_ONCE(shared_data);  // guaranteed to see stores before the atomic

The LKMM also defines _relaxed, _acquire, and _release variants for many atomic operations (atomic_cmpxchg_relaxed(), atomic_xchg_acquire(), etc.). These map to the C11 memory ordering model and allow developers to use the most efficient ordering level that still provides correctness, rather than always paying the full cost of a bidirectional barrier.

Why This Matters Beyond the Kernel

The LKMM is not merely an academic exercise. It has had direct, measurable impact on the quality and correctness of Linux kernel code. When Ingo Molnar sent the LKMM pull request into the Linux 4.17 merge window on April 2, 2018, the roughly five thousand lines of code and documentation came with a track record: litmus-test-based analysis had already uncovered bugs in kernel locking primitives that existed in the codebase undetected, and KCSAN (developed later, merged in 5.8) has since found a large number of additional races guided by the model's formal definitions.

For cybersecurity practitioners, the LKMM is directly relevant to vulnerability analysis. Race conditions in kernel code are among the most dangerous vulnerability classes -- they can lead to use-after-free bugs, privilege escalation, and kernel panics. Understanding the LKMM gives you the conceptual vocabulary to read CVEs, understand what primitives were missing, and evaluate whether a proposed fix actually closes the race or merely makes it less likely.

For kernel developers and systems programmers, the LKMM provides something invaluable: a way to be formally right rather than probably right. Writing concurrent kernel code without understanding the memory model is like writing multi-threaded code without knowing what a mutex is -- you may get away with it most of the time, on most hardware, under most workloads. But the bugs that emerge from weak memory ordering are the bugs that appear in production at 3 AM on the highest-traffic day of the year, on the specific hardware architecture that your tests do not cover.

The LKMM's tooling -- the herd7 model checker, KCSAN, and the litmus test suite -- gives developers the ability to ask and answer questions that were previously only answerable through painful empirical discovery: "Is this specific sequence of operations with these specific barriers correct under all architectures the kernel supports?" The model gives you a definitive answer.

Learning Path

The official learning path recommended by Paul McKenney includes: (1) Read tools/memory-model/Documentation/explanation.txt in the kernel source tree -- it explains the model's design in accessible English. (2) Work through the litmus tests in tools/memory-model/litmus-tests/. (3) Install the herd7 tool and run the model interactively. (4) Study the memory-barriers.txt documentation for practical barrier usage patterns. The model's documentation explicitly notes that playing with the actual model tool is "probably key to gaining a full understanding of LKMM."

Conclusion: Formal Correctness Is Not Optional

The Linux Kernel Memory Model is the kernel community's formal answer to a problem that has existed since the first SMP Linux system: how do you write concurrent code that is provably correct across dozens of CPU architectures with different memory ordering semantics?

The LKMM defines its guarantees through six axioms -- coherence, atomicity, happens-before, propagation, RCU, and plain-coherence -- each expressed as a prohibition on cycles in specific relations between memory events. It distinguishes between marked accesses (those using READ_ONCE(), WRITE_ONCE(), atomics, and barriers) and plain accesses (bare C variable reads and writes), and it requires that shared state use marked accesses to avoid undefined behavior from data races.

The practical toolkit the model enables includes: a comprehensive hierarchy of memory barrier primitives from compiler-only barrier() to full hardware smp_mb(); the acquire/release idioms (smp_load_acquire() / smp_store_release()) for efficient ordered synchronization; RCU for high-frequency read paths that cannot afford locking overhead; and KCSAN for dynamic runtime detection of data races in real kernel code.

Memory ordering bugs are among the most subtle defects in systems software. They are architecture-dependent, timing-dependent, and often invisible until production load on specific hardware surfaces them. The LKMM does not eliminate that complexity -- nothing can. What it does is give developers the formal vocabulary and tooling to reason about it rigorously, to write code that is correct by construction rather than by hope, and to catch the bugs that informal reasoning misses.

If you write kernel code, driver code, or any concurrent systems code that targets Linux, the LKMM is not optional reading. It is the rulebook for the game you are playing. And the game is played on hardware that is, as the ASPLOS 2018 paper memorably titled it, genuinely "frightening" in its concurrency behavior -- unless you know the rules.