There is a moment every new Linux user knows. You have just booted into your first terminal. The prompt blinks at you like a patient predator. You type something. Nothing happens the way you expected. You type something else. An error message appears, cryptic and indifferent. You Google it, paste a command you do not understand, and something either works or completely breaks. You are the bambi -- the fresh spawn -- wandering into an environment that was not designed to hold your hand.

This guide is not the five-minute crash course. It is not a list of commands to memorize. It is an attempt to explain what Linux is, from the kernel outward, in enough depth that you stop feeling like you are operating blind. Because the dirty truth about Linux is that once you understand what is happening underneath, the whole system stops being mysterious. It becomes obvious. Almost elegant.

How Linux Concepts Connect
kernel processes filesystem scheduler signals shell inodes permissions systemd packages

The Thing Running Everything: The Kernel

Before you learn a single command, you need to understand what you are talking to.

Linux is not an operating system. That is the first misconception to dissolve. Linux is a kernel -- the core piece of software that mediates between your hardware and everything else. When someone says "I run Linux," they mean they run an operating system whose kernel is Linux. The operating system itself -- the distribution -- is built around that kernel, but the kernel is the engine.

Note

This distinction is the subject of a long-running naming debate in the open-source community. The Free Software Foundation, led by Richard Stallman, argues that because distributions rely heavily on GNU project tools -- the C library, the compiler, the shell, the coreutils -- the operating system should properly be called GNU/Linux, not just "Linux." The argument has merit: without GNU tools, the kernel alone cannot function as a usable operating system. However, the broader industry, Linus Torvalds himself, and the overwhelming majority of distributions use "Linux" as shorthand for the complete system. Some distributions, such as Alpine Linux and Android, use the Linux kernel with non-GNU user-space tools (musl and Bionic, respectively), which further complicates the naming question. This guide uses "Linux" in the way the industry does -- to refer to the family of operating systems built around the Linux kernel -- while acknowledging that the GNU project's contribution to what people call "Linux" is substantial and historically essential.

I'm doing a (free) operating system (just a hobby, won't be big and professional...)

-- Linus Torvalds, comp.os.minix newsgroup, August 25, 1991

Linus Torvalds, a Finnish computer science student at the University of Helsinki, began writing the Linux kernel in April 1991. On August 25, he posted a now-famous message to the comp.os.minix Usenet newsgroup announcing a free operating system he dismissed as a casual side project that would never rival the GNU project in scope. That message is one of the consequential understatements in the history of computing. The kernel Torvalds started now runs the majority of the world's servers, all of the world's top 500 supercomputers (a fact that has been true since the November 2017 TOP500 list and has held every list since), all Android phones, and many of the systems you interact with daily without knowing it.

The Linux kernel today is maintained by thousands of contributors. Kernel 6.18, released in November 2025, set a new record with 2,134 individual developers contributing code to a single release, surpassing the previous record of 2,090 set during kernel 6.2. The kernel manages everything from processes and memory to filesystems and device hardware, and it is open source, meaning anyone can read, modify, or improve the code. The source tree is available at kernel.org and surpassed 40 million lines of code in January 2025 with kernel 6.14 (counting all source, comments, blank lines, and documentation -- the figure varies depending on methodology, with tools like cloc reporting lower numbers when counting only executable code), doubling in size from approximately 19 million lines in 2015. As of April 2026, the latest stable release is kernel 7.0, which Linus Torvalds released on April 12, 2026 -- the first major version bump since kernel 6.0 in October 2022. The jump to 7.0 carries no special technical significance; as Torvalds explained in the 6.19 release announcement, he bumps the major version number when the minor numbers get large enough to cause confusion. Reading the source is not required to use Linux well. But understanding its architecture changes how you see every command you type.

The kernel operates in what is called kernel space, a region of memory that is entirely separate from user space, where your applications run. When you run a program, it runs in user space. When that program needs to do something that requires direct hardware access -- writing to disk, allocating memory, sending a network packet -- it makes a system call. The kernel handles it, then hands control back. This boundary is not just organizational; it is enforced by the processor hardware. On x86 processors, this is implemented through privilege rings: your terminal emulator lives in ring 3 (the least privileged mode), and the kernel lives in ring 0 (the highest privilege). On ARM processors, the equivalent concept is called Exception Levels (EL0 for user space, EL1 for the kernel). The terminology differs by architecture, but the principle is universal: your user-space code literally cannot reach the hardware directly, by hardware design. It must ask the kernel.

Note

This is why Linux is considered robust. A misbehaving application in user space cannot corrupt kernel memory. It can crash itself. It cannot bring down the system (unless there is a kernel bug, which does occasionally exist). This design decision, inherited from Unix in the 1970s, is why Linux can run for years without a reboot on production servers.

The Process: You Are Never Just Running One Thing

Before moving on, test your understanding: If a user-space program like Firefox tries to write data to disk, what prevents it from accessing the disk hardware directly?

The processor hardware itself enforces the boundary. On x86, Firefox runs in ring 3 (user mode) while the kernel runs in ring 0. Instructions that directly access hardware are privileged -- ring 3 code cannot execute them. Firefox must issue a write() system call, which triggers a controlled transition to ring 0 where the kernel handles the actual disk I/O, then returns control to ring 3. This is why a misbehaving application can crash itself but cannot corrupt the kernel or other processes -- the separation is enforced at the silicon level, not by convention.

When you type a command in a terminal, you are creating a process. A process is a running instance of a program, but that definition undersells what it is. Every process in Linux is represented in the kernel by a structure called task_struct, defined in include/linux/sched.h. This structure contains everything the kernel needs to know about a running program: its memory map, its open file descriptors, its state, its relationships to parent and child processes, its scheduling information, its credentials, and much more. On a 64-bit system with a typical distribution kernel configuration, a single task_struct is approximately 7 KB in size (around 7,168 bytes, spanning 112 CPU cache lines and containing over 150 fields), which you can verify on your own system by reading /sys/kernel/slab/task_struct/object_size. That 7 KB is allocated per task, meaning per thread -- a Java application with 200 threads has 200 separate task_struct instances in kernel memory, each independently tracked by the scheduler. In kernel 2.4 (circa 2001), the same structure was 960 bytes. The growth reflects two decades of added capabilities: namespaces, cgroups, security modules, and the scheduler metadata that EEVDF requires.

There are no true threads in Linux at the kernel level -- there are only tasks. This statement tends to provoke debate, and it deserves nuance. From the perspective of user-space applications, Linux absolutely supports POSIX threads through the NPTL (Native POSIX Threads Library), which provides full threading semantics including shared memory, proper signal handling, and thread-local storage. The statement is about what the kernel sees: internally, every thread is a task created with the clone() system call using flags like CLONE_VM (share virtual memory), CLONE_FS (share filesystem information), CLONE_FILES (share file descriptor table), and CLONE_SIGHAND (share signal handlers). When all of these flags are passed together, the child task shares the parent's address space, file descriptors, and signal handlers -- and that is what makes them "threads" to user space. The glibc pthread_create() function is a wrapper that calls clone() with exactly this combination of flags. Two tasks sharing the same mm_struct (memory descriptor) is the kernel-level reality behind the user-space abstraction. The distinction matters when you start debugging performance or understanding how processes interact, because the scheduler treats threads and processes identically -- they are all just entries in the run queue, each with their own task_struct, each independently scheduled. In other words, the kernel does not have a separate concept of "thread" distinct from "process" -- it has tasks with varying degrees of shared resources, and the degree of sharing is what user space calls a "thread" versus a "process."

Every process has a Process ID, or PID. PID 1 is special. It is the init process -- the first process the kernel spawns after booting. On the large majority of modern Linux systems, PID 1 is systemd -- this includes Ubuntu, Fedora, Debian, Arch, RHEL, SUSE, and their derivatives, which collectively represent the vast majority of installed Linux systems. However, it is worth noting that not all distributions use systemd. Void Linux uses runit, Devuan uses sysvinit, Alpine Linux uses OpenRC, and Gentoo allows users to choose their init system. The existence of these alternatives is a point of active discussion in the Linux community, and some users feel strongly that systemd's scope has grown beyond what an init system should be. Regardless of which init system is running, the PID 1 contract is the same: every other process on the system is a descendant of PID 1. When a process finishes, its parent receives a signal and is expected to clean up the child's resources. When a parent dies before its children, those orphaned processes are re-parented to PID 1. This is why killing PID 1 would bring down the entire user space.

You can see the full process tree at any moment by running pstree. You can see every process's task_struct-derived information by looking in /proc. If a process has PID 4821, then /proc/4821/ is a directory containing virtual files that represent its current state:

process inspection
# View the full process tree
$ pstree

# Examine a process with PID 4821
$ cat /proc/4821/status    # state and memory usage
$ cat /proc/4821/maps      # virtual memory layout
$ ls /proc/4821/fd/        # open file descriptors

The /proc filesystem is what the Linux man-pages project (proc(5)) describes as a pseudo-filesystem: it provides a file-based interface to kernel data structures and can be used both to read system information and to modify certain kernel parameters at runtime. This is not a metaphor. When you read /proc/4821/status, the kernel is not serving you a file from disk. It is generating that data on the fly from its internal task_struct. The file has no size on disk because it does not exist on disk.

Interactive: Fork-Exec Process Lifecycle
shell (parent)
-->
fork()
-->
child copy
-->
exec()
-->
program runs
-->
exit + wait()
$ -- shell is PID 1842, waiting for your command --

How the Scheduler Decides What Runs

Right now, your computer is running dozens or hundreds of processes. Your CPU likely has a handful of cores. The kernel has to continuously decide which process runs on which core, for how long, and in what order. This is the scheduler's job, and the scheduler is one of the intellectually interesting parts of the entire kernel.

For a significant stretch of Linux's history, the primary scheduler for normal processes was the Completely Fair Scheduler, or CFS, written by Ingo Molnar and merged into kernel version 2.6.23 in October 2007. The kernel's own CFS documentation (sched-design-CFS) summarizes the core idea in a single sentence: CFS models an ideal, precise multi-tasking CPU on real hardware. If two processes are running, each should get exactly 50% of the CPU. If four are running, each gets 25%. Since a single CPU core can only run one thing at a time, CFS approximates this ideal through the concept of virtual runtime.

Every process has a vruntime value -- a counter that tracks how much CPU time the process has consumed, weighted by its priority. At every scheduling point, the process's vruntime is incremented based on how long it ran, but scaled by its weight: a process with default priority (nice 0) has vruntime increase at the rate of wall-clock time, a high-priority process (negative nice) has vruntime increase more slowly, and a low-priority process (positive nice) has vruntime increase faster. This means high-priority processes accumulate vruntime slowly and remain on the left side of the tree longer, getting more CPU time in proportion to their weight. Whenever a timer interrupt or context-switch happens, the scheduler always picks the runnable task with the lowest vruntime. The process that has received the least weighted CPU time runs next. This elegantly handles fairness without the complex heuristics that plagued earlier schedulers like the O(1) scheduler it replaced.

The data structure CFS uses to track all runnable processes is a red-black tree -- a self-balancing binary search tree with efficient insert and remove operations that execute in O(log N) time, where N is the number of nodes in the tree. The leftmost node of the tree is always the process with the lowest vruntime, which is therefore the next process to run. Finding the next task is O(1) because the scheduler caches that leftmost pointer.

Pro Tip

This design has a beautiful side effect for interactive applications. An I/O-bound process -- one that spends time waiting for keyboard input or network data -- does not accumulate vruntime while sleeping. When it wakes up, its vruntime is low relative to CPU-bound processes, so it gets scheduled quickly. This is why your terminal feels responsive even while a compilation job is hammering your CPU cores.

Interactive: How Nice Values Affect Scheduling
firefox
vruntime: 50
nice 0
make -j8
vruntime: 50
nice 0
rsync
vruntime: 50
nice 0
All three processes start equal. Adjust the nice value for make -j8 and simulate scheduler ticks to see how vruntime diverges.

As of Linux kernel version 6.6, released on October 29, 2023, CFS has been replaced as the default by the EEVDF scheduler (Earliest Eligible Virtual Deadline First). EEVDF is based on a 1995 research paper by Ion Stoica and Hussein Abdel-Wahab, and was implemented for the Linux kernel by veteran scheduler developer Peter Zijlstra. Rather than simply picking the task with the lowest vruntime, EEVDF calculates a virtual deadline for each eligible task and selects the task whose deadline comes earliest. The eligibility check is based on a value called lag: tasks with lag greater than or equal to zero -- those owed CPU time or at exactly their fair share -- are considered eligible, while tasks with negative lag (those that have exceeded their fair share) are deprioritized. This is a cleaner mechanism than CFS, which relied on accumulating heuristics over its sixteen-year history. EEVDF also handles sleeping tasks through a deferred dequeue mechanism: when a task sleeps, it remains on the run queue but is marked for deferred removal, allowing its lag to decay over virtual runtime. This prevents tasks from gaming the scheduler by sleeping briefly to reset their accumulated negative lag. Tasks can also request specific time slices via the sched_setattr() system call, giving latency-sensitive applications finer control without the ad-hoc heuristics CFS required. If you are running a kernel older than 6.6, you are still on CFS. If you are on kernel 6.12 or newer, the legacy CFS code has been fully removed -- EEVDF is the sole fair scheduler.

Note

Kernel 6.12, released on November 17, 2024, also merged sched_ext -- a framework that allows developers to write custom scheduling policies as eBPF programs loaded from user space. (sched_ext was originally planned for kernel 6.11, but was deferred after Linus Torvalds decided more preparation was needed; the necessary scheduler changes landed in the 6.12 merge window, and Torvalds pulled the sched_ext code on September 21, 2024.) This means that for specialized workloads, you can now prototype and deploy custom schedulers without modifying or recompiling the kernel itself. sched_ext is maintained by Meta engineer Tejun Heo and kernel developer David Vernet. It does not replace EEVDF for general workloads, but it opens scheduling to a level of experimentation that was previously impossible without out-of-tree patches.

The practical implication for users: when you run nice -n 19 make -j8 to compile something at the lowest priority, you are adjusting the weight of your process in the scheduler's fairness calculation. The nice value range spans from -20 (highest priority, heaviest weight, reserved for root) to 19 (lowest priority, lightest weight, available to any user). The scheduler will still run your compilation -- it will not starve -- but every other process will get proportionally more time. On a system with EEVDF, that low-priority compilation will also have a later virtual deadline because the deadline is computed as the task's eligible time plus the ratio of its requested time slice to its weight. A lighter-weight process produces a larger quotient and therefore a more distant deadline, meaning latency-sensitive processes like your text editor or terminal will consistently preempt it.

The EEVDF Deadline Formula

The virtual deadline calculation in the kernel simplifies to: VD = eligible_time + (time_slice / weight). The eligible time is the task's current vruntime (the point at which it has received its fair share). A task with nice 0 has a weight of 1024 (the kernel uses a precomputed sched_prio_to_weight[] table where nice 0 maps to 1024, nice -20 maps to 88,761, and nice 19 maps to 15). The base time slice defaults to 3 ms on machines with 4 or more cores (it was quadrupled from the original 0.75 ms during EEVDF's initial tuning after early benchmarks showed regressions against CFS). Tasks can override this default via sched_setattr() by passing a desired slice in the sched_runtime field, with the kernel accepting values from 100 microseconds to 100 milliseconds. A shorter requested slice produces an earlier deadline and more frequent scheduling at the cost of more context switches; this is how latency-sensitive audio or UI threads get prioritized without needing real-time scheduling classes.

The Filesystem: Everything Is a File

One of Linux's fundamental design principles, inherited directly from Unix, is that everything is a file. Your hard drive is a file (/dev/sda). Your keyboard is a file (/dev/input/event0). Running processes expose themselves as files (/proc/PID/). Hardware settings are files (/sys/). This is not a convenience abstraction. It is a design philosophy with deep implications.

A common source of confusion

The phrase "everything is a file" is a simplification that generates persistent debate online. Critics correctly point out that not everything in Linux is literally a file in the filesystem: network interfaces like eth0 have no /dev entry, sockets do not appear as named files unless you specifically create Unix domain sockets, and some kernel interfaces require ioctl() calls that go beyond simple read/write semantics. Torvalds himself has clarified the principle on LKML: the point is not that everything has a filename, but that everything is a file descriptor -- an integer handle that supports a uniform interface of open/read/write/close operations. Sockets and pipes are file descriptors even though they do not have names in /dev. The more accurate (but less catchy) version of the principle is "everything is a file descriptor," and once you internalize that, the design makes considerably more sense.

The Filesystem Hierarchy Standard (FHS), version 3.0, was originally released on June 3, 2015 by the Linux Foundation. On November 6, 2025, FreeDesktop.org adopted and republished the specification, assuming ongoing maintenance. In the FHS, all files and directories appear under the root directory /, even if they are stored on different physical or virtual devices. Understanding this hierarchy is understanding how Linux thinks about resources.

Here is what is happening in each major directory:

$ tree / --dirsfirst -L 2 (click to explore)
[-] ls -- list directory contents
[-] cat -- concatenate and print files
[-] bash -- the Bourne Again Shell
[-] grep -- search text with patterns
[-] passwd -- user account database
[-] shadow -- encrypted password hashes (root only)
[-] fstab -- filesystem mount table
[-] sshd_config -- daemon configuration
[v] cpuinfo -- CPU specs as the kernel sees them
[v] meminfo -- memory statistics
[l] self/ -- symlink to current process PID dir
[v] net/ipv4/ip_forward -- enable/disable IP forwarding
[b] sda -- first SATA/SCSI block device
[c] null -- bit bucket: writes vanish, reads return EOF
[c] urandom -- cryptographic random byte source
[c] tty -- current controlling terminal
[-] syslog -- main system log
[-] auth.log -- authentication events
[d] alice/ -- files, configs, and dotfiles
[d] boot/ -- kernel images (vmlinuz) and bootloader
[d] tmp/ -- temporary files, cleared on reboot (sticky bit set)
[d] opt/ -- optional third-party software
[d] sys/ -- virtual: device hierarchy (sysfs)
Linux filesystem directories and their purposes
Directory Purpose
/ The root. Everything hangs from here. There is no C: drive or D: drive. There is one tree.
/bin, /usr/bin Compiled executables any user can run. On modern distros these are often the same directory (with /bin symlinked to /usr/bin).
/sbin, /usr/sbin System binaries requiring root for most operations: iptables, fdisk, reboot, fsck.
/etc Configuration. Every service stores its config here. SSH: /etc/ssh/sshd_config. DNS resolvers: /etc/resolv.conf. Small and easy to back up.
/var Variable data. Logs in /var/log, mail spools in /var/mail, package databases in /var/lib/dpkg or /var/lib/rpm.
/proc Virtual filesystem (not on disk). CPU info, memory stats, running process data -- all generated on the fly by the kernel.
/sys Virtual filesystem (sysfs). Structured device hierarchies mirroring actual bus topology. You can dim your screen by writing to /sys/class/backlight/.
/dev Device files. /dev/sda = first SATA disk. /dev/null = bit bucket. /dev/urandom = cryptographic random bytes.
/home User directories. ~ is shorthand for yours. Often on its own partition so filling it does not destabilize the system.
/boot Bootloader and kernel images. vmlinuz is the compressed kernel binary. initramfs.img loads drivers needed to access the root filesystem.
/tmp Temporary files, cleared on reboot. /var/tmp persists across reboots.
/opt Optional third-party software. Packages that do not follow the /usr/bin pattern install here to stay self-contained.
PurposeThe root. Everything hangs from here. There is no C: drive or D: drive. There is one tree.
PurposeCompiled executables any user can run. On modern distros these are often the same directory.
PurposeConfiguration files. Every service stores its config here. Small and easy to back up.
PurposeVariable data. Log files, mail spools, package databases. Grows over time.
PurposeVirtual filesystems. Not on disk. /proc exposes kernel internals, /sys exposes device hierarchies.
PurposeDevice files. Every piece of hardware the kernel knows about. Block devices (disks) and character devices (keyboards, serial ports).
PurposeUser directories, bootloader and kernel images, temporary files, and optional third-party software respectively.

Inodes: The Metadata That Makes Files Work

Every file you create is two things: the data itself, stored in data blocks on disk, and a metadata structure called an inode. When you look at a file, you are looking at a directory entry that maps a filename to an inode number. The inode contains everything about the file except its name and its data: size, permissions, owner, group, link count, and three classic timestamps -- atime (last access), mtime (last data modification), and ctime (last inode change, which is often confused with "creation time" but is not -- it updates whenever permissions, ownership, or link count change). ext4 adds a fourth timestamp, crtime, which is the actual creation time, sometimes called birth time. The inode also contains pointers to the data blocks where the file's content lives.

On ext4 (the common Linux filesystem), small files can have their data stored directly in the inode itself when the inline_data feature is enabled, while larger files use an extent tree -- a hierarchical structure of block pointers that maps logical file offsets to physical disk locations.

Reserved Inodes: What the Numbers Mean

There is no inode 0 -- the kernel uses zero as a sentinel value meaning "no inode" (directory entries with inode 0 are treated as deleted). On ext4, inodes 1 through 10 are reserved by the filesystem for internal use. Inode 1 is the bad blocks inode, which tracks defective disk sectors. Inode 2 is always the root directory (/) -- this is hardcoded and is how the kernel finds the root of the filesystem after mounting. Inode 3 is the ACL index, inode 4 is the ACL data, inode 5 is the boot loader inode, inode 6 is the undelete directory, inode 7 is the resize inode (used internally by resize2fs to grow the filesystem online), inode 8 is the journal, and inodes 9-10 are reserved for future use. Inode 11 is typically lost+found. The first user-created file on a fresh ext4 filesystem will receive inode 12. Virtual filesystems like /proc and /sys use a different scheme: because they have no real disk backing, the kernel assigns them a synthetic inode of 1, which is why ls -i /proc/cpuinfo and ls -i /sys/class/net both show inode 1. You can verify all of these with stat / (inode 2) and sudo debugfs -R 'stat <7>' /dev/sda1 to inspect the resize inode.

This architecture explains a number of things that confuse new users:

Hard links are not copies. When you run ln file1 file2, you are creating a second directory entry that points to the same inode. Both file1 and file2 are names for the same data. Deleting one does not delete the data -- it only removes that directory entry and decrements the inode's link count. The data is only freed when the link count reaches zero and no processes have the file open. This is why rm is technically "unlink" -- it removes a directory entry, not necessarily a file.

Symbolic links are different: they are their own files, containing a path to another file as their data. If you delete the target, the symlink breaks.

$ ls -i filename
$ stat filename

You can see the inode number of any file with ls -i. You can examine the inode directly with stat filename, which shows the inode number, size in bytes, blocks allocated, the three timestamps (access, modify, change), and the link count. On ext4, stat also shows a "Birth" field for the creation time.

Permissions: The Security Model You Need to Understand

Linux permissions exist at the inode level. Every file has an owner (a user), a group owner, and a set of permission bits that determine what the owner, group members, and everyone else can do.

The permission bits are a 12-bit field. Nine of those bits are the familiar rwx-rwx-rwx -- three bits each for read, write, and execute, for the owner, group, and others respectively.

When you see -rwxr-xr--, that means: regular file, owner can read/write/execute, group can read/execute, others can only read. The octal representation is 754 -- seven (4+2+1) for owner, five (4+1) for group, four (4) for others.

SUID, SGID, and the Sticky Bit

The three bits beyond the standard rwx are where the model becomes genuinely interesting:

The SUID (Set User ID) bit on an executable means it runs as its owner, not as the user executing it. passwd -- the program that changes your password -- has SUID set and is owned by root. This is how an unprivileged user can change their own password in /etc/shadow, which root owns and no one else can write. passwd runs as root, checks that you are trying to change your own password, and makes the change on your behalf. This is a controlled privilege escalation by design.

The SGID (Set Group ID) bit works similarly but for groups. On an executable, it runs with the group privileges of the file's group owner. On a directory, SGID is even more useful: any file created inside a SGID directory inherits the directory's group ownership rather than the creating user's primary group. This is how shared project directories work -- set the directory to SGID and the project group, and every file anyone creates inside it will automatically belong to the project group.

The sticky bit on a directory means that even if a directory is world-writable, users can only delete their own files within it. /tmp has the sticky bit set (visible as the t in drwxrwxrwt). This is why you can write files to /tmp but cannot delete other users' files there.

Warning

Linux access control decisions are made at operation time, based on effective identity and mode bits on the target inode. When you execute chmod 644 file.txt, you are telling the kernel to set the permission bits in that file's inode to 110-100-100 in binary -- rw-r--r-- in symbolic notation. The chmod command itself is just a syscall wrapper around the kernel's chmod() system call, which modifies the inode directly.

Interactive: Permission Bit Calculator
Read (4)
Write (2)
Execute (1)
Owner
r
w
x
Group
r
w
x
Others
r
w
x
Octal 644
Symbolic rw-r--r--
chmod 644 filename

Understanding /proc/sys gives you another layer of the permission model: kernel parameters. cat /proc/sys/net/ipv4/ip_forward tells you whether your system is forwarding IP packets. Writing 1 to that file enables routing. These are not configuration files in the traditional sense. They are direct windows into running kernel variables. The sysctl command (sysctl -w net.ipv4.ip_forward=1) is just a convenience wrapper for reading and writing to /proc/sys/.

Signals: How Processes Talk to Each Other (and Die)

Every Linux user has run kill PID. The name is misleading. kill sends a signal to a process. By default, that signal is SIGTERM (signal 15), which politely asks the process to terminate. A well-written program catches SIGTERM and shuts down cleanly. A program that ignores it can be sent SIGKILL (signal 9) with kill -9 PID, which cannot be caught, blocked, or ignored -- the kernel forcibly removes the process from the scheduler and cleans up its resources.

When SIGKILL Does Not Work

There are exactly two situations where kill -9 has no effect, and they confuse even experienced administrators. The first is a zombie process (state Z in ps): a zombie has already terminated but its parent has not called wait() to collect its exit status, so the kernel retains its PID and task_struct entry. You cannot kill what is already dead -- the fix is to kill or signal the parent, which forces cleanup. The second is a process in uninterruptible sleep (state D): this means the process is blocked in kernel space waiting for I/O to complete (typically disk or network), and the kernel will not deliver any signals -- including SIGKILL -- until the I/O operation finishes or times out. This is by design: interrupting a process mid-write to a disk could corrupt the filesystem. The classic cause is a hung NFS mount or a failing storage device. If you see a D-state process that will not die, your problem is the hardware or the driver, not the process. You can inspect what kernel function a D-state process is stuck in by reading /proc/PID/wchan, or trigger a full kernel stack dump of all blocked processes with echo w > /proc/sysrq-trigger (as root). Since kernel 2.6.25, a third state called TASK_KILLABLE exists as a compromise: processes in this state are uninterruptible to normal signals but will respond to fatal ones like SIGKILL, which is now used widely for NFS and other filesystem operations to avoid the "unkillable process" problem.

Common Linux signals, their numbers, and behaviors
Signal Number Behavior
SIGHUP 1 Originally "terminal hung up." Now conventionally tells daemons to re-read their configuration without restarting.
SIGINT 2 What Ctrl+C sends. Requests an interrupt.
SIGKILL 9 Cannot be caught, blocked, or ignored. The kernel forcibly removes the process.
SIGSEGV 11 Segmentation fault -- the kernel sends this to a process that tried to access memory it does not own.
SIGTERM 15 Default signal sent by kill. Politely asks the process to terminate.
SIGCONT 18 Resumes a stopped process.
SIGSTOP 19 Suspends a process. Cannot be caught or ignored.
BehaviorOriginally "terminal hung up." Now tells daemons to re-read configuration without restarting.
BehaviorWhat Ctrl+C sends. Requests an interrupt.
BehaviorCannot be caught, blocked, or ignored. Kernel forcibly removes the process.
BehaviorSegmentation fault. Sent when a process tries to access memory it does not own.
BehaviorDefault signal sent by kill. Politely asks the process to terminate.
BehaviorSIGSTOP suspends a process (cannot be caught). SIGCONT resumes it.

You can see what signals a process has registered handlers for in /proc/PID/status. The fields SigBlk, SigIgn, and SigCgt are bitmasks representing blocked, ignored, and caught signals respectively. (A note on signal numbers: SIGHUP through SIGTERM have the same numbers across all Linux architectures, but SIGCONT and SIGSTOP numbers vary by architecture. The numbers in the table above are for x86 and ARM, which is what you will encounter on the vast majority of hardware. On MIPS, for example, SIGSTOP is 23 rather than 19. When in doubt, check kill -l on your system.)

?
Think Like the Kernel
A sysadmin runs kill -9 3847 but the process will not die. Running ps aux | grep 3847 shows the process in state D. What is happening, and what should the sysadmin investigate?
A The process has caught SIGKILL with a custom signal handler and is ignoring it. The sysadmin should try SIGTERM instead.
B The process is a zombie. The sysadmin needs to kill the parent process so init can reap the zombie's exit status.
C The process is in uninterruptible sleep, blocked on a kernel-space I/O operation. No signal -- including SIGKILL -- will be delivered until the I/O completes. The sysadmin should investigate the storage subsystem or NFS mount.
D The process has elevated privileges and SIGKILL requires root. The sysadmin should use sudo kill -9 3847.
$ explanation
State D means uninterruptible sleep -- the process is waiting for a kernel I/O operation (typically disk or NFS). The kernel deliberately does not deliver signals during this state because interrupting a mid-flight disk write could corrupt the filesystem. SIGKILL cannot be caught or ignored by user-space code (ruling out option A), the process is not a zombie (state Z, ruling out B), and signal delivery is not a privilege issue (ruling out D). The sysadmin should check /proc/3847/wchan to see which kernel function the process is stuck in, and investigate the underlying storage device or network mount. Since kernel 2.6.25, the TASK_KILLABLE state exists as a compromise for some I/O paths, but many drivers still use the traditional uninterruptible state.

The Shell Is Not Linux

This deserves its own section because conflating the two is a frequent conceptual error made by beginners.

The shell -- bash, zsh, fish, dash -- is a user-space program. It is just an application. It reads your input, interprets it, and calls programs. When you type ls -la, the shell parses that into the program name ls and the arguments -la, looks up ls in the directories listed in your PATH environment variable, calls the fork() system call to create a copy of itself, then calls exec() to replace that copy with the ls program. ls runs, writes output to standard output (which is connected to your terminal), and exits. The shell then displays the next prompt.

This fork-exec pattern is fundamental to how Unix-derived systems create processes. Every process except PID 1 was created by another process calling fork(), which duplicates the parent process, followed by exec() which loads a new program into that duplicate. The child inherits the parent's file descriptors, environment variables, and signal handlers. This is how shell pipelines work: ls | grep foo creates two processes with the standard output of ls connected by a pipe to the standard input of grep. The kernel pipe is just an in-memory buffer with two ends; writing to one end and reading from the other.

Why fork() Is Not as Expensive as It Sounds

The word "duplicates" in the previous paragraph is misleading if you take it literally. Modern Linux does not actually copy the parent's memory when fork() is called. Instead, it uses copy-on-write (COW): both parent and child initially share the same physical memory pages, with those pages marked read-only in both processes' page tables. Only when either process tries to write to a page does the kernel trap the page fault, allocate a new physical page, copy the content, and remap the writer's page table entry to the new copy. Since fork() is almost always followed immediately by exec() -- which discards the entire address space and loads a new program -- the parent's memory pages are never actually copied at all. This is why launching a new process on Linux is fast even from a shell with a large memory footprint: fork() only needs to duplicate the page table entries and a few kernel structures (roughly 50 microseconds on modern hardware), not the actual memory content. The vfork() system call goes even further by not copying the page tables at all, but it requires that the child immediately call exec() or _exit() without touching any memory, which is why fork() with COW is the safer and more commonly used approach. The posix_spawn() function, reimplemented in glibc 2.24 to use clone() with CLONE_VM and CLONE_VFORK flags, combines fork and exec into a single operation and is measurably faster when spawning thousands of short-lived child processes.

Note

Understanding this also explains why export VARIABLE=value in a subshell does not affect the parent shell: environment variables are inherited by children, never passed back to parents. The fork creates a copy of the parent's environment, and changes to the copy do not propagate back.

When you type cat /var/log/syslog | grep error | wc -l, how many new processes does the shell create, and how are they connected?

The shell creates three child processes -- one for each command in the pipeline. Before forking, the shell creates two kernel pipes (in-memory buffers with a read end and a write end). Process 1 (cat) has its stdout connected to the write end of pipe 1. Process 2 (grep) has its stdin connected to the read end of pipe 1 and its stdout connected to the write end of pipe 2. Process 3 (wc) has its stdin connected to the read end of pipe 2. All three processes run concurrently -- they are independently scheduled tasks. The shell fork()s three times, sets up file descriptor redirections with dup2(), then calls exec() in each child. The parent shell calls wait() on the last process in the pipeline and displays the prompt when it exits.

Package Management: Where Software Comes From

Every distribution has a package manager, and understanding what it is doing removes a lot of the mystery around software installation on Linux.

A package is a compressed archive containing compiled binaries, configuration files, documentation, and metadata. The metadata describes the package's name, version, dependencies (other packages it needs to function), and file lists. The package manager -- apt on Debian/Ubuntu, dnf on Fedora/RHEL, pacman on Arch -- downloads packages from repositories, checks cryptographic signatures to verify authenticity, resolves the dependency graph, and extracts files to the correct locations.

package installation
# What happens when you run apt install nginx:
$ sudo apt install nginx

# 1. Consults local repository metadata cache
# 2. Determines nginx version and its dependencies
# 3. Downloads the .deb package + all uninstalled dependencies
# 4. Verifies cryptographic signatures
# 5. Extracts files to their correct locations:
#      /usr/sbin/nginx
#      /etc/nginx/nginx.conf
#      /lib/systemd/system/nginx.service
# 6. Runs post-install scripts to configure the service

The /etc/apt/sources.list file (and files in /etc/apt/sources.list.d/) tells apt where its repositories are. These are HTTP servers hosting structured directories of packages and metadata files. The package manager downloads the metadata (containing package names, versions, and SHA256 checksums of package files), and this is what allows apt search to work without downloading every package.

Caution

This infrastructure is why "just download a binary from the internet" is less common on Linux than on Windows. The package manager is a curated, cryptographically verified software distribution network. When you add a third-party repository with add-apt-repository, you are extending trust to that party's signing key. The security of the entire package installation chain rests on the integrity of those keys.

Systemd and the Init System: Booting the User Space

When the kernel finishes initializing hardware, it hands control to PID 1. On the large majority of modern Linux distributions, that is systemd. Systemd's job is to bring the entire user space to life -- starting all the daemons, mounting filesystems, configuring network interfaces, and managing services throughout the system's runtime.

A word on the systemd debate

Systemd is one of the more divisive topics in the Linux community. Its critics argue that it violates the Unix philosophy of "do one thing well" by absorbing functionality that was previously handled by separate tools -- logging (journald), network configuration (networkd), DNS resolution (resolved), and more. Its supporters argue that the old SysVinit approach was fragile, shell-script-dependent, and lacked reliable dependency management between services. Both sides have legitimate technical points. This guide covers systemd because it is the init system you will encounter on the overwhelming majority of production Linux systems, not because it is the only valid choice. If you use a distribution with runit, OpenRC, or another init system, the underlying concepts of PID 1, service management, and process supervision remain the same -- only the tools and configuration syntax differ.

Systemd uses unit files -- configuration files in /lib/systemd/system/ and /etc/systemd/system/ -- to describe services, mount points, devices, timers, and sockets. An nginx.service unit file tells systemd how to start, stop, and restart the nginx web server, what it depends on, whether it should restart on failure, what user it should run as, and how to control its resource usage via cgroups.

The systemctl command is the interface to systemd:

systemd service management
# Check service status, PID, memory, recent logs
$ systemctl status nginx

# Start, stop, restart a service
$ sudo systemctl start nginx
$ sudo systemctl stop nginx
$ sudo systemctl restart nginx

# Enable on boot (creates a symbolic link)
$ sudo systemctl enable nginx

# Query the system journal for nginx logs
$ journalctl -u nginx

# Show the cgroup hierarchy (service tree)
$ systemd-cgls

Cgroups -- control groups -- are a kernel mechanism that systemd uses extensively. Originally introduced in kernel 2.6.24 (January 2008) as cgroups v1, the mechanism was redesigned as cgroups v2 and merged in kernel 4.5 (March 2016). They allow the kernel to group processes and apply resource limits, accounting, and isolation to the group. A cgroup can be told "this group gets at most 2 CPU cores and 4GB of RAM." Modern Linux distributions use cgroups v2, which provides a unified hierarchy where each process belongs to exactly one cgroup -- unlike cgroups v1, which allowed multiple independent hierarchies that could conflict with each other. When systemd starts a service, it creates a cgroup for it. This is what makes systemd-cgls so revealing: it shows the entire cgroup hierarchy, which corresponds to the service hierarchy. When you stop a service, systemd kills the entire cgroup -- preventing the service from spawning processes that outlive it. This is a significant improvement over older init systems like SysVinit, where a daemon could fork children that escaped cleanup.

?
Think Like the Kernel
A web service running under systemd crashes and restarts repeatedly. After 5 rapid restarts, systemd stops restarting it and marks it as failed. You fix the underlying bug and want to restart the service. Running sudo systemctl start myservice returns an error. What needs to happen first?
A Reboot the system. Systemd's failure state is only cleared on restart.
B Run sudo systemctl reset-failed myservice to clear the failure counter, then start the service.
C Edit the unit file to set Restart=always and run systemctl daemon-reload.
D Manually launch the service binary outside of systemd and then register it with systemctl enable.
$ explanation
Systemd tracks restart attempts per service via a rate limiter (configured by StartLimitIntervalSec and StartLimitBurst in the unit file). Once the limit is hit, the service enters a failed state that blocks further start attempts. The reset-failed command clears this counter without rebooting (ruling out A). Option C would change the restart policy but would not clear the existing failure state. Option D would bypass systemd entirely, losing cgroup isolation, logging, and dependency management. The correct workflow is: fix the bug, reset-failed, then start.

How to Start Exploring the Linux Kernel from the Command Line

Step 1: Inspect your CPU and kernel through /proc

Run cat /proc/cpuinfo and read it. Each processor logical core is listed separately. model name is your processor. cpu MHz is its current frequency. flags is a list of CPU features the kernel has detected -- vmx means Intel VT-x (virtualization), avx2 means the processor supports AVX2 SIMD instructions, aes means hardware AES acceleration.

Step 2: Examine your own process through /proc/self

Run ls -lai /proc/self and notice that /proc/self is a symlink that always points to the current process's PID directory. That directory is your shell. The fd/ subdirectory lists open file descriptors: 0 is stdin, 1 is stdout, 2 is stderr.

Step 3: Trace system calls with strace

Run strace ls /tmp and watch what system calls ls makes. You will see it calling openat() to open the directory, getdents64() to read directory entries, fstat() to get inode information on each entry, and write() to produce output. Strace shows you the interface between user space and kernel space.

Step 4: Read the kernel ring buffer and explore tunable parameters

Run dmesg | head -50 to see the kernel ring buffer -- the log messages from the kernel itself, starting from boot. You will see hardware detection, driver loading, filesystem mounting, and any kernel warnings. Then run find /proc/sys/net/ipv4 -type f | head -20 to see a fraction of the IPv4 networking parameters the kernel exposes for runtime tuning.

Linux rewards curiosity in a way that few systems do. The source is open. The documentation is thorough, if dense. The man pages alone are worth days of study. The kernel documentation at docs.kernel.org includes the original design documents written by the engineers who built each subsystem.

The CFS documentation itself is a case in point. Ingo Molnar's design document distills the scheduler's purpose into one core insight: that the goal is to simulate a perfectly fair CPU that gives each process its exact share. That single framing explains the majority of the scheduler's design. When you read documentation like that -- terse, precise, and written by the engineer who built the thing -- you start to see why the open-source model produces systems worth understanding at this level. As of kernel 6.12 (November 2024), the CFS code itself has been removed from the kernel in favor of EEVDF, but the design document remains in the kernel tree as a historical reference.

You spawned in as a bambi. The map is enormous and the environment is hostile to ignorance. But unlike many games, everything in this world is documented, and once you understand the rules -- processes, files, permissions, signals, the scheduler -- the environment stops fighting you. You start to see why it was built the way it was. And that understanding is the skill. Commands are trivial to look up. Knowing why a command works is what separates someone who uses Linux from someone who understands it.

You run sudo systemctl restart nginx. Trace what happens at every layer of the system -- from the shell, through the kernel, to the scheduler and back.

Shell layer: Your shell parses the command, resolves sudo via PATH, fork()s, and exec()s sudo. Sudo is SUID root, so it runs with root privileges. It authenticates you (checking /etc/sudoers), then fork()/exec()s systemctl.

Systemd layer: systemctl communicates with the systemd daemon (PID 1) over a D-Bus socket. Systemd receives the restart command, sends SIGTERM to the nginx cgroup (giving nginx a chance to shut down gracefully), waits for the configured timeout, then sends SIGKILL if the process has not exited.

Kernel layer: The SIGTERM signal is delivered by the kernel to the nginx master process. Nginx's signal handler triggers a graceful shutdown -- closing sockets (releasing file descriptors), flushing logs (write() syscalls to /var/log), and exiting. The kernel updates the task_struct state, reclaims memory, and decrements the inode link count on any open files.

Restart: Systemd fork()/exec()s a new nginx process within a fresh cgroup. The scheduler adds the new task_struct to its run queue with a fresh vruntime. Nginx binds to port 80/443 (a privileged operation handled via kernel capabilities), and the service is live. Systemd logs the entire sequence to the journal.

Frequently Asked Questions

What is the difference between Linux and a Linux distribution?

Linux is a kernel -- the core software that mediates between hardware and everything else. A distribution (distro) is a complete operating system built around the Linux kernel, bundling a package manager, init system, shell, and user-space tools. When someone says they run Linux, they mean they run a distribution whose kernel is Linux.

What replaced the CFS scheduler in the Linux kernel?

As of Linux kernel version 6.6, released on October 29, 2023, the Completely Fair Scheduler (CFS) was replaced by the EEVDF (Earliest Eligible Virtual Deadline First) scheduler as the default. EEVDF is based on a 1995 research paper by Ion Stoica and Hussein Abdel-Wahab, and was implemented for the Linux kernel by Peter Zijlstra. It calculates a virtual deadline for each eligible task -- those with lag greater than or equal to zero -- and picks the one whose deadline comes earliest. By kernel 6.12 (November 2024) the legacy CFS code was fully removed, making EEVDF the sole fair scheduler.

Why does Linux use the everything is a file design principle?

The everything-is-a-file design, inherited from Unix, means hardware devices, running processes, and kernel parameters are all exposed through a uniform file descriptor interface. This allows administrators and programs to interact with wildly different resources using the same standard read/write system calls, making the system composable and scriptable without requiring specialized APIs for each resource type. The phrase is a simplification -- not literally everything has a filename in the filesystem (network interfaces and sockets being notable exceptions) -- but the core principle is that the kernel exposes resources through file descriptors, which support a consistent open/read/write/close interface.

What is PID 1 and why is it important?

PID 1 is the init process -- the first process the kernel spawns after booting. On the large majority of modern Linux systems, PID 1 is systemd, though some distributions use alternatives like runit (Void Linux), sysvinit (Devuan), or OpenRC (Alpine). Regardless of which init system is running, every other process on the system is a descendant of PID 1. When a parent process dies before its children, those orphaned processes are re-parented to PID 1. Killing PID 1 would bring down the entire user space.

What is sched_ext and how does it work?

sched_ext is a kernel framework, merged in Linux 6.12 (November 2024), that allows developers to write custom CPU scheduling policies as eBPF programs loaded from user space. Before sched_ext, changing the scheduler required modifying the kernel source and recompiling. With sched_ext, you can prototype and deploy custom schedulers at runtime without kernel patches. It was developed by Meta engineer Tejun Heo and kernel developer David Vernet, and it sits alongside EEVDF as a scheduling class rather than replacing it. Example schedulers include scx_lavd for gaming interactivity and scx_bpfland for minimizing response latency.

Sources and References

Technical details in this guide are drawn from official documentation, original source material, and verified references.