Most engineers who work with Linux networking know the basics: ip addr, ip route, maybe iptables. But there's an entire layer of the Linux network stack that gets almost no attention -- the traffic control subsystem, accessed through the tc command. This is where you control not just what packets travel through your machine, but how: their priority, their rate, their burst characteristics, and their fate under congestion.
The reason tc is underused isn't that sysadmins don't need it. It's that the documentation is genuinely terrible. The man page for tc reads like a graduate-level networking paper with the explanations removed. This guide exists to fix that -- with real examples, accurate mental models, and the context needed to understand what you're doing instead of just cargo-culting commands from Stack Overflow.
The Layer Nobody Talks About
When a packet leaves your machine, it doesn't just fly out the NIC. It passes through a queuing discipline -- a qdisc -- that decides when and how it gets transmitted. By default, Linux uses a simple FIFO queue. Packets go in, packets come out, in order, with no priority and no rate control. This is fine for a single-user workstation. It is not fine for a server carrying mixed traffic -- interactive SSH sessions competing with bulk backup transfers, or VoIP packets sitting behind a video stream that's saturating your uplink.
The queuing discipline is Linux's answer to the fundamental question all networks face: what do you do when you have more data to send than bandwidth to send it with? The answer is not random. It is a deliberate policy choice that you make -- or, if you do nothing, the kernel makes for you.
-- Adapted from "Linux Advanced Routing and Traffic Control," Bert Hubert et al., lartc.org
That policy choice is what tc gives you control over. The LARTC guide (the canonical reference for this material, freely available at lartc.org) describes the Linux traffic control system in detail, but it was written in the early 2000s and the learning curve is steep. The fundamentals haven't changed -- the concepts map directly to what modern Linux kernels implement -- but the guide assumes familiarity with queuing theory that most practitioners simply don't have.
Here's the conceptual model you need: every network interface has exactly one root qdisc attached to it. That qdisc may be simple (a FIFO) or complex (a hierarchy of classes with their own sub-qdiscs). Packets flow in from the kernel, get classified, get queued according to the discipline's rules, and eventually get dequeued for transmission. The entire system is a tree, and tc is how you build and manage that tree.
Three Things That Always Confused People (Cleared Up)
Before touching any commands, there are three conceptual muddles that trip up almost everyone who tries to learn tc.
1. Ingress vs. Egress
tc primarily controls egress traffic -- packets leaving your machine. This confuses people who want to limit download speed. You cannot directly shape incoming packets with the same tools you use for outgoing ones; the kernel receives packets before your qdisc gets a chance to intervene. To rate-limit ingress traffic, you either shape it on the sending end (if you control it), or use the ingress qdisc combined with tc-police to drop excess packets on arrival. A more sophisticated option is to redirect ingress traffic to an IFB (Intermediate Functional Block) virtual device using tc-mirred, which allows a full egress-style qdisc to be applied to what is logically incoming traffic -- but that adds meaningful complexity. Dropping is blunt. Real shaping -- which implies buffering -- only happens cleanly on egress, whether on the actual egress interface or via the IFB workaround.
This is why traffic shaping on a home router is so effective: the router shapes traffic as it goes out toward your WAN interface, which is the bottleneck. It doesn't need to shape ingress because the WAN is already the limiting factor.
2. Classful vs. Classless Qdiscs
Qdiscs come in two flavors. Classless qdiscs (like pfifo_fast, fq_codel, tbf) are standalone -- they have no sub-classes, no ability to subdivide traffic into categories. Classful qdiscs (like htb, hfsc, prio) can contain classes, and each class can have its own child qdisc. The power of classful qdiscs is that you can make policy decisions -- this traffic gets guaranteed bandwidth, that traffic only gets leftover capacity.
The mistake people make is attaching a classless qdisc when they need a classful one, or the reverse. If you want to prioritize SSH over BitTorrent, you need a classful qdisc so the kernel has somewhere to put the different priority levels. If you just want basic rate limiting of all traffic on an interface, a classless qdisc like tbf (token bucket filter) is simpler and more appropriate.
3. Handles Are Addresses, Not Magic Numbers
Every qdisc and class has a handle. These look like 1:0 or 1:1 or 10: and they terrify newcomers who assume there's some deep significance to the numbers. There isn't. A handle is just an address -- a way for one part of the tc configuration to reference another. The convention is that qdiscs use X:0 (where 0 is implied and often written as just X:), and classes use X:Y where X matches the parent qdisc's major number. You choose these numbers. Consistency matters; the actual values do not.
The kernel documentation for traffic control lives at Documentation/networking/ in the kernel source tree, and online at kernel.org/doc/html/latest/networking/. The tc-htb, tc-fq_codel, and tc-tbf man pages (from the iproute2 package) are the authoritative references for each qdisc type.
HTB: Hierarchical Token Bucket
HTB is the qdisc you'll use for the majority of real-world traffic shaping tasks. It was written by Martin Devera and introduced to the Linux kernel in 2002. The design goals were explicit: a classful scheduler that's easy to configure, that supports bandwidth guarantees, burst allowances, and borrowing of unused capacity between classes. The HTB user guide by Devera himself (archived) remains a good reference, though the original domain is no longer active and the interface has been through the iproute2 tool since Linux 2.4.20.
The mental model for HTB is a water analogy that actually works: each class is a pipe with a guaranteed flow rate (rate) and a maximum possible flow rate (ceil). When a class isn't using its guaranteed rate, the unused capacity flows up to the parent and can be borrowed by sibling classes that need more than their guarantee. No class ever exceeds its ceil. The burst parameter allows a class to temporarily exceed its rate for a short time (measured in bytes), which smooths out the choppy behavior you'd otherwise see with TCP slow start.
A Real HTB Setup: Prioritizing Interactive Traffic
Suppose you have a server with a 100 Mbit uplink. You want SSH and DNS to always have headroom, HTTP/HTTPS to get the bulk of capacity, and everything else (backups, bulk transfers) to use only whatever's left. Here's how that looks:
#!/bin/bash # HTB traffic shaping -- 100Mbit uplink example # Adjust IFACE and rates to match your environment IFACE=eth0 UPLINK=100mbit # Step 1: Remove any existing qdisc on the interface tc qdisc del dev $IFACE root 2>/dev/null # Step 2: Add the root HTB qdisc # default 30 means unclassified traffic falls into class 1:30 tc qdisc add dev $IFACE root handle 1: htb default 30 # Step 3: Root class -- the total bandwidth ceiling tc class add dev $IFACE parent 1: classid 1:1 htb rate $UPLINK ceil $UPLINK # Step 4: Leaf classes # 1:10 -- High priority: SSH (port 22), DNS (port 53) # Guaranteed 20mbit, can burst to full uplink tc class add dev $IFACE parent 1:1 classid 1:10 htb \ rate 20mbit ceil 100mbit burst 32k prio 1 # 1:20 -- Normal: HTTP/HTTPS traffic # Guaranteed 70mbit, can burst to full uplink tc class add dev $IFACE parent 1:1 classid 1:20 htb \ rate 70mbit ceil 100mbit burst 64k prio 2 # 1:30 -- Bulk/default: everything else # Only 10mbit guaranteed, can borrow if others idle tc class add dev $IFACE parent 1:1 classid 1:30 htb \ rate 10mbit ceil 100mbit burst 16k prio 3 # Step 5: Attach leaf qdiscs to each class # fq_codel handles AQM (active queue management) within each class tc qdisc add dev $IFACE parent 1:10 handle 10: fq_codel tc qdisc add dev $IFACE parent 1:20 handle 20: fq_codel tc qdisc add dev $IFACE parent 1:30 handle 30: fq_codel # Step 6: Filters -- classify traffic into classes # u32 filters match on packet fields # SSH traffic (sport or dport 22) -> high priority tc filter add dev $IFACE parent 1: protocol ip u32 \ match ip dport 22 0xffff flowid 1:10 tc filter add dev $IFACE parent 1: protocol ip u32 \ match ip sport 22 0xffff flowid 1:10 # DNS (port 53, UDP) -> high priority tc filter add dev $IFACE parent 1: protocol ip u32 \ match ip dport 53 0xffff flowid 1:10 # HTTP/HTTPS -> normal tc filter add dev $IFACE parent 1: protocol ip u32 \ match ip dport 80 0xffff flowid 1:20 tc filter add dev $IFACE parent 1: protocol ip u32 \ match ip dport 443 0xffff flowid 1:20 # Everything else falls to default class 1:30 automatically echo "Traffic shaping applied to $IFACE"
This setup is doing several things at once that are worth understanding individually. The root class 1:1 is a ceiling -- no traffic can exceed 100mbit total. Each leaf class has a rate (the guarantee) and a ceil (the maximum including borrowed bandwidth). The prio values control which class gets to borrow first when bandwidth is available. Lower priority number means higher priority in HTB.
The leaf qdiscs attached to each class are fq_codel -- Fair Queue CoDel. This is important and frequently omitted in examples. Without a leaf qdisc, HTB uses a simple FIFO within each class, which means a single greedy flow can still monopolize a class. fq_codel provides per-flow fairness within each class and active queue management to reduce bufferbloat. This is the combination that actually delivers good latency characteristics.
The burst parameter in HTB specifies how many bytes a class can send above its rate in a single burst before the kernel starts enforcing the rate. Too small and TCP connections see choppy throughput. Too large and you defeat the purpose of rate limiting. A practical minimum is rate / (kernel HZ), where HZ is typically 250 or 1000 depending on your kernel configuration. For a 20mbit class (2,500 bytes/ms) at HZ=250 (4ms tick), the minimum is roughly 10KB. The 32k shown above provides comfortable headroom; tune upward if you see TCP stalls.
Filters: The Classification Engine
Filters are how packets get assigned to classes. Without filters, everything falls to the default class and your carefully constructed hierarchy does nothing. The two filter types you'll encounter most often are u32 (universal 32-bit match) and fw (firewall mark match). There's also flower (flow classifier, newer and more expressive) and matchall (match everything).
u32 Filters
u32 filters match on arbitrary fields in the packet header. The syntax is verbose but predictable. Each match clause looks like match [protocol] [field] [value] [mask]. The mask is applied before comparison -- 0xffff means match exactly, 0x0000 means match anything.
# Match by destination IP address (e.g., your mail server's IP) tc filter add dev eth0 parent 1: protocol ip u32 \ match ip dst 192.168.1.50/32 flowid 1:10 # Match TCP traffic to a bitmask-aligned port range (8000-8015) # u32 masks work by bitwise AND: port & mask == value & mask # mask 0xfff0 fixes the top 12 bits, leaving 4 bits free -- covers 16 ports # A single u32 mask can only express ranges that are a power-of-2 in size # AND start at a power-of-2 boundary (i.e., value & mask == value) # 8000 = 0x1f40; 0x1f40 & 0xfff0 == 0x1f40 -- valid alignment # Ranges that don't meet both conditions require multiple filter rules tc filter add dev eth0 parent 1: protocol ip u32 \ match ip dport 8000 0xfff0 flowid 1:20 # Match by DSCP value (Differentiated Services Code Point) # EF (Expedited Forwarding): DSCP value = 46 (0x2e) # The ToS byte places DSCP in the top 6 bits: 0x2e << 2 = 0xb8 # Mask 0xfc (11111100) covers those 6 DSCP bits, ignoring the 2 ECN bits # The u32 match operates on the full ToS byte, not the raw DSCP value tc filter add dev eth0 parent 1: protocol ip u32 \ match ip tos 0xb8 0xfc flowid 1:10 # Match by source IP subnet -- limit a particular host's uploads tc filter add dev eth0 parent 1: protocol ip u32 \ match ip src 10.0.0.100/32 flowid 1:30 # IPv6 traffic -- note the protocol change tc filter add dev eth0 parent 1: protocol ipv6 u32 \ match ip6 dport 443 0xffff flowid 1:20
The Better Approach: DSCP + fw Marks
Here's a dot that rarely gets connected: the u32 filter approach works for simple rules, but it doesn't scale. When you have dozens of classification rules, u32 filters are evaluated linearly, and complex matches (multiple fields, multiple protocols) become hard to maintain. The cleaner architecture uses iptables to mark packets and then tc filter on those marks using the fw classifier.
This separates concerns cleanly: iptables handles the complex classification logic (it has a richer matching syntax, conntrack support, and stateful rules), and tc handles the queuing policy. The marks are just integers passed between the two subsystems via the kernel's skb->mark field.
# Part 1: iptables marks packets by type # Mark 10 = high priority, Mark 20 = normal, Mark 30 = bulk # High priority: SSH, DNS, established interactive connections iptables -t mangle -A OUTPUT -p tcp --dport 22 -j MARK --set-mark 10 iptables -t mangle -A OUTPUT -p udp --dport 53 -j MARK --set-mark 10 iptables -t mangle -A OUTPUT -m conntrack --ctstate ESTABLISHED \ -m length --length 0:512 -j MARK --set-mark 10 # Normal: HTTP/HTTPS iptables -t mangle -A OUTPUT -p tcp --dport 80 -j MARK --set-mark 20 iptables -t mangle -A OUTPUT -p tcp --dport 443 -j MARK --set-mark 20 # Part 2: tc fw filters read the marks # (Assumes HTB structure from previous example is already applied) tc filter add dev eth0 parent 1: handle 10 fw flowid 1:10 tc filter add dev eth0 parent 1: handle 20 fw flowid 1:20 tc filter add dev eth0 parent 1: handle 30 fw flowid 1:30
The -m length --length 0:512 match in the iptables rule above is a particularly useful pattern: it promotes small, established-connection packets (ACKs, interactive keystrokes) to high priority. This is a reasonable heuristic because bulk TCP transfers generate large packets, while interactive sessions generate small ones. It's not perfect -- a file download will send small ACKs back that also match -- but it materially improves interactive latency on a busy link.
fq_codel and Bufferbloat
One of the most consequential advances in Linux networking over the past fifteen years is the widespread adoption of CoDel (Controlled Delay) and its companion fq_codel. To understand why this matters, you have to understand bufferbloat.
Bufferbloat is the phenomenon where excessive buffering in network devices causes high latency and latency jitter. When your router, NIC, or kernel has a large buffer and that buffer fills up, packets queued at the back of the buffer experience enormous delays -- sometimes hundreds of milliseconds. This destroys interactive application performance even when throughput appears fine.
Bufferbloat is a problem that has been hiding in plain sight for years, primarily because it only manifests under load -- and when it manifests, people blame the application, the server, or the ISP rather than the buffer.
-- Paraphrased from Jim Gettys's writing on bufferbloat.net. Gettys named and publicized the bufferbloat problem beginning around 2010, prompting the research that produced CoDel and fq_codel.
CoDel, developed by Kathleen Nichols and Van Jacobson, attacks this problem by measuring the sojourn time of packets in the queue -- how long each packet waited -- rather than the queue length. If packets consistently wait more than a target interval (5ms by default), CoDel signals congestion by dropping packets, triggering TCP's congestion control to back off. This keeps the queue short and latency bounded without starving throughput.
fq_codel adds per-flow fairness on top of this: it maintains separate queues for each flow (identified by source/destination IP and port), servicing them in a round-robin fashion. A single heavy flow cannot monopolize the queue and starve interactive flows. This is the property that makes it so effective as a leaf qdisc inside HTB classes.
fq_codel was merged into the mainline kernel in Linux 3.3. The net.core.default_qdisc sysctl, which controls which qdisc new interfaces receive, was added in Linux 3.11. The actual default varies by distribution -- some have shipped with fq_codel as the default for years, while others still default to pfifo_fast. You can check and set the system default with:
# Check current default qdisc sysctl net.core.default_qdisc # Set fq_codel as the system default (persists after reboot if in sysctl.conf) sysctl -w net.core.default_qdisc=fq_codel # Also consider enabling BBR congestion control alongside fq_codel sysctl -w net.ipv4.tcp_congestion_control=bbr # To persist across reboots, add to /etc/sysctl.d/99-network.conf: # net.core.default_qdisc = fq_codel # net.ipv4.tcp_congestion_control = bbr
The combination of fq_codel as qdisc and BBR (Bottleneck Bandwidth and Round-trip propagation time) as the TCP congestion control algorithm is one of those configuration changes that pays immediate dividends on busy servers. BBR, developed at Google and merged into Linux 4.9, estimates available bandwidth and RTT directly rather than inferring congestion from packet loss, which allows it to maintain higher throughput and lower latency simultaneously. The two work well together because BBR doesn't cause the aggressive packet loss that CoDel would then have to manage.
TBF: When You Just Need Rate Limiting
Not every situation calls for the full HTB hierarchy. Sometimes you simply want to cap the rate of traffic on an interface or to a specific destination. For that, the Token Bucket Filter (TBF) is the right tool -- simpler, well-understood, and effective.
The token bucket model is this: imagine a bucket that fills with tokens at a fixed rate (your desired rate). Each byte of traffic consumes one token. When the bucket is full, excess tokens overflow and are discarded. If the bucket is empty and traffic arrives, the traffic must wait (or be dropped, depending on configuration). The burst parameter is the bucket size -- the amount of data that can be sent in a single burst before the rate limit kicks in.
# Limit all egress traffic on eth0 to 50mbit # burst: 10KB bucket, latency: max time a packet can wait in queue tc qdisc add dev eth0 root tbf rate 50mbit burst 10kb latency 50ms # For a VPN or tunnel interface where you want hard rate limiting: tc qdisc add dev tun0 root tbf rate 10mbit burst 4kb latency 70ms # Remove a TBF qdisc: tc qdisc del dev eth0 root # View current qdiscs on all interfaces: tc qdisc show # View statistics (packet counts, drops, overlimits): tc -s qdisc show dev eth0
The latency parameter in TBF is underappreciated. It defines the maximum time a packet is allowed to wait in the queue before being dropped. This prevents the queue from growing unboundedly when traffic exceeds the rate. If you set it too small, you'll see legitimate bursts getting dropped. If you set it too large, you reintroduce bufferbloat. For interactive traffic, 50-100ms is reasonable. For bulk transfers where you care more about throughput than latency, you can push it to 200ms or more.
Monitoring and Debugging tc
The single most useful debugging tool for tc is tc -s, which shows statistics for qdiscs and classes. The numbers you're looking for are dropped (packets dropped due to queue full) and overlimits (packets that exceeded the rate and had to wait or be dropped).
# Show all qdiscs with statistics tc -s qdisc show dev eth0 # Show HTB class statistics (rates, byte counts, drops) tc -s class show dev eth0 # Show filters tc filter show dev eth0 # Watch class statistics in real time (refresh every 2 seconds) watch -n2 'tc -s class show dev eth0' # More detailed stats including fq_codel internal state tc -s -d qdisc show dev eth0 # Verify a filter is matching correctly: add a counter action tc filter add dev eth0 parent 1: protocol ip u32 \ match ip dst 8.8.8.8/32 \ action gact pass tc -s filter show dev eth0 # The 'sent' counter on the filter shows how many packets matched
If you're seeing high drop counts on a class, it usually means one of three things: the rate is set too low for the actual traffic demand, the burst value is too small and bursts are being penalized, or traffic isn't being classified as expected and is falling to a lower-capacity default class. The filter counter trick above (adding a dummy filter with action gact pass to count matches) is useful for the third case -- it tells you definitively whether your classification rules are matching the traffic you think they are.
The ss -tin command (socket statistics, TCP info) shows congestion control state, RTT estimates, and retransmit counts per socket. When you're trying to understand whether your traffic shaping is helping interactive flows, check ss -tin before and after applying your qdisc configuration. RTT and retransmit counts will tell you immediately whether you're reducing latency or just moving the bottleneck.
Making tc Configuration Persist
tc rules do not survive a reboot. They are applied to the kernel's in-memory network state, and that state is reset when the interface goes down or the system restarts. This is the biggest operational headache with tc -- there's no single canonical place to put persistent configuration.
The three practical approaches, in order of increasing robustness:
1. NetworkManager dispatcher scripts. Place your tc script in /etc/NetworkManager/dispatcher.d/ and it will be called whenever an interface comes up. The script receives the interface name and event type as arguments.
#!/bin/bash # Called by NetworkManager when interface state changes # $1 = interface name, $2 = event (up, down, etc.) IFACE="$1" EVENT="$2" if [ "$IFACE" = "eth0" ] && [ "$EVENT" = "up" ]; then # Apply your tc configuration here /usr/local/sbin/apply-tc-shaping.sh fi
2. systemd service. Write a oneshot systemd service that runs your tc script after the network target. This is the cleanest approach on systemd-based systems because it integrates with dependency management and gives you clear logging via journald.
3. ip-route2 with /etc/network/interfaces post-up. On Debian/Ubuntu systems not using NetworkManager, add a post-up directive to /etc/network/interfaces that calls your script when the interface comes up. Simple and reliable for static network configurations.
Always test your tc configuration before adding it to any persistence mechanism. A broken qdisc configuration can silently drop all traffic on an interface, potentially locking you out of a remote server. Test by applying the configuration on a non-critical interface first, or by having a second SSH session open before you modify a production interface. You can always remove a qdisc entirely with tc qdisc del dev eth0 root, which restores the kernel's default behavior.
The Bigger Picture: Where tc Fits
There's a connection between tc and Quality of Service in data center networking that doesn't get discussed enough. When you deploy workloads on a server that shares a physical NIC -- which is essentially every virtualized environment -- the traffic shaping problem you're solving locally is the same problem that data center fabric engineers solve at the switch level with technologies like DCQCN (Data Center Quantized Congestion Notification) and PFC (Priority Flow Control). The concepts are isomorphic: classify traffic by type, assign priorities, manage queues to bound latency, prevent starvation of high-priority flows.
Linux's traffic control subsystem predates modern data center networking by over a decade, and many of the algorithms -- weighted fair queuing, token buckets, active queue management -- were borrowed from or inspired by research that also informs switch ASICs today. Understanding tc at the software level gives you an intuition for what's happening at the hardware level in your cloud provider's switching fabric, even if you can't configure it directly.
Similarly, the problems that fq_codel solves -- bufferbloat, per-flow fairness, latency under load -- are the same problems that motivated Google's development of QUIC and BBR. When you run tc qdisc add dev eth0 root fq_codel on a server, you are implementing, in software, a decades-long conversation in the networking research community about what happens when buffers get too big and why TCP's loss-based congestion control breaks down on modern high-speed links.
The fundamental tradeoff in queuing is always between throughput and latency. You can have very short queues (low latency, potentially wasted bandwidth) or very long queues (high throughput, high latency). AQM algorithms like CoDel represent the state of the art in navigating this tradeoff adaptively, rather than by static configuration.
-- Kathleen Nichols and Van Jacobson, "Controlling Queue Delay," ACM Queue, 2012. doi:10.1145/2208917.2209336
That paper -- Nichols and Jacobson's original CoDel proposal -- is worth reading in full. It's short, clear, and explains in precise terms why the naive approach (big buffers plus drop-tail) produces the latency problems that fq_codel is designed to fix. Understanding the theory makes the configuration choices in tc feel less arbitrary and more like informed decisions.
What to Actually Do
If you take nothing else from this guide, take these four things.
First: set your system default qdisc to fq_codel and your TCP congestion control to bbr. These are low-risk, high-reward changes that improve latency and throughput on almost any server with significant TCP traffic. They require no per-interface configuration.
Second: if you have mixed traffic types on a server -- anything where interactive latency matters alongside bulk transfers -- build an HTB hierarchy with fq_codel as the leaf qdisc inside each class. Use iptables marks for classification if your rules are complex; use u32 filters for simple port-based classification.
Third: monitor with tc -s class show dev eth0 and pay attention to the dropped counter. Drops tell you where your configuration is either too restrictive or where traffic is landing in the wrong class.
Fourth: read the LARTC guide at lartc.org when you need to go deeper. It covers HFSC (a more complex alternative to HTB with better support for latency guarantees), netem (network emulation, useful for testing), and the full taxonomy of classifiers and actions. The tc system has far more capability than this guide covers -- but the pieces described here are the ones you'll use in practice, and understanding them well is the foundation for everything else.