How does the inet family handle dual-stack IPv4 and IPv6 traffic?

When you create a table in the inet family, nftables registers your chains on both IPv4 and IPv6 hooks simultaneously. Rules without a protocol prefix apply to both families. Rules prefixed with ip or ip6 apply only to that family. The kernel VM inserts an implicit family check before protocol-specific field accesses. ICMP and ICMPv6 require separate rules because the protocol numbers differ at the IP layer: ip protocol icmp accept for IPv4 and ip6 nexthdr icmpv6 accept for IPv6. Named sets should be declared with explicit types -- ipv4_addr or ipv6_addr -- and referenced from rules with the appropriate family qualifier.

Why does nftables have a history of local privilege escalation CVEs?

The nf_tables subsystem implements complex memory management in kernel space -- set lifecycle management, transaction atomicity, multi-dimensional pipapo lookup structures -- and operates in the highest-privilege execution context. Memory corruption errors in this code (double-free, use-after-free, type confusion) directly result in kernel-level privilege escalation. Several of these flaws require only an unprivileged user namespace to reach the vulnerable code paths, and distributions like Debian and Ubuntu enable unprivileged user namespace creation by default. CVE-2024-1086 (use-after-free in nft_verdict_init) was added to the CISA Known Exploited Vulnerabilities catalog in May 2024; in October 2025 CISA issued a notification confirming it is known to be used in ransomware campaigns. CVE-2024-26809 (double-free in nft_pipapo_destroy) does not require a user namespace at all, making it reachable from any low-privilege account. The primary mitigations are keeping kernels patched to versions that include the relevant fixes, and disabling unprivileged user namespace creation on systems where it is not required.

How nftables Changes the Model

Q: How do I test a new nftables ruleset without locking myself out over SSH?

Use two complementary techniques. First, run nft -c -f /etc/nftables.conf before loading. The -c flag performs a full syntax and semantic check without making any kernel changes, catching undefined set references, type mismatches, and syntax errors. Second, before loading any untested rules, save your current working ruleset with nft list ruleset to a backup file, then schedule an automatic restore using the at command: echo 'nft -f /tmp/backup.conf' | at now + 2 minutes. If the new rules block your SSH connection, the at job fires after the timeout and restores the previous ruleset. If the new rules work correctly, cancel the restore job with atrm before it fires.

Q: How do I read nft monitor trace output?

Each trace event begins with a trace ID that groups all events for the same packet, followed by the table, chain, and rule that was evaluated. A line ending in verdict continue means that rule was evaluated but did not produce a final verdict and the packet continued to the next rule. A line ending in verdict drop or verdict accept identifies the exact rule that terminated evaluation. If you see policy drop or policy accept instead of a rule reference, no explicit rule matched and the packet reached the default chain policy. The packet descriptor line shows the full 5-tuple so you can confirm which packet triggered the trace.

Q: How does the nftables limit statement work for rate limiting?

The limit statement controls how many packets or bytes per time interval a rule will match, using a token bucket maintained in kernel memory. The syntax is limit rate N/second or limit rate N/minute. Adding the over keyword inverts the match so the rule fires when the rate is exceeded. A burst parameter controls the maximum burst above the sustained rate. The limit statement maintains a single global token bucket shared across all traffic matching the rule. For per-source-address rate limiting where each client IP gets its own independent bucket, use the update @set { key limit rate over N/unit } idiom with a dynamic named set instead.

If you have been writing iptables rules for any length of time, you know the friction. Four separate binaries for four protocol families. Sequential rule evaluation that degrades as rulesets grow. No native way to group addresses or ports without bolting on ipset as a separate tool. And every rule change requires the kernel to reload the entire chain, creating a brief window where packets can slip through unfiltered.

nftables was built from scratch to solve these structural problems. Patrick McHardy of the Netfilter Core Team first presented the project at the Netfilter Workshop 2008. After the project stalled in alpha, Pablo Neira Ayuso of the University of Seville revived it, submitting the core pull request to the Linux kernel mainline tree on October 16, 2013. It was merged on January 19, 2014 with the release of Linux kernel 3.13. Every major distribution now ships it as the default packet filtering backend -- Debian since version 10, Ubuntu since 20.10, RHEL since version 8, and Fedora since version 32. The iptables command you run on a modern system is, in many cases, already iptables-nft: a compatibility shim that translates legacy syntax into nftables rules behind the scenes.

This article walks through what changed at the architectural level and how those changes affect the way you design, write, and maintain firewall rulesets.

The Problem with the Old Model

To understand why nftables exists, you need to understand where iptables hit its limits. The iptables framework was not one tool -- it was four: iptables for IPv4, ip6tables for IPv6, arptables for ARP, and ebtables for Ethernet bridging. Each had its own binary, its own kernel module, and its own syntax quirks. Administrators running dual-stack networks routinely duplicated entire rulesets across iptables and ip6tables, and any divergence between the two was a bug waiting to happen.

The kernel side mirrored this fragmentation. Protocol awareness was embedded so deeply into the iptables code that the filtering logic had to be replicated four times. Adding support for a new protocol family meant duplicating the entire engine. Extending functionality meant writing kernel modules -- match extensions and target extensions -- that could not be combined or reused across protocol boundaries.

Architecture Comparison Toggle between models to see what changed

Protocol tools
4 separate binaries: iptables, ip6tables, arptables, ebtables

Rule evaluation
Sequential, O(n) -- every packet walks every rule until a match

Bulk matching
Requires external ipset kernel module; separate tool and syntax

Ruleset updates
Lock-based, per-rule; brief unprotected window during reload

Kernel extensions
Each match/target is a compiled kernel module; requires kernel changes for new features

Multi-tenant locking
Global lock shared across Docker, Kubernetes, fail2ban -- contention at scale

The iptables model worked well for single-stack hosts with small rulesets. It accumulated fundamental limitations as networks grew to dual-stack, high service count, and multi-component environments.

Protocol tools
Single nft binary; inet family handles IPv4 + IPv6 in one ruleset

Rule evaluation
Set/map lookups are O(1) hash table; rules collapse to data structures

Bulk matching
Named sets built into the framework; no external module needed

Ruleset updates
Atomic Netlink batch transaction; old or new ruleset, never a mix

Kernel extensions
Kernel VM runs bytecode; new features often need only a userspace nft update

Multi-tenant locking
Each component owns its own table; no cross-component lock contention

The nftables model inverts many of the iptables assumptions. Overhead only exists on packet paths you explicitly hook. Data structures replace rule chains. The kernel just runs bytecode.

Performance was the other pressure point. iptables evaluates rules sequentially, top to bottom. A chain with 500 rules means every packet walks through up to 500 comparisons before reaching a verdict. On a host with thousands of services -- a Kubernetes node, for instance -- this linear evaluation becomes a measurable bottleneck. Kubernetes benchmarks published by Dan Winship (Red Hat) in February 2025 showed that at 5,000 and 10,000 services, the average latency for nftables matched the best-case latency for iptables. At 30,000 services, the worst-case latency for nftables beat the best-case latency for iptables by a few microseconds. The nftables kube-proxy mode reached General Availability in Kubernetes 1.33 in April 2025.

In benchmarks published on kubernetes.io in February 2025, Dan Winship (Red Hat) reported that the nftables kube-proxy mode delivers the same performance as IPVS — and described it as having "the same performance as IPVS mode, without any of the downsides."

-- Dan Winship (Red Hat), kubernetes.io, February 2025 (paraphrased; fragment quoted directly)

Note

The iptables command still works on modern systems, but it is now a compatibility layer. On Debian 11+, RHEL 9+, and Ubuntu 22.04+, the binary at /usr/sbin/iptables is iptables-nft, which converts iptables syntax into nftables kernel calls. Legacy iptables-legacy binaries exist but are no longer the default.

The Kernel Virtual Machine

The single biggest architectural change in nftables is the introduction of a virtual machine inside the kernel. Instead of hardcoding protocol-specific matching logic into the kernel the way iptables did, nftables runs a small, general-purpose bytecode interpreter in kernel space.

The nft userspace tool acts as both compiler and decompiler. When you write a rule, nft compiles it into VM bytecode and pushes it to the kernel through the Netlink API. When you list your ruleset, the kernel sends the bytecode back and nft decompiles it into the human-readable form you see on screen.

The VM's instruction set is intentionally minimal. It can extract data from the packet payload and from associated metadata such as the inbound interface or connection tracking state. It supports arithmetic, bitwise, and comparison operators. And critically, it can perform lookups against in-kernel data structures like sets and maps -- replacing what would be dozens of sequential comparisons in iptables with a single hash table lookup.

inspecting VM bytecode

# Show the compiled bytecode for a rule
$ nft --debug=netlink add rule inet filter input ip daddr 10.1.2.3 counter accept

inet filter input
  [ meta load nfproto => reg 1 ]
  [ cmp eq reg 1 0x00000002 ]
  [ payload load 4b @ network header + 16 => reg 1 ]
  [ cmp eq reg 1 0x0302010a ]
  [ counter pkts 0 bytes 0 ]
  [ immediate reg 0 accept ]

That output reveals what the kernel is doing when it evaluates the rule. First it loads the network protocol family into a register and confirms it is IPv4. Then it extracts four bytes from the network header at offset 16 -- the destination address field -- and compares it against the target value. If both checks pass, it increments a counter and accepts the packet. Every nftables rule compiles down to a sequence of these register-based operations.

VM Bytecode Walkthrough Click any instruction to expand what the kernel is doing

Check protocol family

[ meta load nfproto => reg 1 ]

+ what this does

The VM loads a metadata field -- nfproto, the network protocol -- into register 1. Registers are temporary storage slots the VM uses to pass values between instructions. This is the first check because the same rule may be evaluated against both IPv4 and IPv6 packets in an inet table.

Assert the family is IPv4

[ cmp eq reg 1 0x00000002 ]

+ what this does

The VM compares the value in register 1 against 0x02, which is the numeric constant for NFPROTO_IPV4. If the comparison fails -- meaning this is an IPv6 packet -- evaluation stops here and the rule is skipped. The packet continues to the next rule in the chain.

Extract destination IP from the packet

[ payload load 4b @ network header + 16 => reg 1 ]

+ what this does

The VM reads 4 bytes (4b) from the packet's network header starting at offset 16. In an IPv4 header, bytes 16-19 are the destination address field. The extracted value is stored in register 1, overwriting the previous family check result. This is the raw in-kernel equivalent of checking ip daddr.

Compare against the target address

[ cmp eq reg 1 0x0302010a ]

+ what this does

Register 1 now holds the packet's destination IP address in raw network byte order. The value 0x0302010a is 10.1.2.3 in little-endian hex. The VM checks whether they match. If not, evaluation stops and the rule is skipped. This single comparison replaces what would be an entire iptables -m iprange or similar construct.

Increment the counter

[ counter pkts 0 bytes 0 ]

+ what this does

The counter statement is itself a VM instruction. It increments two atomic counters stored alongside the rule object in kernel memory: a packet count and a byte count. These are what you see when you run nft list ruleset and observe match statistics. Counters are optional -- rules without them skip this instruction entirely, which is why nftables rules without explicit counters have slightly lower overhead than equivalent iptables rules, which always counted.

Write the verdict into register 0

[ immediate reg 0 accept ]

+ what this does

Register 0 is the verdict register -- a special slot the kernel reads after rule evaluation completes. Writing accept into it tells the Netfilter hook to release the packet and allow it to continue through the networking stack. The kernel reads this verdict after the instruction sequence ends and acts on it. Other possible values include drop, jump (to another chain), goto, or return.

This design is what makes nftables extensible without kernel changes. New features can often be delivered by updating the userspace nft tool to emit new combinations of existing VM instructions, rather than requiring a kernel patch and a reboot. The reason this works is structural: the kernel does not have a fixed understanding of what a "rule" means at the semantic level. It only understands register loads, comparisons, and lookups. Semantic meaning lives entirely in the bytecode sequence that the userspace compiler emits. A new expression type in nft syntax just needs a new code-generation path in the compiler -- as long as it can be expressed using the existing VM opcodes, the kernel does not need to change.

How the VM connects to sets and maps

The VM's instruction set includes a dedicated lookup opcode that takes a value from a register and tests it against a named set or map in a single instruction. This is what makes set-based matching O(1): instead of generating a sequence of cmp instructions for each element, the compiler emits one lookup instruction referencing the hash table. The hash table lives in kernel memory and is updated independently of the rule that references it -- which is why you can add elements to a named set at runtime without touching the rule itself.

In their Netdev 0.1 paper, McHardy and Neira Ayuso described nftables as "a new packet classification framework based on lessons learnt" from the legacy tools — one that reuses the existing Netfilter building blocks rather than replacing them.

-- Patrick McHardy and Pablo Neira Ayuso, Netdev 0.1 Conference, February 2015 (paraphrased; fragment quoted directly)

Unified Address Families

In iptables, handling IPv4 and IPv6 traffic required maintaining two entirely separate rulesets. nftables introduces the inet address family, which processes both IPv4 and IPv6 packets in a single table with a single set of chains and rules.

creating an inet table

# One table handles both IPv4 and IPv6
# nft add table inet firewall
# nft add chain inet firewall input { type filter hook input priority 0 \; policy drop \; }

# This single rule allows SSH over both IPv4 and IPv6
# nft add rule inet firewall input tcp dport 22 accept

That rule applies to both protocol versions without duplication. The kernel VM inserts an implicit check on the protocol family when needed -- as visible in the bytecode output above -- but the administrator never has to think about it. The reason a single rule can cover both families is that the VM has no hardcoded understanding of TCP or IP. It treats packet data as raw bytes at specific offsets. When the compiler encounters an ip daddr expression in an inet table, it emits a family-guard instruction first, then the payload extraction. An IPv6 packet fails the family guard at step one and moves to the next rule without the destination address ever being read. The protocol-specific logic is in the emitted bytecode, not in the kernel itself.

The full set of address families in nftables covers every layer that iptables required a separate binary for: ip (IPv4 only), ip6 (IPv6 only), inet (IPv4 + IPv6 combined), arp, bridge, and netdev (for ingress/egress filtering on a specific interface). The inet family is the recommended default for nearly all production use.

Pro Tip

Use the inet family unless you have a specific reason to filter only one protocol version. It eliminates the entire class of bugs where IPv4 rules are updated but the matching IPv6 rules are forgotten.

The ingress Hook: Before the Network Stack

nftables introduced one hook that has no iptables equivalent at all: the ingress hook. It fires on a packet after the NIC driver delivers it but before the network stack processes it -- before any routing decision, before connection tracking, before prerouting. A chain attached at ingress runs earlier in the pipeline than any iptables rule ever could.

Ingress chains use the netdev address family and are bound to a specific physical interface rather than a protocol family. This is a deliberate design: the ingress hook is for per-device policy, not for general host firewalling. The reason for the per-device binding is that the ingress hook sits at the netdev layer, below the point where the kernel's networking stack has demultiplexed the packet into a protocol family. At ingress time, there is no IPv4 or IPv6 context yet -- the kernel only knows which device the frame arrived on. Binding to a device is therefore the only available scope at that point in the pipeline. Because it runs before the network stack has parsed the packet into protocol layers, you can filter on raw Layer 2 fields including Ethernet headers, VLAN tags, and even raw byte offsets in the packet payload.

netdev ingress chain

# Create a netdev table bound to a specific interface
# nft add table netdev ingress_filter

# Attach a base chain to the ingress hook on eth0
# nft add chain netdev ingress_filter eth0_ingress {
    type filter hook ingress device eth0 priority 0 \; policy accept \;
}

# Drop a specific IP before the network stack ever sees it
# nft add rule netdev ingress_filter eth0_ingress ip saddr 203.0.113.50 drop

# Drop invalid TCP flag combinations before conntrack processes them
# nft add rule netdev ingress_filter eth0_ingress tcp flags & (fin|syn|rst|ack) == fin|syn drop

The primary production use for the ingress hook is early DDoS mitigation. Packets dropped at ingress consume almost no CPU relative to packets that reach the IP layer -- conntrack is never consulted, no routing decisions are made, and the packet is discarded before the network stack has begun processing it. On a host receiving a volumetric flood, this is the difference between a system that saturates and one that continues serving legitimate traffic while shedding the attack at line rate.

Linux kernel 5.16 added the symmetric egress hook, which fires on outbound packets immediately before the NIC driver transmits them. The egress hook uses the same netdev family and per-interface syntax as ingress. It is useful for enforcing outbound rate limiting and for implementing QoS marks at the last possible point before a packet leaves the host.

ingress vs prerouting

The ingress hook fires before the network stack has parsed the packet, which means conntrack state is not available and routing decisions have not been made. If you need ct state matching or NAT, your rules belong at prerouting or later hooks, not at ingress. Use ingress specifically for stateless early filtering where speed is the priority and protocol context is not required.

When flowtables reference the ingress hook in their definition (as seen in the flowtable section), they are using this same pre-stack insertion point to intercept established flows as early as possible, enabling the fastpath to make forwarding decisions before the full Netfilter pipeline is entered.

No Default Tables or Chains

One of the more disorienting changes for administrators coming from iptables is that nftables ships with no predefined tables or chains. In iptables, the filter, nat, and mangle tables with their INPUT, FORWARD, and OUTPUT chains existed before you wrote a single rule. Every packet traversed those chains whether you wanted it to or not.

In nftables, you create everything yourself. If you add a table with no chains, no packets are affected. If you add a chain but do not attach it to a Netfilter hook, it sits idle until another chain explicitly jumps to it. This means nftables has zero overhead on packet paths you are not using.

The deeper reason for this design is that nftables was built for a multi-component environment, not for a single sysadmin controlling everything. In iptables, all components shared the same predefined tables and chains. Docker, Kubernetes, fail2ban, and libvirt all wrote to the same filter and nat tables, and the only way to keep their rules separate was naming conventions and insertion ordering -- fragile mechanisms that broke under load. By removing predefined tables entirely, nftables forces each component to declare its own namespace. Docker creates a docker table. Kubernetes creates a kube-proxy table. Your own rules live in whatever table you name. The kernel processes them all as equal participants in the Netfilter hook system, and no component's operations can block another's.

building from scratch

# Create a table -- it does nothing until you add chains
# nft add table inet firewall

# Base chain: attaches to the input hook
# nft add chain inet firewall input {
    type filter hook input priority 0 \; policy drop \;
}

# Regular chain: not attached to any hook, used as a jump target
# nft add chain inet firewall tcp_allowed

# Rules in the base chain can jump to the regular chain
# nft add rule inet firewall input ip protocol tcp jump tcp_allowed

This clean-slate approach also eliminates the global lock contention that plagued iptables. In iptables, every component that modified firewall rules -- Docker, Kubernetes, fail2ban, libvirt -- competed for a single global lock on the shared tables. In nftables, each component can own its own table, and operations on one table do not block operations on another.

Sets and Maps: The Performance Model

If the kernel VM is the foundation of nftables, sets and maps are the data structures that make it fast. They replace the pattern of writing one rule per IP address or one rule per port with a single rule that references a collection of values, looked up through a hash table in O(1) time regardless of how many elements the collection contains.

The reason iptables is O(n) for address matching is structural: iptables chains are ordered linked lists of kernel structs. There is no indexing. The kernel walks each entry in sequence, comparing the packet against each rule's match criteria until it finds a hit or reaches the default policy. Adding more rules makes every packet that does not match early pay a higher cost. Sets in nftables are backed by hash tables. A hash table computes a position in a fixed array from the lookup key in constant time, regardless of how many elements the table contains. Whether the set has 10 elements or 100,000, the number of operations to determine membership is the same.

Named Sets

A named set is a collection of values of the same type -- IP addresses, port numbers, interface names, or other selectors -- that you define once and reference from any rule in the same table.

named sets

# Define a set of blocked IP addresses
# nft add set inet firewall blocked_hosts { type ipv4_addr \; }

# Populate it
# nft add element inet firewall blocked_hosts { 203.0.113.50, 198.51.100.77 }

# Reference it in a rule -- one rule handles the entire blocklist
# nft add rule inet firewall input ip saddr @blocked_hosts drop

# Add new addresses at any time without touching the rule
# nft add element inet firewall blocked_hosts { 192.0.2.44 }

With iptables, blocking 10,000 IP addresses meant either 10,000 individual rules (evaluated sequentially) or bolting on the ipset utility as a separate kernel module. In nftables, the same task is a single set with 10,000 elements and a single rule referencing it. The hash table lookup remains constant-time whether the set contains 10 elements or 10,000.

Sets also support the interval flag for CIDR ranges and the timeout flag for elements that expire automatically -- useful for temporary bans or rate-limiting windows.

Verdict Maps

Maps take the set concept further by associating each element with a value. A verdict map associates match criteria directly with an action, collapsing what would be many separate rules in iptables into a single lookup.

verdict maps

# Anonymous verdict map: port-to-action mapping inline
# nft add rule inet firewall input tcp dport vmap {
    22 : accept,
    80 : accept,
    443 : accept,
    23 : drop
}

# Named verdict map: update without replacing the rule
# nft add map inet firewall port_policy { type inet_service : verdict \; }
# nft add element inet firewall port_policy {
    22 : accept, 80 : accept, 443 : accept, 23 : drop
}
# nft add rule inet firewall input tcp dport vmap @port_policy

Data maps work on the same principle but return a value instead of a verdict. A common use case is DNAT port redirection -- mapping destination ports to backend server addresses in a single rule rather than a chain of separate DNAT rules:

data map for DNAT

# Three iptables DNAT rules become one nftables rule
# Use the inet family to handle both IPv4 and IPv6 in one table
# nft add rule inet nat prerouting dnat to tcp dport map {
    80   : 192.168.1.100,
    8888 : 192.168.1.101,
    9090 : 192.168.1.102
}
# Note: this maps destination ports to new IP addresses.
# The original destination port is preserved on each connection.
# To remap the port as well, use addr . port concatenation: 80 : 192.168.1.100 . 8080

The ability to express complex routing decisions as data lookups rather than chains of individual rules is one of the reasons nftables scales so much better at high rule counts.

Rule Lookup Cost: O(n) vs O(1) Drag the slider to change rule/element count

Elements / rules: 1,000

iptables

-- comparisons

nftables set

1 lookup

With 1,000 rules: iptables evaluates up to 1,000 comparisons per packet in the worst case. An nftables named set performs a single hash table lookup regardless of how many elements it contains. At 1,000 elements the difference is already significant; at 10,000 or more it is the difference between a functioning firewall and a bottleneck.

Atomic Ruleset Updates

In iptables, modifying a ruleset was not an atomic operation. Adding or deleting a rule required the kernel to lock the table, make the change, and release the lock. During that window, other components waiting for the lock were blocked, and in high-churn environments -- think Kubernetes nodes where service endpoints change constantly -- this serialization created visible latency spikes. The reload path was worse: iptables-restore operated by fetching the entire existing ruleset from the kernel, applying the new rules to that in-memory copy, and then writing the complete blob back -- a cycle that could leave the kernel in a half-updated state if the process was interrupted, and that blocked all other Netfilter operations for the duration of the write.

nftables solves this with atomic transactions over Netlink. When you load a ruleset from a file with nft -f, the entire contents are submitted as a single Netlink batch. The kernel applies all changes at once: a packet being processed will see either the old ruleset in its entirety or the new one in its entirety, never a half-updated mix.

/etc/nftables.conf

#!/usr/sbin/nft -f

# Flush existing ruleset and replace atomically
flush ruleset

table inet firewall {

    set trusted_nets {
        type ipv4_addr
        flags interval
        elements = { 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 }
    }

    set allowed_ports {
        type inet_service
        elements = { 22, 80, 443 }
    }

    chain input {
        type filter hook input priority 0; policy drop;

        # Connection tracking
        ct state established,related accept
        ct state invalid drop

        # Loopback
        iifname "lo" accept

        # ICMP
        ip protocol icmp accept
        ip6 nexthdr icmpv6 accept

        # Allowed services
        tcp dport @allowed_ports accept

        # Trusted networks get full access
        ip saddr @trusted_nets accept

        # Log and reject everything else
        log prefix "nft-reject: " counter reject
    }

    chain forward {
        type filter hook forward priority 0; policy drop;
    }

    chain output {
        type filter hook output priority 0; policy accept;
    }
}

Loading this file with nft -f /etc/nftables.conf flushes the old ruleset and installs the new one in a single transaction. There is no window where the system is unprotected. Enable the nftables.service systemd unit to restore this configuration automatically at boot.

Why atomicity matters for Kubernetes

In a Kubernetes cluster, service endpoints change constantly as pods scale, restart, and migrate. The iptables kube-proxy mode had to reload rules for every endpoint change -- and on a node with thousands of services, that meant frequent global-lock acquisitions and large table reloads. The nftables kube-proxy backend uses incremental set element updates instead: when an endpoint changes, it adds or removes one element from a named set. No full reload. No lock contention across other components. This is the primary reason nftables kube-proxy latency matches IPVS at 30,000 services where the iptables mode collapses.

Persisting Rulesets Across Reboots

Loading a ruleset with nft -f applies it to the running kernel immediately, but it does not survive a reboot. On systemd-based distributions, the nftables.service unit handles persistence. The service reads /etc/nftables.conf at boot and loads it with nft -f. Enabling it is a single command:

enabling the nftables service

# Enable the service and start it immediately
# systemctl enable --now nftables.service

# Reload the ruleset from /etc/nftables.conf without rebooting
# systemctl reload nftables.service

# Check the service loaded the ruleset without errors
# systemctl status nftables.service

# Verify what is loaded in the running kernel
# nft list ruleset

The default /etc/nftables.conf on most distributions is a minimal template. You write your complete ruleset into this file -- including flush ruleset at the top -- and the service loads it atomically. Because the load is atomic, a failed reload (due to a syntax error in the file) leaves the current running ruleset intact. The service reports the failure through systemctl status and the kernel retains the previous ruleset unchanged.

The distinction between systemctl reload and systemctl restart matters on a remote host. reload sends a signal to the running service process, which re-executes nft -f /etc/nftables.conf without stopping the service. Rules are replaced atomically and SSH connectivity is maintained throughout. restart stops the service -- which flushes all rules, including the rules that permit your SSH connection -- before restarting it. On a remote host, that stop phase is a brief window where all traffic is unfiltered. Use reload for live rule changes when connected over SSH.

Where /etc/nftables.conf is read from

On Debian and Ubuntu, the service unit file at /lib/systemd/system/nftables.service passes /etc/nftables.conf to nft -f. RHEL and Fedora use /etc/sysconfig/nftables.conf as the primary file with an include mechanism pointing to /etc/nftables/ for per-service fragment files. You can inspect the exact path by running systemctl cat nftables.service. Always verify which file your distribution's service unit is loading before editing.

The flush ruleset directive at the top of /etc/nftables.conf is worth understanding precisely. It removes every table, chain, rule, and set from the kernel in a single operation before applying the new configuration. Combined with the atomicity of nft -f, this guarantees the result is exactly and only what is in your configuration file -- no residual rules from a previous load, no orphaned sets. The atomicity guarantee means that if flush ruleset succeeds but a subsequent statement in the file fails, the entire batch is rolled back and the pre-flush ruleset is restored.

flush ruleset and Docker

If you run Docker on the same host, be aware that flush ruleset in your nftables configuration file removes Docker's NAT and forwarding rules along with your own. Docker writes its rules at startup and does not automatically re-add them when your nftables service reloads. After a systemctl reload nftables.service on a Docker host, container networking will be broken until Docker is restarted or sends its rules again. If you use Docker alongside native nftables rules, either manage Docker's rules explicitly in your configuration file, use DOCKER-USER chain conventions, or use a tool like firewalld that coordinates with Docker's table namespace.

Warning

If you use Docker, be aware that Docker manipulates nftables/iptables rules directly to expose container ports. These NAT rules are processed before your custom filter rules, which means Docker can bypass your firewall. Audit your effective ruleset with nft list ruleset after starting containers to verify what Docker has added.

Built-in Tracing and Debugging

iptables had no native equivalent to packet tracing. Debugging a blocked connection meant adding LOG targets, reading syslog, and guessing which rule in which chain was responsible. nftables includes a purpose-built tracing system that lets you follow a packet through the entire rule evaluation path in real time.

The process has two parts: a temporary rule that marks the traffic you want to trace, and the nft monitor trace command that displays the kernel's evaluation of each rule against matching packets.

tracing a blocked connection

# Insert a trace rule at the top of the chain for a specific source
# nft insert rule inet firewall input ip saddr 10.10.1.15 tcp dport 22 meta nftrace set 1

# In another terminal, watch the trace output
# nft monitor trace

# You will see output like:
trace id a3b2c1 inet firewall input
  rule ip saddr 10.10.1.0/24 tcp dport 22 drop
  verdict drop

# Remove the trace rule when done
# nft delete rule inet firewall input handle <handle_number>

The trace output shows you the exact chain, the exact rule, and the exact verdict applied to the packet. No guesswork, no log parsing. This alone can cut firewall debugging time from hours to minutes on complex rulesets.

Reading the output requires knowing what each field means. The trace id is a per-packet identifier that groups all trace events for the same packet together -- useful when multiple packets are being traced simultaneously. Each event shows the table name, the chain name, and the rule that matched. A line showing only verdict with no rule handle indicates the chain's default policy was applied rather than any explicit rule. The packet descriptor lines show the 5-tuple (source address, destination address, protocol, source port, destination port) so you can confirm which packet triggered each trace event.

annotated trace output

# Annotated example of nft monitor trace output

# [1] The trace ID groups all events for a single packet
trace id a3b2c1d4 inet firewall input packet: iif "eth0" ip saddr 10.10.1.15 ip daddr 192.168.1.1 ip protocol tcp tcp dport 22

# [2] The trace rule matched first: the nftrace=1 rule itself
trace id a3b2c1d4 inet firewall input rule ip saddr 10.10.1.15 tcp dport 22 meta nftrace set 1 (verdict continue)

# [3] Next rule evaluated: the established/related accept rule (no match)
trace id a3b2c1d4 inet firewall input rule ct state established,related accept (verdict continue)

# [4] Next rule: the blocking rule -- this one matched
trace id a3b2c1d4 inet firewall input rule ip saddr 10.10.1.0/24 tcp dport 22 drop (verdict drop)

# [5] Final verdict: the packet was dropped at this rule
trace id a3b2c1d4 inet firewall input verdict drop

# If the chain policy were reached instead, you would see:
# trace id ... inet firewall input policy drop

The sequence of verdict continue entries shows every rule the packet passed through before the final verdict. When a chain's default policy fires, the output shows policy drop (or policy accept) rather than a rule reference -- which immediately tells you no explicit rule matched and the packet reached the end of the chain.

Caution

Tracing generates kernel output for every matching packet. On a high-traffic host, a broadly scoped trace rule -- like one matching all traffic from a busy subnet -- can flood the trace buffer and degrade performance. Always scope trace rules as narrowly as possible, and remove them immediately after debugging.

Flowtables: The Forwarding Fastpath

One of the most underexplored capabilities in nftables is the flowtable, a forwarding fastpath that bypasses the full Netfilter hook traversal for established connections. It is not a filtering feature -- it is an acceleration feature, and it matters significantly on any system that forwards large volumes of sustained TCP traffic.

The way Netfilter normally works, every packet traverses the full pipeline: prerouting, forward, and postrouting hooks, connection tracking lookups at each step. For an established TCP connection, this means repeating the same checks thousands of times across the life of the connection. A flowtable eliminates that repetition. Once a connection is classified and offloaded to the flowtable, all subsequent packets in that flow are forwarded directly via neigh_xmit(), bypassing the classic IP forwarding path entirely. The TTL is still decremented and NAT mappings are still applied -- the flowtable stores the NAT configuration alongside the cached routing decision -- but the full Netfilter pipeline is skipped.

flowtable configuration

# Define a flowtable on the forwarding interfaces
table inet router {

    flowtable fastpath {
        hook ingress priority 0
        devices = { eth0, eth1 }
    }

    chain forward {
        type filter hook forward priority 0; policy drop;

        # Offload established TCP and UDP flows to the fastpath.
        # meta l4proto covers both IPv4 and IPv6 in an inet table;
        # ip protocol { tcp, udp } would only match IPv4 here.
        meta l4proto { tcp, udp } flow add @fastpath

        # Stateful fallback for non-offloaded traffic
        ct state established,related accept
        ct state invalid drop

        # Accept traffic from trusted interfaces
        iifname "eth0" accept
    }
}

The flow add statement in the forward chain is what populates the flowtable. It fires after the first packet of a connection has gone through the full forwarding path and connection tracking has confirmed bidirectional traffic. From the second packet onward, the flowtable takes over. The reason the flowtable can skip the conntrack confirm step for subsequent packets is that confirmation only needs to happen once: the conntrack entry is created and committed to the tracking table during the first packet's traversal of the full pipeline. Subsequent packets in the same flow carry no new information that needs to be recorded -- they just need to be forwarded. The flowtable caches everything it needs (the routing decision, the NAT mapping, the neighbor table entry) from the initial classification, making the per-packet work for established flows a simple hashtable lookup and a direct transmit via neigh_xmit(). On a forwarding host under sustained load, this can produce measurable reductions in CPU overhead compared to full pipeline traversal.

Fragmented Traffic

Fragmented IP packets cannot be offloaded to a flowtable. The flowtable lookup requires transport-layer selectors -- source and destination port -- which are missing from fragments beyond the first. Fragmented packets automatically fall back to the classic forwarding path. This is expected behavior documented in the kernel's flowtable infrastructure and does not require any ruleset adjustment.

Kernel Requirements for Flowtables

Software flowtable offload requires kernel 4.16 or later. Hardware offload to a supporting NIC (such as Mellanox/NVIDIA ConnectX) requires kernel 5.13 or later and nftables 0.9.9 or later, and is enabled by adding the flags offload directive to the flowtable definition. You can identify offloaded flows in the connection tracking table by the [OFFLOAD] tag in conntrack -L output.

Flowtables are invisible in the sense that nft list ruleset will still show zero rule hits for the forward chain rules on offloaded packets -- the counter is only incremented for the first two packets before a flow is offloaded. This is expected behavior, not a bug. If you need accurate byte and packet counts for offloaded flows, add the counter keyword to your flowtable definition.

Migrating from iptables

The transition from iptables to nftables does not have to be disruptive. The Netfilter project provides iptables-translate, a tool that converts existing iptables rules into equivalent nftables syntax. It is not perfect -- complex rules with uncommon match extensions may need manual adjustment -- but it handles the majority of common patterns.

translating rules

# Translate a single iptables rule
$ iptables-translate -A INPUT -p tcp --dport 22 -j ACCEPT
nft add rule ip filter INPUT tcp dport 22 counter accept

# Translate an entire saved ruleset
$ iptables-save | iptables-restore-translate

# Save the translated ruleset to a file
$ iptables-save | iptables-restore-translate > /etc/nftables-migrated.conf

A practical migration path looks like this: export your current iptables rules with iptables-save, translate them, load the translated ruleset alongside your existing rules on a test system, validate behavior, then cut over. On the compatibility layer, both sets of rules coexist in the same kernel -- they are all nftables rules internally -- so you can run mixed configurations during the transition.

Pro Tip

After migrating, resist the urge to simply replicate your old iptables structure in nftables syntax. The real gains come from redesigning your ruleset to use sets, maps, and the inet family. A 200-rule iptables configuration can often be reduced to 20-30 nftables rules that are both faster and easier to maintain. The reason the reduction is this dramatic is structural: iptables rules are one match criterion per rule -- each rule handles one IP address, one port, or one combination. Every addition is a new line in the kernel's linked list. In nftables, one rule referencing a named set handles arbitrarily many elements, and one verdict map dispatches to different actions for an entire class of matches in a single lookup. What iptables requires 50 rules to express, nftables can express as one rule pointing at a map with 50 elements.

Red Hat's RHEL 9 release notes formally declared that "iptables is deprecated in RHEL 9" and signalled it may be removed in a future major version — a concrete distribution-level signal that the migration to nftables is not optional indefinitely.

-- Red Hat, RHEL 9 Release Notes, 2022 (paraphrased; fragment quoted directly)

Testing Safely Without Locking Yourself Out

The single most common operational hazard when working with firewall rules remotely is applying a configuration that drops your own SSH connection. nftables provides two safeguards worth building into your workflow.

The first is nft -c, the syntax check flag. It parses and compiles your configuration file without loading it into the kernel. It catches syntax errors and references to undefined sets or chains before any rules change. This does not validate that the policy is correct -- it only validates that the syntax is valid -- but it eliminates the class of outage caused by a typo or an unclosed brace. There is an important limitation to understand: nft -c performs its validation entirely in userspace, using the nft compiler's own logic. It does not query the running kernel. This means it will not catch errors caused by missing kernel modules. For example, if your configuration uses conntrack rate-limiting features and the nft_connlimit module is not compiled into your kernel, nft -c reports no error but the actual load will fail at the kernel boundary. On mainstream distributions this distinction rarely matters because all nf_tables modules are included, but on custom kernels or embedded systems it can surprise you:

dry-run and safe load

# Check syntax without loading -- no kernel changes made
# nft -c -f /etc/nftables.conf
Error: ... (syntax errors appear here, nothing was loaded)

# Only load if the syntax check passes
# nft -c -f /etc/nftables.conf && nft -f /etc/nftables.conf

# Schedule an automatic rollback in 2 minutes using at(1)
# If you can't confirm the new rules work, the job restores the old ones
# nft list ruleset > /tmp/nftables-backup.conf
# echo "nft -f /tmp/nftables-backup.conf" | at now + 2 minutes
# nft -f /etc/nftables.conf
# If the new ruleset is correct and SSH works, cancel the rollback job:
# atrm <job_number>

The at-based rollback is the standard technique for testing firewall changes on remote hosts. You schedule the restore job before loading the new rules. If the new rules break connectivity, you simply wait for the at job to fire and restore the previous ruleset. If everything works, you cancel the job with atrm. The timeout can be set to whatever interval gives you enough time to verify the change -- two minutes is a common default for quick tests, five or more for complex changes where full verification takes longer.

Saving before loading

Always capture your current working ruleset with nft list ruleset > /tmp/nftables-backup.conf before loading experimental rules. The backup file is valid nftables syntax and can be reloaded directly with nft -f. On a system running Docker or Kubernetes, this backup will include the rules those tools have written, giving you a complete restore point rather than just your own configuration.

Inspecting Hook Priority Ordering

Once your ruleset coexists with rules from Docker, libvirt, or Kubernetes, the processing order across all registered hooks matters. The nft list hooks command, introduced in nftables 1.0.8, displays every Netfilter hook currently registered on the system, showing the address family, hook name, priority value, and the component that registered it. The command was added specifically in response to operational confusion in multi-component environments: each tool that writes nftables rules registers its chains silently, and previously there was no unified way to see the effective hook processing order without parsing the outputs of multiple different tools -- or reading the kernel's internal hook arrays directly. nft list hooks gives that view in one command.

listing all registered hooks

# Show all hooks registered on the system across all families
# nft list hooks

# Filter to a specific family and hook
# nft list hooks inet

# Example output fragment showing priority ordering
family ip hook input {
  +0000000000 chain inet firewall input [nf_tables]
  +0000000000 chain ip filter INPUT [nf_tables]   # iptables-nft compat, same priority
}

This output is particularly valuable for verifying that your custom chains run at a lower priority number (higher precedence) than any iptables-compatibility or third-party chains, and for diagnosing unexpected policy interactions in container environments.

What nftables Does Not Do

nftables is not a drop-in replacement for every iptables feature. The most notable gap is the absence of deep packet inspection. iptables offered a string match module that could inspect packet payloads for arbitrary byte sequences -- used for tasks like SNI filtering on TLS connections. nftables has no equivalent. This is a deliberate design constraint, not an oversight: the nftables VM instruction set is intentionally generic and stateless, operating on fixed-offset byte patterns in the packet headers. Deep inspection of variable-length application layer protocols requires stateful reassembly across multiple packets and full protocol parsing -- operations that don't fit the register-and-compare model the VM is built around. If you depend on string matching for application-layer filtering, you will need to use a separate tool like a transparent proxy or an application-layer gateway.

There is also a behavioral difference in how accept verdicts work. In iptables, a -j ACCEPT in the filter table effectively guaranteed the packet was accepted, because iptables had a monolithic table model where the filter table had final authority. In nftables, an accept verdict ends processing in the current table's chain, but nftables tables are independent namespaces that each contribute their own hooks to the Netfilter pipeline. Another table registered at a later priority on the same hook can still evaluate the packet and drop it. This matters in environments like Kubernetes where multiple components write to the kernel's packet filtering rules and need to coordinate across table boundaries rather than relying on the older finality of iptables acceptance.

Some uncommon iptables match extensions and target extensions also lack direct nftables counterparts. The iptables-translate tool will flag these with a comment when it encounters rules it cannot convert. In practice, the affected extensions tend to be niche -- the vast majority of production rulesets translate cleanly.

nftables changes the model, not the mission. The goal is still the same: inspect packets, make decisions, enforce policy. The difference is that the machinery underneath is now faster, more consistent, and built for the scale of modern infrastructure.

nftables and eBPF: Different Tools, Different Layers

A question that comes up consistently among anyone working at the kernel networking level is how nftables relates to eBPF and XDP. Both run code in the kernel in response to packet events. Both can drop packets at very early points in the pipeline. The answer is that they solve different problems at different layers of abstraction, and in practice they coexist rather than compete.

eBPF is a general-purpose kernel programmability substrate. An eBPF program is arbitrary C-like bytecode that the kernel verifies for safety and then attaches to a hook point. It can read and modify packet data, call a restricted set of kernel helpers, and write to shared maps for coordination with userspace. XDP (eXpress Data Path) is a specific eBPF hook point that fires inside the NIC driver, before the kernel has even allocated a socket buffer for the packet. At XDP, the only operation more minimal is firmware on the NIC itself.

nftables operates at the Netfilter layer, which sits above the network stack's core packet processing infrastructure. The nftables VM runs bytecode compiled from nft rule syntax, but that bytecode is interpreted by a purpose-built interpreter with a fixed instruction set -- not the general eBPF verifier. nftables chains have direct access to conntrack state, NAT tables, and the full Netfilter hook topology, none of which are easily accessible from a raw XDP program. The reason XDP can't readily access conntrack state is that conntrack state is established by the Netfilter conntrack hook, which fires at the prerouting priority (-200). XDP runs before the kernel has allocated a socket buffer for the packet, which means it runs before prerouting ever fires. From XDP's perspective, the conntrack table is simply uninitialized for that packet -- there is no entry to look up yet.

Where each tool fits

XDP and eBPF tc programs are appropriate for stateless line-rate filtering, custom protocol parsers, and load balancing scenarios where you need to process millions of packets per second and the Netfilter overhead is measurable. nftables is appropriate for stateful firewall policy, NAT, conntrack-based rules, and any scenario where you need set-based lookups, atomic ruleset updates, and readable configuration syntax. The two are architecturally complementary: a host can run XDP for DDoS scrubbing at the ingress driver level and nftables for stateful filtering at the Netfilter layer simultaneously.

The nftables ingress hook (discussed in the ingress section above) occupies a position in the pipeline after XDP but before the standard Netfilter prerouting hook. This means there is a three-tier early-drop stack available on a modern Linux host: XDP inside the NIC driver, nftables at the netdev ingress hook, and nftables at the Netfilter prerouting hook. Traffic that survives the first two layers reaches the third having already been through two stages of stateless filtering.

The practical implication for firewall design: if you are running nftables on a host that is not under volumetric attack, there is no reason to also run XDP filtering. The nftables ingress hook at the netdev layer is fast enough for all but the highest-traffic DDoS scrubbing scenarios, and it integrates cleanly with the rest of your nftables ruleset. XDP programs become relevant when you need sub-microsecond decision latency or when you are implementing custom forwarding logic that Netfilter simply cannot express.

firewalld, ufw, and nftables

firewalld and ufw are userspace management tools that write nftables (or iptables) rules on your behalf. On Fedora, RHEL, and derivatives, firewalld defaults to using nftables as its backend since RHEL 8. On Debian and Ubuntu, ufw still uses the iptables-nft compatibility layer by default as of Ubuntu 24.04. If you are writing nftables rules directly while firewalld or ufw is also running, you need to understand which tables those tools own. firewalld creates its own firewalld table; ufw creates rules in the iptables-nft shim, which appear in nftables as a separate table. Running nft list ruleset after enabling either frontend will show you exactly what they have written. Direct-managed nftables rules and frontend-managed rules can coexist as long as you understand the priority ordering between them.

Internals Worth Knowing

Several nftables behaviors are not covered in introductory documentation but become relevant quickly in production environments.

Dynamic Sets and Automatic Expiry

Named sets in nftables support a timeout flag that causes elements to expire automatically after a specified duration. This replaces the pattern of running a cron job to flush blocklists or rate-limit entries. A set element added with a timeout is evicted from kernel memory without any userspace intervention. Combined with the update @set statement, this forms the basis of stateful per-source rate limiting directly in the ruleset:

dynamic set with timeout for rate limiting

# Dynamic set: elements expire automatically after 60 seconds of inactivity.
# size is mandatory for sets populated from the packet path (prevents memory exhaustion).
set ratelimit_offenders {
    type ipv4_addr
    flags dynamic,timeout
    timeout 60s
    size 65536
}

chain input {
    type filter hook input priority 0; policy drop;

    # Drop immediately if source IP is already in the offender set
    ip saddr @ratelimit_offenders drop

    # Track per-source rate; add to offender set and drop if exceeded.
    # update @set { key limit rate over N/unit } is the correct nftables
    # idiom for combining per-key rate tracking with dynamic set population.
    # tcp flags & (fin|syn|rst|ack) == syn matches pure SYN packets only,
    # ensuring only connection-initiating packets are counted.
    ct state new tcp flags & (fin|syn|rst|ack) == syn \
        update @ratelimit_offenders { ip saddr limit rate over 10/minute } drop
}

The update @set { key limit rate over N/unit } idiom is the correct nftables pattern for combining per-key rate tracking with dynamic set population. Each unique source IP gets its own rate bucket inside the kernel. When the rate limit is exceeded the update expression returns true, the source address is written into ratelimit_offenders, and the drop verdict fires. All subsequent packets from that address are dropped immediately by the first rule -- without re-evaluating the rate expression -- until the 60-second timeout expires and the element is evicted. The size 65536 field on the set is mandatory whenever a set is populated from the packet path. The reason is memory safety: the kernel processes packets in interrupt context at high speed with no natural backpressure. Without a declared maximum, a flood of unique source addresses could grow the set without bound, exhausting kernel memory. The size declaration caps this: when the set is full, new addresses are not added and the behavior falls through to the next rule instead. It prevents unbounded memory growth under a flood. This approach is meaningfully more efficient than the equivalent iptables technique using hashlimit and ipset together.

Conntrack Zones and Asymmetric Routing

Connection tracking in nftables operates through the standard Netfilter conntrack subsystem, but nftables gives you direct control over conntrack zones. A conntrack zone is a numeric identifier that scopes connection tracking entries -- two connections with the same 5-tuple but different zone IDs are tracked independently. This is essential on hosts that perform asymmetric routing, where traffic may arrive and depart on different interfaces.

The reason asymmetric routing breaks standard conntrack is structural: conntrack matches flows by their 5-tuple (source IP, destination IP, source port, destination port, protocol) and expects to see both directions of a connection through the same Netfilter hook path. When routing is asymmetric -- the forward path comes in on eth0 but the reply goes out on eth1, for example -- conntrack may register the initial SYN on eth0 but never see the SYN-ACK reply, because the reply arrived through a path that doesn't traverse the same hooks. The forward-direction packets are then classified INVALID because conntrack cannot find a matching established entry. Zones solve this by giving each interface its own isolated tracking namespace: a connection arriving on eth0 is only matched against entries in zone 1, not entries from eth1 traffic in zone 2. Each direction is tracked independently and correctly.

conntrack zone assignment by interface

# Assign conntrack zone 1 to traffic entering on eth0
# Assign conntrack zone 2 to traffic entering on eth1
# Allows asymmetric routing without INVALID state misclassification
chain prerouting {
    type filter hook prerouting priority -300;
    iifname "eth0" ct zone set 1
    iifname "eth1" ct zone set 2
}

This technique is used in ECMP (Equal-Cost Multi-Path) environments and on hosts with policy routing tables where the kernel intentionally routes reply traffic through a different interface than the one that received the request.

How Netlink Batch Atomicity Works

When you run nft -f /etc/nftables.conf, the nft binary parses the entire file, compiles each statement into a Netlink message, and wraps all messages in a single NFNL_BATCH_BEGIN / NFNL_BATCH_END envelope before sending the batch to the kernel over a Netlink socket. The kernel processes the entire batch in a single kernel-side transaction protected by the nft_commit call. If any message in the batch fails -- for instance, if a referenced set does not exist -- the entire batch is rolled back and the ruleset remains unchanged. This is fundamentally different from the iptables iptables-restore mechanism, which applied rules one at a time and could leave the system in a partially updated state if an error occurred mid-stream.

The practical implication is that you can safely use flush ruleset at the top of your configuration file. If anything in the file fails to parse or apply, the kernel rolls back the entire transaction and your previous ruleset remains in place. You are never left with an empty ruleset and an open firewall.

The Priority Number System

Every base chain in nftables is registered at a specific priority value on a specific Netfilter hook. Lower numbers run first. The Netfilter kernel code defines several named priority constants that map to specific integer values: NF_IP_PRI_CONNTRACK_DEFRAG (-400), NF_IP_PRI_RAW (-300), NF_IP_PRI_CONNTRACK (-200), NF_IP_PRI_MANGLE (-150), NF_IP_PRI_NAT_DST (-100), NF_IP_PRI_FILTER (0), NF_IP_PRI_SECURITY (50), NF_IP_PRI_NAT_SRC (100). The nft tool accepts both the numeric values and symbolic names like filter, raw, mangle, and dstnat/srcnat.

The reason conntrack and destination NAT run at negative priorities -- below zero, before your filter chain -- is a hard ordering dependency. Your filter chain almost certainly contains rules like ct state established,related accept, which requires conntrack to have already classified the packet before the filter chain evaluates it. If conntrack ran at the same priority as filter, or after it, there would be no ct state value available at rule evaluation time. Similarly, DNAT at -100 must rewrite the destination address before the routing decision so the kernel sends the packet to the right place. If DNAT ran after routing, the kernel would have already made a forwarding decision based on the original destination, and the NAT rewrite would have no effect on where the packet was sent.

The reason this matters in practice: if you write a filter chain at priority 0 and also have Docker running on the same host, Docker registers its NAT rules at priority -100 (dstnat). Docker's DNAT rules run before your filter rules. A packet destined for a container port has its destination address rewritten by Docker before your filter chain sees it -- which means if your filter chain blocks traffic to the container's backend address and port, the block will work, but a filter rule written to block the original published port may not behave as expected. Running nft list hooks reveals exactly this ordering and removes the guesswork.

Netfilter Hook Priority Map Click any priority slot to see what runs there and why

prerouting

-300 raw

-200 conntrack

-100 dstnat

input

-200 conntrack

0 filter

50 security

forward

0 filter

output

-300 raw

0 filter

50 security

100 srcnat

postrouting

100 srcnat

300 conntrack

Priority Reference

The nftables wiki maintains a full priority mapping table at wiki.nftables.org/wiki-nftables/index.php/Netfilter_hooks. The kernel source authoritative reference is include/uapi/linux/netfilter_ipv4.h and its IPv6 equivalent. When in doubt, use symbolic names (filter, dstnat, srcnat) rather than bare integers -- they are more readable and immune to priority value changes across kernel versions.

nftables and eBPF: Where the Boundary Sits

A question that comes up frequently in Kubernetes and high-performance networking contexts is how nftables relates to eBPF. They are not competing technologies -- they operate at different points in the stack with different tradeoffs. nftables works through the Netfilter hook system, which means it participates in the full kernel networking stack including conntrack, NAT, and routing decisions. eBPF programs attached at the TC (Traffic Control) or XDP (eXpress Data Path) layers can run before the Netfilter hooks and can make forwarding decisions without ever entering the network stack proper. Tools like Cilium bypass kube-proxy and nftables entirely for service routing, using eBPF maps directly.

The practical division: nftables is the right tool for stateful host firewalling, NAT, and transparent packet classification on general-purpose Linux systems. eBPF is the right tool for extremely high-throughput forwarding, XDP-level DDoS mitigation, and kernel-bypass networking in environments where every microsecond of stack traversal matters. On a Kubernetes node running Cilium with eBPF, nftables may still be present and active for host-level firewall rules while Cilium handles all service routing entirely outside the Netfilter path.

Where flowtables fit between Netfilter and XDP

nftables flowtables occupy a middle position between the full Netfilter pipeline and eBPF XDP. The full Netfilter path evaluates every hook at every packet. Flowtables bypass that pipeline for established flows but still run inside the network stack -- TTL decrements, NAT translations, and neighbor table lookups still happen. eBPF at XDP runs before the network stack is entered at all, making it capable of line-rate forwarding on supported NICs but with the tradeoff of losing conntrack, NAT, and routing table integration. For a forwarding router that needs NAT and conntrack, flowtables give most of the performance benefit of XDP without losing the stateful features the network stack provides.

The nft_compat Extension Layer: How iptables-nft Actually Works

When iptables-nft translates an iptables rule into nftables, it does not convert every match and target into native nftables expressions. Instead, it relies on a kernel module called nft_compat that acts as an adapter between the nftables rule engine and the legacy xtables extension API. Understanding this layer explains several behaviors that administrators encounter when running mixed iptables-nft and native nftables configurations.

The xtables infrastructure -- which backs iptables -- expresses each match and target as a separate kernel module exposing a C struct of function pointers: one to parse the userspace argument, one to match a packet, one to print. nftables has no concept of this struct natively. The nft_compat module wraps xtables extensions in a special nftables expression type called xt, which calls the legacy extension's match function directly from inside the nftables VM evaluation path. When you run nft list ruleset and see output like xt match "tcp" or xt target "CONNMARK", you are seeing the nft_compat wrapper at work.

iptables-nft rule visible in nft list ruleset

# An iptables-nft rule like this:
$ iptables-nft -A INPUT -m conntrack --ctstate ESTABLISHED -j ACCEPT

# Appears in nft list ruleset as an xt expression:
table ip filter {
    chain INPUT {
        # xt match wrapping the legacy conntrack extension
        xt match "conntrack"
        xt target "ACCEPT"
    }
}

# A native nftables equivalent uses no xt wrapper at all:
table inet filter {
    chain input {
        ct state established accept
    }
}

This distinction has a practical consequence: xt-wrapped expressions cannot participate in the nftables set and map lookup system. You cannot reference an xtables match result as a set key, use it in a verdict map, or combine it with the flow add flowtable action. The reason is architectural: the nftables set and flowtable infrastructure works by extracting a typed value from a register and using it as an index into a kernel data structure. It requires the match criterion to be expressible as a concrete, typed value -- an IP address, a port number, a protocol identifier. An xt-wrapped extension is an opaque function call; the kernel has no way to extract a typed key from it, so it cannot be indexed. The nft_compat wrapper gives you compatibility, not integration. For any rule where you want the full performance characteristics of native nftables -- set lookups, flowtable offload, atomic set element updates -- the rule must be rewritten in native nft syntax.

The nft_compat module is also the reason why running nft flush ruleset flushes rules written by iptables-nft: they both live in the same kernel data structure. This is intentional -- the iptables-nft compatibility layer is not a separate firewall, it is a translation interface into the same nf_tables kernel subsystem. Knowing this, you can audit the full effective rule state of any system -- regardless of which tools created the rules -- with a single nft list ruleset command.

iptables-legacy vs iptables-nft

On systems where both binaries are present, iptables-legacy writes to the old x_tables kernel infrastructure, which is entirely separate from nf_tables. Rules written by iptables-legacy do not appear in nft list ruleset and do not interact with nftables rules in any way. Mixing the two -- running iptables-legacy and native nftables on the same host -- is a configuration error that produces undefined behavior and double-filtering on some packets. Modern distributions set the default iptables binary to the nft variant precisely to prevent this.

Flowtable Garbage Collection: How Entries Get Evicted

A detail that rarely appears in flowtable documentation but matters in production is exactly how flowtable entries are removed when a connection closes. The answer reveals something important about the relationship between flowtables and the connection tracking subsystem.

Once a flow is offloaded to a flowtable, subsequent packets in that flow bypass the Netfilter hooks entirely -- including the hook at which the conntrack confirm function runs. This means the conntrack entry for an offloaded flow does not receive normal per-packet updates. Instead, the flowtable infrastructure maintains a workqueue-based garbage collector that periodically walks the flowtable entries and checks whether the underlying conntrack entry is still alive. The workqueue fires approximately every 500 milliseconds by default (controlled by nf_flowtable_work_interval in the kernel source).

When a TCP connection reaches a FIN or RST state, or when a UDP flow ages out of the conntrack table, the GC workqueue detects that the conntrack entry is either in CLOSE or TIME_WAIT state and begins the eviction process. The flowtable entry is marked for removal. Packets arriving during the eviction window -- between when the entry is marked and when it is removed -- fall through to the classic forwarding path (a flowtable miss), are evaluated by the full Netfilter pipeline, and the connection state is updated there. This fallback is transparent to both endpoints.

observing flowtable offload status in conntrack

# List conntrack entries -- offloaded flows are tagged [OFFLOAD]
# conntrack -L -p tcp

# Active software-offloaded entry
tcp  6 src=10.0.0.1 dst=10.0.0.2 sport=52100 dport=443 [OFFLOAD] mark=0 use=2

# Hardware-offloaded entry (requires NIC support + kernel 5.13+)
tcp  6 src=10.0.0.1 dst=10.0.0.2 sport=52101 dport=443 [HW_OFFLOAD] mark=0 use=2

# Entry no longer offloaded (GC eviction in progress, back on slowpath)
tcp  6 src=10.0.0.1 dst=10.0.0.2 sport=52102 dport=443 TIME_WAIT mark=0 use=1

The GC design has one important implication for NAT: the flowtable entry caches the NAT mapping alongside the routing decision when a flow is first offloaded. If the NAT mapping changes for a flow that is already in the flowtable (for instance, because a SNAT rule is updated while the flow is active), the cached mapping in the flowtable entry is not updated. The flow continues using the old mapping until the GC evicts the entry and it is re-evaluated through the full Netfilter pipeline. For production environments where NAT mappings are static this is never an issue. For environments where NAT rules change dynamically, this has a concrete failure mode: Kubernetes service endpoint churn is the primary example. In a Kubernetes cluster, when a pod backing a service is terminated and replaced, the service IP maps to a new pod IP. If a flowtable entry was caching the old pod's address as the DNAT target, packets for that active connection continue being forwarded to the terminated pod's IP until the GC runs and evicts the stale entry -- typically within half a second for a TCP close, but during that window the flow is effectively broken. This is the primary reason the nftables kube-proxy backend avoids long-lived flowtable entries for service traffic and uses incremental named set updates instead: set element updates take effect immediately for all new and existing rule evaluations, with no stale cache problem.

nf_tables is a new in-kernel rule engine that comes with full atomic rule replacement.

-- Pablo Neira Ayuso, nf_tables kernel pull request description, October 2013

nftables as an Attack Surface

Understanding the nftables architecture has a security dimension that goes beyond writing better firewall rules. Because nftables is implemented as a kernel subsystem with a Netlink-based API accessible to unprivileged users in certain configurations, it has attracted sustained attention from vulnerability researchers -- and has been the location of a recurring pattern of local privilege escalation flaws. These are not abstract concerns: several nftables CVEs have been actively exploited in the wild and added to the CISA Known Exploited Vulnerabilities catalog.

The common thread across many of these vulnerabilities is the interaction between nftables and unprivileged user namespaces. On Linux, user namespaces allow a process to operate with what appears to be root privileges inside an isolated namespace without having root on the host. The mechanism that makes this relevant to nftables is namespace-scoped capability checking: Linux capabilities including CAP_NET_ADMIN are checked relative to the namespace a process is operating in. A process inside a user namespace where it has UID 0 holds full capabilities within that namespace, and the nf_tables Netlink socket only checks for CAP_NET_ADMIN -- it does not verify that the caller is operating from the initial (host-level) network namespace. This means unprivileged user namespace creation is sufficient to satisfy the nf_tables access check. On distributions that enable unprivileged user namespace creation by default -- which includes Debian and Ubuntu -- this means any local user can reach the nf_tables code paths that have historically contained vulnerability classes like double-free conditions, use-after-free, and type confusion.

The CVE Pattern

The following are the nftables-specific CVEs that have been actively exploited or received significant security attention:

CVE-2024-1086 (CVSS 7.8, actively exploited in ransomware campaigns): A use-after-free in the nft_verdict_init() function. The function allowed positive values as a drop error within a hook verdict, causing nf_hook_slow() to trigger a double-free when NF_DROP was issued with a drop error resembling NF_ACCEPT. The vulnerability was present in kernel versions from 3.15 to 6.8-rc1 -- meaning it existed in kernels for roughly a decade before being identified. CISA added the flaw to its Known Exploited Vulnerabilities catalog in May 2024, and in October 2025 issued a notification confirming it is known to be used in ransomware campaigns. Sysdig Threat Research Team analysis identified RansomHub and Akira as operators using the exploit for post-compromise privilege escalation: gain initial access through stolen credentials or a vulnerable service, then exploit CVE-2024-1086 to escalate from a limited user account to root, after which security tools are disabled, data is exfiltrated, and encryption payloads are deployed. Public proof-of-concept code became available in March 2024, which substantially lowered the barrier to exploitation. Fixed in kernel versions v5.15.149+, v6.1.76+, and v6.6.15+.

CVE-2024-26809 (double-free in nft_pipapo_destroy()): The pipapo set implementation -- which handles multi-dimensional packet classification for large rulesets -- contains a flaw in its destruction path. When a set is marked as "dirty" (modified in a transaction that has not yet been committed), the destroy function fails to check whether elements exist in both the live match structure and the backup clone, leading to repeated deallocations of the same memory. Unlike earlier nftables LPE flaws, CVE-2024-26809 does not require CAP_NET_ADMIN or an unprivileged namespace as a prerequisite, which makes it accessible from any low-privilege account or compromised container on an affected host. Ready-to-use exploit code has been published.

CVE-2024-42070 (type confusion, privilege escalation and information disclosure): Insufficient validation of user-supplied data types in nftables allows a type confusion condition where data is processed as though it belongs to a different type than expected. Exploiting the condition can lead to sensitive information disclosure and, when combined with other techniques, arbitrary code execution in kernel context. Systems running affected kernels in multi-user or shared hosting environments are at elevated risk.

CVE-2021-22555 (heap out-of-bounds write, CVSS 8.3, CISA KEV): A flaw in xt_compat_target_from_user() in net/netfilter/x_tables.c where the allocation size for 32-to-64-bit structure conversion did not account for padding, allowing a local attacker to write a small number of bytes out-of-bounds on the kernel heap, enabling full kernel compromise via heap corruption. The vulnerability is reachable through an unprivileged user namespace. It was added to the CISA KEV catalog in October 2025 after evidence of active exploitation. Affected kernels: 2.6.19-rc1 through 5.12. Fixed in kernel 5.13.

CVE-2022-34918 (type confusion, Metasploit module available): A type confusion bug in nft_set_elem_init() allowed an unprivileged user to escalate privileges to ring zero via a heap buffer overflow. An attacker needed an unprivileged user namespace to obtain CAP_NET_ADMIN access. Multiple public proof-of-concept exploits were released, including a Metasploit module targeting specific kernel versions.

CVE-2023-35001 (out-of-bounds read/write, Pwn2Own Vancouver 2023): An out-of-bounds read/write in nft_byteorder_eval() caused by an incorrect element stride calculation when processing 16-bit values -- the loop iterates using 32-bit union-sized steps rather than 16-bit steps, allowing reads and writes past the end of the VM register array. Demonstrated by Synacktiv at Pwn2Own Vancouver 2023 to achieve local privilege escalation on Ubuntu. Required CAP_NET_ADMIN or an unprivileged user namespace. Reached root on affected systems.

CVE-2025-21826 (pipapo width mismatch, disclosure March 2025): A logic error in pipapo_init() where register-based arithmetic allows a mismatch between the set key length and the sum of per-field lengths in a concatenated key. For example, a set key length of 10 and field description [5, 4] produces a pipapo rule width of 12, creating an out-of-bounds condition during multi-dimensional packet classification. The issue was introduced alongside the pipapo implementation and affects Debian bookworm, Ubuntu (medium priority), and RHEL 9. The fix validates that the sum of field lengths matches the declared set key length before pipapo initialization proceeds. While no public proof-of-concept exploit had been confirmed at the time of writing, the pipapo subsystem has been the source of multiple prior LPE vulnerabilities, and the pattern of incorrect width calculations creates a consistent attack surface.

CVE-2026-23231 (CVSS 7.1 High, use-after-free in nf_tables_addchain(), disclosure March 2026): A race-condition use-after-free in the chain creation error path. When a newly-created chain is published into the table's chain list via RCU-style insertion before hooks are fully registered, and registration subsequently fails, the error path removes the chain and frees its memory without waiting for the RCU grace period to expire. A data-path race specific to NFPROTO_INET families makes this particularly reachable: the IPv4 hook may be transiently installed even if IPv6 hook registration fails. Packets entering via the transient hook can execute nft_do_chain() and dereference blob_gen_X data from the already-freed chain structure. The fix inserts a synchronize_rcu() call after nft_chain_del() to ensure all RCU readers complete before destruction. Affects kernels prior to 6.19.6; fixed via commit 71e99ee20fc3 and stable backports.

CVE-2026-23392 (flowtable use-after-free in error path, disclosure March 2026): A use-after-free in the nf_tables flowtable error path where a flowtable object remains accessible to both the packet path and the nfnetlink_hook control plane after teardown, because an RCU grace period is not respected before the object is freed. The fix adds a synchronize_rcu() call after hook unregistration in the failure path. The vulnerability is triggered during a failed flowtable setup or failed hardware offload transition rather than on the normal path, which narrows exploitation to environments actively using flowtable offloading. CVSS scoring was pending at publication time, but the nature of the flaw -- a UAF in an actively traversed forwarding path -- places it in the same class as other nf_tables UAF flaws that have been weaponized.

The pattern that emerges from this CVE history is consistent: the nf_tables code is complex memory management logic executed in kernel space with minimal overhead per packet. That same complexity and the proximity to raw memory operations is what produces this vulnerability class. An additional structural factor sharpened the problem in 2024: the Linux kernel team became a CVE Numbering Authority, which accelerated formal CVE assignments for kernel bugs substantially -- the 2024 total reached over 3,500 kernel CVEs, and 2025 continued at a similar pace. For security teams, this volume makes numeric severity alone an unreliable triage signal; nftables CVEs specifically warrant prompt treatment because the exploit path to root from a low-privilege user is well-documented and tooling exists. This is not a reason to avoid nftables -- iptables has its own history of related flaws -- but it is a reason to treat kernel patching as directly relevant to firewall infrastructure, not as a separate concern.

Hardening the nftables Attack Surface

Several concrete mitigations reduce exposure to nftables-based local privilege escalation:

Restrict unprivileged user namespaces. On systems where containers or per-user namespaces are not required, the attack surface is substantially reduced by disabling unprivileged user namespace creation. This removes the vector through which low-privilege users can reach the nf_tables code with elevated capabilities:

restricting unprivileged user namespaces

# Disable immediately (reverts on reboot)
# sysctl -w kernel.unprivileged_userns_clone=0

# Persist across reboots
# echo "kernel.unprivileged_userns_clone=0" \
    > /etc/sysctl.d/99-disable-unpriv-userns.conf
# sysctl -p /etc/sysctl.d/99-disable-unpriv-userns.conf

# Verify the setting is in effect
# sysctl kernel.unprivileged_userns_clone
kernel.unprivileged_userns_clone = 0

Container compatibility

Disabling unprivileged user namespaces will break rootless container runtimes (rootless Docker, rootless Podman, and similar tools) and some sandboxed browser configurations that rely on user namespace isolation. Audit your workload requirements before applying this sysctl in environments where rootless containers are in use.

Keep the kernel current. Every nftables LPE CVE listed above was fixed in a specific kernel version. The fix for CVE-2024-1086, for example, landed in v5.15.149+, v6.1.76+, and v6.6.15+. The fix for CVE-2021-22555 landed in 5.13. Because exploit code for several of these vulnerabilities is publicly available, an unpatched kernel on a multi-user system or a Kubernetes node is a known and documented risk. On RHEL and derivatives, dnf updateinfo list security will list kernel security advisories pending application. On Debian and Ubuntu, apt-cache policy linux-image-$(uname -r) shows the installed version against the available candidate.

Enable audit logging for nf_tables Netlink operations. The auditd framework can be configured to log Netlink socket activity, giving visibility into attempts to manipulate nftables rules from low-privilege accounts. Combined with a SIEM, this produces a detection signal for exploitation attempts before the session-level consequences become visible:

audit rule for nf_tables Netlink socket access

# Log all nf_tables Netlink socket creations
# AF_NETLINK = 16, NETLINK_NETFILTER = 12
# auditctl -a always,exit -F arch=b64 \
    -S socket -F a0=16 -F a2=12 \
    -k nftables_access

# View triggered events
# ausearch -k nftables_access

Apply SELinux or AppArmor confinement. MAC frameworks can restrict which processes are permitted to issue Netlink commands to the nf_tables subsystem, even if those processes have been granted user namespace capabilities. This addresses a limitation of the user namespace sysctl: disabling unprivileged user namespaces entirely is a coarse control that breaks rootless containers and browser sandboxes. SELinux and AppArmor policies can allow user namespace creation for specific process labels while still denying the specific network netlink operations that reach nf_tables -- a surgical control the sysctl cannot provide. On RHEL 9 and derivatives, the default targeted SELinux policy already restricts arbitrary Netlink access from confined domains. On Ubuntu, AppArmor profiles for container runtimes can be tuned to deny network netlink access where it is not operationally required.

Enable kernel lockdown where applicable. Linux kernel lockdown mode (introduced in kernel 5.4 as a formal LSM) restricts operations that could undermine kernel integrity. Its relevance to nftables exploitation is specific: achieving initial kernel code execution through a UAF or double-free vulnerability does not, by itself, produce a usable root shell. Most weaponized exploit chains require additional post-exploitation steps -- reading credential structures from kernel memory, overwriting security metadata, or writing to privileged disk locations. Lockdown mode restricts several of these post-exploitation primitives, including direct kernel memory access via /dev/mem and certain module loading paths, raising the cost of turning a kernel bug into a reliable root shell. This doesn't eliminate the vulnerability, but it increases the exploitation complexity and reduces the window in which a generic public PoC produces an immediate root shell without further adaptation. On systems using UEFI Secure Boot with an enrolled shim, lockdown is often enforced automatically.

nft list ruleset as a detection tool

From a defensive perspective, nft list ruleset is also an artifact worth examining during incident response on a Linux host. Attackers who have achieved local privilege escalation sometimes modify nftables rules to suppress logging, allow persistent backdoor connections, or redirect traffic. Any rules added outside of your configuration management baseline are worth investigating. Comparing live ruleset output against a known-good snapshot is a useful triage step on a potentially compromised host.

Wrapping Up

The shift from iptables to nftables is not a cosmetic rebrand. It is a fundamental redesign of how Linux classifies and filters network traffic. The kernel virtual machine replaces hardcoded protocol logic with generic bytecode. The inet family eliminates dual-stack duplication. Sets and maps turn linear rule evaluation into constant-time lookups. Atomic transactions close the gap between old and new rulesets. Flowtables bypass the full Netfilter pipeline for established connections on forwarding hosts. And built-in tracing gives administrators visibility into rule evaluation that iptables never had.

The Kubernetes project's move to a nftables kube-proxyThe per-node Kubernetes component responsible for maintaining network rules that implement Service load-balancing. It watches the API server for Service and Endpoint changes and programs the kernel's packet-filtering layer so that traffic to a Service IP is forwarded to a healthy pod. Prior backends included iptables and IPVS; the nftables backend became the recommended backend in Kubernetes 1.35, though iptables remains the default for compatibility reasons and must be explicitly overridden. backend, which reached General Availability in Kubernetes 1.33 (April 2025), is a clear signal of where production infrastructure is heading. The signal sharpened further in December 2025 when Kubernetes 1.35 officially deprecated IPVS mode and named nftables the recommended kube-proxy backend for Linux nodes. Worth noting: nftables is the recommended backend but iptables remains the default mode -- the Kubernetes project has kept iptables as the default for compatibility reasons, so enabling nftables requires explicit configuration. The Kubernetes project notes that nftables mode may not be compatible with all network plugins -- consult your CNI documentation before migrating. Amazon EKS documentation confirms IPVS will be removed entirely in Kubernetes 1.36. Every major distribution has already made the switch at the backend level -- RHEL 8 in May 2019, Debian 10 in July 2019, Fedora 32 in April 2020, and Ubuntu 20.10 in October 2020 -- meaning the transition at the infrastructure layer has been complete for several years. The iptables compatibility layer will keep legacy scripts working for years to come, but learning to write native nftables rulesets -- using sets, maps, the inet family, and the declarative file-based configuration model -- is the path toward firewalls that are simpler, faster, and easier to debug at any scale.

How to Build a Production nftables Ruleset

Step 1: Create a table and base chain

Use the nft add table commandThe full command is: nft add table inet <name> — for example, nft add table inet firewall. The inet family processes both IPv4 and IPv6. The table name is arbitrary; it becomes the namespace for all your chains and sets. to create a table in the inet family, which handles both IPv4 and IPv6 in a single ruleset. Then attach a base chain with a type, hook, and priority to connect it to the kernel networking stack.

Step 2: Define named sets and verdict maps

Create named sets to group IP addresses, port numbers, or other selectors for efficient O(1) lookups. Use verdict maps to associate match criteria directly with actions, reducing dozens of individual rules into a single compact rule referencing a map.

Step 3: Load the ruleset atomically from a file

Write your complete ruleset to a configuration file such as /etc/nftables.conf. Use the nft -f flag to load the entire file in a single atomic transaction. Enable the nftables systemd service to restore the ruleset automatically at boot.

Step 4: Trace and debug rules with nftrace

Insert a temporary rule that sets meta nftrace to 1 for the traffic you want to inspect. Run the nft monitor trace command to watch the kernel evaluate each rule against matching packets in real time, identifying exactly which rule produces the final verdict.

Step 5: Accelerate forwarding with flowtables

Define a flowtable attached to the ingress hook of your forwarding interfaces. Add a flow offload rule in your forward chain to offload established TCP connections to the flowtable fastpath, allowing subsequent packets in those flows to bypass the full Netfilter hook traversal and be forwarded with significantly lower CPU overhead.

Knowledge Checkpoint 5 questions -- answers are in the guide above

Test your understanding of the core model changes. Each question has one correct answer.

question 01 When you run nft -f /etc/nftables.conf, the kernel applies the new ruleset atomically. What happens if a single rule in the file fails to parse?

question 02 You have a named set with 50,000 IP addresses in nftables, and an iptables ruleset with 50,000 individual DROP rules for the same addresses. How does lookup cost compare when a packet arrives?

question 03 What does the nftables kernel virtual machine use to pass values between instructions when evaluating a rule?

question 04 You are on a forwarding host and notice that rule hit counters in your forward chain stay near zero for long-lived TCP connections, even though traffic is flowing. What explains this?

question 05 You add a new nftables filter chain at priority 0. Docker is also running on the host. In what order do Docker's NAT rules and your filter chain process packets arriving on prerouting?

Frequently Asked Questions

Can I still use iptables syntax on a system that runs nftables?

Yes. Modern distributions ship iptables-nft, a compatibility layer that translates iptables commands into nftables rules behind the scenes. The iptables binary you run on Debian 11+, RHEL 9+, and Ubuntu 22.04+ is already iptables-nft. This lets you keep existing scripts working while the kernel processes everything through the nftables engine.

What kernel version do I need for nftables?

The nftables framework was merged into Linux kernel 3.13 in January 2014. However, many features used in production rulesets today -- including the inet family, sets with interval support, and flowtable offloading -- require kernel 5.x or later. For full compatibility with current tooling and distribution defaults, kernel 5.13 or newer is recommended.

How do nftables sets differ from ipset?

ipset was a separate userspace tool and kernel module used alongside iptables for bulk matching against large collections of IP addresses or ports. nftables replaces ipset entirely by building set support directly into the framework. Named sets in nftables use hash tables internally, giving O(1) lookup performance regardless of element count, and they can hold combinations of data types such as address-port pairs. Unlike ipset, nftables sets are managed through the same nft tool and syntax used for all other ruleset operations.

What are nftables flowtables and when should I use them?

A flowtable is a forwarding fastpath in nftables that allows established connections to bypass the full Netfilter hook traversal. Once a connection is tracked and offloaded to a flowtable, subsequent packets in that flow are forwarded directly without re-evaluating firewall rules. This is most useful on routers or forwarding hosts carrying high volumes of sustained TCP traffic. Flowtable support requires kernel 4.16 or later for software offload; hardware offload to supporting NICs requires kernel 5.13 or later and nftables 0.9.9 or later. Flowtables are defined with a hook ingress priority and a list of devices, and are populated from the forward chain using the flow add statement.

Should I use nftables or IPVS for Kubernetes kube-proxy?

As of Kubernetes 1.33 (April 2025), the nftables kube-proxy backend reached General Availability. In Kubernetes 1.35 (December 2025), the Kubernetes project officially deprecated IPVS mode and named nftables the recommended kube-proxy mode for Linux nodes. One important operational detail: nftables is recommended but is not yet the default -- the Kubernetes project has explicitly stated it will keep iptables as the default mode for compatibility reasons, so you must set mode: nftables explicitly in your kube-proxy configuration to use it. The Kubernetes project notes that nftables mode may not be compatible with all network plugins -- consult your CNI documentation before migrating. Amazon EKS documentation confirms IPVS will be removed entirely in Kubernetes 1.36. Unlike IPVS, nftables integrates natively with the kernel's Netfilter infrastructure, supports incremental ruleset updates proportional only to the number of changed endpoints rather than the total cluster size, and eliminates global lock contention by giving each component its own private table. For new Linux-based clusters running kernel 5.13 or later, nftables is the recommended backend.

How do I test a new nftables ruleset without locking myself out over SSH?

Use two complementary techniques. First, run nft -c -f /etc/nftables.conf before loading. The -c flag performs a full syntax and semantic check in userspace -- catching undefined set references, type mismatches, and syntax errors -- without making any kernel changes. Important caveat: nft -c validates in userspace only and does not query the running kernel, so it will not catch errors caused by missing kernel modules; on mainstream distributions this rarely matters, but on custom or embedded kernels a clean -c pass does not guarantee a successful load. Second, before loading any untested rules, save your current working ruleset with nft list ruleset to a backup file, then schedule an automatic restore using the at command: echo "nft -f /tmp/backup.conf" | at now + 2 minutes. If the new rules block your SSH connection, the at job fires after the timeout and restores the previous ruleset. If the new rules work correctly, cancel the restore job with atrm before it fires. This workflow is the standard approach for testing firewall changes on remote hosts where loss of SSH access is operationally significant.

How do I read nft monitor trace output?

Each trace event begins with a trace ID -- a per-packet identifier that groups all events for the same packet -- followed by the table, chain, and rule that was evaluated. A line ending in "verdict continue" means that rule was evaluated but did not produce a final verdict; the packet continued to the next rule. A line ending in "verdict drop" or "verdict accept" identifies the exact rule that terminated evaluation. If you see "policy drop" or "policy accept" instead of a rule reference, no explicit rule matched and the packet reached the default chain policy. The packet descriptor line at the start of each trace group shows the full 5-tuple (addresses, protocol, ports), letting you confirm which packet triggered the trace.

Should I use nftables directly or through a frontend like firewalld or ufw?

It depends on what you are building. For servers where you want full control over rule structure, atomic updates, and custom sets and maps, writing nftables rules directly gives you the complete model and avoids a translation layer you do not need. For workstations and systems managed with Ansible or other configuration management tools, firewalld's zone model provides a cleaner abstraction and handles interface changes automatically. The critical point is that both firewalld (on RHEL 8+ and Fedora with the nftables backend) and direct nftables are writing to the same kernel subsystem. If you mix them, run nft list ruleset to see everything that is active, use nft list hooks to verify priority ordering, and ensure you understand which tables each tool owns before writing rules that depend on precedence between them.

How does the nftables limit statement work for rate limiting?

The limit statement controls how many packets or bytes per time interval a rule will match. It is simpler than per-source tracking and is suitable for host-level rate limiting where you do not need per-IP buckets. The syntax is limit rate N/second or limit rate N/minute (and byte equivalents with limit rate Nbytes/second). Adding the over keyword inverts the match: limit rate over 100/second matches packets that exceed the rate, which is the more common usage for drop rules. The limit state is maintained as a token bucket in kernel memory. A burst parameter controls the maximum burst above the rate before tokens are exhausted. For per-source-address rate limiting -- where each client IP gets its own independent bucket -- use the update @set { key limit rate over N/unit } idiom with a dynamic named set, since limit maintains a single global token bucket shared across all traffic matching the rule.

How does the inet family handle dual-stack traffic?

When you create a table in the inet family, nftables registers your chains on both the IPv4 and IPv6 hooks simultaneously. A rule written without a protocol-specific prefix -- such as tcp dport 22 accept -- applies to both address families. A rule prefixed with ip or ip6 applies only to that family. The kernel VM inserts an implicit family check before protocol-specific field accesses, which is why you can safely mix ip saddr and ip6 saddr rules in the same chain without errors -- each family check short-circuits correctly. Sets in an inet table can hold values from both families if declared with a combined type. For named sets that hold only IPv4 or only IPv6 addresses, declare them with type ipv4_addr or type ipv6_addr explicitly and reference them from rules that include the appropriate family qualifier. ICMP and ICMPv6 require separate rules: ip protocol icmp accept for IPv4 and ip6 nexthdr icmpv6 accept for IPv6 -- there is no single statement that covers both because the protocol numbers differ at the IP layer.

What is nft_compat and does it affect performance?

The nft_compat kernel module is the bridge between iptables-nft and the native nftables rule engine. When you run a rule through iptables-nft, the command translates iptables syntax into a Netlink message that includes a special xt expression wrapping the legacy xtables extension. This expression type calls the original xtables extension match function directly from inside the nftables VM evaluation path. The performance cost is roughly equivalent to iptables: the xtables extension code runs the same comparisons it always has. What you do not get with nft_compat expressions is the ability to use them as set keys, verdict map selectors, or flowtable offload triggers -- those features require native nftables expressions. Rules visible in nft list ruleset as xt match "..." or xt target "..." are using this compatibility layer. Rewriting them in native nft syntax is the only way to access the full nftables feature set for those match conditions.

What is the relationship between nftables and Linux kernel privilege escalation CVEs?

nftables has been the location of a recurring pattern of local privilege escalation vulnerabilities because the nf_tables subsystem implements complex memory management logic -- set lifecycle, transaction atomicity, pipapo multi-dimensional lookup structures -- in kernel space, where any memory corruption is in the highest-privilege context. Many of these flaws (CVE-2024-1086, CVE-2022-34918, CVE-2023-35001) require the attacker to first access nf_tables through an unprivileged user namespace, which grants CAP_NET_ADMIN within the namespace scope. On distributions that enable unprivileged user namespace creation by default -- including Debian and Ubuntu -- this means any local user can reach the vulnerable code paths. CVE-2024-1086 was confirmed by CISA as actively exploited in ransomware campaigns. The primary mitigations are: keep the kernel patched to the version that includes the fix for each relevant CVE, and where rootless containers are not required, disable unprivileged user namespace creation via sysctl kernel.unprivileged_userns_clone=0.

Sources

The following primary sources were used in the research and verification of this guide:

Pablo Neira Ayuso, The Netfilter.org nftables project page, netfilter.org
Patrick McHardy and Pablo Neira Ayuso, nftables: a new packet classification framework, Netdev 0.1 Conference, February 2015
Pablo Neira Ayuso, kernel mainline pull request, submitted October 16, 2013, merged January 19, 2014 (Linux kernel 3.13)
Red Hat Developer, What comes after iptables? Its successor, of course: nftables, developers.redhat.com, 2016
Dan Winship (Red Hat), NFTables mode for kube-proxy, kubernetes.io, February 28, 2025
Kubernetes v1.33 Release Notes, Kubernetes v1.33: Octarine, kubernetes.io, April 23, 2025
Kubernetes v1.35 Sneak Peek, Kubernetes v1.35: IPVS deprecation and nftables recommendation, kubernetes.io, November 26, 2025
Linux kernel documentation, Netfilter's flowtable infrastructure, docs.kernel.org
nftables wiki, Flowtables, wiki.nftables.org
nftables 0.9.9 release announcement (flowtable hardware offload), LWN.net, 2021
LWN.net, Nftables reaches 1.0, August 2021
Kubernetes SIG Network, KEP-3866: nftables kube-proxy backend, github.com/kubernetes/enhancements
Amazon EKS documentation, Review release notes for Kubernetes versions on standard support, docs.amazonaws.cn (confirming IPVS removal in 1.36)
Red Hat, RHEL 9.0 Release Notes: Deprecated functionality, access.redhat.com, 2022
Linux kernel source, net/netfilter/nft_compat.c and net/netfilter/nf_flow_table_core.c -- nft_compat wrapper and flowtable GC workqueue implementation
Pablo Neira Ayuso, nf_tables: initial commit, Linux kernel git, kernel.org, October 2013
NVD, CVE-2024-1086 -- use-after-free in nft_verdict_init(), nvd.nist.gov; Sysdig TRT, active exploitation in ransomware campaigns confirmed by CISA, October 2025
NVD, CVE-2024-26809 -- double-free in nft_pipapo_destroy(), nvd.nist.gov; Google security-research PoC, github.com/google/security-research
NVD, CVE-2024-42070 -- type confusion in nftables subsystem, nvd.nist.gov
CISA Known Exploited Vulnerabilities Catalog, CVE-2021-22555 -- heap out-of-bounds write in nft_set_elem_init(), added 2025; LinuxSecurity, Linux Kernel Vulnerabilities Exploited in 2025: CISA KEV Insights
NVD, CVE-2023-35001 -- out-of-bounds read/write in nft_byteorder_eval() due to incorrect element stride (Pwn2Own Vancouver 2023, Synacktiv), nvd.nist.gov

^ back to top