Linux Kernel Tuning for High-Traffic Servers

Every Linux server ships with kernel defaults designed for general-purpose workloads. These defaults are conservative by design -- they need to work reasonably well on a Raspberry Pi and a 128-core bare-metal machine alike. But when your server is fielding tens of thousands of concurrent connections, serving high-bandwidth transfers, or backing a busy API gateway, those defaults become the bottleneck. The kernel itself has all the machinery for extreme throughput -- it just needs to be told to use it.

This guide walks through the specific sysctl parameters, network stack settings, and memory management configurations that matter for high-traffic Linux servers. Every recommendation is grounded in kernel documentation, upstream maintainer guidance, and production-validated configurations from organizations like ESnet, Red Hat, and Google.

Warning

As the ESnet tuning guide warns, many of the settings described here will actually decrease performance on hosts connected at rates less than 1 Gbps. These tunings are specifically for servers on 1 Gbps or faster links handling substantial concurrent load. Measure before and after. Change one thing at a time.

How sysctl Works

The sysctl interface exposes kernel tunables through the /proc/sys virtual filesystem. You can read any parameter with sysctl <parameter> and write temporary changes with sysctl -w. Temporary changes are lost on reboot.

For persistent configuration, add parameters to /etc/sysctl.conf or create a dedicated file in /etc/sysctl.d/, such as /etc/sysctl.d/99-high-traffic.conf. Apply with:

$ sudo sysctl --system

Before making any changes, establish baselines. You need data to compare against:

baseline measurements

# CPU and load
$ uptime && mpstat 1 10

# Memory
$ free -h && vmstat 1 10

# Disk I/O
$ iostat -x 1 10

# Network
$ sar -n DEV 1 10

# Connection state overview
$ ss -s

TCP and Network Stack Tuning

The network stack is where high-traffic servers hit their first ceilings. The kernel manages connection queues, socket buffers, port allocation, and congestion control -- all of which have defaults sized for modest workloads.

The Listen Backlog: somaxconn and tcp_max_syn_backlog

When a client initiates a TCP connection, the kernel processes it through two queues. The SYN queue (half-open queue) holds connections that have received a SYN but haven't completed the three-way handshake. The accept queue holds fully established connections waiting for the application to call accept(). These two queues are governed by net.ipv4.tcp_max_syn_backlog and net.core.somaxconn, respectively.

The default somaxconn value was 128 until kernel 5.4, which raised it to 4096. If you are running any modern distribution (kernel 5.4+), the default is already 4096 -- meaning the value below is a no-op on modern kernels. For high-traffic servers, you should raise it further. On extremely busy load balancers or reverse proxies, values of 16384 or even 65535 are common. As the listen(2) man page specifies, if an application passes a backlog value larger than somaxconn, the kernel silently truncates it. This means your nginx or HAProxy instance might think it has a backlog of 16384, but the kernel would cap it at 4096 unless you raise this value.

/etc/sysctl.d/99-high-traffic.conf

# Default is 4096 on kernel 5.4+; raise for high-concurrency workloads
net.core.somaxconn = 16384
net.ipv4.tcp_max_syn_backlog = 16384

On RHEL 8+ kernels, somaxconn supports 32-bit values (up to 2,147,483,647), after a kernel commit by Eric Dumazet expanded the sk_max_ack_backlog field from 16-bit to 32-bit (Red Hat Customer Portal).

You must also configure your application to match. For nginx:

nginx.conf

listen 80 backlog=16384;

Monitor whether your queues are overflowing:

monitoring accept queues

# Check accept queue status per port
$ ss -ntl '( sport = :443 )'

# Check for SYN queue overflows
$ nstat -az TcpExtListenDrops TcpExtListenOverflows

If the Recv-Q column approaches the Send-Q value, your backlog is too small.

The Netdev Backlog

Before packets even reach the TCP stack, the kernel buffers incoming frames in a per-CPU backlog queue. The net.core.netdev_max_backlog parameter controls this queue's size (default: 1000). On servers with 10 Gbps or faster NICs, the NIC can deliver packets faster than a single CPU core can process them. If this queue fills, packets are silently dropped. You can detect this by checking /proc/net/softnet_stat -- the second column indicates drops due to a full backlog (Red Hat Documentation).

/etc/sysctl.d/99-high-traffic.conf

net.core.netdev_max_backlog = 5000

For 10 Gbps and above, values of 10000 to 30000 are used in production. The NGINX Plus AMI on AWS historically shipped with netdev_max_backlog = 30000.

TCP Buffer Sizes

TCP socket buffers determine how much data can be in-flight between sender and receiver. The kernel auto-tunes these, but the maximum values constrain what auto-tuning can achieve. The required buffer size depends on bandwidth and latency via the bandwidth-delay product (BDP): Buffer = Bandwidth (bytes/sec) x RTT (seconds).

A 10 Gbps link with 100 ms RTT requires approximately 120 MB of buffering for a single flow. ESnet's guidance for 10G hosts optimized for paths up to 100 ms RTT recommends:

TCP buffers -- 10G host, 100ms RTT (ESnet)

# Global max receive/send buffer
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864

# Per-socket TCP buffers: min / default / max
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432

For web servers handling many concurrent short-lived connections rather than bulk data transfers, smaller per-connection buffers prevent a few greedy connections from monopolizing memory. Values of 16777216 (16 MB) for the max are appropriate.

Note

ESnet explicitly advises leaving net.ipv4.tcp_mem at its defaults, as the kernel manages total TCP memory well on its own. Some guides also recommend disabling tcp_timestamps and tcp_sack to reduce CPU load -- ESnet strongly advises against this, as it breaks performance in the vast majority of cases.

Local Port Range and TIME_WAIT

Every outbound connection consumes a local ephemeral port. The default range on many systems is 32768 60999, giving roughly 28,000 ports. For servers that proxy many connections upstream, this can be exhausted quickly.

/etc/sysctl.d/99-high-traffic.conf

net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1

Note

tcp_tw_reuse only affects outbound connections -- it allows reuse of TIME_WAIT sockets when creating new connections as a client. For a web server that primarily accepts inbound connections, this setting has no effect on its listening sockets. It matters for reverse proxies and load balancers that make upstream connections to backends. Also note that starting ip_local_port_range at 1024 may conflict with services bound to IANA registered ports (1024-49151). If you run services on specific ports in that range, a safer starting point is 10240 65535 or 15000 65535.

Caution

The now-removed tcp_tw_recycle parameter was dropped from the kernel in version 4.12 because it broke connections from clients behind NAT. As community commentary has consistently noted, it caused connectivity failures for users sharing a public IP address. Never use it, and disregard any legacy guide that recommends it.

TCP Fast Open and SYN Cookies

TCP Fast Open (TFO) allows data to be sent in the initial SYN packet, eliminating one full round-trip from connection establishment. Setting the value to 3 enables TFO for both client and server roles. SYN cookies (tcp_syncookies = 1) provide protection against SYN flood attacks -- this is typically enabled by default and should remain on.

/etc/sysctl.d/99-high-traffic.conf

net.ipv4.tcp_fastopen = 3
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_slow_start_after_idle = 0

Disabling tcp_slow_start_after_idle prevents the kernel from resetting the congestion window on idle connections, which is critical for keepalive connections to backends.

Congestion Control: Why BBR Matters

The congestion control algorithm determines how aggressively TCP sends data and how it reacts to network conditions. The default on Linux is CUBIC, a loss-based algorithm: it increases its sending rate until it detects packet loss, interprets that loss as congestion, and backs off. This approach has fundamental limitations. In networks with shallow buffers, packets can be dropped from brief bursts even when the link isn't congested. In networks with deep buffers, CUBIC fills them before detecting loss, creating the "bufferbloat" problem.

BBR: A Model-Based Approach

In 2016, Google introduced BBR (Bottleneck Bandwidth and Round-trip propagation time), merging it into Linux kernel 4.9. Instead of reacting to packet loss, BBR continuously estimates the bottleneck bandwidth and minimum RTT, then paces its sending rate to match the available capacity.

The results have been significant. AWS CloudFront reported performance gains of up to 22% in aggregate throughput after deploying BBR in 2019. In Google's own testing, comparing BBR to CUBIC on an emulated 10 Gbps link with 100 ms RTT and 1% packet loss, CUBIC achieved only about 3.3 Mbps while BBR achieved over 9,100 Mbps -- nearly three orders of magnitude higher (Google Cloud Blog). These numbers reflect BBR's fundamental advantage: because it doesn't interpret packet loss as congestion, it maintains throughput in lossy environments where loss-based algorithms collapse.

BBR is a sender-side algorithm -- it only needs to be enabled on your server, not on the client or anywhere in the network path:

enabling BBR

net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

The fq (fair queueing) qdisc is essential for BBR to pace packets correctly. ESnet notes that while both fq and fq_codel support pacing, fq is specifically recommended by the BBR team at Google for use with BBR.

Pro Tip

BBR isn't universally better. Research from APNIC found that under small BDP and deep buffers, CUBIC actually achieves higher throughput. BBR excels under large BDP and shallow buffer conditions -- which describes the vast majority of internet-facing server traffic.

Note

BBRv1 has known fairness issues: it can starve CUBIC flows when sharing a bottleneck, particularly with small buffers. BBRv2 (and the subsequent BBRv3 work) addresses these problems with improved inter-protocol fairness and ECN support. If your kernel supports it (check sysctl net.ipv4.tcp_available_congestion_control), BBRv2 is the better choice for production environments where mixed congestion control algorithms coexist.

Verify your configuration is active:

verifying BBR

# Check current algorithm
$ sysctl net.ipv4.tcp_congestion_control

# List available algorithms
$ sysctl net.ipv4.tcp_available_congestion_control

# Verify BBR is active on connections
$ ss -tin | grep bbr

ESnet also notes they no longer recommend HTCP as a congestion control algorithm, stating that with newer kernel versions there is no longer a performance advantage over the default CUBIC.

Memory Management Tuning

Memory management tuning affects how the kernel balances RAM between application memory, filesystem caches, and swap. For high-traffic servers, incorrect defaults here can cause latency spikes, I/O storms, or premature swapping.

Swappiness

The vm.swappiness parameter controls how aggressively the kernel moves inactive pages from RAM to swap. The default is 60. The kernel uses this along with a "distress" value and the mapped memory ratio to calculate a "swap tendency" -- when this exceeds 100, the kernel begins swapping application memory (Red Hat Customer Portal).

For servers with ample RAM (8 GB+), a value of 60 is far too aggressive. The kernel will swap out application memory to make room for file cache pages even when plenty of RAM is available. This introduces disk I/O latency on what should be memory-speed operations.

/etc/sysctl.d/99-high-traffic.conf

vm.swappiness = 10

A value of 10 tells the kernel to strongly prefer reclaiming file cache pages over swapping out application memory. Some database workloads run with vm.swappiness = 1, though setting it to 0 carries OOM kill risk under extreme memory pressure on older kernels.

Dirty Page Ratios

The kernel caches writes in "dirty pages" before flushing them to disk. Two parameters control this: vm.dirty_background_ratio (when background writeback starts) and vm.dirty_ratio (when processes are forced to do synchronous I/O).

The defaults (10% and 20%) can cause problems on large-memory servers. On a 128 GB machine, dirty_ratio = 20 means the kernel might accumulate 25 GB of dirty pages before forcing a flush. The Gluster documentation describes this clearly: high dirty ratios on large-memory systems can trigger massive pagecache flushes to disk, causing huge wait times and decreasing overall responsiveness.

/etc/sysctl.d/99-high-traffic.conf

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
vm.vfs_cache_pressure = 50

Lower values cause more frequent but smaller flushes, providing more consistent I/O latency. The vfs_cache_pressure value of 50 (default: 100) tells the kernel to prefer retaining directory and inode caches, which reduces filesystem lookup latency on file-serving workloads.

Transparent Huge Pages (THP)

Standard Linux memory pages are 4 KB. Transparent Huge Pages allows the kernel to automatically use 2 MB pages, reducing TLB misses. However, THP is a source of serious latency problems for many server workloads. The kernel's khugepaged daemon runs in the background compacting memory to create 2 MB pages, causing unpredictable latency spikes.

For JVM-based applications (Elasticsearch, Kafka), databases (Redis, MongoDB, PostgreSQL), and latency-sensitive web applications, disabling THP is standard practice:

disable THP

$ echo never > /sys/kernel/mm/transparent_hugepage/enabled
$ echo never > /sys/kernel/mm/transparent_hugepage/defrag

Alternatively, setting THP to madvise instead of never provides a middle ground: the kernel will not automatically use huge pages, but applications that explicitly request them via madvise(MADV_HUGEPAGE) can still benefit. This is useful for JVMs or other runtimes that are configured to opt in to huge pages while keeping the rest of the system free from khugepaged compaction latency.

To make this persistent across reboots, create a systemd unit:

/etc/systemd/system/disable-thp.service

[Unit]
Description=Disable Transparent Huge Pages

[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
ExecStart=/bin/sh -c "echo never > /sys/kernel/mm/transparent_hugepage/defrag"

[Install]
WantedBy=multi-user.target

File Descriptors and Connection Tracking

Every open socket, file handle, and pipe consumes a file descriptor. The system-wide limit is controlled by fs.file-max. Per-process limits are configured via /etc/security/limits.conf or systemd's LimitNOFILE.

/etc/sysctl.d/99-high-traffic.conf

fs.file-max = 2097152

/etc/security/limits.conf

*    soft    nofile    1048576
*    hard    nofile    1048576

Connection Tracking (conntrack)

If your server uses iptables/nftables with stateful rules, the nf_conntrack module maintains a table of every connection flowing through the firewall. The default table size (often 65536) is rapidly exhausted under high-concurrency workloads. When it fills, new connections are dropped and dmesg shows nf_conntrack: table full, dropping packet.

/etc/sysctl.d/99-high-traffic.conf

net.netfilter.nf_conntrack_max = 262144

Each entry consumes approximately 300-400 bytes of kernel memory, so 262144 entries require roughly 80-100 MB. If your server doesn't require stateful firewalling (for example, behind a dedicated hardware firewall or cloud security group), the most performant option is to disable conntrack entirely by unloading the module.

IRQ Affinity and Packet Pacing

Kernel parameter tuning alone cannot solve every bottleneck. On multi-queue NICs (standard on 10 Gbps+ adapters), packet processing can be distributed across CPU cores using Receive Side Scaling (RSS) and IRQ affinity.

By default, the kernel may assign all NIC interrupts to a single CPU core, creating a processing bottleneck while other cores sit idle. Configuring RSS distributes network processing across cores on the same NUMA node as the NIC, minimizing memory access latency (DigitalOcean).

IRQ affinity and RSS

# Check current IRQ distribution
$ cat /proc/interrupts | grep eth

# Set RSS queue count to match available cores
$ ethtool -L eth0 combined 8

# Enable RPS on systems without hardware RSS
$ echo "ff" > /sys/class/net/eth0/queues/rx-0/rps_cpus

Packet Pacing

Bursty traffic patterns -- where the NIC sends packets in rapid micro-bursts rather than at a steady rate -- can overflow switch buffers and receive host buffers, causing packet loss distinct from sustained congestion. ESnet strongly recommends using packet pacing:

packet pacing with tc

# For a 10G host -- cap slightly below line rate
$ tc qdisc add dev eth0 root fq maxrate 8gbit

# For a 10G host using 4 parallel streams
$ tc qdisc add dev eth0 root fq maxrate 2gbit

NUMA Awareness

On multi-socket servers, memory access latency depends on which NUMA node the memory is allocated from. If your application runs on CPU cores attached to NUMA node 0 but its network buffers are allocated on node 1, every packet involves a cross-node memory access penalty. Use numactl --cpunodebind=0 --membind=0 to pin latency-sensitive services to the same NUMA node as the NIC. Verify NIC-to-NUMA mapping with cat /sys/class/net/eth0/device/numa_node. For high-traffic servers, NUMA-aware placement can matter as much as the sysctl tuning described in this guide.

A Note on Window Scaling

Some legacy tuning guides recommend disabling net.ipv4.tcp_window_scaling. Do not do this. Window scaling (RFC 1323) is required for TCP windows larger than 64 KB, which is essential for any high-throughput connection. It is enabled by default and should stay that way.

The Complete Configuration

Here is a complete, production-ready /etc/sysctl.d/99-high-traffic.conf for a web-facing server on a 10 Gbps link. Every parameter has been covered in the sections above:

/etc/sysctl.d/99-high-traffic.conf

#
# Network: Connection Handling
# (default is 4096 on kernel 5.4+; raise for high-concurrency)
#
net.core.somaxconn = 16384
net.ipv4.tcp_max_syn_backlog = 16384
net.core.netdev_max_backlog = 5000

#
# Network: TCP Buffers
#
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432

#
# Network: Congestion Control
#
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

#
# Network: Connection Reuse and Ports
#
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_fastopen = 3

#
# Network: Security
#
net.ipv4.tcp_syncookies = 1

#
# Memory Management
#
vm.swappiness = 10
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
vm.vfs_cache_pressure = 50

#
# File Descriptors
#
fs.file-max = 2097152

#
# Connection Tracking (if using stateful firewall)
#
net.netfilter.nf_conntrack_max = 262144

$ sudo sysctl --system

Monitoring and Validation

After applying changes, validate that they're taking effect and measure the impact.

post-tuning validation

# Watch for listen queue overflows
$ nstat -az TcpExtListenDrops TcpExtListenOverflows

# Monitor accept queue per port
$ ss -ntl

# Check for softnet backlog drops (second column)
$ awk '{for (i=1; i<=NF; i++) printf strtonum("0x" $i) (i==NF?"\n":" ")}' /proc/net/softnet_stat

# Verify BBR is active
$ ss -tin | head -5

# Watch swap and dirty pages under load
$ vmstat 1
$ grep -E "Dirty|Writeback" /proc/meminfo

# Load test
$ wrk -t12 -c1000 -d30s http://localhost/

Final Guidance

Kernel tuning is not a one-time task. Newer kernels consistently improve defaults -- modern kernels (6.6+) ship with the EEVDF scheduler (replacing CFS), better auto-tuning of TCP buffers, and improved memory management. What was essential tuning on a 4.x kernel may be unnecessary or even counterproductive on a 6.x kernel.

The community around Linux networking tuning has a recurring theme. As the Linux Network Performance Parameters project notes: many people search for sysctl values that promise high throughput with no trade-offs in every situation, but that's unrealistic. The newer kernel versions are well-tuned by default, and uninformed changes can hurt performance rather than help it.

The process is always the same: measure, change one thing, measure again, and document what you changed and why.

Use tools like perf, bpftrace, and sar for deep analysis. Consider eBPF-based observability for kernel-level insights that go beyond what sysctl can tell you. And revisit your tuning after every major kernel upgrade -- the defaults may have shifted in your favor.

^ back to top