CUDA vs ROCm on Linux for AI in 2026 — sudowheel.com

Q: What does nvidia-smi show and why does it matter?

nvidia-smi is the NVIDIA System Management Interface. It reports the installed driver version, the maximum CUDA version that driver supports, GPU name, temperature, power draw, memory usage, and running GPU processes. The CUDA Version shown in the top-right corner is the ceiling for PyTorch wheel compatibility -- you must install a PyTorch wheel targeting that version or lower. Running nvidia-smi and seeing a clean output table is the first verification step for any NVIDIA GPU AI setup on Linux.

Brian, Kandi

GPU acceleration for running AI workloads on Linux comes down to two software stacks: NVIDIA's CUDA and AMD's ROCm. They solve the same problem -- giving frameworks like PyTorch direct access to GPU compute -- but they take different architectural approaches, have dramatically different ecosystem maturity, and require different amounts of setup effort on Linux. Understanding the practical difference between them determines how much of your time goes into OS maintenance versus actual AI work. If you are still deciding on hardware, the Linux GPU tiers guide ranks AMD, NVIDIA, and Intel cards by driver support and AI compute readiness.

The comparison is more nuanced in 2026 than it has been in previous years. ROCm 7.1.1 delivered up to 5x inference performance improvement over ROCm 6.4.4 across key AI models, per AMD Performance Labs testing (source: AMD ROCm What's New). ROCm 7.2.0 shipped January 21, 2026, and ROCm 7.2.1 -- released March 25, 2026 -- is the current production release as of April 2026 (source: Phoronix, AMD ROCm 7.2.1 coverage). ROCm 7.2.1 adds Ubuntu 24.04.4 support (with both the 6.8 GA kernel and 6.17 HWE kernel), marks Ubuntu 24.04.3 end-of-life for ROCm, delivers improved hipBLASLt performance for MXFP8 and MXFP4 GEMM kernels, enables JAX 0.8.2, and discontinues the ROCm Offline Installer Creator in favor of the self-extracting Runfile Installer (source: AMD ROCm 7.2.1 Release Notes). Consumer GPU support expanded significantly with 7.2.0: the RX 7700 (non-XT), RX 9060 XT LP, and Radeon AI PRO R9600D were added to the official Radeon support matrix, joining the RX 9070 XT, RX 9070, RX 9070 GRE, RX 7900 series, RX 7700 XT, RX 7800 XT, and RX 9060 XT that were already listed (source: AMD ROCm Linux System Requirements). The gap between the two stacks has genuinely narrowed. At the same time, CUDA's ecosystem maturity still represents a real and significant advantage, particularly for specialized libraries and the long tail of AI tooling that assumes NVIDIA hardware.

What Each Stack Actually Is

CUDA (Compute Unified Device Architecture) is NVIDIA's proprietary parallel computing platform, launched in 2006. From a Linux perspective, it consists of: a kernel-mode driver loaded at boot, a userspace runtime library, a compiler (nvcc), and an extensive collection of domain libraries -- cuDNN for deep learning primitives, cuBLAS for linear algebra, cuFFT for Fourier transforms, TensorRT for inference optimization, and dozens more. The CUDA SDK has been continuously refined for nearly two decades, and that accumulated optimization is why compute-intensive workloads tend to run faster on NVIDIA hardware even when the underlying FLOPS count looks comparable to an AMD alternative. For a full comparison of how Linux GPU drivers differ across vendors, including open vs. proprietary kernel modules, see the GPU drivers guide. If you are hardening the system this GPU is running on, be aware that both CUDA and ROCm install kernel-mode drivers that run at ring 0 -- the same privilege level exploited in vulnerabilities like CVE-2026-3888 on Ubuntu 24.04; keeping drivers and the kernel patched is not optional in any production AI environment, and hardening SSH access to the machine running your GPU stack should happen before you expose any AI service port. The guide to disabling root SSH login on Ubuntu covers the baseline config changes that close the most commonly exploited SSH attack vectors on Ubuntu 24.04.

MITRE T1068 — Exploitation for Privilege Escalation MITRE T1547.006 — Kernel Modules and Extensions MITRE T1652 — Device Driver Discovery NIST SP 800-53 SI-2 (Flaw Remediation) NIST SP 800-53 CM-6 (Configuration Settings) NIST SP 800-123 (Server Security Guide)

ROCm (Radeon Open Compute) is AMD's open-source GPU compute stack. It provides a kernel-mode driver (amdgpu-dkms), a compute runtime, and its own library ecosystem: rocBLAS, MIOpen (deep learning primitives), rocFFT, and others. The key architectural element is HIP (Heterogeneous-compute Interface for Portability), a C++ programming model designed to be close enough to the CUDA API that code can be compiled for either AMD or NVIDIA hardware with minimal changes. PyTorch uses HIP internally for its ROCm support, which is why the Python API is identical across both backends -- torch.cuda.is_available() returns True on a working ROCm system just as it does on CUDA.

// check your understanding

Why does torch.cuda.is_available() return True on a ROCm system? Is that expected behavior or a quirk?

Expected behavior. PyTorch on ROCm uses HIP internally as its compute backend. AMD designed HIP to be API-compatible with CUDA, and PyTorch exposes that through the same torch.cuda namespace on both stacks. It is intentional: existing CUDA PyTorch code runs on ROCm without Python-level changes. The definitive confirmation you have the ROCm wheel (not the CUDA or CPU build) is torch.version.hip -- if it returns a version string rather than None, you have the ROCm build. The shared API surface also means that environment variable injection attacks targeting CUDA PyTorch deployments (MITRE ATT&CK T1574 — Hijack Execution Flow) apply equally to ROCm environments.

Side-by-Side Comparison

Factor	CUDA (NVIDIA)	ROCm (AMD)
License	Proprietary	Open source (MIT / Apache)
Linux driver setup	Automated via ubuntu-drivers or DKMS	Manual via amdgpu-dkms + group membership
PyTorch support	Full, first-class, all versions	Full for stable ROCm releases; nightly for latest
Specialized libraries	cuDNN, cuBLAS, TensorRT, Flash Attention 2	MIOpen, rocBLAS; Flash Attention 2 limited
Consumer GPU support	GTX 1060+ fully supported	RX 7800 XT, RX 7700 XT, RX 7900 series, RX 9070/9070 XT/9070 GRE, RX 9060 XT officially supported in ROCm 7.2.1; older cards via HSA override
Silent fallback risk	Low (driver mismatch gives clear errors)	Higher (unsupported GPU silently uses CPU)
Performance vs CUDA	Baseline	10-30% behind in optimized workloads; narrowing
Hardware cost	Premium	15-40% lower at comparable tiers
Portability (HIP)	Code runs NVIDIA-only	HIP code compiles for AMD and NVIDIA

CUDAProprietary. Automated driver install via ubuntu-drivers or DKMS.

ROCmOpen source. Manual setup via amdgpu-dkms, render/video group membership required.

CUDAFull first-class PyTorch support. cuDNN, cuBLAS, TensorRT, Flash Attention 2.

ROCmFull PyTorch support for stable ROCm releases. MIOpen, rocBLAS. Flash Attention 2 limited.

CUDABaseline performance. Premium hardware pricing.

ROCm10-30% behind in optimized workloads, gap narrowing. 15-40% lower hardware cost at comparable tiers.

CUDALow. Driver/version mismatch produces clear error output.

ROCmHigher. Unsupported GPU silently falls back to CPU with no error message.

// GPU Stack Selector — answer 4 questions, get a recommendation

What is your primary use case for GPU compute on this Linux machine?

Does your GPU appear in the ROCm 7.2.1 official support list (RX 7800 XT, RX 7900 series, RX 9070/9060 series, Instinct)?

Do any of your training pipelines rely on Flash Attention 2 optimizations, TensorRT, or vendor-specific quantization libraries?

Is software supply chain auditability or open-source stack access a security or compliance requirement for you?

Are you comfortable with the HSA_OVERRIDE_GFX_VERSION workaround and testing stability before deploying to production?

CUDA: Driver Setup and Verification

NVIDIA's driver story on Linux has improved substantially -- for full context on the open kernel module transition and driver versioning, see NVIDIA Linux Drivers: Open Modules, GSP Firmware, and the Road to Blackwell. Ubuntu's ubuntu-drivers tool detects the recommended driver and installs it with a single command. DKMS handles kernel module rebuilds after kernel updates automatically.

terminal

# Show recommended driver for your GPU
$ ubuntu-drivers devices

# Install automatically
$ sudo ubuntu-drivers autoinstall

# Or install a specific version manually (use the version shown by ubuntu-drivers devices)
$ sudo apt install nvidia-driver-570

# Reboot and verify
$ sudo reboot
$ nvidia-smi

The nvidia-smi output is your primary verification tool. The table it prints shows the driver version, the maximum CUDA version that driver supports (top-right corner), GPU name, temperature, power draw, memory usage, and any running GPU processes. The CUDA version shown here is the ceiling for PyTorch wheel compatibility -- you cannot install a PyTorch wheel targeting a higher CUDA version than your driver supports and expect GPU acceleration to work.

terminal

# Real-time GPU utilization monitor
$ nvidia-smi dmon -s u

# Confirm PyTorch sees the GPU
$ python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

# Check which CUDA version PyTorch was built against
$ python3 -c "import torch; print('PyTorch CUDA:', torch.version.cuda)"
# This must be <= the CUDA Version shown by nvidia-smi

ROCm: Driver Setup and Verification

ROCm 7.2.1 -- released March 25, 2026 -- is the current production release (source: AMD ROCm Release Notes). Beyond the Ubuntu 24.04.4 support headline, 7.2.1 fixes two bugs that directly affect AI workloads: a doubling of runtime latency in the hipStreamCreate API that was introduced in ROCm 7.2.0 (GitHub issue #5978, impacted RCCL collective benchmarks but not PyTorch application-level performance), and a memory coherency issue specifically on gfx1201 GPUs (the RX 9070 / 9070 XT / 9070 GRE architecture). If you installed ROCm 7.2.0 on an RX 9070 series card, the 7.2.1 upgrade is not optional -- the coherency fix addresses data corruption under certain workloads. The AMD_DIRECT_DISPATCH environment variable is also deprecated in 7.2.1's HIP runtime; if you have this set in any .bashrc, service file, or container config, remove it before upgrading. AMD's recommended installation method as of ROCm 7.2.1 is the native package manager -- AMD has removed the amdgpu-install documentation from the primary install path and flags it as a legacy process (source: ROCm Installation Overview). The amdgpu-install .deb package still works and is used in the quick-start guide, but the package manager path via APT repository is now the canonical method. Two post-install steps that are easy to miss regardless of method: adding your user to the render and video groups, and rebooting before verification. Missing either will produce confusing errors. In multi-user environments, be deliberate about which accounts get render group membership -- it grants direct GPU compute access; the Linux user permissions audit guide covers how to review group memberships and identify overprivileged accounts. Ubuntu version matters: ROCm 7.2.1 supports Ubuntu 24.04.4 only -- 24.04.3 is now end-of-life for ROCm. Note also that AMD SMI (amd-smi) is replacing rocm-smi as the primary GPU management utility -- AMD has flagged rocm-smi for phase-out, so new setups should use amd-smi for health checks and monitoring.

terminal (Ubuntu 24.04.4 -- recommended package manager method)

# Step 1: Install prerequisites
$ sudo apt update
$ sudo apt install python3-setuptools python3-wheel wget

# Step 2: Download and import the AMD ROCm GPG signing key
$ sudo mkdir --parents --mode=0755 /etc/apt/keyrings
$ wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \
  gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null

# Step 3: Add ROCm 7.2.1 repositories (Ubuntu 24.04 / Noble)
$ sudo tee /etc/apt/sources.list.d/rocm.list << EOF
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/7.2.1 noble main
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/graphics/7.2.1/ubuntu noble main
EOF

# Step 4: Pin the ROCm repo to avoid conflicts
$ sudo tee /etc/apt/preferences.d/rocm-pin-600 << EOF
Package: *
Pin: release o=repo.radeon.com
Pin-Priority: 600
EOF

$ sudo apt update

# Step 5: Add user to render and video groups (critical -- skip this and GPU access fails)
$ sudo usermod -a -G render,video $LOGNAME

# Step 6: Install ROCm
$ sudo apt install rocm

# Step 7: Reboot before verification
$ sudo reboot

Note: Legacy amdgpu-install method

AMD still provides the amdgpu-install .deb package as a legacy install path, but as of ROCm 7.2.1 the documentation for it has been removed from the primary install flow. If you prefer the legacy method, download the package for Ubuntu 24.04.4 with:

terminal

$ wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb

The same group membership and reboot steps apply. For the Runfile Installer (offline installs), see the ROCm installation documentation.

Note: iGPU Conflicts and the ROCR_VISIBLE_DEVICES Workaround

ROCm does not currently support AMD integrated graphics for compute workloads. If your system has an AMD iGPU alongside a discrete GPU, disable the iGPU in BIOS before installing ROCm. If the runtime detects the iGPU, it may crash even when you attempt to exclude it. The issue is that HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES operate at different layers of the ROCm stack. HIP_VISIBLE_DEVICES controls which devices are visible to the HIP runtime layer; ROCR_VISIBLE_DEVICES controls visibility at the lower ROCm runtime (ROCR) layer, which is where iGPU enumeration happens before HIP ever sees it. On systems where BIOS disabling is not available — laptops, mini-PCs, and Ryzen systems where the iGPU drives display output — set ROCR_VISIBLE_DEVICES to the PCI device ID of only the discrete GPU to prevent the iGPU from being enumerated at the ROCR layer entirely:

export ROCR_VISIBLE_DEVICES=$(rocminfo | grep -B5 "discrete" | grep "Device Type" -A5 | ...) — in practice, find the correct PCI ID from rocminfo output and hard-code it into your service override. Also confirm you are running Ubuntu 24.04.4 -- ROCm 7.2.1 marks Ubuntu 24.04.3 as end-of-life. AMD Ryzen processor users are specifically advised to use the inbox graphics drivers from Ubuntu 24.04.3 (not the separate AMDGPU stack) alongside ROCm 7.2.1 -- this is a separate path from the Radeon discrete GPU install described above (source: AMD ROCm 7.2.1 Radeon and Ryzen Release Notes).

Looking ahead: Ubuntu 26.04 native packaging and TheRock build system

Canonical announced in December 2025 an expanded collaboration with AMD to package and maintain ROCm directly in Ubuntu's official repositories, targeting Ubuntu 26.04 LTS (Resolute Raccoon, releasing April 2026). The goal is to make ROCm installable via a simple apt install rocm with up to 15 years of Ubuntu Pro support and automatic security updates -- eliminating the manual external-repo setup described above. As of early April 2026, the Ubuntu 26.04 archive carries ROCm 7.1 packages and the responsible Canonical engineer's upload rights were only approved on March 31, so whether ROCm 7.2 lands in the archive by launch day remains unresolved. Until those packages land and stabilize, the AMD upstream repository method documented here remains the correct path for ROCm 7.2.1 on Ubuntu 24.04.4 (sources: Canonical ROCm announcement; Phoronix, Ubuntu 26.04 ROCm state, March 2026).

Separately, AMD's TheRock build system — the next-generation ROCm build infrastructure — is currently in technology preview and is planned to replace the current production stream in mid-2026. TheRock introduces architecture-specific Python packages (targeting only your installed GPU architecture, rather than a fat multi-arch install), ManyLinux_2_28 compliance for better multi-distro portability, and a slimmed-down SDK focused on core compute. Nightly builds are available at rocm.nightlies.amd.com. For users who need a newer GPU target — such as RDNA 4 (gfx1200/gfx1201) support that postdates the current stable release — TheRock nightly builds are the practical path ahead of the next stable release, rather than the HSA override workaround. AMD has indicated that TheRock will begin an expanded hardware support cadence (approximately every six weeks) once the technology preview transitions to production.

Verifying the ROCm installation

terminal

# List agents -- look for your GPU with its marketing name and GFX version
$ rocminfo | grep -A2 "Marketing Name"

# Check kernel driver is loaded
$ lsmod | grep amdgpu

# Confirm group membership
$ groups $USER
# Must include both render and video

# ROCm version
$ cat /opt/rocm/.info/version

# GPU visible to ROCm compute layer
$ clinfo | grep "Device Type"

# Monitor GPU utilization during a workload
$ amdgpu_top   # or: radeontop

The HSA Override: When ROCm Silently Falls Back

This is the single most important practical difference between CUDA and ROCm on Linux for AI work, and it trips up a disproportionate number of AMD users: ROCm performs hardware compatibility checks using GFX version strings. If your installed GPU is not in ROCm's official support list, the runtime rejects it and falls back to CPU inference -- without printing an error message. torch.cuda.is_available() returns False. Ollama runs at 3 tokens per second. Nothing tells you why.

The fix is HSA_OVERRIDE_GFX_VERSION, an environment variable that bypasses the compatibility check by telling the ROCm runtime to treat your GPU as if it were a supported architecture. Set it to the GFX version of the nearest officially supported card.

terminal

# Find your GPU's GFX version
$ rocminfo | grep -E "Name.*gfx|gfx[0-9]"
# Example output: Name: gfx1031
# gfx1031 is an RX 6700 XT -- not officially supported
# Nearest supported: gfx1030 -- so use 10.3.0

# Test the override in the current shell session
$ export HSA_OVERRIDE_GFX_VERSION=10.3.0
$ python3 -c "import torch; print(torch.cuda.is_available())"
# Should now return True

To make the override persistent for Ollama, add it to the systemd service override file. For PyTorch scripts run manually, add it to your ~/.bashrc or ~/.profile:

/etc/systemd/system/ollama.service.d/override.conf

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="HSA_OVERRIDE_GFX_VERSION=10.3.0"
# Adjust the version to match your card

Warning

The override bypasses compatibility checks but does not guarantee every GPU operation will succeed. Unsupported cards may fail on specific compute kernels, particularly in operations that rely on architecture-specific features introduced in the card the GFX version targets. For training-heavy workloads, test thoroughly before relying on an overridden card in production.

Two Distinct Override Failure Modes

The override resolves one specific failure: the ROCm runtime rejecting a GPU because its GFX version string is not in the supported list. It does not resolve a different failure that looks nearly identical: rocBLAS failing because it cannot find a matching Tensile kernel library entry for your GPU architecture. When this happens, you will see an error like Cannot read TensileLibrary.dat for GPU arch: gfx1032 even after the HSA override has passed the HIP compatibility check. torch.cuda.is_available() returns True, but the first matrix multiply crashes. These are separate problems with separate solutions.

The rocBLAS / TensileLibrary failure happens because rocBLAS ships pre-compiled Tensile kernels only for officially targeted architectures. When your GPU is masquerading as a different GFX version via the override, rocBLAS looks up kernels by the real hardware architecture string — not the override value — and finds no entry. The workaround for this class of error is either: (a) using community-built rocBLAS binaries compiled with your actual GFX target explicitly included (these exist for common unsupported architectures), or (b) building rocBLAS from source with -D GPU_TARGETS=gfx1032 (or your specific target) so that Tensile compiles real kernels for your hardware. Building from source is the more stable resolution for any unsupported card where you need reliable training rather than just basic inference.

Multi-GPU Setups With Mixed Architectures

When two AMD GPUs of different GFX architectures are present — for example, an Instinct card alongside a consumer Radeon — HSA_OVERRIDE_GFX_VERSION applies globally to all devices. Setting a single global override that targets one card's nearest supported version will misrepresent the other card, producing either incorrect behavior or a crash. The documented solution is per-device indexed overrides:

terminal — per-device indexed HSA override (mixed-architecture multi-GPU)

# Single global override — applies to ALL devices (not suitable for mixed-arch setups)
$ export HSA_OVERRIDE_GFX_VERSION=10.3.0

# Per-device indexed overrides — device 0 and device 1 get separate values
$ export HSA_OVERRIDE_GFX_VERSION_0=10.3.0
$ export HSA_OVERRIDE_GFX_VERSION_1=11.0.0

# Verify which device index maps to which physical GPU
$ rocminfo | grep -E "Agent [0-9]|Marketing Name|Name.*gfx"

Device index assignment follows the order GPUs are enumerated by the ROCm HSA runtime, which may not match the order shown by amd-smi. Always verify the index-to-device mapping with rocminfo before writing per-device overrides into persistent service files or shell profiles.

GPU Hang and Reset Mitigation Flags

On certain older architectures — particularly RDNA 1 (gfx1010, gfx1012) and some Vega-family cards — running compute workloads under an HSA override can produce intermittent GPU hangs or driver-level resets even when the basic HIP check passes. Two additional environment variables can reduce or eliminate these:

terminal — stability flags for older or overridden hardware

# Disable SDMA (System DMA engine) -- resolves hangs on some RDNA 1 / Vega systems
$ export HSA_ENABLE_SDMA=0

# Disable MWAITX instruction use -- resolves certain spin-wait hangs
$ export HSA_ENABLE_MWAITX=0

# Use subquadratic cross-attention in ComfyUI / diffusion pipelines on older RDNA
# (more stable than split cross-attention, lower VRAM than full attention)
$ export PYTORCH_ATTENTION_MODE=sub_quadratic

Note: Building ROCm for a Specific Unsupported Target

For RDNA 1 GPUs (gfx1010, gfx1012, gfx1031) and other architectures where the HSA override produces mismatched instruction sets and the rocBLAS TensileLibrary errors persist, the cleanest resolution is compiling ROCm components with the actual GFX target explicitly specified. The community-maintained rocm_sdk_builder project provides scripts for this. When building rocBLAS specifically, the critical CMake flag is -D GPU_TARGETS=gfx1010 (substitute your actual target). This produces native Tensile kernels for your hardware rather than forcing gfx1030 kernels to run on a gfx1010 die. The result is lower peak throughput on some operations — the instruction set genuinely differs — but stable execution rather than random crashes, which is the appropriate trade-off for unsupported-hardware situations. AMD's TheRock build system (the next-generation ROCm build infrastructure, currently in technology preview through mid-2026) uses architecture-specific Python packages designed to reduce the overhead of multi-target builds and will make targeted unsupported-arch builds easier to maintain once it replaces the current production stream.

Threat Surface: Environment Variable Injection & Service File Persistence

When you write HSA_OVERRIDE_GFX_VERSION into a systemd service override file or ~/.bashrc, you are adding a persistent environment variable that any process inheriting that environment can read. In shared or multi-user AI lab setups, environment variable injection is a documented lateral movement and persistence technique -- an attacker who can modify a service unit file or a user's shell profile can redirect or manipulate the behavior of any process that trusts those variables.

Review who has write access to /etc/systemd/system/ overrides and keep your service unit file permissions tight. The unauthorized crontab modification guide covers the broader class of persistence via shell environment and scheduling files, including auditd detection rules that apply here.

MITRE T1574 — Hijack Execution Flow MITRE T1053.006 — Scheduled Task: Systemd Timers MITRE T1546.004 — Unix Shell Configuration Modification NIST SP 800-53 CM-7 (Least Functionality) NIST SP 800-53 AC-6 (Least Privilege)

Multi-GPU systems: per-device override syntax

For systems with multiple AMD GPUs that have different GFX versions, use the per-device variable form: HSA_OVERRIDE_GFX_VERSION_0, HSA_OVERRIDE_GFX_VERSION_1, etc. The index follows HSA agent enumeration order as reported by rocminfo. Ollama's official documentation uses 0-based indexing (e.g. HSA_OVERRIDE_GFX_VERSION_0 and HSA_OVERRIDE_GFX_VERSION_1), but on some systems the CPU occupies agent 0, shifting GPU indices so that the first GPU is agent 1 -- always verify your mapping with rocminfo | grep -E "Agent [0-9]|Marketing Name|Device Type" before writing persistent overrides into service files or shell profiles (sources: field notes on dual AMD GPU ROCm setups; Ollama GPU documentation). One distinction worth knowing: ROCR_VISIBLE_DEVICES operates at the HSA/ROCr runtime layer, below HIP -- it restricts which physical GPUs the entire ROCm stack can see, before HIP enumerates them. HIP_VISIBLE_DEVICES operates at the HIP layer and filters the devices HIP presents to your application. The two variables can produce different device numbering. If you set both and the indices don't agree, ROCR_VISIBLE_DEVICES takes precedence at the hardware layer -- which means HIP_VISIBLE_DEVICES=1 may refer to a different physical card than you expect if ROCR_VISIBLE_DEVICES has already filtered the list.

Common GFX versions and their override values

GPU	Actual GFX	Override GFX	Notes
RX 6600 XT	gfx1032	10.3.0	RDNA 2, not officially listed; nearest supported is gfx1030
RX 6700 XT	gfx1031	10.3.0	RDNA 2, not officially listed; nearest supported is gfx1030
RX 6800 / 6800 XT / 6900 XT	gfx1030	Officially supported (Radeon Pro W6800 is listed; check your OS)	gfx1030 architecture; verify against current ROCm system requirements for your distro
RX 7600	gfx1102	11.0.0	RDNA 3, not officially listed; nearest supported is gfx1100
RX 7700 (non-XT)	gfx1101	Not needed -- officially supported (ROCm 7.2.0+)	RDNA 3, added in ROCm 7.2.0
RX 7700 XT	gfx1101	Not needed -- officially supported (ROCm 6.4.2+)	RDNA 3, added to Linux support in ROCm 6.4.2
RX 7800 XT	gfx1101	Not needed -- officially supported	RDNA 3, full official support
RX 7900 XTX / XT / GRE	gfx1100	Not needed -- officially supported	RDNA 3, full official support
RX 9070 / 9070 XT / 9070 GRE	gfx1201	Not needed -- officially supported (ROCm 7.0+)	RDNA 4, full official support
RX 9060 XT / 9060 XT LP	gfx1200	Not needed -- officially supported (ROCm 7.2.0+)	RDNA 4, added in ROCm 7.2.0

Actual GFXgfx1032 / gfx1031

OverrideHSA_OVERRIDE_GFX_VERSION=10.3.0

NoteRDNA 2, not officially listed; nearest supported is gfx1030

Actual GFXgfx1030

OverrideNot needed for gfx1030 arch -- verify your card against current system requirements

NoteRadeon Pro W6800 is the listed gfx1030 reference; RX 6800 series shares the architecture

Actual GFXgfx1102

OverrideHSA_OVERRIDE_GFX_VERSION=11.0.0

NoteRDNA 3, not officially listed; nearest supported is gfx1100

Actual GFXgfx1101

OverrideNot needed -- officially supported (ROCm 7.2.0+)

NoteRDNA 3, added to official support matrix in ROCm 7.2.0

Actual GFXgfx1101

OverrideNot needed -- officially supported (ROCm 6.4.2+)

NoteRDNA 3, added to Linux support in ROCm 6.4.2

Actual GFXgfx1101

OverrideNot needed -- officially supported

NoteRDNA 3, full official support

Actual GFXgfx1100

OverrideNot needed -- officially supported

NoteRDNA 3, full official support

Actual GFXgfx1201

OverrideNot needed -- officially supported (ROCm 7.0+)

NoteRDNA 4, full official support

Actual GFXgfx1200

OverrideNot needed -- officially supported (ROCm 7.2.0+)

NoteRDNA 4, added in ROCm 7.2.0

PyTorch on CUDA vs ROCm: The Practical Reality

From the Python side, PyTorch on ROCm is designed to be identical to PyTorch on CUDA. The same torch.cuda API, the same tensor operations, the same training loop. The differences emerge when you go deeper into the library stack:

Standard training and inference: Works well on both. ROCm 7.x delivers strong performance for mainstream model architectures -- transformer-based models, CNNs, standard embedding operations. The 10-30% performance gap versus CUDA cited in benchmarks is real but not disqualifying for most workloads. One practical difference almost no guides mention: ROCm's HIP memory allocator is less aggressive than CUDA's pooling strategy. Workloads that fit within VRAM on NVIDIA can trigger out-of-memory errors on an AMD card with the same nominal VRAM because ROCm holds onto freed blocks differently. If you hit OOM on ROCm that you would not hit on CUDA at the same model size, add export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:128 to your environment before running. This instructs the ROCm memory allocator to reclaim more aggressively and limits how large a single contiguous allocation can be split, reducing peak fragmentation at a modest performance cost.
MIOpen kernel cache cold-start delay: MIOpen -- AMD's equivalent of NVIDIA's cuDNN -- does not ship pre-compiled kernels for every possible GPU and workload combination. On first run against your specific hardware, it compiles optimized kernels on-device. This can take three to ten minutes for a complex model, and the delay is invisible -- PyTorch simply appears to hang at the first training step. The compiled kernels are cached under ~/.cache/miopen/. Subsequent runs skip compilation entirely. This is not a bug; it is expected behavior. If you deploy in Docker or other ephemeral environments, mount ~/.cache/miopen/ as a persistent volume or pre-warm the cache before your first production run.
Flash Attention 2: The CUDA implementation is highly optimized via Triton and cuDNN fusion paths. ROCm support exists but is less mature. If your training pipeline uses Flash Attention 2 extensively, CUDA will be faster by a more significant margin.
TensorRT: NVIDIA-only. There is no AMD equivalent. If your inference pipeline depends on TensorRT for optimization, CUDA is required.
bitsandbytes (quantization): Primarily CUDA-focused. ROCm support is in progress but lags behind. For 4-bit or 8-bit quantization with Hugging Face models, CUDA gives a more reliable experience.
Profiling and tracing: PyTorch ROCm users relying on ROCTracer for kernel operation events should be aware that ROCTracer is deprecated and scheduled for end of support by end of Q2 2026. AMD has flagged that some ROCTracer kernel events may already fail to report in ROCm 7.2.1. Migration to ROCprofiler-SDK is the recommended path (source: ROCm 7.2.1 Release Notes).
ROCm Optiq (introduced in 7.2.0): AMD added ROCm Optiq in the ROCm 7.2.0 release -- a new communication optimization layer targeting multi-GPU and multi-node collective operations. If you are running distributed training across multiple AMD GPUs, Optiq may improve throughput for all-reduce and broadcast operations. It is not something most single-GPU users will notice, but it is relevant context for anyone building multi-GPU training pipelines on ROCm.
Ollama and llama.cpp inference: Both work well on ROCm for officially supported or overridden GPUs. This is probably where ROCm performs best relative to CUDA for the Linux AI user -- local LLM inference with standard models is a workload that plays to ROCm's strengths. Running models locally also means sensitive prompts and documents never leave your network, which is a deliberate data security posture decision as much as a performance one; if you are exposing Ollama's API port beyond localhost, nftables firewall rules to restrict which hosts can reach it are not optional, and accessing it remotely through a WireGuard VPN rather than exposing the port directly is the correct approach.

Known issues as of ROCm 7.2.1

Three real-world issues worth knowing before committing to hardware. First, the RX 9060 XT has active bug reports on GitHub (ROCm issue #5999) describing system crashes -- including black screens and session drops -- when running dockerized JAX or PyTorch workloads under ROCm 7.2.0 on Ubuntu 24.04.4. The card is officially supported from ROCm 7.2.0, but stability on this specific GPU should be verified before deploying it in production AI workflows. Second, the Radeon AI Pro R9700 has a confirmed AMDGPU SMU driver interface version mismatch on ROCm 7.2.1 (issue #6101): the card firmware is four interface versions ahead of the AMDGPU driver, preventing fan control registers from being reached and causing thermal throttling at 109°C with the fan physically stationary. This is a serious safety concern on that specific card until a matched driver ships. Both issues are open and unresolved as of early April 2026. Third, AMD Instinct MI300X users in CPX or NPS4 partition mode (38 CUs per partition) may see significantly longer GEMM runtimes under ROCm 7.2.1 because hipBLASLt cannot find pre-tuned kernels for those partition dimensions and falls back to an exhaustive search. The specific failure mode is matrix configurations such as 16384×16384 in TN mode where search time dominates. AMD has a fix in the hipBLASLt develop branch (GitHub issue #6065) that will ship in a future release, but it is not in 7.2.1. If you are seeing unexpectedly slow transformer attention layers on MI300X, this is the likely cause.

AMD's stated goal for the Canonical ROCm collaboration, announced December 2025, is open, high-performance GPU acceleration for AI and HPC workloads on AMD hardware -- treated as a first-class workload in Ubuntu's official repositories.

When AMD ROCm Is the Right Choice

The hardware cost argument for AMD is real at the consumer tier. The Linux GPU tiers guide ranks AMD, NVIDIA, and Intel cards by driver support and compute performance so you can map hardware choices to your budget before committing. An RX 7900 XTX with 24 GB of VRAM costs substantially less than an RTX 4090 with 24 GB, and for memory-bound workloads like large-context LLM inference, VRAM capacity matters more than raw compute throughput. The Linux GPU tier rankings break down which cards offer the best VRAM-per-dollar at each price tier. AMD's Instinct line in the data center offers similar value advantages at scale. For verified pricing and positioning, see AMD Radeon RX 7900 XTX product page.

ROCm also appeals on principle for anyone who cares about open-source software ecosystems. The entire ROCm stack is open source and auditable in a way that CUDA never will be. For anyone operating under a security policy that requires software supply chain review, that distinction is substantive -- CUDA's closed binary stack cannot be audited, and kernel-level code running at the highest privilege ring is a documented attack surface for which source visibility matters. HIP-based code can be compiled for both AMD and NVIDIA hardware, reducing vendor lock-in for teams building portable AI tooling. The broader picture on how Linux systems get compromised -- including supply chain vectors like trojanized binaries and BPF-based rootkits that are invisible to userspace -- is relevant context for any engineer responsible for a GPU server running AI workloads. Canonical's December 2025 announcement that ROCm will be packaged natively in Ubuntu's official repositories starting with Ubuntu 26.04 LTS is a meaningful ecosystem signal -- it treats ROCm on equal footing with CUDA in terms of distro-level support and mirrors a parallel CUDA packaging announcement from September 2025. The promised apt install rocm experience with up to 15 years of Ubuntu Pro support would further reduce the setup barrier that has historically been ROCm's biggest practical disadvantage. The packages are not fully in place yet as of April 2026, but the direction is clear (source: Canonical blog).

MITRE T1195.002 — Software Supply Chain Compromise NIST SP 800-161 Rev 1 (C-SCRM) NIST SP 800-53 SR-4 (Provenance)

Practically speaking, AMD ROCm on Linux is a good choice when: you have officially supported hardware (RX 6800 series, RX 7900 series, RX 9070 XT, or Instinct cards); your primary workloads are LLM inference, standard PyTorch training without specialist libraries, or data pipelines; and you are comfortable with the additional driver setup complexity and the need to monitor for silent CPU fallback. If your interest in local AI also extends to understanding how AI workloads intersect with threat modeling and organizational security posture, kandibrian.com covers cybersecurity training and certification prep for engineers building in these environments.

CUDA remains the safer default when: you are working with specialized inference tools (TensorRT, CUDA-specific kernels), using quantization libraries where ROCm support is incomplete, building something for deployment on hardware you do not control, or when the documentation for your specific framework or tutorial assumes NVIDIA. For anyone deploying AI infrastructure on Linux with a security requirement -- whether that means audit logging, least-privilege service accounts, or network segmentation between the inference server and the rest of the stack -- the zero-trust Linux implementation guide covers the concrete configurations that apply regardless of which GPU stack you choose.

Unified Verification Workflow

Regardless of which stack you are using, this sequence confirms end-to-end GPU acceleration is working correctly:

terminal

# ── NVIDIA ──────────────────────────────────────────────
# Step 1: Driver and CUDA version
$ nvidia-smi

# Step 2: PyTorch GPU detection
$ python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# Step 3: Confirm GPU is active during inference (run in separate terminal)
$ nvidia-smi dmon -s u
# Run a model in another terminal and watch GPU utilization climb


# ── AMD ROCm ─────────────────────────────────────────────
# Step 1: ROCm stack and GPU agent detection
$ rocminfo | grep -A2 "Marketing Name"

# Step 2: Group membership check
$ groups $USER | grep -E "render|video"

# Step 3: PyTorch GPU detection (ROCm uses torch.cuda API)
$ python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# Step 4: Confirm GPU utilization during inference
$ amdgpu_top
# Non-zero GPU utilization confirms the GPU is active, not the CPU


# ── Both stacks: quick tensor smoke test ──────────────────
$ python3 -c "
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device: {device}')
x = torch.rand(1000, 1000).to(device)
y = torch.mm(x, x)
print(f'Matrix multiply result shape: {y.shape}')
print(f'GPU used: {device == \"cuda\"}')
"

How to Verify GPU Acceleration is Working on Linux for AI Workloads

Step 1: Verify NVIDIA CUDA driver and GPU detection

Run nvidia-smi. A clean output table showing your GPU name, driver version, and CUDA Version confirms the driver is loaded. The CUDA Version shown is the ceiling for PyTorch wheel compatibility -- note it before installing any PyTorch CUDA wheel. If nvidia-smi fails or is not found, the driver is not installed or not loaded by the kernel.

Step 2: Verify AMD ROCm stack and GPU detection

Run rocminfo | grep -A2 "Marketing Name". The output should list your GPU with its marketing name and GFX version string. Also confirm group membership with: groups $USER | grep -E 'render|video'. Missing group membership is a frequent cause of ROCm permission errors. If your GPU's GFX version is not in ROCm's supported list, you may need to set HSA_OVERRIDE_GFX_VERSION to the nearest supported GFX version.

Step 3: Verify PyTorch GPU access

For both CUDA and ROCm, run: python3 -c 'import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))'. ROCm uses the same torch.cuda API as CUDA. A result of True followed by your GPU name confirms end-to-end GPU access from Python. False means the driver, wheel version, group membership, or HSA override needs investigation. On ROCm specifically, also run python3 -c "import torch; print(torch.version.hip)" -- this prints the HIP version your PyTorch wheel was built against (e.g. 6.2.41134), which is the definitive confirmation that you have the ROCm PyTorch wheel and not the CPU or CUDA build accidentally installed. If torch.version.hip returns None, you have the wrong wheel regardless of what torch.cuda.is_available() says. AMD's own profiling documentation also notes that profiling PyTorch workloads on ROCm may fail because ROCm libraries are not on the default linker path -- if you encounter missing library errors when running with profilers, add export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH to your environment.

Step 4: Monitor GPU utilization during inference or training

Detection returning True is necessary but not sufficient -- confirm the GPU is actually doing work. For NVIDIA, run nvidia-smi dmon -s u in a separate terminal and watch the utilization percentage climb during a training loop or inference request. For AMD, run amd-smi monitor (the successor to the deprecated rocm-smi), or amdgpu_top. Near-zero utilization while a model is generating means the workload is on CPU despite positive detection results.

MIOpen Cold-Start Kernel Compilation

MIOpen does not ship pre-compiled kernels for every GPU and workload combination. On first run, it compiles optimized kernels on-device for your specific GPU architecture and the specific operation shapes it encounters. This first-run compilation can take three to ten minutes and manifests as what looks like a hang at the first training step or the first inference request -- no output, no progress indicator, just a frozen terminal. This is expected behavior, not a crash.

The compiled kernels are cached at ~/.cache/miopen/. On a standard persistent Linux installation, this means the cold-start only happens once per GPU and per workload type encountered. Subsequent runs use the cache and start immediately. The problem becomes critical in containerized or ephemeral environments where the home directory is not persisted between runs: every container start triggers a fresh cold-start compilation, adding minutes of latency before the first inference token or training step begins.

The correct solution for containerized ROCm deployments is to mount the MIOpen cache directory as a persistent volume:

Docker / Podman — persisting the MIOpen kernel cache

# Docker: mount the MIOpen cache as a named volume
$ docker run --device=/dev/kfd --device=/dev/dri \
  -v miopen-cache:/root/.cache/miopen \
  -v rocblas-cache:/root/.cache/rocblas \
  rocm/pytorch:latest

# docker-compose equivalent
# volumes:
#   miopen-cache:
# services:
#   inference:
#     volumes:
#       - miopen-cache:/root/.cache/miopen

# Pre-warm the cache before putting a container into production
# Run a small representative workload to trigger compilation for your model shapes
$ python3 -c "
import torch
device = 'cuda'
# Run a representative set of matrix shapes from your actual workload
shapes = [(1, 4096, 4096), (8, 2048, 2048), (32, 512, 512)]
for b, m, n in shapes:
    x = torch.rand(b, m, n).to(device)
    y = torch.rand(b, n, m).to(device)
    _ = torch.bmm(x, y)
print('MIOpen cache pre-warm complete')
"

For production deployments, include the pre-warm step as part of your container startup sequence before the inference server begins accepting requests. This prevents the first user request from triggering a three-to-ten-minute compilation delay. Run the pre-warm with shapes representative of your actual model's attention head dimensions and batch sizes -- MIOpen compiles kernels per shape, so a single generic matrix multiply is not sufficient if your model uses multiple distinct tensor dimensions.

Automated Silent CPU Fallback Detection

Manual monitoring with amd-smi monitor or nvidia-smi dmon is appropriate for interactive sessions, but it does not protect containerized workloads or services that run unattended. The correct approach for automated or production deployments is a startup assertion in your Python code that fails loudly if the GPU is not active, rather than silently running inference on CPU at 3 tokens per second.

torch.cuda.is_available() returning True is a necessary condition but not sufficient by itself -- it confirms the compute stack initialized, not that the GPU is processing tensors. On ROCm, the definitive confirmation is torch.version.hip being non-None. A timed smoke test catches the silent-CPU-fallback case that passes both checks but still runs on the processor:

gpu_assert.py — startup assertion for production deployments

import torch
import time
import sys

def assert_gpu_active(min_gflops: float = 500.0) -> None:
    """
    Raise RuntimeError if GPU compute is not active.
    min_gflops: minimum acceptable throughput for the smoke test (GFLOP/s).
    A CPU fallback will typically produce < 50 GFLOP/s on a matrix multiply
    of this size; a GPU will produce > 1000 GFLOP/s on modern hardware.
    Adjust the threshold to match your hardware if needed.
    """
    if not torch.cuda.is_available():
        raise RuntimeError(
            "GPU not detected: torch.cuda.is_available() returned False. "
            "Check driver installation, group membership (render/video for ROCm), "
            "and HSA_OVERRIDE_GFX_VERSION if using an unsupported AMD GPU."
        )

    # ROCm-specific: confirm we have the ROCm wheel, not the CPU or CUDA build
    if torch.version.hip is None and torch.version.cuda is None:
        raise RuntimeError(
            "PyTorch reports no CUDA or HIP version. This is likely the CPU-only wheel. "
            "Reinstall with the correct ROCm or CUDA wheel for your stack."
        )

    device = "cuda"
    size = 4096

    # Timed matrix multiply smoke test
    a = torch.rand(size, size, device=device, dtype=torch.float32)
    b = torch.rand(size, size, device=device, dtype=torch.float32)
    torch.cuda.synchronize()

    start = time.perf_counter()
    c = torch.mm(a, b)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    # FLOPs for matrix multiply: 2 * N^3
    flops = 2 * (size ** 3)
    gflops = (flops / elapsed) / 1e9

    if gflops < min_gflops:
        raise RuntimeError(
            f"GPU smoke test failed: {gflops:.1f} GFLOP/s observed, "
            f"expected >= {min_gflops} GFLOP/s. "
            f"Workload may be running on CPU despite positive torch.cuda.is_available(). "
            f"Verify GPU utilization with amd-smi monitor or nvidia-smi dmon."
        )

    stack = f"ROCm {torch.version.hip}" if torch.version.hip else f"CUDA {torch.version.cuda}"
    print(f"GPU verified: {torch.cuda.get_device_name(0)} | {stack} | {gflops:.0f} GFLOP/s")

# Call at process startup before loading any model
assert_gpu_active()

The threshold of 500 GFLOP/s for the 4096x4096 float32 smoke test is deliberately conservative: a modern discrete GPU will deliver well above this figure, while a CPU fallback will produce 10–50 GFLOP/s on typical hardware. If you are running on lower-end or older GPU hardware, reduce the threshold accordingly -- but do not set it below 100 GFLOP/s or the assertion will not reliably distinguish GPU from CPU execution. Call assert_gpu_active() at the top of your inference server startup, before any model weights are loaded, so that a misconfigured stack fails fast with a clear error rather than silently degrading throughput.

Threat Model: Running a GPU Compute Stack on Linux

A GPU server running CUDA or ROCm has a substantially larger attack surface than a general-purpose Linux server. Both stacks install kernel-mode drivers that execute at ring 0. Both require group-level access (render, video) that grants direct GPU compute access to any process in that group. Many AI setups expose local model APIs (Ollama at port 11434, Jupyter at 8888, vLLM HTTP server) without authentication. And GPU servers frequently pull large binary blobs -- CUDA toolkit installers, pre-compiled model weights, Python wheels -- from external sources with varying supply chain integrity controls. Each of these elements maps directly to documented adversary techniques in the MITRE ATT&CK Enterprise framework.

// MITRE ATT&CK Techniques Relevant to GPU AI Servers on Linux

T1068 Exploitation for Privilege Escalation — kernel-mode GPU drivers run at ring 0; unpatched CVEs allow privilege escalation from user to root. Check installed driver versions and subscribe to vendor advisories before any production deployment.

audit

# AMD driver version
cat /opt/rocm/.info/version
apt-cache policy amdgpu-dkms
# NVIDIA driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader
# Check which packages are eligible for unattended security updates
sudo unattended-upgrade --dry-run

AMD advisories: amd.com/en/resources/product-security.html — NVIDIA advisories: nvidia.com/en-us/security/

T1547.006 Boot or Logon Autostart: Kernel Modules and Extensions — malicious kernel modules load at boot alongside legitimate GPU drivers; DKMS rebuilds any registered module after kernel updates automatically. Audit the registry for unfamiliar entries and optionally enable module signature enforcement.

audit

# List all DKMS-registered modules — unexpected entries warrant investigation
dkms status
# Loaded GPU-related kernel modules
lsmod | grep -E "amdgpu|nvidia|nouveau"
# Enable module signature enforcement (locked-down systems only)
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub, then:
sudo update-grub
# module.sig_enforce=1  <-- unsigned modules will refuse to load

T1195.002 Compromise Software Supply Chain — a compromised GPG key or repository mirror delivers malicious drivers system-wide. The install steps above use scoped signed-by= keyrings. If you used older guides with apt-key add, the key is in the global trusted keyring and applies to all repos — migrate it.

audit

# Keys in legacy global keyring (should be empty on a clean setup)
apt-key list 2>/dev/null
# GPU repo sources missing signed-by= are trusting the global keyring
grep -rn "signed-by" /etc/apt/sources.list /etc/apt/sources.list.d/
# Verify PyTorch wheel checksums against pytorch.org/get-started/locally/
pip download torch --no-deps -d /tmp/wheels
sha256sum /tmp/wheels/*.whl

T1574 Hijack Execution Flow — environment variables like HSA_OVERRIDE_GFX_VERSION, HIP_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, CUDA_VISIBLE_DEVICES, and LD_LIBRARY_PATH are inherited by all child processes and redirect which GPU or library version loads. Audit what is set and lock service override files to root.

audit + harden

# Audit GPU env vars in service files and shell profiles
sudo systemctl show ollama --property=Environment
grep -rE "HSA_|HIP_|ROCR_|CUDA_|LD_LIBRARY" \
  ~/.bashrc ~/.profile ~/.bash_profile /etc/environment 2>/dev/null
# Lock service override file to root-only write
sudo chown root:root /etc/systemd/system/ollama.service.d/override.conf
sudo chmod 640 /etc/systemd/system/ollama.service.d/override.conf

T1546.004 Unix Shell Configuration Modification — writing GPU environment variables to ~/.bashrc or ~/.profile is structurally identical to shell profile persistence. Monitor with auditd and check for recent unauthorized changes.

/etc/audit/rules.d/gpu-hardening.rules

# Add these rules, then: sudo augenrules --load
-w /home -p wa -k shell_profile_mod
-w /etc/environment -p wa -k env_mod
-w /etc/systemd/system -p wa -k systemd_unit_mod

query

ausearch -k shell_profile_mod --start today

T1078.003 Valid Accounts: Local Accounts — the render and video groups grant direct GPU compute access via /dev/kfd and /dev/dri/renderD*. Scope membership to only what is explicitly required. Device files should be root:render at mode 660 — not world-readable.

audit + harden

# List all accounts in render and video groups
getent group render video
# Remove an account that does not need GPU compute access
sudo gpasswd -d USERNAME render
# Verify device file permissions (expect crw-rw---- root:render)
ls -la /dev/kfd /dev/dri/renderD*
# Audit rule: alert on any future group membership change
# Add to gpu-hardening.rules, then augenrules --load
-w /etc/group -p wa -k gpu_group_change
# Query alerts
ausearch -k gpu_group_change

T1046 / T1203 Network Service Discovery / Exploitation for Client Execution — AI inference servers (Ollama port 11434, vLLM HTTP, Jupyter 8888) exposed on non-localhost interfaces with no authentication are trivially discoverable. Confirm binding and restrict at the firewall.

audit + harden

# Should show 127.0.0.1:11434, not 0.0.0.0:11434
ss -tlnp | grep -E "11434|8888"
# Block inference ports from external reach regardless of service binding
sudo nft add rule inet filter input tcp dport 11434 ip saddr != 127.0.0.1 drop
sudo nft add rule inet filter input tcp dport 8888 ip saddr != 127.0.0.1 drop
# Persist nftables rules
sudo nft list ruleset > /etc/nftables.conf

Any legitimate remote access should go through WireGuard — not through direct port exposure.

T1652 Device Driver Discovery — adversaries enumerate lsmod output to identify GPU driver versions for CVE targeting or cryptomining deployment. Alert on lsmod invocations from non-administrative accounts.

/etc/audit/rules.d/gpu-hardening.rules

# Add to gpu-hardening.rules, then: sudo augenrules --load
-w /sbin/lsmod -p x -k lsmod_exec

query + optional lock

ausearch -k lsmod_exec
# Restrict /proc/modules (locked-down systems only — prevents further module loads)
echo "kernel.modules_disabled=1" | sudo tee /etc/sysctl.d/99-gpu-hardening.conf
sudo sysctl --system

Hardening Command Reference

The following code block consolidates the actionable commands from the threat entries above into a single runnable reference. Run the audit commands first to assess current state, then apply the hardening steps that are appropriate for your environment.

gpu-hardening.sh — audit and harden a Linux GPU AI server

#!/usr/bin/env bash
# GPU AI Server Hardening Reference
# Run audit steps first; apply hardening steps selectively for your environment.
# Requires root or sudo for most hardening steps.
# Tested on Ubuntu 24.04.4 with ROCm 7.2.1 and NVIDIA driver 560+.

## ── AUDIT: Current state ─────────────────────────────────────────────────

# T1068 — Check installed GPU driver versions
$ cat /opt/rocm/.info/version 2>/dev/null || echo "ROCm not found"
$ apt-cache policy amdgpu-dkms 2>/dev/null | grep Installed
$ nvidia-smi --query-gpu=driver_version --format=csv,noheader 2>/dev/null || echo "nvidia-smi not found"

# T1547.006 — Audit DKMS registry for unexpected modules
$ dkms status
# Expected entries: amdgpu or nvidia. Any unrecognized module name warrants investigation.

# T1547.006 — List loaded GPU-related kernel modules
$ lsmod | grep -E "amdgpu|nvidia|nouveau"

# T1195.002 — Check for stale apt-key entries (should be empty on a clean setup)
$ apt-key list 2>/dev/null
# Any ROCm or NVIDIA key here is in the global keyring (insecure) — migrate it.

# T1195.002 — Verify all GPU repo sources use signed-by scoped keyrings
$ grep -rn "signed-by" /etc/apt/sources.list /etc/apt/sources.list.d/ 2>/dev/null
# Any ROCm or CUDA repository entry missing signed-by= is trusting the global keyring.

# T1574 — Audit GPU environment variables set in service files and shell profiles
$ sudo systemctl show ollama --property=Environment 2>/dev/null
$ grep -rE "HSA_|HIP_|ROCR_|CUDA_|LD_LIBRARY" \
    ~/.bashrc ~/.profile ~/.bash_profile /etc/environment 2>/dev/null

# T1078.003 — List all accounts in render and video groups
$ getent group render video
# Review: does every listed account genuinely need GPU compute access?

# T1078.003 — Verify GPU device file permissions
$ ls -la /dev/kfd /dev/dri/renderD* 2>/dev/null
# Expected: crw-rw---- root:render (660). World-readable (crw-rw-rw-) is a misconfiguration.

# T1046 — Check whether Ollama is binding to all interfaces (should be 127.0.0.1 only)
$ ss -tlnp | grep -E "11434|8888|8000"
# 0.0.0.0:11434 means Ollama is reachable from any host on your network.

## ── HARDENING: Apply where appropriate ──────────────────────────────────

# T1078.003 — Remove an account from the render group
$ sudo gpasswd -d USERNAME render

# T1546.004 / T1078.003 / T1652 — Deploy auditd rules for GPU server monitoring
# Write all rules to a dedicated rules file
$ sudo tee /etc/audit/rules.d/gpu-hardening.rules << 'EOF'
## GPU AI server hardening rules
# T1078.003: Alert on render/video group membership changes
-w /etc/group -p wa -k gpu_group_change
# T1546.004: Alert on shell profile modifications (GPU env var persistence vector)
-w /etc/environment -p wa -k env_mod
-w /etc/systemd/system -p wa -k systemd_unit_mod
# T1652: Alert on lsmod execution by any process
-w /sbin/lsmod -p x -k lsmod_exec
EOF
$ sudo augenrules --load
$ sudo systemctl restart auditd

# T1546.004 — Check for recent shell profile modifications
$ ausearch -k shell_profile_mod --start today 2>/dev/null
$ ausearch -k gpu_group_change 2>/dev/null
$ ausearch -k lsmod_exec 2>/dev/null

# T1574 — Lock service override file to root-only write
$ sudo chown root:root /etc/systemd/system/ollama.service.d/override.conf
$ sudo chmod 640 /etc/systemd/system/ollama.service.d/override.conf

# T1046 — Restrict Ollama to localhost in its service override
$ sudo mkdir -p /etc/systemd/system/ollama.service.d/
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

# T1046 — Block inference ports at the host firewall regardless of service binding
$ sudo nft add rule inet filter input tcp dport 11434 ip saddr != 127.0.0.1 drop
$ sudo nft add rule inet filter input tcp dport 8888 ip saddr != 127.0.0.1 drop
# Make nftables rules persistent:
$ sudo nft list ruleset > /etc/nftables.conf

# T1547.006 — Enable kernel module signature enforcement (production-locked systems only)
# WARNING: This prevents any unsigned module from loading after this boot.
# Confirm all required GPU modules are already loaded before applying.
# Add to /etc/default/grub GRUB_CMDLINE_LINUX, then update-grub and reboot.
# module.sig_enforce=1

# T1652 — Restrict /proc/modules visibility (fully locked-down systems only)
# WARNING: This also blocks any further module loading after boot.
$ echo "kernel.modules_disabled=1" | sudo tee /etc/sysctl.d/99-gpu-hardening.conf
$ sudo sysctl --system

// Applicable NIST Special Publications

SP 800-53 Rev 5.2 Security and Privacy Controls for Information Systems — the primary control catalog for hardening GPU AI servers. Relevant control families: CM (Configuration Management), AC (Access Control), SI (System and Information Integrity), SR (Supply Chain Risk Management), AU (Audit and Accountability). Rev 5.2 (2025) adds SA-24 (Design for Cyber Resiliency) directly applicable to AI inference infrastructure.

SP 800-123 Guide to General Server Security — baseline server hardening guidance applicable to any Linux GPU server: removing unnecessary services, configuring host firewalls, managing user accounts and group memberships (directly applicable to render/video group hygiene), and establishing patch management processes for third-party drivers.

SP 800-147 BIOS Protection Guidelines — GPU firmware (VBIOS) is a distinct attack surface from the OS-level driver. GPU firmware updates ship through vendor driver packages. Secure Boot enforcement and TPM-based attestation do not cover GPU VBIOS integrity. This publication provides context for why hardware-level firmware verification matters for systems with high-value compute resources.

SP 800-161 Rev 1 Cybersecurity Supply Chain Risk Management Practices — directly applicable to the evaluation of CUDA (proprietary, closed-source, NVIDIA-controlled supply chain) vs. ROCm (open-source, auditable, community-verifiable). The open-source nature of ROCm is a meaningful supply chain transparency advantage under C-SCRM guidance, particularly for environments operating under federal or regulated compliance requirements.

SP 800-190 Application Container Security Guide — if you run AI workloads in Docker or Podman containers with GPU passthrough (--gpus all or --device /dev/kfd), NIST SP 800-190 guidance on container privilege, image provenance, and runtime isolation applies. GPU-enabled containers with --privileged or full device access defeat most container security boundaries; understand what each flag actually grants.

// check your understanding

The render group grants GPU compute access. Why is over-provisioning this group a security risk, and what does MITRE ATT&CK say about it?

Adding users or service accounts to the render group beyond those that explicitly need GPU compute access violates the principle of least privilege. Any compromised account in the render group can access /dev/kfd and /dev/dri/renderD* -- the interfaces ROCm uses for GPU compute. This maps to MITRE ATT&CK T1078.003 (Valid Accounts: Local Accounts): an attacker who compromises a low-privilege account that was unnecessarily added to render gains GPU access without any additional privilege escalation step. Under NIST SP 800-53 AC-6 (Least Privilege), access to the render group should be scoped to only the specific users and service accounts that require GPU compute functionality, reviewed periodically, and audited for changes via auditd.

// Knowledge Check — CUDA vs ROCm Question 1 / 7

Frequently Asked Questions

Is ROCm good enough for serious AI and ML work in 2026?

Yes, for many workloads -- with caveats. PyTorch on ROCm is genuinely functional for training and inference, and ROCm 7.1.1 delivered significant performance improvements over prior releases. The gap versus CUDA has narrowed considerably. Where ROCm still lags is in the long tail of specialized libraries: tools that assume CUDA-specific APIs, the Flash Attention 2 kernel, TensorRT, and frameworks that have CUDA-optimized paths but no ROCm equivalent yet. For Ollama-based local inference and standard PyTorch training, ROCm is a practical and cost-effective choice on officially supported hardware.

What does nvidia-smi show and why does it matter?

nvidia-smi is the NVIDIA System Management Interface. It reports the installed driver version, the maximum CUDA version that driver supports, GPU name, temperature, power draw, memory usage, and running GPU processes. The CUDA Version shown in the top-right corner is the ceiling for PyTorch wheel compatibility -- you must install a PyTorch wheel targeting that version or lower. Running nvidia-smi and seeing a clean output table is the first verification step for any NVIDIA GPU AI setup on Linux.

What is HIP and does it let ROCm code run on NVIDIA GPUs?

HIP (Heterogeneous-compute Interface for Portability) is AMD's C++ programming model for GPU compute. It is deliberately designed to be close to the CUDA API, so that code written in HIP can be compiled for both AMD GPUs via ROCm and NVIDIA GPUs via CUDA with minimal changes. AMD provides hipcc, a compiler driver that targets either backend. PyTorch uses this internally for its ROCm support, which is why the Python API is identical across both hardware backends. In practice, HIP portability works well for standard compute kernels but requires more effort for code that uses vendor-specific libraries or features. One specific porting trap that catches engineers coming from CUDA: ROCm 7.0 made warpSize a non-constexpr variable, matching the CUDA specification more closely. HIP code that uses warpSize as a compile-time constant (for example, in template parameters or constexpr array bounds) will fail to compile on ROCm 7.0+. The fix is to query wavefront size at runtime via hipGetDeviceProperties or the in-kernel warpSize variable. AMD also removed the __AMDGCN_WAVEFRONT_SIZE and __AMDGCN_WAVEFRONT_SIZE__ macros entirely in this cycle -- any HIP kernel code that relied on them is now a compile error.

Why does ROCm silently fall back to CPU without an error?

ROCm performs hardware compatibility checks using GFX version strings. If the installed GPU does not match a version in ROCm's supported list, the runtime rejects it and falls back to CPU inference -- without printing an error to the terminal. The fix is the HSA_OVERRIDE_GFX_VERSION environment variable, which bypasses the compatibility check by telling the runtime to treat the GPU as if it were a supported architecture. The correct value is the GFX version of the nearest officially supported card, which can be found using rocminfo.

Does Ubuntu version matter for ROCm installation in 2026?

Yes, and it matters more than it used to. ROCm 7.2.1 officially supports Ubuntu 24.04.4 -- AMD marked Ubuntu 24.04.3 as end-of-life for ROCm when 7.2.1 shipped on March 25, 2026. If you are on 24.04.3, upgrade to 24.04.4 before installing ROCm 7.2.1. Ubuntu 22.04.5 is still supported. The Radeon consumer GPU tier -- RX 7700 series, RX 7800 XT, RX 7900 series, and RX 9000 series -- is limited to Ubuntu 24.04.4 and Ubuntu 22.04.5; these GPUs are not supported on older Ubuntu point releases. (See: ROCm 7.2.1 Release Notes.)

What changed in the ROCm installation process in 2026?

AMD removed the amdgpu-install documentation from the primary ROCm install flow as of ROCm 7.2.1, repositioning it as a legacy option. The recommended install method is now via the native APT package manager using signed repository keys from repo.radeon.com. AMD also discontinued the ROCm Offline Installer Creator in 7.2.1, replacing it with a self-extracting Runfile Installer for offline and air-gapped deployments. The amdgpu-install .deb package still works and the quick-start guide still references it, but the detailed installation documentation leads with the package manager path. (See: ROCm Installation Overview.)

Sources

// References & Further Reading

[1] ROCm 7.2.1 Release Notes — AMD. Covers Ubuntu 24.04.4 support, Ubuntu 24.04.3 EOL, hipBLASLt MXFP8/MXFP4 improvements, JAX 0.8.2, Runfile Installer, ROCTracer deprecation, and known issues including gfx1201 coherency fix and RX 9060 XT instability.

[2] ROCm System Requirements (Linux) — AMD. Official supported GPU list for ROCm 7.2.1, including Instinct, Radeon RX 7000, and RX 9000 series hardware.

[3] AMD ROCm 7.2.1 Released With Ubuntu 24.04.4 LTS Support, Bug Fixes — Phoronix. Independent coverage of the ROCm 7.2.1 release on March 25, 2026.

[4] ROCm Installation Overview (Linux) — AMD. Documents the package manager as the primary install path as of ROCm 7.2.1, with amdgpu-install repositioned as legacy.

[5] ROCm Quick Start Installation Guide (Linux) — AMD. Step-by-step install reference for Ubuntu and RHEL-based systems.

[6] AMD ROCm 7.2.1 on Radeon and Ryzen for Linux Release Notes — AMD. Radeon consumer GPU tier and Ryzen iGPU-specific guidance, driver interface version notes, and known issues for the 7.2.1 release.

[7] Ollama AMD GPU Hardware Support — Ollama. Supported GPU list and HSA_OVERRIDE_GFX_VERSION guidance for running LLMs locally on AMD ROCm hardware.

[8] CUDA Documentation — NVIDIA. Official CUDA toolkit documentation covering driver installation, nvcc, cuDNN, cuBLAS, TensorRT, and the full CUDA SDK.

[9] PyTorch ROCm Support — PyTorch. Official wheel installation guide for CUDA and ROCm backends, including version matrix and verification steps.

[10] AMD ROCm 7.2 on Radeon and Ryzen for Linux Release Notes — AMD. ROCm 7.2.0 release coverage including RX 9060 XT, RX 7700 (non-XT), ROCm Optiq introduction, and consumer GPU support expansion.

[11] Canonical to Distribute AMD ROCm Libraries With Ubuntu 26.04 LTS — Canonical. December 2025 announcement of the Canonical–AMD collaboration to package ROCm natively in Ubuntu's official repositories with up to 15 years of Ubuntu Pro support.

[12] The Integrated ROCm Story For Ubuntu 26.04 Still Playing Out — Phoronix. March 2026 coverage of the ROCm packaging status in Ubuntu 26.04's archive ahead of launch.

^ back to top

Linux GPU Acceleration for AI: CUDA vs ROCm — What You Need to Know in 2026

What Each Stack Actually Is

Side-by-Side Comparison

CUDA: Driver Setup and Verification

ROCm: Driver Setup and Verification

Verifying the ROCm installation

The HSA Override: When ROCm Silently Falls Back

Two Distinct Override Failure Modes

Multi-GPU Setups With Mixed Architectures

GPU Hang and Reset Mitigation Flags

Common GFX versions and their override values

PyTorch on CUDA vs ROCm: The Practical Reality

When AMD ROCm Is the Right Choice

Unified Verification Workflow

How to Verify GPU Acceleration is Working on Linux for AI Workloads

Step 1: Verify NVIDIA CUDA driver and GPU detection

Step 2: Verify AMD ROCm stack and GPU detection

Step 3: Verify PyTorch GPU access

Step 4: Monitor GPU utilization during inference or training

MIOpen Cold-Start Kernel Compilation

Automated Silent CPU Fallback Detection

Threat Model: Running a GPU Compute Stack on Linux

Hardening Command Reference

Frequently Asked Questions

Is ROCm good enough for serious AI and ML work in 2026?

What does nvidia-smi show and why does it matter?

What is HIP and does it let ROCm code run on NVIDIA GPUs?

Why does ROCm silently fall back to CPU without an error?

Does Ubuntu version matter for ROCm installation in 2026?

What changed in the ROCm installation process in 2026?

Sources