PyTorch 2.11.0 is the current stable release as of March 23, 2026, according to PyPI. According to a Linux Foundation report cited by the Hugging Face blog, PyTorch holds a 63% adoption rate for model training, making it the most widely used framework in the field. The Hugging Face Hub lists over one million pretrained models, the overwhelming majority using PyTorch as their backend. If you're doing anything with machine learning on Linux, PyTorch is where you'll spend your time. If you're setting up a Linux system specifically for AI workloads for the first time, see the running AI workloads on Linux setup guide for an overview of the hardware and OS prerequisites before continuing here.

Getting it installed correctly is not difficult, but several things in 2026 can cause silent failures or confusing errors if you don't know about them. The biggest change affecting new installs: starting with PyTorch 2.11, running pip install torch without an --index-url now installs a CUDA 13.0 wheel from PyPI by default -- not the CPU-only build as in previous releases. CUDA 13.0 only supports Turing (SM 7.5) and newer GPU architectures. If your GPU is older or your driver doesn't support CUDA 13.0, the install will appear to succeed but PyTorch won't see your GPU. There's also a separate Volta/V100 issue: the CUDA 12.8, 12.9, and 13.0 builds in 2.11 dropped Volta support entirely to enable a CuDNN upgrade. This guide covers all of it.

Additional common failure modes: installing a wheel targeting a CUDA version your driver doesn't support, or polluting a system Python installation that later conflicts with another project. This guide avoids all of those. We'll use virtual environments, select the right wheel for your GPU backend and generation, and verify everything is working before writing any model code.

Examples are shown for Ubuntu 22.04 and 24.04. The pip commands translate directly to Fedora, Arch, and any other distribution -- only the package manager commands for system dependencies differ, and those are noted inline.

Prerequisites

Before installing PyTorch, confirm the following are in place:

Note

As of PyTorch 2.6.0, the official conda channel no longer provides PyTorch builds -- this remains true in PyTorch 2.11. Even if you use conda for environment management, you install PyTorch via pip inside the conda environment. This guide uses Python's built-in venv, which is sufficient for the vast majority of PyTorch projects on Linux.

Running on WSL2?

If you are on Windows Subsystem for Linux 2 (WSL2), the pip install commands in this guide are identical. WSL2 exposes the host Windows NVIDIA driver to the Linux environment -- you do not install a separate Linux NVIDIA driver inside WSL2. Run nvidia-smi inside your WSL2 terminal to confirm the driver is visible before installing PyTorch. ROCm is not supported inside WSL2 for AMD GPUs; AMD users on WSL2 should use a native Linux installation or Docker. The steps from the venv section onward apply directly to WSL2 Ubuntu environments.

Setting Up a Virtual Environment

Installing PyTorch into a virtual environment rather than the system Python is non-negotiable on any serious project. A venv keeps your PyTorch installation isolated from other Python projects, makes the dependency set reproducible via requirements.txt, and means you can have different PyTorch versions for different projects on the same machine without conflict.

// mental model

Think of a virtual environment as a hermetically sealed cabinet for a single project. The system Python is the warehouse floor -- you never store project-specific tools on the warehouse floor. Each cabinet (venv) contains its own pip, its own site-packages, and its own binaries. When you activate one, your shell's PATH is prepended to point inside that cabinet first. Deactivating just removes that PATH prefix. Nothing is actually moved or deleted -- the cabinet is always there on disk, and you can walk back into it any time with source pytorch-env/bin/activate.

terminal
# Create a virtual environment in the current directory
$ python3 -m venv pytorch-env

# Activate it
$ source pytorch-env/bin/activate

# Your prompt will now show (pytorch-env)
# Upgrade pip before installing anything
(pytorch-env)$ python3 -m pip install --upgrade pip

# Confirm you're inside the venv
(pytorch-env)$ which python3
# Should output: /path/to/pytorch-env/bin/python3

To deactivate the environment when you're done, simply run deactivate. The environment persists on disk and can be reactivated any time.

Warning

Never run sudo pip install to install PyTorch or any Python packages. Installing as root modifies the system Python, which can break distribution tools that depend on specific package versions. Always use a virtual environment and install as a regular user.

// knowledge check

You run source pytorch-env/bin/activate and see (pytorch-env) in your prompt. You then run which python3. What path should you expect?

There are two things to get right for NVIDIA users in PyTorch 2.11. First, the CUDA version in the PyTorch wheel must match or be below the CUDA version your driver supports. Installing a wheel built for CUDA 12.8 when your driver only supports CUDA 12.4 will result in torch.cuda.is_available() returning False -- no error, just silent CPU fallback.

// mental model

The CUDA version number serves a dual role that trips people up: it appears both in nvidia-smi and in the PyTorch wheel name, but they mean different things. The number in nvidia-smi is a ceiling -- the highest API version your driver is capable of speaking. The number in the wheel name is a floor requirement -- the minimum driver capability the wheel expects. If the wheel requires CUDA 12.8 but your driver ceiling is 12.4, the wheel cannot run. The PyTorch CUDA runtime libraries are entirely bundled inside the wheel itself, so no separate CUDA Toolkit installation is involved. Your driver is the only external dependency.

Second, and new as of PyTorch 2.11: running pip install torch without an --index-url now installs a CUDA 13.0 wheel from PyPI by default, not the CPU-only build as it did previously. According to the PyTorch 2.11.0 release notes, CUDA 13.0 only supports Turing (SM 7.5) and newer architectures on Linux x86_64 -- Maxwell, Pascal, and Volta are excluded. Additionally, PyTorch 2.11's CUDA 12.8 and 12.9 builds dropped Volta (V100) support to enable a CuDNN upgrade incompatible with Volta. Volta users must use the CUDA 12.6 index URL, which retains full Volta support. The table below maps GPU generations to the correct wheel index.

"Starting with PyTorch 2.11, pip install torch on PyPI installs CUDA 13.0 wheels by default for both Linux x86_64 and Linux aarch64."

-- PyTorch 2.11.0 Release Notes, github.com/pytorch/pytorch

Step 1: Find your maximum supported CUDA version and GPU generation

terminal
$ nvidia-smi
# Look at the top-right: "CUDA Version: 12.8"
# This is the MAXIMUM version your current driver supports
# You can install any PyTorch CUDA wheel at or below this number

In addition to your driver's CUDA version, your GPU's compute capability (architecture generation) determines which wheel you can use. PyTorch 2.11 introduced architecture restrictions that matter:

GPU Generation Examples Compute Cap. Recommended Wheel (PyTorch 2.11)
Blackwell RTX 5090, 5080, 5070 (SM 12.0); B100/B200/GB200 (SM 10.0) SM 10.0 / 12.0 cu128 or PyPI default (cu130)
Ada Lovelace RTX 4090, 4080, 4070 SM 8.9 cu128 or PyPI default (cu130)
Ampere RTX 3090, 3080, A100 SM 8.0/8.6 cu128 or PyPI default (cu130)
Turing RTX 2080, GTX 1660 SM 7.5 cu128 or PyPI default (cu130)
Volta V100, Titan V SM 7.0 cu126 only -- dropped from cu128, cu129, cu130
Pascal / Maxwell GTX 1080, GTX 980 SM 6.x / 5.x cu126 only -- not supported in cu130
ExamplesRTX 5090, 5080, 5070 (SM 12.0); B100/B200/GB200 (SM 10.0)
Compute Cap.SM 10.0 (datacenter Blackwell) / SM 12.0 (RTX 50-series)
Wheelcu128 or PyPI default (cu130)
ExamplesRTX 4090, 4080, 4070
Wheelcu128 or PyPI default (cu130)
ExamplesRTX 3090, 3080, A100
Wheelcu128 or PyPI default (cu130)
ExamplesRTX 2080, GTX 1660
Wheelcu128 or PyPI default (cu130)
ExamplesV100, Titan V
Wheelcu126 ONLY -- dropped from cu128, cu129, and cu130 in PyTorch 2.11
ExamplesGTX 1080, GTX 980
Wheelcu126 ONLY -- not supported under CUDA 13.0

Step 2: Install PyTorch with the correct wheel index URL

Always install from a PyTorch wheel index rather than the bare PyPI default. As of PyTorch 2.11, omitting --index-url installs a CUDA 13.0 wheel from PyPI -- which works fine for Turing and newer, but will fail or fall back to CPU on older GPUs or systems whose driver does not yet support CUDA 13.0:

terminal
# CUDA 12.8 -- Turing and newer (Blackwell, Ada, Ampere, Turing)
# Volta (V100) is NOT supported in cu128 builds -- see cu126 below
(pytorch-env)$ pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu128

# CUDA 12.6 -- required for Volta (V100), Pascal, Maxwell
# Also the safe choice if your driver ceiling is below 12.8
(pytorch-env)$ pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu126

# CPU-only (no GPU, intentional)
(pytorch-env)$ pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cpu
PyTorch 2.11 + Volta / V100 Warning

The cu128, cu129, and cu130 builds in PyTorch 2.11 no longer support Volta GPUs (compute capability 7.0 -- V100, Titan V). Per the PyTorch 2.11.0 release notes, this change was required to enable the CuDNN upgrade in the 2.11 builds, which is incompatible with Volta. V100 users must install with --index-url https://download.pytorch.org/whl/cu126, which retains full Volta support and uses CuDNN 9.10.2.21. Alternatively, build from source with Volta in TORCH_CUDA_ARCH_LIST.

Tip

The current stable CUDA wheel index URLs and the PyTorch version they target are published at pytorch.org/get-started/locally/. Use the selector tool there to generate the exact command for your configuration.

// interactive tool
Install Command Generator

Select your GPU backend and generation to get the exact pip install command for PyTorch 2.11.

// generated command
Why install torchvision and torchaudio?

torchvision provides pretrained computer vision models (ResNet, EfficientNet, ViT), standard datasets (ImageNet, CIFAR-10), and image transformation utilities. torchaudio provides audio processing tools, waveform transforms, and pretrained speech models. Neither is required if you are only working with raw tensors or NLP. You can omit either from the install command and add them later if your project needs them. The key reason to install them in the same command as torch is that their versions are tightly coupled to the PyTorch version -- installing them separately at a later point can pull in a mismatched version. Always install all three together.

Step 3: Verify CUDA is detected

terminal
# Is CUDA available?
(pytorch-env)$ python3 -c "import torch; print(torch.cuda.is_available())"
# Expected: True

# What CUDA version did PyTorch build against?
(pytorch-env)$ python3 -c "import torch; print(torch.version.cuda)"
# Expected: 12.8 (or whichever version you installed)

# Which GPU is detected?
(pytorch-env)$ python3 -c "import torch; print(torch.cuda.get_device_name(0))"
# Expected: NVIDIA GeForce RTX 4080 (or your card)

# How much VRAM is available?
(pytorch-env)$ python3 -c "import torch; print(round(torch.cuda.get_device_properties(0).total_memory / 1024**3, 1), 'GB')"

If torch.cuda.is_available() returns False, run the CUDA version mismatch check in the troubleshooting section below before assuming a deeper problem.

// knowledge check

Your nvidia-smi shows CUDA Version: 12.6. You run pip install torch torchvision torchaudio with no --index-url. What wheel does pip install, and will PyTorch see your GPU?

PyTorch uses the same torch.cuda API for both NVIDIA CUDA and AMD ROCm -- from the Python perspective, the interface is identical. The difference is in the wheel you install and the underlying hardware stack. ROCm-enabled PyTorch wheels are pip-installable on Linux and target specific ROCm versions.

// mental model

AMD's ROCm support is built on a principle of source-level translation, not runtime emulation. When the PyTorch wheel is compiled, a tool called HIPIFY rewrites CUDA source code into HIP source code at build time. By the time the wheel reaches your machine, there is no CUDA code left in the AMD build -- it is all HIP. From Python, torch.cuda is a unified namespace that routes to either the CUDA or HIP runtime depending on what hardware is present. The practical consequence: any PyTorch code that runs on NVIDIA also runs on AMD without modification, as long as you installed the ROCm wheel. The hardware difference is fully abstracted below the Python layer.

The current production ROCm release is 7.2.1. Per the official PyTorch release compatibility matrix, ROCm 7.2 is the CI-tested ROCm target for PyTorch 2.11. In practice, there are two install tracks: the pytorch.org nightly index at download.pytorch.org/whl/nightly/rocm7.2 carries PyTorch 2.11 builds (requires --pre), and AMD's separately validated wheels at repo.radeon.com currently pair ROCm 7.2.1 with PyTorch 2.9.1. Per AMD's own documentation, the repo.radeon.com wheels undergo more thorough AMD validation and are recommended for production use, even though they currently trail the upstream PyTorch version by one to two releases.

"PyTorch includes tooling that generates HIP source code from the CUDA backend."

-- AMD ROCm Documentation, rocm.docs.amd.com

This means the torch.cuda API works identically on both NVIDIA and AMD hardware from the Python layer -- no code changes needed when switching between backends, only the wheel you install.

Confirm your ROCm installation

terminal
# Check ROCm version
$ cat /opt/rocm/.info/version
# e.g. 7.2.1 (current production release) or 6.4.x

# Confirm your GPU is visible to ROCm
$ rocminfo | grep -E "Marketing Name|gfx"

# Check user group membership (required for GPU access)
$ groups $USER | grep -E "render|video"
# Must include both render and video

If your user isn't in the render and video groups, add them and log out and back in:

Not all AMD GPUs are ROCm-supported

ROCm has a specific supported GPU list. Consumer Radeon cards (RX series) have varying support depending on the gfx architecture. If rocminfo lists your GPU but PyTorch still can't use it, check whether your card's gfx target (e.g. gfx1100 for RX 7900 XTX) is compiled into the PyTorch wheel. You can verify with:

terminal
TORCHDIR=$(dirname $(python3 -c 'import torch; print(torch.__file__)')); roc-obj-ls -v $TORCHDIR/lib/libtorch_hip.so | grep gfx

The officially supported consumer GPU list is maintained in the ROCm compatibility matrix. If your GPU is not in the list, the Docker image path is your safest option.

terminal
$ sudo usermod -a -G render,video $USER
# Log out and back in for group changes to take effect

Install the ROCm PyTorch wheel

The ROCm install path for PyTorch 2.11 has two distinct tracks, and it is important to understand which one applies to your situation. The PyTorch release compatibility matrix lists ROCm 7.2 as the CI-tested ROCm target for PyTorch 2.11. However, as of April 2026, the pytorch.org stable wheel index (download.pytorch.org/whl/rocm7.2) for PyTorch 2.11 is accessed via the nightly index with the --pre flag, consistent with how the AMD ROCm installation documentation documents it. Separately, AMD's own validated production wheels at repo.radeon.com for ROCm 7.2.1 currently pair with PyTorch 2.9.1, not 2.11 -- AMD validates and releases these on their own cadence, which lags the upstream PyTorch release by one to two versions. If you specifically need PyTorch 2.11 with ROCm 7.2, use the pytorch.org nightly index. If you want AMD's most thoroughly tested configuration, use repo.radeon.com and accept PyTorch 2.9.1.

terminal
# PyTorch 2.11 + ROCm 7.2 -- pytorch.org nightly index (--pre required)
# This is the ROCm 7.2-targeted build for PyTorch 2.11 per the release compatibility matrix
(pytorch-env)$ pip install --pre torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/nightly/rocm7.2

# AMD repo.radeon.com wheels -- AMD's most thoroughly tested configuration
# NOTE: As of April 2026 these pair ROCm 7.2.1 with PyTorch 2.9.1, not 2.11
# Check repo.radeon.com/rocm/manylinux/ for the current release path and wheel filenames
(pytorch-env)$ pip install torch torchvision torchaudio \
  --index-url https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2.1/
Warning: Two separate ROCm tracks

There are two distinct ways to install PyTorch with ROCm 7.2 and they do not give you the same PyTorch version. The pytorch.org nightly index (download.pytorch.org/whl/nightly/rocm7.2) carries PyTorch 2.11 builds targeting ROCm 7.2, but requires the --pre flag. AMD's repo.radeon.com wheels are AMD's more thoroughly validated builds, but as of April 2026 they pair ROCm 7.2.1 with PyTorch 2.9.1. Always verify after installation with python3 -c "import torch; print(torch.__version__, torch.cuda.is_available())" -- a version or library mismatch can silently install a CPU-only build.

Verify ROCm detection

terminal
# ROCm uses the same torch.cuda API as CUDA
(pytorch-env)$ python3 -c "import torch; print(torch.cuda.is_available())"
# Expected: True

# Device name shows AMD GPU
(pytorch-env)$ python3 -c "import torch; print(torch.cuda.get_device_name(0))"
# Expected: AMD Radeon RX 7900 XTX (or your card)

# Recommended: full environment report (PyTorch version, ROCm, HIP, MIOpen, OS)
(pytorch-env)$ python3 -m torch.utils.collect_env
# Outputs a complete diagnostic snapshot -- use this when filing bug reports

# Quick build config check (CUDA/ROCm version wheel was compiled against)
(pytorch-env)$ python3 -c "import torch; print(torch.__config__.show())"
# Look for ROCm version and HIP runtime version in the output

Running Your First Model

// interactive tool
VRAM Budget Reference

Enter your GPU's VRAM to see what model sizes and batch sizes are practical.

GB VRAM

With PyTorch installed and GPU access verified, the following example runs a complete forward pass using a pretrained ResNet-18 model. It downloads the model weights on first run, moves the model and a random input tensor to the GPU, and prints the output shape to confirm end-to-end GPU execution.

inference_check.py
import torch
import torchvision.models as models

# Select device: cuda if available, otherwise cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Running on: {device}")
if device.type == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {round(torch.cuda.get_device_properties(0).total_memory / 1024**3, 1)} GB")

# Load a pretrained ResNet-18 (downloads ~45 MB on first run)
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
model = model.to(device)
model.eval()

# Create a random input tensor: batch of 1, RGB, 224x224
x = torch.rand(1, 3, 224, 224).to(device)

# Run inference
with torch.no_grad():
    output = model(x)

print(f"Input shape:  {x.shape}")
print(f"Output shape: {output.shape}")
# Output shape: torch.Size([1, 1000])
# 1000 classes = ImageNet classification output
terminal
(pytorch-env)$ python3 inference_check.py
# Running on: cuda
# GPU: NVIDIA GeForce RTX 4080
# VRAM: 16.0 GB
# Input shape:  torch.Size([1, 3, 224, 224])
# Output shape: torch.Size([1, 1000])

If Running on: cuda appears and the output shape prints correctly, your PyTorch installation is fully functional with GPU acceleration. If it shows Running on: cpu, the GPU detection step above failed and the troubleshooting section below applies.

// knowledge check

You load a model with model = models.resnet18(weights=...) and move it to the GPU with model = model.to(device). Then you create a tensor with x = torch.rand(1,3,224,224) -- without calling .to(device). What happens when you call model(x)?

Installing Common ML Packages

A complete ML development environment typically includes several packages alongside PyTorch. Install them into the same virtual environment:

terminal
# Data handling and numerical computing
(pytorch-env)$ pip install numpy pandas scikit-learn

# Visualization
(pytorch-env)$ pip install matplotlib seaborn

# Jupyter for interactive development
(pytorch-env)$ pip install jupyterlab

# Hugging Face ecosystem (transformers, datasets)
(pytorch-env)$ pip install transformers datasets accelerate

# TensorBoard for training visualization
(pytorch-env)$ pip install tensorboard

# Lock the environment for reproducibility
(pytorch-env)$ pip freeze > requirements.txt
Note

If you're using numpy alongside a ROCm PyTorch wheel, you may encounter a numpy 2.x incompatibility. If you see ImportError messages referencing numpy, downgrade with pip install "numpy<2.0". This is a known issue with some ROCm wheel builds.

Troubleshooting GPU Detection

// interactive diagnostic
torch.cuda.is_available() Returns False
Answer each question to identify the cause.
Does nvidia-smi run without errors and show your GPU?
What does python3 -c "import torch; print(torch.version.cuda)" return?
Is the version printed by torch.version.cuda higher than the CUDA version shown in nvidia-smi?
Is your GPU a Volta (V100, Titan V) or older generation (Pascal/Maxwell)?
Did you install PyTorch inside an active virtual environment?
Does pip show torch | grep Version show a version string ending in +cpu?
Run echo "$CUDA_VISIBLE_DEVICES". Does it print an empty line (variable is set to empty string)?
Run ldconfig -p | grep libcuda. Does it return at least one result pointing to an existing file?
Is the failure happening inside a DataLoader worker, a subprocess, or after calling multiprocessing.Process?
Are you on AMD ROCm? Run rocminfo | grep -i name -- does it list your GPU, but PyTorch still returns False?

This is almost always a CUDA version mismatch. Run the following to see what version your PyTorch wheel targets versus what your driver supports:

terminal
# CUDA version your driver supports -- extracts just the version number
$ nvidia-smi | grep -oP 'CUDA Version: \K[0-9.]+'

# CUDA version PyTorch was built against
(pytorch-env)$ python3 -c "import torch; print('PyTorch CUDA:', torch.version.cuda)"

# If torch.version.cuda is higher than your driver supports,
# reinstall PyTorch with a lower CUDA index URL

# Also confirm you didn't accidentally install the CPU-only wheel
(pytorch-env)$ pip show torch | grep -i "location\|version"
# If the version string ends in +cpu, you have the CPU-only build

# Full environment snapshot -- use this for comprehensive diagnosis
(pytorch-env)$ python3 -m torch.utils.collect_env
# Reports PyTorch version, CUDA, cuDNN, driver, OS, Python -- paste this when filing issues

torch.cuda.is_available() returns False on an AMD machine

The most common causes are: the ROCm wheel version doesn't match your installed ROCm version; the user isn't in the render and video groups; or the ROCm stack itself isn't detecting the GPU. Run rocminfo first to confirm the hardware is visible at the ROCm level before blaming PyTorch. If rocminfo shows the GPU but PyTorch doesn't see it, check whether the installed PyTorch wheel version targets the same ROCm major/minor version as your system installation.

ImportError or segfault when importing torch

Usually caused by mixing pip and system packages, or by a broken venv. The fastest fix is to delete the venv and recreate it from scratch: deactivate, then rm -rf pytorch-env, then run the venv creation steps again. This eliminates any corrupted state from partial installs.

Out of memory errors during model loading

The model is larger than your available VRAM. Either use a smaller model, apply quantization (libraries like bitsandbytes make this straightforward for Hugging Face models), or enable CPU offloading if your use case permits slower inference.

Expert Gotchas and Things Hard to Find Elsewhere

pip shows the correct CUDA version but inference is still running on CPU

This is usually caused by calling model(x) without first moving the model to the GPU with model.to(device) or model.cuda(). PyTorch does not automatically move models or tensors to GPU -- every object must be explicitly placed on the device. The device-agnostic pattern used in the inference example above (model.to(device) and x.to(device)) is the correct approach and sidesteps this class of bug entirely.

nvidia-smi shows CUDA 12.8 but nvcc --version shows a different version

These two numbers mean different things and the confusion trips up a surprising number of people. The CUDA version shown in nvidia-smi is the maximum CUDA API version your installed driver supports -- it is determined by the driver, not by any toolkit you have installed. The version shown by nvcc --version is the CUDA compiler toolkit version that happens to be in your PATH. For pip-installed PyTorch, only the nvidia-smi version matters. The CUDA runtime libraries are bundled inside the PyTorch wheel and do not come from your system's nvcc installation. If these two numbers differ, that is normal and not a problem for standard PyTorch usage.

// mental model

Think of nvidia-smi as reporting your driver's capabilities -- the maximum API the driver can speak. Think of nvcc --version as reporting the compiler you happen to have installed -- a separate piece of software that generates GPU code at build time. For running a pre-built PyTorch wheel, only the driver capability matters. The wheel ships with its own bundled runtime; nvcc is only relevant if you are writing and compiling your own CUDA C++ kernels.

// knowledge check

nvidia-smi reports CUDA Version: 12.8. nvcc --version reports release 11.8. You install PyTorch with --index-url https://download.pytorch.org/whl/cu128. Will PyTorch be able to use your GPU?

torch.version.cuda vs torch.backends.cudnn.version()

Two useful diagnostic values that beginners rarely know exist: torch.version.cuda returns the CUDA version the PyTorch wheel was compiled against. torch.backends.cudnn.version() returns the cuDNN runtime version being used. For PyTorch 2.11, the expected cuDNN version is 9.x (9.15 or later). If cuDNN is not available, certain operations like convolutions will fall back to slower paths. You can check the full build configuration with print(torch.__config__.show()), which dumps a multi-line report including the compiler, CUDA version, cuDNN, MKL, and ROCm/HIP information depending on your wheel.

Understanding the ROCm HIPification layer

AMD's ROCm support in PyTorch does not use separate code paths written from scratch. Instead, PyTorch includes a toolchain called HIPIFY that automatically converts CUDA source code to HIP source code at build time. From the Python side, torch.cuda is a unified namespace that routes to either CUDA or HIP depending on what hardware is present -- which is why the verification commands are identical for both backends. When you call torch.cuda.is_available() on an AMD machine with a properly installed ROCm wheel, it returns True because the ROCm/HIP runtime is presenting itself through the torch.cuda interface. This is intentional design, not a quirk. One practical consequence: any code written against torch.cuda is hardware-agnostic between NVIDIA and AMD without modification, as long as the correct wheel is installed.

Decoding the pip version string

When you run pip show torch, the version string encodes what you installed. A string like 2.11.0+cu128 means PyTorch 2.11.0 compiled against CUDA 12.8. A string of 2.11.0+rocm7.2 means the ROCm 7.2 build for PyTorch 2.11. A string of just 2.11.0 (no suffix) is the PyPI default wheel, which as of 2.11 means CUDA 13.0. A string ending in +cpu is the CPU-only build. You can also check at runtime: if torch.version.cuda returns None, you have the CPU-only build regardless of what the version string shows.

// interactive tool
pip Version String Decoder

Paste a torch version string from pip show torch to decode what it means.

Upgrading PyTorch to a new version

The correct way to upgrade PyTorch is to uninstall the current version first, then reinstall with the explicit --index-url for your hardware. Running pip install --upgrade torch without an index URL will pull from PyPI and as of 2.11 will install the CUDA 13.0 wheel -- which may not be what you want if you were previously on a specific CUDA version or on Volta. The safe upgrade pattern is:

terminal
# Step 1: uninstall the current build completely
(pytorch-env)$ pip uninstall torch torchvision torchaudio -y

# Step 2: reinstall with the explicit index URL for your GPU
# (same command you used originally -- just re-run it)
(pytorch-env)$ pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu128

# Step 3: verify the new version
(pytorch-env)$ python3 -c "import torch; print(torch.__version__)"

If you want to pin a specific PyTorch version for reproducibility (recommended for shared or production projects), specify the version explicitly in the install command. This ensures anyone running the same command later gets the same build:

terminal
# Pin all three coupled packages to exact matching versions (recommended for teams)
# torchvision and torchaudio versions are tightly coupled to the torch version --
# always pin all three together to guarantee compatible builds
(pytorch-env)$ pip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0 \
  --index-url https://download.pytorch.org/whl/cu128

# Then lock the full environment for exact reproducibility
(pytorch-env)$ pip freeze > requirements.txt

Using torch.compile() for faster inference and training

torch.compile() is PyTorch 2.x's graph compilation system and the most impactful single-line optimization available for most models. It traces your model's forward pass, compiles the graph with TorchInductor, and generates optimized GPU kernels. For many standard models it delivers a 20–50% throughput improvement with no code changes beyond the compile call. Using it is straightforward once your installation is verified:

compile_check.py
import torch
import torchvision.models as models

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load a model as normal
model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
model = model.to(device).eval()

# Compile it -- first call takes 30-60s to trace and compile
# Subsequent calls use the cached compiled graph
model = torch.compile(model)

x = torch.rand(4, 3, 224, 224).to(device)
with torch.no_grad():
    output = model(x)
print(f"Output shape: {output.shape}")  # torch.Size([4, 1000])

torch.compile() requires Linux or macOS for full functionality; on Windows it currently falls back to eager mode. The first compiled run incurs a one-time 30–90 second compilation overhead while Triton generates GPU kernels. Subsequent runs use the cached result. If compile causes errors, the fastest diagnostic is torch.compile(model, backend="eager") -- if that works, the issue is in the Inductor backend rather than the tracing itself.

The PYTORCH_CUDA_ALLOC_CONF environment variable

When running large models you may hit GPU out-of-memory errors that seem to occur at lower utilization than expected. This is often memory fragmentation rather than true exhaustion. PyTorch's memory allocator can hold onto freed blocks in a way that prevents them from being reused for new large allocations. Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True enables a more flexible allocation strategy that significantly reduces fragmentation for large workloads:

terminal
# Set before launching any Python script that uses PyTorch
$ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# Or set it for a single run inline
$ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 train.py

# Or add to ~/.bashrc to set permanently
$ echo 'export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True' >> ~/.bashrc

This setting is safe to leave on permanently and is recommended for any workload that loads models larger than roughly 4 GB or runs batch inference loops. It has no performance downside for small models.

Using multiple GPUs on a single machine

If your machine has more than one GPU, PyTorch can use them. torch.nn.DataParallel is the older single-process API that splits batches across GPUs in one process — it works but carries known inefficiencies including GIL contention and load imbalance across devices. PyTorch's own documentation now recommends torch.nn.parallel.DistributedDataParallel (DDP) for all new multi-GPU code, including single-node setups. DDP launches one process per GPU via torchrun, communicates through NCCL, and has no GIL overhead. To confirm how many GPUs PyTorch can see and verify they all work:

terminal
# How many GPUs does PyTorch see?
(pytorch-env)$ python3 -c "import torch; print(torch.cuda.device_count())"

# List all GPU names
(pytorch-env)$ python3 -c "import torch; [print(i, torch.cuda.get_device_name(i)) for i in range(torch.cuda.device_count())]"

# DataParallel -- legacy API, included for reference only
# PyTorch docs recommend DistributedDataParallel (DDP) for all new code,
# including single-node multi-GPU. DataParallel has known inefficiencies
# (GIL contention, load imbalance) that DDP avoids entirely.
# import torch.nn as nn
# model = nn.DataParallel(model)  # legacy: avoid for new projects

# DDP (recommended) -- launch with torchrun for single-node multi-GPU:
# torchrun --nproc_per_node=NUM_GPUS train.py
# Inside train.py: torch.distributed.init_process_group("nccl")
# model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

To restrict which GPUs PyTorch sees without changing code, set the CUDA_VISIBLE_DEVICES environment variable before launching Python: CUDA_VISIBLE_DEVICES=0,2 python3 train.py makes only GPUs 0 and 2 visible, and PyTorch will index them as devices 0 and 1 within the process.

Facts Even Experienced Users Are Unlikely to Know

These are precise, verifiable details sourced from PyTorch's internal documentation, source code, and official performance guides. Many are consequential for real workloads but rarely appear in installation guides or tutorials.

The caching allocator rounds all allocations twice — and the minimums are larger than you'd expect

When PyTorch allocates GPU memory, the caching allocator applies two rounding passes before touching the driver. The first rounds the requested size up to the nearest multiple of 512 bytes. The second applies when PyTorch needs to request a new block from CUDA itself — in that case, the minimum block it requests is 2 MB, regardless of how small your tensor is. Large allocations are requested in 2 MB increments. This means a tensor requiring 1.5 MB actually causes PyTorch to reserve a 2 MB block from the CUDA driver, and the remaining 0.5 MB sits as an inactive split inside that block. Per the PyTorch caching allocator internals documentation, this is intentional: requesting fewer, larger blocks from CUDA reduces driver call overhead and the implicit synchronization that cudaFree introduces.

// mental model

The PyTorch caching allocator functions like a warehouse with a strict minimum pallet size. When you ask for a small item, the warehouse still issues a full 2 MB pallet and stores it under your name. If you later ask for a similarly small item, the allocator checks its rack of previously issued pallets first -- if it finds one with enough free space, it hands you a slice of that pallet instead of issuing a new one. The cost of going to the CUDA driver directly (asking for a new pallet) is high because cudaFree synchronizes the GPU, stalling the pipeline. Maintaining an internal cache of reusable blocks keeps the GPU busy and the driver calls minimal. The trade-off is that nvidia-smi shows all your reserved pallets as "used memory," even the ones sitting empty in the rack.

CUDA stream isolation is a hidden source of fragmentation

Every memory block in the PyTorch caching allocator is permanently bound to the CUDA stream it was allocated on. If that block is later split, all fragments remain bound to the original stream. According to the PyTorch memory management internals, programs using many CUDA streams accumulate fragmentation proportional to the number of streams, because free blocks on stream A cannot be reused by allocations on stream B without an event synchronization recording. For most single-stream inference or training scripts this is invisible, but any code that explicitly creates multiple CUDA streams — or any library doing so under the hood — will see elevated reserved memory that grows with stream count.

The Triton kernel cache is invalidated by environment variable changes

When torch.compile() generates GPU kernels via TorchInductor, each compiled kernel is cached using a SHA-256 hash that includes: the Triton installation version hash, the kernel source code hash, the backend GPU architecture hash, compiler options, and a sorted hash of relevant environment variables. Per Red Hat's Triton cache analysis, this means changing any relevant environment variable — including adding or removing PYTORCH_CUDA_ALLOC_CONF settings — will invalidate all cached kernels and trigger full recompilation on the next run. The 30–90 second compilation cost on first run is not a one-time cost across reboots: it repeats any time the cache key changes.

The Triton cache is wiped on reboot by default — unless you relocate it

When TORCHINDUCTOR_CACHE_DIR is not set, TorchInductor stores all compiled artifacts — FX graph cache, AOTAutograd cache, and the Triton kernel subdirectory — in a temp directory under /tmp/torchinductor_<username>, per the PyTorch compile caching configuration documentation. On Linux, /tmp is typically cleared on reboot, so the 30–90 second compilation overhead recurs every time the machine reboots. Setting TORCHINDUCTOR_CACHE_DIR to a persistent location fixes this for all Inductor caches at once. TRITON_CACHE_DIR only needs to be set separately if you want the Triton kernel cache in a different location than the rest of the Inductor cache:

terminal
# Recommended: persist all Inductor caches (FX graphs, AOTAutograd, Triton) in one place
# TORCHINDUCTOR_CACHE_DIR also sets the Triton subdirectory if TRITON_CACHE_DIR is unset
$ export TORCHINDUCTOR_CACHE_DIR=~/.cache/torchinductor

# Enable FX graph cache (stores compiled graphs across process restarts)
$ export TORCHINDUCTOR_FX_GRAPH_CACHE=1

# Also enable AOTAutograd cache (requires FX_GRAPH_CACHE=1; stores autograd artifacts)
$ export TORCHINDUCTOR_AUTOGRAD_CACHE=1

# Optional: set Triton cache to a separate location if needed
# (not required if TORCHINDUCTOR_CACHE_DIR is already set)
# export TRITON_CACHE_DIR=~/.cache/triton

# Add all three exports to ~/.bashrc to make them permanent
$ echo 'export TORCHINDUCTOR_CACHE_DIR=~/.cache/torchinductor' >> ~/.bashrc
$ echo 'export TORCHINDUCTOR_FX_GRAPH_CACHE=1' >> ~/.bashrc
$ echo 'export TORCHINDUCTOR_AUTOGRAD_CACHE=1' >> ~/.bashrc

With these set, a compiled model that took 60 seconds on first run will start in under 2 seconds on all subsequent runs on the same machine, as long as the kernel cache key has not changed.

expandable_segments reserves virtual address space up to 1⅛× your total GPU memory

When expandable_segments:True is set in PYTORCH_CUDA_ALLOC_CONF, the allocator uses CUDA's virtual memory APIs (cuMemAddressReserve, cuMemCreate, cuMemMap) to reserve a virtual address range up to 1.125× your total GPU VRAM on first allocation — without mapping physical memory. Physical pages are allocated on demand: 2 MB at a time for the small allocation pool, 20 MB at a time for the large allocation pool. When torch.cuda.empty_cache() is called (or during OOM recovery), unused physical pages can be individually unmapped and returned to CUDA via cuMemUnmap and cuMemRelease. This is fundamentally different from the default allocator, which can only release entire contiguous blocks. The practical effect: under expandable_segments, nvidia-smi shows significantly less reserved memory between model runs than without it, because physical pages that are no longer needed return to CUDA rather than sitting in PyTorch's cache. However, according to the PyTorch allocator source, tensors allocated under expandable_segments:True cannot be shared between processes — this includes DataLoader workers using num_workers > 0. For multi-process DataLoader setups, disable it in worker processes via torch.cuda.memory._set_allocator_settings('expandable_segments:False') inside the worker init function.

nvidia-smi VRAM usage does not reflect what your model actually occupies

The "Memory-Usage" column in nvidia-smi shows how much GPU memory the PyTorch process has reserved from the CUDA driver, not how much is actually occupied by tensors. The difference is the caching allocator's internal free blocks — memory that PyTorch is holding but not currently using, in anticipation of future allocations. On a 16 GB GPU running a 6 GB model, nvidia-smi might show 10 GB used while torch.cuda.memory_allocated() returns only 6 GB. The missing 4 GB is cached but inactive. To get a precise picture, use the three functions together:

terminal
(pytorch-env)$ python3 -c "
import torch
a = torch.cuda.memory_allocated(0) / 1024**3
r = torch.cuda.memory_reserved(0) / 1024**3
print(f'Allocated (tensors): {a:.2f} GB')
print(f'Reserved  (held by PyTorch): {r:.2f} GB')
print(f'Cached idle: {r - a:.2f} GB')
"

To generate a full visual memory snapshot that you can drag onto pytorch.org/memory_viz for interactive debugging of fragmentation and allocation history, per the PyTorch CUDA memory documentation:

memory_snapshot.py
import torch

# Start recording every allocator event
torch.cuda.memory._record_memory_history()

# ... run your model code here ...

# Save snapshot to disk
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
torch.cuda.memory._record_memory_history(enabled=None)
# Drag memory_snapshot.pickle to pytorch.org/memory_viz
# to see per-tensor allocation history and fragmentation timeline

CUDA_LAUNCH_BLOCKING=1 moves GPU errors to the right Python line

GPU kernels execute asynchronously. When a CUDA error occurs — an illegal memory access, a NaN in an operation — the Python exception is raised not at the offending line but at the next CUDA synchronization point, which may be several operations later. This makes GPU errors genuinely difficult to trace. Setting CUDA_LAUNCH_BLOCKING=1 forces every CUDA kernel to complete before Python moves to the next line. The error will now appear on the correct line. The cost is significant — synchronous execution removes all GPU/CPU overlap and can slow training by 10× or more — but it is indispensable for debugging. Use it only during debugging, never in production or benchmarking.

terminal
# Use only for debugging GPU errors -- causes 10x+ slowdown
$ CUDA_LAUNCH_BLOCKING=1 python3 your_script.py

# Errors will now appear on the exact Python line that caused them
# instead of at the next synchronization point

OMP_NUM_THREADS defaults to physical core count — set it to 1 for pure GPU workloads

PyTorch uses GNU OpenMP (libgomp) for CPU-side parallelism. By default, per the PyTorch performance tuning guide, OMP_NUM_THREADS is set to the number of available physical cores. On a machine with a GPU doing all the heavy computation, the default CPU thread count creates contention between PyTorch's OpenMP threads, DataLoader worker processes, and the GPU dispatch thread. For inference scripts where all math is on the GPU, setting OMP_NUM_THREADS=1 often reduces latency by eliminating CPU thread overhead. For CPU-heavy preprocessing pipelines, the default is appropriate.

terminal
# For GPU inference workloads: reduce CPU thread overhead
$ OMP_NUM_THREADS=1 python3 inference.py

# For DataLoader-heavy training: match workers to physical cores
# torch.set_num_threads(N) can also be called inside your script
$ OMP_NUM_THREADS=4 python3 train.py

max_split_size_mb trades one fragmentation problem for another

The max_split_size_mb option in PYTORCH_CUDA_ALLOC_CONF prevents the allocator from splitting large cached blocks. When a block larger than the threshold is freed, it stays unsplit in the cache. This solves the classic fragmentation problem where a large block gets split into many small fragments that can never be recombined into a large block again. However, it introduces a different issue: small allocations can no longer borrow space from large freed blocks, so each small allocation must either find a matching small cached block or request a new one from CUDA. In practice, as documented by NVIDIA's CUDA Graph troubleshooting guide, the right value is workload-specific. A common starting point for large model inference is 512 MB, but it requires tuning against your actual allocation pattern. Running torch.cuda.memory_summary() and looking at the InactiveSplit row shows how much memory is stuck in split fragments — if that number is large relative to your total reserved memory, max_split_size_mb may help.

terminal
# Diagnose fragmentation: run after loading your model and warming it up
# Use a script rather than a one-liner for anything non-trivial:
#   import torch
#   # ... load your model and run a forward pass here ...
#   print(torch.cuda.memory_summary())
#   # Look at the InactiveSplit row -- large values indicate fragmentation

# If InactiveSplit is large, try max_split_size_mb
# Combine with expandable_segments for best results
$ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

How to Install PyTorch 2.11 on Linux with GPU Acceleration

Step 1: Create and activate a Python virtual environment

Run python3 -m venv pytorch-env to create an isolated environment in the current directory, then activate it with source pytorch-env/bin/activate. Always install PyTorch inside a virtual environment to avoid package conflicts with other projects. Upgrade pip immediately after activating: python3 -m pip install --upgrade pip.

Step 2: Identify your GPU backend, CUDA version, and GPU generation

For NVIDIA: run nvidia-smi and note the CUDA Version in the top-right corner. This is the maximum CUDA version your driver supports. Also check your GPU generation -- Volta (V100) and older must use the CUDA 12.6 wheel in PyTorch 2.11, as Volta was dropped from CUDA 12.8, 12.9, and newer builds. For AMD: confirm your ROCm version with cat /opt/rocm/.info/version. Then visit pytorch.org/get-started/locally/ and generate the correct pip install command.

Step 3: Install PyTorch with the correct wheel index URL

For NVIDIA CUDA 12.8 (Turing and newer): pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128. For NVIDIA CUDA 12.6 (Volta/V100, Pascal, Maxwell): pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126. For AMD ROCm 7.2 targeting PyTorch 2.11: pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.2. Always specify the index URL explicitly -- as of PyTorch 2.11, omitting it installs a CUDA 13.0 wheel from PyPI, which may fail on older drivers or GPU architectures.

Step 4: Verify GPU detection and run a tensor operation

Run python3 -c 'import torch; print(torch.cuda.is_available())'. A result of True confirms PyTorch has detected a CUDA or ROCm-capable GPU. Confirm the device name with python3 -c 'import torch; print(torch.cuda.get_device_name(0))'. Then run the inference_check.py script from the article to confirm a complete forward pass executes on GPU.

Frequently Asked Questions

Do I need to install the CUDA Toolkit separately to use PyTorch with an NVIDIA GPU on Linux?

For most use cases, no. When you install a CUDA-enabled PyTorch wheel via pip, the required CUDA runtime libraries are bundled inside the package. You do need the NVIDIA driver installed and functional (nvidia-smi must work), and the driver version must support the CUDA version that the PyTorch wheel was built against. A full system-wide CUDA Toolkit installation is only required if you are compiling custom CUDA C++ extensions or writing your own CUDA kernels alongside PyTorch.

How do I find which CUDA version to use for my PyTorch installation?

Run nvidia-smi and look at the CUDA Version shown in the top-right corner of the output. This is the maximum CUDA version your installed driver supports. Then visit pytorch.org/get-started/locally/ and select the CUDA version that matches or is below your driver's maximum. Also check your GPU generation -- Volta and older GPUs require the CUDA 12.6 wheel in PyTorch 2.11 regardless of what your driver reports.

torch.cuda.is_available() returns False even though I have an NVIDIA GPU. What is wrong?

The two most common causes in 2026 are: (1) a mismatch between the CUDA version PyTorch was built against and the version your driver supports, and (2) a GPU architecture that is not supported by the wheel you installed. Run nvidia-smi to confirm your driver is loaded and note the CUDA version shown. Then run python3 -c 'import torch; print(torch.version.cuda)' to see which CUDA version your wheel targets. If the wheel targets a newer CUDA version than your driver supports, reinstall with a matching wheel. If you have a Volta GPU (V100), note that PyTorch 2.11's CUDA 12.8 and 12.9 builds dropped Volta -- use the CUDA 12.6 index instead.

Should I use a virtual environment or conda for a PyTorch project on Linux?

Either works, but they have different strengths. Python venv is lightweight, fast to create, and sufficient for most PyTorch projects where you manage the driver stack yourself. Conda is useful when you need to manage non-Python dependencies alongside your packages, such as CUDA toolkit versions, FFmpeg, or libjpeg. As of PyTorch 2.6.0, the official conda channel no longer provides PyTorch builds -- this remains true in 2.11 -- so even in a conda environment you install PyTorch via pip. The key rule is to avoid mixing pip and conda installs carelessly in the same environment, as this can cause difficult-to-debug dependency conflicts.

What changed with pip install torch in PyTorch 2.11?

Starting with PyTorch 2.11, running pip install torch without an --index-url now installs a CUDA 13.0 wheel from PyPI by default -- not the CPU-only build as in earlier releases. Per the PyTorch 2.11.0 release notes, CUDA 13.0 only supports Turing (SM 7.5) and newer GPU architectures on Linux x86_64. Maxwell, Pascal, and Volta GPUs are excluded. Users on those architectures, or on systems whose driver does not yet support CUDA 13.0, must specify an explicit --index-url pointing to the CUDA 12.6 or 12.8 builds.

Does PyTorch 2.11 support NVIDIA Volta GPUs (V100)?

Not via the CUDA 12.8 or newer wheels. PyTorch 2.11 dropped Volta from its CUDA 12.8 and 12.9 builds to enable an upgrade to CuDNN 9.15.1. Volta users must install using the CUDA 12.6 wheel index: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126. The CUDA 12.6 builds retain full Volta (SM 7.0) support. Building from source with TORCH_CUDA_ARCH_LIST including 7.0 is the other option if you need a specific newer PyTorch version with Volta.

How do I upgrade PyTorch without accidentally installing the wrong wheel?

The safest upgrade pattern is to explicitly uninstall first, then reinstall with the same --index-url you used originally. Running pip install --upgrade torch without an index URL will fetch from PyPI which as of PyTorch 2.11 installs a CUDA 13.0 wheel -- not appropriate for Volta GPUs or systems whose driver does not yet support CUDA 13.0. The correct sequence: pip uninstall torch torchvision torchaudio -y, then pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 (or whichever index your hardware requires). For team projects, pin the version explicitly (torch==2.11.0) and commit a requirements.txt produced by pip freeze so every developer gets the same build.

Does PyTorch work inside WSL2 on Windows?

Yes, and the pip install commands are identical to native Linux. WSL2 uses the host Windows NVIDIA driver directly -- there is no separate Linux driver to install inside WSL2. Run nvidia-smi inside your WSL2 terminal to confirm the driver is visible before proceeding. If nvidia-smi fails inside WSL2, your Windows NVIDIA driver needs updating rather than anything in the WSL2 environment. AMD ROCm is not supported inside WSL2; AMD GPU users should use a native Linux installation or the AMD ROCm Docker image instead. All steps in this guide from the venv section onward apply directly to WSL2 Ubuntu without modification.

What does torch.compile() do and should I use it?

torch.compile() is PyTorch 2.x's graph compilation system. It traces your model's forward pass and compiles it with TorchInductor, generating optimized GPU kernels via Triton. For standard models it typically delivers a 20–50% throughput improvement with a single line of code change: model = torch.compile(model). The first run after compile incurs a one-time compilation overhead of 30–90 seconds while kernels are generated; subsequent runs use the cached result. It works on both NVIDIA CUDA and AMD ROCm backends. If your compiled model raises errors that don't appear in eager mode, pass backend="eager" first as a diagnostic: if that succeeds, the issue is in TorchInductor rather than the model itself.

What are torchvision and torchaudio and do I need them?

torchvision provides pretrained computer vision models (ResNet, EfficientNet, ViT), standard datasets like ImageNet and CIFAR-10, and image transformation utilities. torchaudio provides audio processing tools, waveform transforms, and pretrained speech models. Neither is required if you are only working with raw tensors or NLP pipelines. You can omit either from the install command and add them later. The important constraint is version coupling: torchvision and torchaudio versions are tied to specific PyTorch versions. Always install all three in the same pip command to ensure compatible versions are selected. Installing them separately afterward can pull in mismatched versions that cause hard-to-diagnose import errors.

How do I use multiple GPUs with PyTorch on a single Linux machine?

The recommended approach for all new code is torch.nn.parallel.DistributedDataParallel (DDP), launched via torchrun --nproc_per_node=NUM_GPUS train.py. DDP runs one process per GPU, communicates through NCCL, and avoids the GIL contention and load-imbalance issues of the older API. torch.nn.DataParallel is still available and wraps a model with a single line (model = torch.nn.DataParallel(model)), but PyTorch's own documentation now classifies it as a legacy API with known inefficiencies -- avoid it for new projects. To check how many GPUs PyTorch can see: python3 -c "import torch; print(torch.cuda.device_count())". To limit which GPUs a script uses, set CUDA_VISIBLE_DEVICES=0,1 before launching Python -- PyTorch will then index those as devices 0 and 1 internally.