Running large language models locally on Linux has gone from an enthusiast experiment to a legitimate production workflow. API costs add up fast, sensitive data shouldn't leave your machine -- and with Linux servers an active target for data-exfiltrating malware in 2026, keeping inference fully on-premises is increasingly a security decision, not just a cost one -- and the open-weight model ecosystem -- Llama 4, Qwen3, Gemma 3, Mistral, DeepSeek and others -- has reached a quality level where local inference is genuinely useful for everyday tasks. If you're just getting started with running AI workloads on Linux more broadly, that guide covers the toolchain landscape before you dive into Ollama specifically. Ollama makes the mechanics manageable. It wraps llama.cpp behind a clean command-line interface and REST API, handles model downloads, manages GPU memory allocation automatically, and runs as a systemd service on Linux. The hard part is getting GPU acceleration right -- especially on AMD hardware, where the ROCm software stack requires more manual attention than NVIDIA's comparatively frictionless CUDA path. As of Ollama v0.12.11, a Vulkan backend is also available as an experimental option for Intel GPUs and AMD cards not covered by ROCm.
This guide assumes you're on Ubuntu 22.04, 24.04, or a compatible Debian-based distribution. Most commands translate directly to Fedora and Arch with minor package manager substitutions noted where relevant.
"If the model fits on one GPU, Ollama loads it there."
-- Ollama official FAQ, docs.ollama.com -- this principle drives most of the VRAM sizing decisions in this guide; fitting entirely in VRAM means no PCIe bus transfers during inference
Prerequisites and Hardware Expectations
Before installing anything, it's worth being realistic about hardware requirements. Ollama will run on CPU-only hardware, but the experience is noticeably slower -- expect 2 to 8 tokens per second on a modern CPU with a 7B model, compared to 40 to 80 or more tokens per second on a mid-range GPU. For interactive use, a GPU matters.
General VRAM guidance for common model sizes at Q4_K_M quantization:
| Model Size | VRAM Required (Q4_K_M) | Example Models | Minimum GPU |
|---|---|---|---|
| 3B | ~2.5 GB | Llama 3.2 3B, Phi-4 Mini | Any modern GPU with 4 GB |
| 7-8B | ~5-6 GB | Llama 3.1 8B, Mistral 7B | RTX 3060 / RX 6600 XT |
| 13-14B | ~9-10 GB | Qwen2.5 14B, Phi-4 14B | RTX 3080 / RX 6800 XT |
| 32B | ~22-24 GB | Qwen 2.5 32B, DeepSeek 32B | RTX 4090 / RX 7900 XTX |
| 70B+ | 48 GB+ or CPU offload | Llama 3.1 70B, Qwen3 70B | Multi-GPU or high-RAM CPU |
These figures assume a default context window of 8K tokens. The KV cache grows linearly with context length. At 32K context, a 7-8B model at Q4_K_M will need roughly 4.5 GB of additional VRAM just for the cache on top of the model weights. If you need long contexts, account for this when sizing your hardware.
Mixture of Experts (MoE) models like Qwen3 30B-A3B, DeepSeek-V3, and Mixtral do not follow the dense model VRAM formula. A MoE model with 30B total parameters may activate only 3B parameters per forward pass (hence "30B-A3B" — 30B total, 3B active). The VRAM requirement is determined by total parameter count (all experts must fit in VRAM), but throughput and token generation speed reflect the active parameter count. This means Qwen3 30B-A3B requires approximately the same VRAM as a dense 30B model (~20 GB at Q4_K_M) but runs at speeds closer to a 3B dense model — a significant quality-per-second advantage over equivalently-sized dense models. If a model tag on ollama.com/library lists both a total and an active parameter count, it is a MoE architecture and deserves separate VRAM budgeting.
"VRAM is a hard boundary, not a soft limit."
-- LocalLLM.in, Ollama VRAM Requirements Guide (2026) -- when a model overflows VRAM into system RAM, inference typically slows 5-20x
Installing Ollama on Linux
The official installer handles architecture detection, binary placement, and systemd service setup in a single step:
# Install Ollama $ curl -fsSL https://ollama.com/install.sh | sh # Verify installation $ ollama --version # Check service status $ systemctl status ollama
The installer creates a dedicated ollama system user, places the binary at /usr/local/bin/ollama, and registers a systemd service that starts on boot. Model files are stored under ~/.ollama/models/ by default. If you want to store models on a different partition (common if your home directory is on a small SSD), set the OLLAMA_MODELS environment variable before starting the service:
# Create override directory and file $ sudo mkdir -p /etc/systemd/system/ollama.service.d/ $ sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF [Service] Environment="OLLAMA_MODELS=/mnt/models" EOF $ sudo systemctl daemon-reload $ sudo systemctl restart ollama
Finding models and updating Ollama
The full model catalog is at ollama.com/library. Each model page shows available quantization tags, parameter counts, and context length. You can also search from the terminal:
# Search for models by name or capability $ ollama search qwen $ ollama search codellama # List all locally downloaded models $ ollama list # Remove a model to free disk space $ ollama rm llama3.1:8b
To update Ollama itself, re-run the official installer script. It detects the existing installation and upgrades the binary in place without removing your downloaded models or service configuration:
# Update Ollama to the latest release $ curl -fsSL https://ollama.com/install.sh | sh # Confirm the new version $ ollama --version
Model files are stored separately from the Ollama binary and are not affected by updates. Re-running the installer will not delete downloaded models. If you pinned a specific model version with a quantization tag, it remains in ~/.ollama/models/ after the upgrade.
NVIDIA GPU Setup (CUDA)
NVIDIA is the more straightforward path. Ollama detects CUDA automatically when the NVIDIA driver is installed and nvidia-smi is functional. Your GPU must have Compute Capability 5.0 or higher (Maxwell architecture or newer -- this includes GeForce GTX 750, GTX 900-series, and all RTX cards). The minimum supported driver version is 531; a recent stable driver in the 550+ range is a widely used baseline for current Ollama releases, though the official requirement is simply 531+.
Verify driver status
# Check driver version and GPU detection $ nvidia-smi # Expected output includes driver version and GPU name # If this fails, the driver is not installed or not loaded
If nvidia-smi fails, install the driver through Ubuntu's package manager rather than the NVIDIA runfile installer -- the package manager handles kernel module updates automatically:
# Check recommended drivers for your hardware $ ubuntu-drivers devices # Install the recommended driver automatically (preferred method) $ sudo ubuntu-drivers autoinstall # Or install a specific version -- check ubuntu-drivers devices for the # current recommended version; the number below may not be current $ sudo apt install nvidia-driver-570 # Reboot to load the kernel module $ sudo reboot
A driver version mismatch between the host driver and the CUDA toolkit is the most common cause of Ollama failing to detect an NVIDIA GPU. If you're using Docker for any AI workloads, the NVIDIA Container Toolkit also needs to be installed separately -- the driver alone is not sufficient for containerized GPU access.
If you run a desktop compositor (Wayland/X11) or other GPU-accelerated applications on the same machine, Ollama's scheduler will attempt to use all available VRAM and may starve those processes. Use OLLAMA_GPU_OVERHEAD to reserve a fixed number of bytes. For example, to reserve 1 GB: Environment="OLLAMA_GPU_OVERHEAD=1073741824" (value is in bytes). This tells the scheduler that 1 GB is off-limits and it will size models accordingly.
Confirm GPU usage after pulling a model
# Pull the 8B model (Q4_K_M quantization, ~4.7 GB download) $ ollama pull llama3.1:8b # Run it interactively $ ollama run llama3.1:8b # In a separate terminal, verify GPU usage $ ollama ps # PROCESSOR column shows: 100% GPU (fully on GPU), 100% CPU, or a split like 52%/48% GPU/CPU # Cross-check with nvidia-smi $ nvidia-smi # Look for ollama runner or ollama_llama_server in the Processes section # (process name varies by Ollama version; either confirms GPU usage)
The server log also records how many layers are offloaded to the GPU. A value of 0 means CPU-only inference is running despite a GPU being present:
# Watch the server log while a model loads $ tail -f ~/.ollama/logs/server.log | grep -E "n_gpu_layers|model layers" # Old engine format: n_gpu_layers = 33 -> all layers on GPU (healthy) # New engine format: "model layers"=33 requested=-1 # Either format: value of 0 means CPU-only; partial value means VRAM pressure
AMD GPU Setup (ROCm)
AMD ROCm support on Linux has improved considerably, but it requires more deliberate setup than the NVIDIA path. Ollama requires ROCm v7 on Linux as of current releases (v0.12.11+). Windows AMD GPU acceleration through Ollama is not officially supported; the ROCm path is Linux-only. If you have an AMD GPU that is not on the official ROCm support list, see the HSA override section below, or consider the Vulkan backend as an alternative.
Installing ROCm
The recommended installation method for Ubuntu uses AMD's official package repository:
# Ubuntu 24.04 (noble) -- ROCm v7.2.1 (current production release) # URL and filename verified against AMD's official Linux Drivers page on 2026-04-10 # Check https://www.amd.com/en/support/download/linux-drivers.html for the latest before running $ wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb $ sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb $ sudo apt update # Install required Python build dependencies $ sudo apt install python3-setuptools python3-wheel # Install ROCm runtime stack $ sudo amdgpu-install --usecase=rocm # Add your user to the render and video groups $ sudo usermod -a -G render,video $LOGNAME # Reboot to activate driver and group membership $ sudo reboot # Verify ROCm sees the GPU $ rocminfo | grep -A2 "Agent"
The package URL above is for Ubuntu 24.04 (noble) and was verified against AMD's official installation instructions on 2026-04-10 -- the filename amdgpu-install_7.2.1.70201-1_all.deb is current as of that date, with ROCm 7.2.1 being AMD's designated production release. For Ubuntu 22.04 (jammy), replace noble with jammy in the URL. Before running, confirm the current filename at AMD's Linux Drivers page or AMD's ROCm quick-start documentation -- package versions change with each release. Use sudo apt install ./filename.deb rather than sudo dpkg -i to ensure dependencies are automatically resolved.
Supported AMD GPUs
Ollama's ROCm backend officially supports the following Radeon GPU families: RX 7000 series (RDNA 3), RX 6000 series (RDNA 2), Vega 64, and the Radeon PRO and Instinct data center lines. If your card is not on the official support list, that doesn't necessarily mean it won't work -- it means ROCm's compatibility checks will block it unless you override them.
The HSA_OVERRIDE_GFX_VERSION workaround for unsupported GPUs
ROCm hardware compatibility checks use a GFX version string to determine whether a GPU is supported. If your card is newer or older than the official support window, you can override this check by setting HSA_OVERRIDE_GFX_VERSION to the GFX version of the nearest officially supported card. The override bypasses the check entirely -- Ollama will attempt to use the GPU as if it were the target architecture.
To find your card's GFX version:
# List GPU agents and their GFX versions $ rocminfo | grep "gfx" # Example output for an RX 5500 XT (gfx1012, not officially supported): # Name: gfx1012 # The nearest supported version is gfx1030, so set: # HSA_OVERRIDE_GFX_VERSION=10.3.0 # For RDNA3 cards (gfx1100, gfx1101), the nearest supported is 11.0.0
Set the override in the Ollama systemd service override file so it persists across reboots:
[Service] Environment="HSA_OVERRIDE_GFX_VERSION=10.3.0"
If you have multiple AMD GPUs with different GFX versions, append the device index to set overrides individually:
# Per-device overrides for mixed AMD GPU systems [Service] Environment="HSA_OVERRIDE_GFX_VERSION_0=10.3.0" Environment="HSA_OVERRIDE_GFX_VERSION_1=11.0.0"
The Ollama documentation shows HSA_OVERRIDE_GFX_VERSION_0 and HSA_OVERRIDE_GFX_VERSION_1 (zero-indexed). However, confirmed real-world testing on mixed AMD GPU systems reports that the per-device variable is actually 1-indexed in practice — meaning the first GPU uses HSA_OVERRIDE_GFX_VERSION_1 and the second uses HSA_OVERRIDE_GFX_VERSION_2. If zero-indexed overrides have no effect on your second GPU, switch to 1-indexed numbering. This appears to be an undocumented behavioral quirk where HSA_OVERRIDE_GFX_VERSION (no suffix) and HSA_OVERRIDE_GFX_VERSION_1 both address the first device, making the effective device numbering start at 1. Verify by watching journalctl -u ollama -f immediately after a restart to see which override takes effect.
$ sudo systemctl daemon-reload $ sudo systemctl restart ollama # Verify the override is active in the service log $ journalctl -u ollama -n 30 | grep -i "rocm\|gfx\|override"
Without HSA_OVERRIDE_GFX_VERSION, Ollama silently falls back to CPU inference if the GPU fails the ROCm compatibility check. There's no error message -- the model simply runs slower than expected. Always verify GPU usage with ollama ps after any driver or configuration change.
Adding your user to the render and video groups is required for ROCm GPU access, but group membership is a security surface worth auditing periodically on any multi-user machine. The Linux user permissions audit guide covers the commands to check which users belong to hardware-access groups and whether those memberships are still appropriate.
The HSA_OVERRIDE_GFX_VERSION variable maps to an LLVM target, not a hardware feature set. When you set it to 10.3.0, ROCm compiles kernels targeting the gfx1030 instruction set. Your actual card will run them if it is instruction-compatible — most minor-revision RDNA cards are. The risk is that shader model features introduced in your card's revision may not be exercised, occasionally producing lower performance than a native match. If the override works but generation speed is slower than expected, try the nearest higher supported GFX version instead of the nearest lower one.
Vulkan Backend (Intel and Unsupported AMD GPUs)
Since Ollama v0.12.11, a Vulkan GPU backend is available as an opt-in experimental feature. Vulkan covers hardware that neither CUDA nor ROCm can reach: Intel integrated and discrete GPUs, AMD GPUs outside the ROCm support matrix, and any GPU with working Vulkan 1.1+ drivers. When both ROCm and Vulkan are available, Ollama prioritizes the native vendor backend (ROCm for AMD) unless you override it.
To enable Vulkan, set OLLAMA_VULKAN=1 in the Ollama systemd service environment:
[Service] Environment="OLLAMA_VULKAN=1"
$ sudo systemctl daemon-reload $ sudo systemctl restart ollama # Confirm Vulkan GPU is detected $ journalctl -u ollama -n 30 | grep -i "vulkan\|library" # Look for: library=Vulkan name=Vulkan0
Installing Vulkan drivers on Linux
On Linux, Vulkan drivers are not always included by default. Most NVIDIA GPUs have Vulkan support bundled with the proprietary driver. AMD and Intel require an explicit install:
# AMD: Mesa Vulkan driver (open-source, covers most Radeon GPUs) $ sudo apt install mesa-vulkan-drivers vulkan-tools # Intel: Mesa Vulkan + media driver (iGPU and Arc discrete) $ sudo apt install mesa-vulkan-drivers intel-media-va-driver vulkan-tools # Verify Vulkan device is visible $ vulkaninfo --summary 2>/dev/null | grep deviceName # If vulkaninfo is not found, install vulkan-tools first: # sudo apt install vulkan-tools
For AMD: AMD also publishes vendor-specific Vulkan packages through the amdgpu-install utility using --usecase=graphics. These may give better performance than Mesa on supported hardware. See AMD's amdgpu-install documentation for details.
For Intel: Intel Arc discrete GPUs and 12th-gen+ integrated graphics can use the Intel open-source drivers. See Intel's GPU driver documentation for the current setup steps for your distribution.
Vulkan is still marked experimental in Ollama. Performance is generally lower than CUDA or ROCm. Some hardware -- particularly Intel integrated graphics -- produces unreliable output with models larger than roughly 1B parameters. If inference output looks like garbled text, try adding OLLAMA_FLASH_ATTENTION=0 to the service environment, or fall back to CPU by setting OLLAMA_NUM_GPU=0. Vulkan is most reliable on dedicated AMD and NVIDIA cards.
Additionally, Vulkan requires extra Linux capabilities or running as root to expose accurate VRAM data to the Ollama scheduler. Without those capabilities, the scheduler uses approximate model sizes rather than real available VRAM data, which can result in suboptimal GPU placement decisions. To target a specific Vulkan device, use GGML_VK_VISIBLE_DEVICES with the numeric device ID shown in vulkaninfo --summary.
Monitoring AMD GPU usage
# Install amdgpu_top (Rust-based, more detailed than radeontop) # Not in Ubuntu apt repos -- download the .deb from the GitHub releases page: # https://github.com/Umio-Yasuno/amdgpu_top/releases $ sudo apt install ./amdgpu_top_X.X.X_amd64.deb # replace with actual filename # Or install via Cargo if you have Rust toolchain installed: $ cargo install amdgpu_top # Run while Ollama is generating to see VRAM activity $ amdgpu_top # radeontop is available in the Ubuntu apt repos as a lighter-weight alternative $ sudo apt install radeontop $ radeontop
Choosing the Right Quantization
GGUF quantization determines how aggressively the model's weight values are compressed. Lower bit quantization means smaller VRAM footprint and faster loading, at the cost of some output quality. The tradeoff is gentler than it sounds -- the difference between Q8 and Q4_K_M is subtle for most use cases, while the VRAM savings are significant.
There are two structurally different quantization families in GGUF, and the distinction matters more than most guides acknowledge:
K-quants (Q4_K_M, Q5_K_M, Q6_K, etc.) use a two-level scheme: weights are grouped into 32-weight blocks, each with its own scale and zero-point. Those blocks are then grouped into 256-weight super-blocks with an additional scale applied on top. The mixed suffix denotes that different layer types get different bit depths -- attention and output layers receive more bits than feed-forward layers, following a fixed heuristic baked into the quantization preset. This is why Q4_K_M outperforms naive Q4_0 at the same bit depth: the layerwise allocation is more intelligent.
IQ-quants (IQ2_XS, IQ3_XS, IQ4_XS, IQ4_NL, etc.) go further. Instead of a fixed heuristic for which layers get more bits, they use an importance matrix -- a calibration file generated by running representative text through the model before quantization and measuring which weights most strongly influence output quality. Precision is then allocated to high-impact weights at the expense of low-impact ones. The result: IQ4_XS sits at roughly 4.3 bits per weight versus Q4_K_M's 4.5 bpw, saving approximately 400 MB on a 7B model at similar output quality. Blind preference testing on Mistral 7B places IQ4_XS between Q4_K_M and Q5_K_M in quality while being smaller than both.
IQ-quants require more GPU compute per weight to dequantize than K-quants. The lookup-table-based decoding of IQ formats is more demanding than the affine transforms used by K-quants. On a GPU doing full VRAM inference this is irrelevant — the compute overhead is negligible compared to matrix multiply. But on CPU-offloaded layers, the extra decode cost can make IQ-quants slower than K-quants of equivalent quality, despite being smaller. If you are running a model where even a fraction of layers spill to CPU, benchmark IQ versus K before committing to IQ. The smaller file size only helps throughput if you are bound by memory bandwidth, not compute.
| Quantization | Type | Quality | VRAM vs Q8 | Recommended For |
|---|---|---|---|---|
Q2_K |
K-quant | Poor | ~35% | Avoid unless severely VRAM-constrained |
IQ3_XS |
IQ-quant | Fair | ~40% | Better than Q3_K_M at same size; requires imatrix |
Q4_K_S |
K-quant | Good | ~55% | Budget GPUs, 4-6 GB VRAM |
IQ4_XS |
IQ-quant | Very good | ~54% | Slightly smaller than Q4_K_M at similar quality; full-GPU inference only |
Q4_K_M |
K-quant | Very good | ~57% | General use -- recommended default |
Q5_K_M |
K-quant | Excellent | ~68% | When quality matters and VRAM allows |
Q6_K |
K-quant | Near-lossless | ~80% | High-VRAM cards, quality-critical tasks |
Q8_0 |
K-quant | Near-original | 100% | Benchmarking, maximum fidelity |
"Models that fit entirely in VRAM are dramatically faster."
-- dev.to/rosgluk, RTX 4080 benchmark (Ollama 0.15.2, 16 GB VRAM) -- the benchmark measured GPT-OSS 20B at ~140 tokens/sec fully in VRAM versus GPT-OSS 120B with heavy CPU offload at ~13 tokens/sec: roughly an 11x gap
To pull a specific quantization variant, append the tag to the model name:
# Default pull (Ollama chooses quantization for your hardware) $ ollama pull llama3.1:8b # Explicitly request Q4_K_M $ ollama pull llama3.1:8b-instruct-q4_K_M # Higher quality if VRAM allows $ ollama pull llama3.1:8b-instruct-q5_K_M # List available model variants $ ollama list
Customizing Models with Modelfiles
A Modelfile lets you create a named variant of any model with custom system prompts, temperature settings, context length, and other parameters. This is useful when you have a consistent use case -- a coding assistant that always uses concise output, or a research assistant with a specific persona.
Undocumented Modelfile parameters with real performance impact
The official Modelfile documentation covers the obvious parameters -- temperature, num_ctx, repeat_penalty. The following parameters are accepted by the parser and passed to the llama.cpp runner, but are either undocumented or poorly documented in the official Ollama docs. Each addresses a specific problem that the common parameters cannot.
num_gpu — controls exactly how many model layers to offload to GPU. Ollama's automatic detection is optimal when fitting the full model in VRAM, but num_gpu is the correct tool when you want deliberate, quantified partial offload. Benchmarked result on an RTX 4060 8 GB with Qwen3 8B Q4_K_M at 16K context: automatic full-GPU offload used 7.2 GB VRAM and ran at 40.58 tokens/sec. Setting PARAMETER num_gpu 25 reduced VRAM to 4.8 GB but dropped to 8.62 tokens/sec -- a 4.7× speed penalty for a 2.4 GB VRAM saving. Use this intentionally when you need to run two models simultaneously and must budget VRAM, not as a general optimization.
num_thread — controls the CPU thread count for CPU-offloaded layers and CPU-only inference. Defaults to auto (all logical cores). On hyperthreaded CPUs, logical core count is 2× physical core count. Due to cache contention between sibling threads sharing an L3 cache slice, setting num_thread to the physical core count rather than the logical count often improves CPU inference throughput by 10–20%. Find physical core count on Linux: lscpu | grep "Core(s) per socket".
num_batch — controls the prefill batch size: how many tokens are processed in parallel during the prompt ingestion phase. Default is 512. Increasing to 1024 or 2048 on high-VRAM cards can significantly speed up long-prompt processing (sending a 10,000-token document) at the cost of a VRAM spike during prefill. Reduce below 512 if you see OOM errors on long prompts that don't occur on shorter ones -- the prefill peak is where large batch sizes bite.
use_mlock — prevents the OS from paging model weights to swap under memory pressure. When set to true, the kernel is instructed to pin the model's memory pages and refuse to swap them out. This is critical on systems where other processes compete for RAM during long idle periods -- without it, the kernel can evict model pages to swap, causing the first request after an idle period to take several seconds longer than expected even though ollama ps still shows the model as "loaded." Note that mlock requires the process to have CAP_IPC_LOCK or sufficient ulimit -l -- the Ollama systemd service unit sets this up automatically on most Ubuntu installations.
use_mmap -- controls how the GGUF file is loaded from disk. When true (default), the file is memory-mapped: the OS handles page faults on demand, loading only the portions of the model file accessed during inference. On fast NVMe this is generally fine and reduces peak RAM usage during loading. On slow HDD, network-attached storage, or NFS mounts, mmap triggers expensive random-access page faults during inference at unpredictable times, causing stalls that don't show up in VRAM or CPU metrics. Setting PARAMETER use_mmap false forces the full model into RAM before inference begins -- startup is slower but inference latency becomes consistent.
FROM llama3.1:8b # Explicit GPU layer count (25 of 33 layers on GPU; adjust to fit your VRAM budget) # Remove this line to let Ollama auto-detect (recommended unless budgeting VRAM deliberately) PARAMETER num_gpu 25 # Thread count for CPU layers -- use physical core count, not logical # lscpu | grep "Core(s) per socket" to find physical count PARAMETER num_thread 8 # Prefill batch size -- increase for faster long-prompt ingestion on high-VRAM cards PARAMETER num_batch 1024 # Pin model pages in RAM -- prevents swap eviction during idle periods PARAMETER use_mlock true # Disable mmap -- forces full load before inference; eliminates random page faults # Recommended when model is on HDD, NFS, or network-attached storage PARAMETER use_mmap false PARAMETER num_ctx 8192 PARAMETER temperature 0.3
When programmatically cycling through multiple models (e.g., a benchmark script that loads model A, queries it, then loads model B), Ollama's scheduler polls for VRAM recovery after each eviction using a 90% threshold -- it waits until 90% of the expected freed VRAM has been returned before loading the next model. VRAM fragmentation in the CUDA or ROCm memory allocator can cause the reported available VRAM to be lower than the true physical free VRAM, triggering CPU offload for models that previously fit. The fix is to send "keep_alive": 0 in the final request to each model before loading the next, and add a 1-2 second delay between loads to allow the allocator to consolidate. This is a known issue in sequential automation workflows that doesn't appear in normal interactive use.
Key environment variables for performance tuning
Several environment variables control Ollama's behavior at the server level. Set these in the systemd service override file the same way as GPU-related variables:
| Variable | Default | Effect |
|---|---|---|
OLLAMA_FLASH_ATTENTION |
0 | Set to 1 to enable flash attention, which reduces VRAM usage for the KV cache. Speeds up inference on most supported hardware. Not compatible with Vulkan on some Intel iGPUs. |
OLLAMA_NUM_PARALLEL |
auto | Number of parallel request processing slots. Each slot multiplies KV cache VRAM. A 2K context with 4 parallel slots allocates an effective 8K context worth of KV cache. Leave at default for single-user local use. |
OLLAMA_KEEP_ALIVE |
5m | How long a loaded model stays in VRAM after the last request. Set to 0 to unload immediately, or -1 to keep loaded indefinitely. Note: individual API requests can override this with a per-request keep_alive field, and the per-request value always wins. |
OLLAMA_MAX_LOADED_MODELS |
3 (GPU) / 1 (CPU) | Maximum number of models that can be concurrently loaded. On GPU, each additional model must completely fit in remaining VRAM -- models that require CPU offload cannot be concurrently loaded with another model. |
OLLAMA_CONTEXT_LENGTH |
auto (4K/32K/256K based on VRAM) | Sets the global default context length for all models. Without this, Ollama auto-selects based on available VRAM. Override when you want a specific context size regardless of available memory. |
OLLAMA_GPU_OVERHEAD |
0 | Reserves a fixed number of bytes of VRAM for other processes. Ollama subtracts this from the available VRAM it will use for model scheduling. Useful on machines running a desktop compositor or other GPU workloads alongside Ollama. |
OLLAMA_SCHED_SPREAD |
false | When true, forces the scheduler to always spread a model across all available GPUs rather than packing it onto the fewest GPUs that can fit it. Useful when you want consistent multi-GPU utilization rather than Ollama picking the most efficient placement. |
OLLAMA_MULTIUSER_CACHE |
false | Optimizes KV cache prompt prefix sharing for multi-user scenarios. When multiple users send requests with identical prefixes (such as a shared system prompt), enabling this allows the computed KV cache for that prefix to be reused across requests rather than recomputed each time. |
OLLAMA_LOAD_TIMEOUT |
5m | Stall detection timeout during model loading. If the model load process hangs and does not progress for this duration, Ollama considers it failed. Set to 0 for no timeout -- useful for very large models being loaded from slow storage. |
OLLAMA_MODELS |
~/.ollama/models | Override model storage path. Useful when home directory is on a small SSD. |
OLLAMA_NUM_GPU |
auto | Number of GPU layers to offload. Set to 0 to force CPU-only mode. Useful for troubleshooting or when you want to benchmark CPU vs GPU inference speed. |
1 to enable flash attention. Reduces KV cache VRAM, enabling longer contexts. Not compatible with Vulkan on some Intel iGPUs.0 to unload immediately, -1 to keep indefinitely. Per-request keep_alive API field always overrides this global setting.0 to force CPU-only mode. Useful for troubleshooting.Enabling OLLAMA_FLASH_ATTENTION=1 is one of the highest-impact single changes you can make on supported hardware. It reduces KV cache VRAM requirements significantly, which directly increases the maximum context length you can run without overflow to system RAM. Set it globally in the service override file rather than per-model.
# Base model FROM llama3.1:8b # System prompt applied to every conversation SYSTEM """ You are a Linux sysadmin assistant. Answer concisely. Prefer shell commands over explanations when both would work. Always specify the Linux distribution if the command differs by distro. """ # Lower temperature for more deterministic output PARAMETER temperature 0.3 # Increase context window (uses more VRAM) PARAMETER num_ctx 16384 # Reduce repetition PARAMETER repeat_penalty 1.2
# Create the custom model from the Modelfile $ ollama create sysadmin-assistant -f ~/Modelfile.coding # Use it $ ollama run sysadmin-assistant
Advanced Tuning: KV Cache Quantization and API Exposure
Quantizing the KV cache to reclaim VRAM
This is something most Ollama guides skip entirely. When flash attention is enabled, you can also quantize the KV (key-value) cache itself -- not just the model weights -- using the OLLAMA_KV_CACHE_TYPE environment variable. The default is f16 (16-bit float). Switching to q8_0 or q4_0 halves or quarters the VRAM used by the KV cache, at the cost of a small amount of generation quality at very long contexts.
In practice, the quality impact at q8_0 is barely measurable for typical conversational use. The practical effect is that you can run a longer context window on the same hardware, or load a larger model that would otherwise overflow:
[Service] # Enable flash attention (required for KV cache quantization) Environment="OLLAMA_FLASH_ATTENTION=1" # Quantize KV cache: f16 (default), q8_0 (half the VRAM), q4_0 (quarter) Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Combining OLLAMA_FLASH_ATTENTION=1 with OLLAMA_KV_CACHE_TYPE=q8_0 is particularly effective on cards in the 8-12 GB VRAM range. An RTX 3060 (12 GB) can run a 7B model at 32K context without overflow where it would otherwise spill to system RAM at 16K context or higher. This is one of the highest-leverage free optimizations available on consumer hardware.
Setting OLLAMA_KEEP_ALIVE=-1 in the service environment does not guarantee a model stays loaded. Any API client -- including Open WebUI and Continue -- can send "keep_alive": 0 in its request body, which immediately evicts the model from VRAM after the response, overriding your global setting. The per-request keep_alive field always wins. To verify a model is actually pinned rather than just showing a timer, query curl http://localhost:11434/api/ps and check the expires_at field in the response. A value of "0001-01-01T00:00:00Z" (the Go null timestamp) means the model is pinned indefinitely. Any real timestamp means it has a scheduled eviction regardless of your global setting.
When you set OLLAMA_NUM_PARALLEL=4, Ollama doesn't just process 4 requests at once — it allocates KV cache for an effective context 4× your num_ctx value. A 2K context with 4 parallel slots consumes the same KV cache VRAM as a single 8K context. The default for OLLAMA_NUM_PARALLEL is memory-dependent: Ollama auto-selects 4 when VRAM is plentiful and drops to 1 on memory-constrained hardware. If you are on a tight VRAM budget and notice the model unexpectedly spilling to CPU, check your OLLAMA_NUM_PARALLEL setting before reducing model size — Ollama may have silently auto-selected 4 parallel slots, multiplying your effective KV cache footprint. You can observe what value Ollama selected by running OLLAMA_DEBUG=1 ollama serve and looking for the parallelism line in startup output.
Exposing the Ollama API on your local network safely
By default, Ollama binds to 127.0.0.1:11434 -- localhost only. If you want to reach it from other machines on your LAN (a Raspberry Pi running Open WebUI, a laptop querying your desktop's GPU), you can bind to all interfaces. Do this only on a trusted private network and behind a firewall. The Ollama API has no built-in authentication. If you're running Ollama inside Docker, be aware that Docker's networking architecture can bypass UFW rules -- read that guide before opening ports in containerized deployments.
[Service] # Bind to all interfaces -- exposes API on port 11434 with NO auth Environment="OLLAMA_HOST=0.0.0.0:11434"
# Allow access only from your local subnet (adjust CIDR to match your network) $ sudo ufw allow from 192.168.1.0/24 to any port 11434 proto tcp # Verify the rule was added (default UFW policy is already DENY for unmatched traffic) $ sudo ufw status # Reload UFW $ sudo ufw reload # Verify from another machine on the LAN $ curl http://<your-server-ip>:11434/api/tags
UFW is convenient for simple subnet rules, but if you are hardening a production machine, nftables gives you finer-grained control: per-source rate limiting, dynamic sets for blocklists, and stateful connection tracking that UFW abstracts away. The example nftables rules guide covers building a default-deny ruleset from scratch, including the kind of port-specific allow rules you would use to isolate port 11434 to a specific subnet.
If you need authenticated external access -- for instance, reaching your home server from outside your LAN -- the cleanest approach is to put the machine behind a WireGuard VPN and access it over the tunnel rather than exposing port 11434 at all. If you do need a public-facing endpoint, put Nginx or Caddy in front with HTTP Basic Auth or mutual TLS. The Ollama project has noted that built-in auth is on the roadmap but not yet shipped as of v0.20. For a broader look at locking down a Linux server that runs services like this, the zero-trust Linux implementation guide covers nftables microsegmentation, systemd sandboxing, and continuous audit logging in one place.
Using the Ollama REST API
Ollama exposes a REST API at http://localhost:11434 that any application or script can call directly. The two endpoints you'll use most are /api/generate for single-turn completions and /api/chat for multi-turn conversations with message history. Both support streaming responses via server-sent events.
# Single-turn generation (streaming output to terminal) $ curl http://localhost:11434/api/generate \ -d '{"model": "llama3.1:8b", "prompt": "Explain KV cache in one paragraph.", "stream": false}' # Multi-turn chat with message history $ curl http://localhost:11434/api/chat \ -d '{ "model": "llama3.1:8b", "messages": [ {"role": "user", "content": "What is ROCm?"} ] }' # List all available local models $ curl http://localhost:11434/api/tags # Check which models are currently loaded in VRAM $ curl http://localhost:11434/api/ps
The API is OpenAI-compatible at /v1/chat/completions, which means any tool or library built against the OpenAI SDK -- including Open WebUI, LangChain, and LlamaIndex -- can point at Ollama with a base URL change and no API key.
# OpenAI-compatible endpoint (no API key needed) $ curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hello"}] }'
Thinking mode: using reasoning models via the API and CLI
Reasoning models such as DeepSeek-R1, Qwen3, and QwQ support a thinking mode that exposes the model's internal chain-of-thought before producing a final answer. Introduced in Ollama v0.9.0, this is controlled via the think parameter in the API and the --think flag in the CLI. Note that --think=false and /set nothink suppression is model-variant and Ollama-version dependent -- some Qwen3 and DeepSeek variants have been reported to continue emitting reasoning tokens regardless of the flag. If thinking output is not suppressing as expected, pull the latest model version with ollama pull <model> and update Ollama to the current release.
When think is enabled, the response includes a separate thinking field alongside content. The thinking field contains the reasoning trace; the content field contains the final answer. The two are cleanly separated in both streaming and non-streaming responses.
# Enable thinking for a single run $ ollama run deepseek-r1 --think "Explain why flash attention reduces VRAM usage" # Disable thinking (for reasoning models where it is on by default) $ ollama run qwen3 --think=false "Summarize this file" < notes.txt # Think but hide the reasoning trace -- show only the final answer $ ollama run deepseek-r1 --think --hidethinking "Is 9.9 bigger than 9.11?" # Toggle thinking inside an interactive session $ ollama run qwen3 # Inside the session: /set think (enable) # /set nothink (disable)
# Enable thinking via the chat API (non-streaming) $ curl http://localhost:11434/api/chat \ -d '{ "model": "deepseek-r1", "messages": [{"role": "user", "content": "How does ROCm differ from CUDA?"}], "think": true, "stream": false }' # Response includes: message.thinking (reasoning trace) + message.content (final answer)
Qwen3 models support hybrid thinking: a single model checkpoint handles both thinking-on and thinking-off requests by switching the chat template. You do not need to pull a separate variant. This differs from DeepSeek-R1 distills, which always generate reasoning tokens regardless of the think parameter (the parameter only controls whether those tokens are surfaced in the response). When thinking is enabled, reasoning tokens are generated before the answer — on a 14B model this typically produces 5,000–20,000 reasoning tokens before the final answer, which adds inference time proportional to the reasoning token count. If you are using Qwen3 for tasks that do not benefit from reasoning (simple retrieval, formatting, summarization), setting think: false significantly reduces response latency at no quality cost for those task types. The reasoning tokens themselves do not consume additional VRAM during generation — they are simply extra decode steps using the same loaded model weights.
The ollama ps command shows loaded models and their VRAM usage, but it doesn't show instantaneous GPU compute utilization. For that you need GPU-specific tools running in a second terminal while inference is happening:
# NVIDIA: live GPU utilization, memory, clocks, temperature every 1 second $ nvidia-smi dmon -s u -d 1 # NVIDIA: full process view with VRAM per-process $ watch -n 1 nvidia-smi # AMD: terminal dashboard (download .deb from github.com/Umio-Yasuno/amdgpu_top/releases) $ amdgpu_top # AMD: CLI metrics $ rocm-smi --showuse --showmemuse # Any GPU: nvtop gives a htop-style view for both NVIDIA and AMD $ sudo apt install nvtop && nvtop
If GPU utilization reads near 0% while tokens are generating, but ollama ps shows the model on GPU, partial CPU offload is happening -- not all layers fit in VRAM. The fix is either a smaller model, lower quantization, reduced num_ctx, or as described above, enabling KV cache quantization to reclaim headroom.
Multi-GPU Setup
Ollama distributes model layers across all visible GPUs automatically when multiple are present. If the model fits entirely across the combined VRAM of all GPUs, inference runs at near-full GPU speed. If it still overflows, llama.cpp spills the remainder to system RAM as it would with a single GPU. Ollama selects GPUs by their compute capabilities and free VRAM, not by device index order.
To restrict Ollama to specific GPUs rather than using all available devices, use CUDA_VISIBLE_DEVICES for NVIDIA or ROCR_VISIBLE_DEVICES for AMD in the service override file:
[Service] # NVIDIA: use only GPU 0 (device index from nvidia-smi) Environment="CUDA_VISIBLE_DEVICES=0" # NVIDIA: use GPUs 0 and 1, exclude GPU 2 # Environment="CUDA_VISIBLE_DEVICES=0,1" # AMD: use only GPU 0 (device index from rocminfo) # Environment="ROCR_VISIBLE_DEVICES=0"
# Check GPU device indices for NVIDIA $ nvidia-smi -L # Check GPU device indices for AMD $ rocminfo | grep "Agent [0-9]" # Verify Ollama sees all intended GPUs after reload $ sudo systemctl daemon-reload && sudo systemctl restart ollama $ journalctl -u ollama -n 40 | grep -i "gpu\|cuda\|rocm"
Ollama does not currently support running different models on different GPUs simultaneously. All loaded models share the full pool of visible GPU VRAM. If you need true multi-model GPU isolation, you would need to run separate Ollama instances on different ports, each with its own CUDA_VISIBLE_DEVICES restriction.
For AMD, setting ROCR_VISIBLE_DEVICES=-1 (an invalid device ID) is the canonical way to force CPU-only mode for testing, since there is no OLLAMA_NUM_GPU=0 equivalent that works cleanly on all ROCm setups. To see exactly what GPU libraries and devices Ollama detects at startup, set OLLAMA_DEBUG=1 and run ollama serve -- the startup log will list every discovered GPU, its VRAM, and which compute library was selected.
To control how many different models can be loaded into VRAM simultaneously (across all GPUs), set OLLAMA_MAX_LOADED_MODELS in the service override file. The default is 3× your GPU count for GPU inference, or 3 for CPU-only. When a new model request arrives and the limit is hit, Ollama automatically evicts the least recently used model to make room. Setting this to 1 on a memory-constrained machine prevents multiple models from competing for VRAM.
Troubleshooting Common Problems
"They're building on a foundation they control."
-- Markaicode, Multi-GPU Ollama Setup Guide -- on why 70% of self-hosted LLM users cite data privacy, not cost savings, as their primary motivation
NVIDIA GPU disappears after suspend/resume
On Linux, after a suspend/resume cycle the NVIDIA UVM (Unified Virtual Memory) driver can fail to reinitialize correctly, causing Ollama to silently fall back to CPU inference on the next request even though nvidia-smi still shows the GPU. This is an acknowledged driver bug documented in the official Ollama GPU docs. The fix is to reload the UVM driver module:
# Reload the NVIDIA UVM driver after resume $ sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm $ sudo systemctl restart ollama
To automate this after every resume, create a systemd service unit that triggers on the suspend.target:
[Unit] Description=Reload NVIDIA UVM driver after resume After=suspend.target hibernate.target hybrid-sleep.target [Service] Type=oneshot ExecStart=/bin/sh -c '/sbin/rmmod nvidia_uvm; /sbin/modprobe nvidia_uvm' ExecStartPost=/bin/systemctl restart ollama [Install] WantedBy=suspend.target hibernate.target hybrid-sleep.target
$ sudo systemctl enable nvidia-uvm-resume.service
Ollama is running on CPU despite a GPU being present
For NVIDIA: run nvidia-smi to confirm the driver is loaded and the GPU is visible. Check that the driver version is 531 or newer. If using Docker, the NVIDIA Container Toolkit is also required. For AMD: check the server log for ROCm-related messages with journalctl -u ollama -n 50. If you see a compatibility error or no ROCm output at all, the HSA_OVERRIDE_GFX_VERSION environment variable is likely missing or incorrect for your card.
NVIDIA driver not loading on Ubuntu with Secure Boot enabled
If nvidia-smi fails with "No devices were found" after installing the driver, and your system uses Secure Boot, the NVIDIA kernel module may be blocked because it is not signed with a key trusted by the firmware. Ubuntu's ubuntu-drivers autoinstall sets up the nvidia-dkms package and prompts you to enroll a Machine Owner Key (MOK) during installation -- this step is easy to miss. Use dmesg to check the status:
# Check if the NVIDIA module is blocked by Secure Boot $ dmesg | grep -i "nvidia\|secure boot\|module" # Check MOK enrollment status $ mokutil --sb-state $ mokutil --list-enrolled | grep -i nvidia # Re-trigger MOK enrollment if needed $ sudo mokutil --import /var/lib/shim-signed/mok/MOK.der # Then reboot and select "Enroll MOK" in the UEFI blue screen
If you prefer to avoid the MOK process entirely, disabling Secure Boot in the UEFI firmware settings is a simpler path on a machine you control. On production or shared hardware, the MOK enrollment approach is the correct one.
Running Ollama in Docker on Linux
The official Ollama Docker image works on Linux and is a clean path for users who want containerized isolation or are already managing their stack with Compose. For NVIDIA GPU access inside Docker, the NVIDIA Container Toolkit must be installed on the host in addition to the driver. If you're running Ubuntu 24.04, it's also worth being aware of CVE-2026-3888, a local privilege escalation in Ubuntu 24.04's systemd interaction with snap-confine that affects any machine where untrusted local users exist -- patch before adding GPU-accessible services to the mix. Install the Container Toolkit first:
# Add the NVIDIA Container Toolkit repository $ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg $ curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list $ sudo apt update && sudo apt install -y nvidia-container-toolkit # Configure Docker to use the NVIDIA runtime $ sudo nvidia-ctk runtime configure --runtime=docker $ sudo systemctl restart docker # Verify the toolkit can see the GPU $ docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
# NVIDIA GPU passthrough $ docker run -d --gpus=all \ -v ollama:/root/.ollama \ -p 11434:11434 \ --name ollama \ ollama/ollama # AMD GPU passthrough (ROCm) $ docker run -d \ --device /dev/kfd \ --device /dev/dri \ -v ollama:/root/.ollama \ -p 11434:11434 \ --name ollama \ ollama/ollama:rocm # Pull and run a model inside the container $ docker exec -it ollama ollama run llama3.1:8b
The host NVIDIA driver version must be compatible with the CUDA version used inside the container. If nvidia-smi works on the host but Ollama in Docker falls back to CPU, the Container Toolkit is either not installed or the Docker daemon has not been restarted after its installation. Run sudo systemctl restart docker after installing the toolkit.
Out-of-memory errors when loading a model
Either the model is too large for your available VRAM, or the KV cache for your configured context window is consuming more VRAM than expected. Reduce num_ctx in a Modelfile, switch to a lower quantization variant (Q4_K_S instead of Q4_K_M), or pull a smaller model. Partial GPU offloading is possible -- llama.cpp will split layers between GPU and CPU -- but performance degrades significantly once the model starts spilling to system RAM.
Slow generation despite GPU being detected
Confirm that n_gpu_layers (or "model layers" in newer Ollama log formats) in the server log is greater than 0. If it's low (say 10 out of 33 for an 8B model), partial offloading is happening due to VRAM pressure. Also check for thermal throttling on the GPU: high temperatures cause modern GPUs to reduce their clock speeds. For NVIDIA, nvidia-smi dmon shows real-time clock speeds and temperatures. For AMD, amdgpu_top provides equivalent data.
Model downloads fail or produce corrupted output
Large model downloads (models in the 20-40 GB range) can time out on unstable connections. Ollama supports resumable downloads -- run the same ollama pull command again and it will resume from where it stopped. If a model produces garbled output after a completed download, delete it with ollama rm model_name and re-pull.
How to Install Ollama and Run a Model on Linux with GPU Acceleration
-
Install Ollama on Linux
Run the official one-line installer:
curl -fsSL https://ollama.com/install.sh | sh. The installer detects your CPU architecture and installs the Ollama binary, sets up a systemd service, and creates the~/.ollamadirectory for model storage. Verify the installation withollama --versionand confirm the service is running withsystemctl status ollama. -
Configure GPU acceleration (NVIDIA, AMD ROCm, or Vulkan)
For NVIDIA: ensure the driver is version 531 or newer (the official minimum; 550+ is a widely used stable baseline), your GPU has Compute Capability 5.0+, and that
nvidia-smiruns successfully. Ollama detects CUDA automatically with no additional configuration. For AMD: install ROCm v7 drivers using AMD's officialamdgpu-installutility and add your user to therenderandvideogroups. If your AMD GPU is not officially supported by ROCm, setHSA_OVERRIDE_GFX_VERSIONto the nearest supported GFX version in the Ollama systemd service environment file. For Intel GPUs or AMD hardware that does not work with ROCm at all, enable the Vulkan backend by settingOLLAMA_VULKAN=1in the service environment. Restart the Ollama service after any driver or environment changes. -
Pull a model and verify GPU usage
Pull a model with
ollama pull llama3.1:8b. Once downloaded, runollama psto confirm the processor column shows GPU. For NVIDIA, cross-check withnvidia-smi. Check~/.ollama/logs/server.logfor then_gpu_layersvalue -- a value greater than 0 confirms GPU offload is active. A value of 0 means the model is running on CPU despite a GPU being present, indicating a driver or configuration issue. -
Tune quantization and context length for your hardware
Choose a quantization level that fits your available VRAM. Q4_K_M is recommended for most hardware, requiring approximately 5-6 GB VRAM for a 7-8B model. If you see out-of-memory errors, reduce
num_ctxin a Modelfile, or pull a lower quantization variant such as Q4_K_S. Useollama pswhile a model is running to monitor live VRAM usage and offload status. Once you have a baseline that works, consider creating a named Modelfile variant for your most common use cases.
Frequently Asked Questions
Do I need a GPU to run Ollama on Linux?
No, Ollama will fall back to CPU inference if no supported GPU is detected. CPU inference is functional but noticeably slower -- expect 2-8 tokens per second on a modern CPU with a 7B model, compared to 40-80+ tokens per second on a mid-range GPU. For interactive use, a GPU makes a significant practical difference.
How do I know if Ollama is actually using my GPU?
Run ollama ps while a model is loaded. The output includes a processor column that shows GPU or CPU. For NVIDIA, you can also run nvidia-smi and look for the ollama_llama_server process. For AMD, use radeontop or amdgpu_top and watch for VRAM activity. The server log at ~/.ollama/logs/server.log will also show n_gpu_layers, which tells you how many model layers are offloaded to the GPU -- 0 means CPU-only.
My AMD GPU is listed as unsupported by ROCm. Can I still use it with Ollama?
Often yes. Setting the environment variable HSA_OVERRIDE_GFX_VERSION to the GFX version of the nearest supported GPU bypasses ROCm's hardware compatibility checks. For example, an RX 5400 (gfx1034) would use HSA_OVERRIDE_GFX_VERSION=10.3.0 since gfx1030 is the nearest supported target. Add this to the Ollama systemd service environment and restart the service. Results vary by card and driver version. If ROCm still doesn't work, the Vulkan backend (OLLAMA_VULKAN=1) is a broader fallback option.
What quantization level should I use for local LLM inference on Linux?
Q4_K_M is the recommended starting point for most hardware. It offers a strong balance of output quality, VRAM efficiency, and generation speed. Q5_K_M gives marginally better quality at roughly 15-20% higher VRAM cost. Avoid Q2 and Q3 unless VRAM is extremely limited -- quality degradation becomes severe. Q8_0 approaches original model quality but requires nearly as much VRAM as full precision, so it is only practical on high-VRAM cards.
Does Ollama support Vulkan on Linux?
Yes, since v0.12.11. Vulkan is an opt-in experimental feature enabled by setting OLLAMA_VULKAN=1 in the Ollama service environment. It covers Intel GPUs and AMD hardware outside the ROCm support matrix. On Linux, you will likely need to install Mesa Vulkan drivers separately (sudo apt install mesa-vulkan-drivers). Vulkan performance is generally lower than native CUDA or ROCm, and Intel integrated graphics in particular may produce garbled output on models larger than ~1B parameters -- a known issue still being worked on upstream.
What AMD ROCm version does Ollama require on Linux?
Ollama requires ROCm v7 on Linux as of current releases. Install it via AMD's amdgpu-install utility. After installation, add your user to the render and video groups and reboot. See AMD's official ROCm documentation for the current package URL for your Ubuntu version.
What is the minimum NVIDIA driver version for Ollama on Linux?
Ollama requires NVIDIA GPUs with Compute Capability 5.0 or higher and a driver version of 531 or newer. The 531+ requirement is the official minimum; a recent stable driver (550+ is a widely used baseline) ensures the best compatibility with current Ollama releases. Install the driver through Ubuntu's package manager (ubuntu-drivers autoinstall) rather than the NVIDIA runfile installer to ensure kernel module updates are handled automatically.
How do I update Ollama on Linux?
Re-run the official install script: curl -fsSL https://ollama.com/install.sh | sh. The installer detects the existing installation and upgrades the binary in place. Your downloaded models, service configuration, and any systemd override files are not affected. Confirm the new version with ollama --version after the update.
Does Ollama support multiple GPUs on Linux?
Yes. Ollama automatically distributes model layers across all visible GPUs when more than one is present, using combined VRAM to fit larger models. To restrict Ollama to specific devices, set CUDA_VISIBLE_DEVICES (NVIDIA) or ROCR_VISIBLE_DEVICES (AMD) in the systemd service override file. Note that Ollama does not support running different models on different GPUs simultaneously within a single instance.
Can I use Ollama with Docker on Linux?
Yes. The official ollama/ollama Docker image supports NVIDIA GPU passthrough via the --gpus=all flag when the NVIDIA Container Toolkit is installed on the host. For AMD, use the ollama/ollama:rocm image with --device /dev/kfd --device /dev/dri. The API runs on port 11434 inside the container and is OpenAI-compatible, so any tool that supports a custom base URL can connect to it.
Sources and References
The technical details in this guide are drawn from official documentation and verified against current Ollama releases (v0.20.x, April 2026). Quotes are attributed inline to their original sources.
- Ollama GPU hardware support documentation -- official GPU requirements, ROCm v7 requirement, HSA_OVERRIDE_GFX_VERSION usage, Vulkan opt-in setup, ROCR_VISIBLE_DEVICES
- Ollama official FAQ -- multi-GPU scheduling, flash attention, KV cache quantization (OLLAMA_KV_CACHE_TYPE), keep_alive behavior
- Ollama official install script -- installer behavior, systemd service setup, user group configuration
- AMD ROCm Linux installation documentation -- amdgpu-install utility, ROCm v7 package repository
- llama.cpp (GitHub) -- underlying inference engine, quantization format reference, n_gpu_layers behavior
- Intel GPU driver documentation -- Linux driver setup for Intel Arc and integrated graphics with Vulkan
- LocalLLM.in, Ollama VRAM Requirements Guide (2026) -- VRAM boundary characterization, quantization impact data (quoted)
- dev.to/rosgluk, RTX 4080 benchmark (Ollama 0.15.2) -- 11x speed gap benchmark between full-VRAM and CPU-offload inference (quoted)
- Markaicode, Multi-GPU Ollama Setup Guide -- data privacy motivation statistics (quoted)