GPU Options for Local AI Inference: VRAM, Software Stacks, and What Actually Works (Mid-2026)

Running AI inference locally means making a hardware decision before you can make a software one. The choice of GPU determines which models fit in memory, how fast they generate, and – less obviously – which software stacks, container images, and deployment patterns are available to you. This overview covers the current hardware landscape for Linux-based inference, with the software constraints that make several apparently attractive options less attractive in practice.

Windows is not discussed. Driver instability, WSL2 overhead, the absence of first-class ROCm support, and the general friction of containerised workloads on Windows place it outside consideration for any repeatable inference pipeline.

VRAM: what fits where

Before hardware specifics, the memory arithmetic. A model’s parameter count determines its minimum VRAM requirement: a 7B parameter model in 16-bit precision needs approximately 14GB; at 4-bit quantisation (Q4), around 4GB. The practical floors for common workloads:

VRAM LLMs at Q4 Image generation
8GB 7-8B SD 1.5; Flux fp8 GGUF (slow)
12GB 13B comfortably Flux.1-dev fp8; practical Flux minimum
16GB 13B Q8 / 14B Q4 Full-precision Flux; Flux + ControlNet via fp8
24GB 32B Q4; 70B with CPU offload Flux FP16 + ControlNet; video models
48GB 70B Q4 fully in VRAM Large multi-model pipelines
80GB 70B FP8 with headroom; multiple concurrent models Heavy batch inference
96GB 70B FP8 with context headroom; 100B+ at Q4 Production-scale
128GB unified 70B-235B MoE at Q4 Full-precision large image models
192-512GB 405B-671B at 4-bit N/A as limiting factor

Note that 70B in FP16 requires approximately 140GB for weights alone. Single-card 70B serving requires quantisation as the norm, not the exception. For image generation, the VRAM floor is hard: if the model and activations do not fit, the process fails or falls back to system RAM at speeds that make the workflow impractical.

NVIDIA

NVIDIA remains the default, primarily because the software story is the simplest. CUDA on Linux is universal: PyTorch ships pre-built CUDA wheels, the NVIDIA Container Toolkit exposes GPUs to Docker via --gpus all, and pre-built container images for vLLM, Ollama, ComfyUI and most major frameworks are available from Docker Hub and NVIDIA NGC without recompilation.

Consumer cards (Blackwell, RTX 50 series, launched January 2025):

Card VRAM Bandwidth
RTX 5090 32GB GDDR7 1.79 TB/s
RTX 5080 16GB GDDR7 960 GB/s
RTX 5070 Ti 16GB GDDR7 896 GB/s
RTX 5070 12GB GDDR7 672 GB/s

No consumer RTX card exceeds 32GB. The RTX 4090 (24GB, prior generation) remains in strong demand and is widely available used. NVLink is absent from all RTX 40 and 50 consumer cards; multi-GPU inference runs over PCIe. The Ampere generation RTX 3090 retained NVLink, which makes a dual-3090 (48GB total) a well-established budget configuration for 70B serving.

Workstation tier:

The RTX 6000 Ada (48GB GDDR6 ECC) fits Llama 70B Q4 on a single card. The RTX PRO 6000 Blackwell (96GB GDDR7 ECC, ~$8,565) delivers roughly 34 tokens/sec on a 70B Q4 model and is the most capable single-card option currently available for on-premises large-model serving. The RTX PRO 5000 Blackwell sits at 48GB GDDR7.

Data-centre cards:

GPU VRAM Bandwidth Approximate used/rental
A100 40GB 40GB HBM2e 1.6 TB/s ~$7,800 used / ~$1.29/hr
A100 80GB 80GB HBM2e 2.0 TB/s ~$18,900 used / ~$3.43/hr
H100 80GB HBM3 3.35 TB/s ~$25-40k / ~$1.38-2.50/hr
H200 141GB HBM3e 4.8 TB/s ~$35-45k / ~$2+/hr
B200 180-192GB HBM3e ~8 TB/s DGX systems ~$350k+ / ~$2.12-6.03/hr

Individual purchase of new data-centre cards is not a realistic path: enterprise sales channels, long lead times, and no retail warranty. Used A100 80GB cards represent the most accessible route to 80GB of HBM bandwidth; prices are softening as Blackwell capacity expands. Renting via Lambda Labs, Vast.ai, RunPod or Thunder Compute is the practical option for intermittent high-VRAM workloads.

Jetson AGX Thor and DGX Spark:

Both platforms provide 128GB of unified memory (CPU and GPU share a single pool) at 273 GB/s on a 256-bit bus. The Jetson AGX Thor (~$3,499) is ARM64-based running Ubuntu via JetPack 7.x; the DGX Spark (~$3,000-4,699, GB10 Grace Blackwell) targets developers wanting a CUDA-native desktop AI box. Both run CUDA 13 and support Docker with GPU access.

The unified memory figure is the headline, but 273 GB/s is approximately one-seventh of the bandwidth available on an RTX 5090. A 70B model fits in memory but generates tokens slowly relative to a discrete GDDR7 or HBM card. These platforms suit large-capacity, lower-throughput edge inference, multi-model concurrent serving, or MoE architectures where each active expert is modest in size.

The ARM64 architecture on Jetson introduces operational overhead that is easy to underestimate. The vast majority of Docker images for AI/ML frameworks are built for x86-64. Running them on Jetson via QEMU emulation is both slower and unreliable for compiled CUDA extensions. The correct approach is ARM64-native images from NVIDIA NGC or the dusty-nv/jetson-containers project, or rebuilding from source for linux/arm64. Many Python packages in the AI/ML ecosystem do not publish ARM64 wheels on PyPI and must be compiled from source or sourced from NVIDIA’s Jetson-specific index (pypi.jetson-ai-lab.io), which has coverage gaps requiring fallback to public PyPI. PyTorch itself is not the standard pip install torch wheel on Jetson; it requires either NVIDIA’s BSP build or compilation from source against the ARM64 CUDA toolchain.

There is also a container flag difference that breaks orchestration tooling assuming standard NVIDIA syntax: Jetson Thor requires --runtime=nvidia rather than --gpus all. This is a known platform difference with no workaround.

AMD

AMD’s consumer options offer good VRAM per pound. The RX 7900 XTX (24GB GDDR6, ~£800) is the current UK leader on that metric. The workstation PRO line reaches 48GB with the W7900, and the Radeon AI PRO R9700 (32GB GDDR7, RDNA4) targets AI workloads specifically. The data-centre MI300X (192GB HBM3, 5.325 TB/s) is available for rent (~$1.71-3.41/hr on TensorWave, Vultr, RunPod) but not through accessible purchase channels for an SME. The MI325X and MI355X (256GB and 288GB HBM3e respectively) extend the MI series further.

The software situation on Linux x86-64 is now genuinely workable. ROCm Docker support runs via the AMD Container Runtime Toolkit (or manual device passthrough with --device /dev/kfd --device /dev/dri). Pre-built ROCm PyTorch images are available from rocm/pytorch:latest. The experience is similar to NVIDIA on Linux – host driver, container toolkit, pull a pre-built image – though the image catalogue is thinner, the supported GPU list is narrower, and some frameworks (ComfyUI extensions, TensorRT-dependent tools) have no ROCm equivalent.

ROCm is x86-64 only. There is no ARM64 ROCm support; AMD GPUs are not usable for GPU-accelerated ML on ARM hosts. The officially supported consumer cards for ROCm 7.x are principally RDNA3 and RDNA4 (the RX 7900 XTX shares the gfx1100 target with the W7900). Older RDNA2 cards require the HSA_OVERRIDE_GFX_VERSION workaround, which is unsupported and may break between ROCm releases.

Apple Silicon

Apple M-series Macs with unified memory are a serious local inference option on capacity grounds alone.

Config Unified Memory Bandwidth
M3 Max up to 128GB 400 GB/s
M3 Ultra up to 512GB 819 GB/s
M4 Max up to 128GB 546 GB/s

The 512GB M3 Ultra can hold the 671B-parameter DeepSeek R1 at 4-bit quantisation (approximately 404GB consumed) on a single machine that costs less than most data-centre GPU cards. At roughly 17-18 tokens/sec under 200W, throughput is modest. For models below 32B parameters, a discrete NVIDIA or AMD GPU produces substantially higher tokens/sec; the Apple capacity advantage only becomes decisive above roughly 32B, where VRAM becomes the binding constraint.

Apple’s GPU acceleration stack (Metal, MLX, MPS) is macOS-only and cannot be reproduced on Linux. Running Linux on Apple Silicon via Asahi Linux exposes the CPU but not the GPU for ML acceleration. MLX, Ollama’s Metal backend, and PyTorch MPS all require macOS.

Docker on macOS does not expose the Apple GPU to containers. Docker Desktop uses Apple’s mandatory Hypervisor.framework, which provides no virtual GPU surface; containers fall back to CPU for any workload that would use Metal on the host. Workarounds exist at the edges: Podman can route Vulkan API calls to the host GPU for Vulkan-capable applications, and Docker Model Runner runs GPU-accelerated inference as a native macOS service and exposes it as an API endpoint to containers. These cover specific use cases but are not general-purpose GPU passthrough.

The practical consequence is that an Apple Silicon Mac is architecturally incompatible with Docker-based inference pipelines intended to run the same image on cloud instances and on-premises hardware. Inference must run as native macOS processes.

Chinese GPUs

Vendor Flagship Memory Status
Biren BR100 64GB HBM2e US Entity List (Oct 2023); enterprise-only
Moore Threads MTT S4000 48GB GDDR6 US Entity List (Oct 2023); China distribution only
Cambricon Siyuan 590 80GB HBM US Entity List (2022); enterprise/state
Hygon DCU (various) HBM (unverified) ROCm/HIP-based; China-only
Huawei Ascend 910B 64GB HBM2e Export-banned; enterprise/state only
Lisuan LX 7G106 / 7G105 12/24GB GDDR6 Consumer; China-only; launched June 2026

Biren, Moore Threads and Cambricon are on the US Entity List. Huawei is under export controls. None of these vendors sells through accessible channels to UK businesses. The only card that has appeared in consumer channels at all (Moore Threads MTT S80, 16GB) has immature, primarily Chinese-language drivers. Chinese GPU stacks are not a viable path for an SME or startup outside China, irrespective of the raw hardware specifications.

Software compatibility at a glance

Platform OS Docker GPU ARM64 images
NVIDIA (x86-64) Linux --gpus all via nvidia-container-toolkit x86-64 only
NVIDIA Jetson (ARM64) Ubuntu/JetPack --runtime=nvidia; ARM64 images required Required
AMD ROCm (x86-64) Linux x86-64 only AMD Container Toolkit or device passthrough x86-64 only
Intel Arc / oneAPI Linux Less mature; Vulkan via Docker Model Runner x86-64 only
Apple Silicon macOS only No GPU passthrough in Docker N/A

No major inference framework is strictly CUDA-only as of mid-2026. CUDA receives new features first, has the broadest community, and is the least troubleshooting-intensive path. The framework coverage for NVIDIA, AMD and Apple:

Framework NVIDIA/Linux AMD ROCm/Linux Apple/macOS
llama.cpp CUDA HIP Metal
Ollama CUDA ROCm (solid) Metal / MLX
vLLM CUDA First-class ROCm vllm-metal plugin
HuggingFace Transformers CUDA ROCm MPS
ComfyUI CUDA ROCm (RDNA3/4) Metal

What to buy

Starting point – used RTX 3090 (24GB, ~£600-800) on x86-64 Linux. Runs 32B Q4 LLMs and full-precision Flux. Two over PCIe (48GB combined) run 70B Q4 without CPU offload. Standard Docker tooling throughout.

Best value under ~£1,000 for new hardware: AMD RX 7900 XTX (24GB) on Linux with ROCm. Verify the specific SKU is on the ROCm 7.x supported list before purchasing.

Best single new card for most workloads: RTX 5090 (32GB). UK pricing remains elevated (~£3,599); the RTX 4090 (24GB) used is strong value at current prices.

Maximum single-card capacity without workstation pricing: RTX PRO 6000 Blackwell (96GB, ~£7k). The right choice when 70B FP8 in-VRAM with context headroom is a hard requirement.

70B+ at full quality, or 100B+ models – three paths:

Rent before buying anything above £2k to validate throughput requirements against real workloads.

A few caveats

GPU pricing is volatile. A late-2025 memory-supply shortage pushed RTX 50-series prices above MSRP in the UK; used data-centre card prices are softening as Blackwell capacity expands.

Unified-memory bandwidth is the binding constraint on Jetson Thor, DGX Spark and Apple Silicon. 128GB at 273 GB/s and 512GB at 819 GB/s are both substantially below the bandwidth of discrete GDDR7 or HBM. Large models load but generate slowly.

ROCm’s supported-GPU list changes between releases. Cards that work via override flags in one release may break in the next. Confirm support for a specific GPU SKU against the current ROCm release notes before committing.

This overview covers inference. Training and fine-tuning impose stricter VRAM requirements (optimiser states, gradient checkpoints) and benefit more strongly from NVLink; CUDA’s lead over other stacks is larger in those contexts.


Questions about this? Get in touch.