GPU Options for Local AI Inference: VRAM, Software Stacks, and What Actually Works (Mid-2026)

Running AI inference locally means making a hardware decision before you can make a software one. The choice of GPU determines which models fit in memory, how fast they generate, and – less obviously – which software stacks, container images, and deployment patterns are available to you. This overview covers the current hardware landscape for Linux-based inference, with the software constraints that make several apparently attractive options less attractive in practice.

Windows is not discussed. Driver instability, WSL2 overhead, the absence of first-class ROCm support, and the general friction of containerised workloads on Windows place it outside consideration for any repeatable inference pipeline.

VRAM: what fits where

Before hardware specifics, the memory arithmetic. A model’s parameter count determines its minimum VRAM requirement: a 7B parameter model in 16-bit precision needs approximately 14GB; at 4-bit quantisation (Q4), around 4GB. The practical floors for common workloads:

VRAM	LLMs at Q4	Image generation
8GB	7-8B	SD 1.5; Flux fp8 GGUF (slow)
12GB	13B comfortably	Flux.1-dev fp8; practical Flux minimum
16GB	13B Q8 / 14B Q4	Full-precision Flux; Flux + ControlNet via fp8
24GB	32B Q4; 70B with CPU offload	Flux FP16 + ControlNet; video models
48GB	70B Q4 fully in VRAM	Large multi-model pipelines
80GB	70B FP8 with headroom; multiple concurrent models	Heavy batch inference
96GB	70B FP8 with context headroom; 100B+ at Q4	Production-scale
128GB unified	70B-235B MoE at Q4	Full-precision large image models
192-512GB	405B-671B at 4-bit	N/A as limiting factor

Note that 70B in FP16 requires approximately 140GB for weights alone. Single-card 70B serving requires quantisation as the norm, not the exception. For image generation, the VRAM floor is hard: if the model and activations do not fit, the process fails or falls back to system RAM at speeds that make the workflow impractical.

NVIDIA

NVIDIA remains the default, primarily because the software story is the simplest. CUDA on Linux is universal: PyTorch ships pre-built CUDA wheels, the NVIDIA Container Toolkit exposes GPUs to Docker via --gpus all, and pre-built container images for vLLM, Ollama, ComfyUI and most major frameworks are available from Docker Hub and NVIDIA NGC without recompilation.

Consumer cards (Blackwell, RTX 50 series, launched January 2025):

Card	VRAM	Bandwidth
RTX 5090	32GB GDDR7	1.79 TB/s
RTX 5080	16GB GDDR7	960 GB/s
RTX 5070 Ti	16GB GDDR7	896 GB/s
RTX 5070	12GB GDDR7	672 GB/s

No consumer RTX card exceeds 32GB. The RTX 4090 (24GB, prior generation) remains in strong demand and is widely available used. NVLink is absent from all RTX 40 and 50 consumer cards; multi-GPU inference runs over PCIe. The Ampere generation RTX 3090 retained NVLink, which makes a dual-3090 (48GB total) a well-established budget configuration for 70B serving.

Workstation tier:

The RTX 6000 Ada (48GB GDDR6 ECC) fits Llama 70B Q4 on a single card. The RTX PRO 6000 Blackwell (96GB GDDR7 ECC, ~$8,565) delivers roughly 34 tokens/sec on a 70B Q4 model and is the most capable single-card option currently available for on-premises large-model serving. The RTX PRO 5000 Blackwell sits at 48GB GDDR7.

Data-centre cards:

GPU	VRAM	Bandwidth	Approximate used/rental
A100 40GB	40GB HBM2e	1.6 TB/s	~$7,800 used / ~$1.29/hr
A100 80GB	80GB HBM2e	2.0 TB/s	~$18,900 used / ~$3.43/hr
H100	80GB HBM3	3.35 TB/s	~$25-40k / ~$1.38-2.50/hr
H200	141GB HBM3e	4.8 TB/s	~$35-45k / ~$2+/hr
B200	180-192GB HBM3e	~8 TB/s	DGX systems ~$350k+ / ~$2.12-6.03/hr

Individual purchase of new data-centre cards is not a realistic path: enterprise sales channels, long lead times, and no retail warranty. Used A100 80GB cards represent the most accessible route to 80GB of HBM bandwidth; prices are softening as Blackwell capacity expands. Renting via Lambda Labs, Vast.ai, RunPod or Thunder Compute is the practical option for intermittent high-VRAM workloads.

Jetson AGX Thor and DGX Spark:

Both platforms provide 128GB of unified memory (CPU and GPU share a single pool) at 273 GB/s on a 256-bit bus. The Jetson AGX Thor (~$3,499) is ARM64-based running Ubuntu via JetPack 7.x; the DGX Spark (~$3,000-4,699, GB10 Grace Blackwell) targets developers wanting a CUDA-native desktop AI box. Both run CUDA 13 and support Docker with GPU access.

The unified memory figure is the headline, but 273 GB/s is approximately one-seventh of the bandwidth available on an RTX 5090. A 70B model fits in memory but generates tokens slowly relative to a discrete GDDR7 or HBM card. These platforms suit large-capacity, lower-throughput edge inference, multi-model concurrent serving, or MoE architectures where each active expert is modest in size.

The ARM64 architecture on Jetson introduces operational overhead that is easy to underestimate. The vast majority of Docker images for AI/ML frameworks are built for x86-64. Running them on Jetson via QEMU emulation is both slower and unreliable for compiled CUDA extensions. The correct approach is ARM64-native images from NVIDIA NGC or the dusty-nv/jetson-containers project, or rebuilding from source for linux/arm64. Many Python packages in the AI/ML ecosystem do not publish ARM64 wheels on PyPI and must be compiled from source or sourced from NVIDIA’s Jetson-specific index (pypi.jetson-ai-lab.io), which has coverage gaps requiring fallback to public PyPI. PyTorch itself is not the standard pip install torch wheel on Jetson; it requires either NVIDIA’s BSP build or compilation from source against the ARM64 CUDA toolchain.

There is also a container flag difference that breaks orchestration tooling assuming standard NVIDIA syntax: Jetson Thor requires --runtime=nvidia rather than --gpus all. This is a known platform difference with no workaround.

AMD

AMD’s consumer options offer good VRAM per pound. The RX 7900 XTX (24GB GDDR6, ~£800) is the current UK leader on that metric. The workstation PRO line reaches 48GB with the W7900, and the Radeon AI PRO R9700 (32GB GDDR7, RDNA4) targets AI workloads specifically. The data-centre MI300X (192GB HBM3, 5.325 TB/s) is available for rent (~$1.71-3.41/hr on TensorWave, Vultr, RunPod) but not through accessible purchase channels for an SME. The MI325X and MI355X (256GB and 288GB HBM3e respectively) extend the MI series further.

The software situation on Linux x86-64 is now genuinely workable. ROCm Docker support runs via the AMD Container Runtime Toolkit (or manual device passthrough with --device /dev/kfd --device /dev/dri). Pre-built ROCm PyTorch images are available from rocm/pytorch:latest. The experience is similar to NVIDIA on Linux – host driver, container toolkit, pull a pre-built image – though the image catalogue is thinner, the supported GPU list is narrower, and some frameworks (ComfyUI extensions, TensorRT-dependent tools) have no ROCm equivalent.

ROCm is x86-64 only. There is no ARM64 ROCm support; AMD GPUs are not usable for GPU-accelerated ML on ARM hosts. The officially supported consumer cards for ROCm 7.x are principally RDNA3 and RDNA4 (the RX 7900 XTX shares the gfx1100 target with the W7900). Older RDNA2 cards require the HSA_OVERRIDE_GFX_VERSION workaround, which is unsupported and may break between ROCm releases.

Apple Silicon

Apple M-series Macs with unified memory are a serious local inference option on capacity grounds alone.

Config	Unified Memory	Bandwidth
M3 Max	up to 128GB	400 GB/s
M3 Ultra	up to 512GB	819 GB/s
M4 Max	up to 128GB	546 GB/s

The 512GB M3 Ultra can hold the 671B-parameter DeepSeek R1 at 4-bit quantisation (approximately 404GB consumed) on a single machine that costs less than most data-centre GPU cards. At roughly 17-18 tokens/sec under 200W, throughput is modest. For models below 32B parameters, a discrete NVIDIA or AMD GPU produces substantially higher tokens/sec; the Apple capacity advantage only becomes decisive above roughly 32B, where VRAM becomes the binding constraint.

Apple’s GPU acceleration stack (Metal, MLX, MPS) is macOS-only and cannot be reproduced on Linux. Running Linux on Apple Silicon via Asahi Linux exposes the CPU but not the GPU for ML acceleration. MLX, Ollama’s Metal backend, and PyTorch MPS all require macOS.

Docker on macOS does not expose the Apple GPU to containers. Docker Desktop uses Apple’s mandatory Hypervisor.framework, which provides no virtual GPU surface; containers fall back to CPU for any workload that would use Metal on the host. Workarounds exist at the edges: Podman can route Vulkan API calls to the host GPU for Vulkan-capable applications, and Docker Model Runner runs GPU-accelerated inference as a native macOS service and exposes it as an API endpoint to containers. These cover specific use cases but are not general-purpose GPU passthrough.

The practical consequence is that an Apple Silicon Mac is architecturally incompatible with Docker-based inference pipelines intended to run the same image on cloud instances and on-premises hardware. Inference must run as native macOS processes.

Chinese GPUs

Vendor	Flagship	Memory	Status
Biren	BR100	64GB HBM2e	US Entity List (Oct 2023); enterprise-only
Moore Threads	MTT S4000	48GB GDDR6	US Entity List (Oct 2023); China distribution only
Cambricon	Siyuan 590	80GB HBM	US Entity List (2022); enterprise/state
Hygon	DCU (various)	HBM (unverified)	ROCm/HIP-based; China-only
Huawei	Ascend 910B	64GB HBM2e	Export-banned; enterprise/state only
Lisuan	LX 7G106 / 7G105	12/24GB GDDR6	Consumer; China-only; launched June 2026

Biren, Moore Threads and Cambricon are on the US Entity List. Huawei is under export controls. None of these vendors sells through accessible channels to UK businesses. The only card that has appeared in consumer channels at all (Moore Threads MTT S80, 16GB) has immature, primarily Chinese-language drivers. Chinese GPU stacks are not a viable path for an SME or startup outside China, irrespective of the raw hardware specifications.

Software compatibility at a glance

Platform	OS	Docker GPU	ARM64 images
NVIDIA (x86-64)	Linux	`--gpus all` via nvidia-container-toolkit	x86-64 only
NVIDIA Jetson (ARM64)	Ubuntu/JetPack	`--runtime=nvidia`; ARM64 images required	Required
AMD ROCm (x86-64)	Linux x86-64 only	AMD Container Toolkit or device passthrough	x86-64 only
Intel Arc / oneAPI	Linux	Less mature; Vulkan via Docker Model Runner	x86-64 only
Apple Silicon	macOS only	No GPU passthrough in Docker	N/A

No major inference framework is strictly CUDA-only as of mid-2026. CUDA receives new features first, has the broadest community, and is the least troubleshooting-intensive path. The framework coverage for NVIDIA, AMD and Apple:

Framework	NVIDIA/Linux	AMD ROCm/Linux	Apple/macOS
llama.cpp	CUDA	HIP	Metal
Ollama	CUDA	ROCm (solid)	Metal / MLX
vLLM	CUDA	First-class ROCm	vllm-metal plugin
HuggingFace Transformers	CUDA	ROCm	MPS
ComfyUI	CUDA	ROCm (RDNA3/4)	Metal

What to buy

Starting point – used RTX 3090 (24GB, ~£600-800) on x86-64 Linux. Runs 32B Q4 LLMs and full-precision Flux. Two over PCIe (48GB combined) run 70B Q4 without CPU offload. Standard Docker tooling throughout.

Best value under ~£1,000 for new hardware: AMD RX 7900 XTX (24GB) on Linux with ROCm. Verify the specific SKU is on the ROCm 7.x supported list before purchasing.

Best single new card for most workloads: RTX 5090 (32GB). UK pricing remains elevated (~£3,599); the RTX 4090 (24GB) used is strong value at current prices.

Maximum single-card capacity without workstation pricing: RTX PRO 6000 Blackwell (96GB, ~£7k). The right choice when 70B FP8 in-VRAM with context headroom is a hard requirement.

70B+ at full quality, or 100B+ models – three paths:

Capacity with CUDA on Linux: Jetson AGX Thor or DGX Spark (128GB unified). Accept ARM64 complexity, ARM64-native container images, and ~273 GB/s bandwidth constraining throughput on very large models.
Capacity with macOS workflows: Apple M3 Ultra (128-512GB). Accept macOS as the permanent OS; native inference tools only; no Docker GPU path.
No architectural constraints: rent H100, H200, B200 or MI300X by the hour. (If you need GPU inference without owning the hardware, Marigold runs open-weight models on private AWS infrastructure and exposes them through a standard API – the same models, without the operational overhead of managing GPU instances.)

Rent before buying anything above £2k to validate throughput requirements against real workloads.

A few caveats

GPU pricing is volatile. A late-2025 memory-supply shortage pushed RTX 50-series prices above MSRP in the UK; used data-centre card prices are softening as Blackwell capacity expands.

Unified-memory bandwidth is the binding constraint on Jetson Thor, DGX Spark and Apple Silicon. 128GB at 273 GB/s and 512GB at 819 GB/s are both substantially below the bandwidth of discrete GDDR7 or HBM. Large models load but generate slowly.

ROCm’s supported-GPU list changes between releases. Cards that work via override flags in one release may break in the next. Confirm support for a specific GPU SKU against the current ROCm release notes before committing.

This overview covers inference. Training and fine-tuning impose stricter VRAM requirements (optimiser states, gradient checkpoints) and benefit more strongly from NVLink; CUDA’s lead over other stacks is larger in those contexts.