GPU Options for Local AI Inference: VRAM, Software Stacks, and What Actually Works (Mid-2026)
Running AI inference locally means making a hardware decision before you can make a software one. The choice of GPU determines which models fit in memory, how fast they generate, and – less obviously – which software stacks, container images, and deployment patterns are available to you. This overview covers the current hardware landscape for Linux-based inference, with the software constraints that make several apparently attractive options less attractive in practice.
Windows is not discussed. Driver instability, WSL2 overhead, the absence of first-class ROCm support, and the general friction of containerised workloads on Windows place it outside consideration for any repeatable inference pipeline.
VRAM: what fits where
Before hardware specifics, the memory arithmetic. A model’s parameter count determines its minimum VRAM requirement: a 7B parameter model in 16-bit precision needs approximately 14GB; at 4-bit quantisation (Q4), around 4GB. The practical floors for common workloads:
| VRAM | LLMs at Q4 | Image generation |
|---|---|---|
| 8GB | 7-8B | SD 1.5; Flux fp8 GGUF (slow) |
| 12GB | 13B comfortably | Flux.1-dev fp8; practical Flux minimum |
| 16GB | 13B Q8 / 14B Q4 | Full-precision Flux; Flux + ControlNet via fp8 |
| 24GB | 32B Q4; 70B with CPU offload | Flux FP16 + ControlNet; video models |
| 48GB | 70B Q4 fully in VRAM | Large multi-model pipelines |
| 80GB | 70B FP8 with headroom; multiple concurrent models | Heavy batch inference |
| 96GB | 70B FP8 with context headroom; 100B+ at Q4 | Production-scale |
| 128GB unified | 70B-235B MoE at Q4 | Full-precision large image models |
| 192-512GB | 405B-671B at 4-bit | N/A as limiting factor |
Note that 70B in FP16 requires approximately 140GB for weights alone. Single-card 70B serving requires quantisation as the norm, not the exception. For image generation, the VRAM floor is hard: if the model and activations do not fit, the process fails or falls back to system RAM at speeds that make the workflow impractical.
NVIDIA
NVIDIA remains the default, primarily because the software story is the simplest. CUDA on Linux is universal: PyTorch ships pre-built CUDA wheels, the NVIDIA Container Toolkit exposes GPUs to Docker via --gpus all, and pre-built container images for vLLM, Ollama, ComfyUI and most major frameworks are available from Docker Hub and NVIDIA NGC without recompilation.
Consumer cards (Blackwell, RTX 50 series, launched January 2025):
| Card | VRAM | Bandwidth |
|---|---|---|
| RTX 5090 | 32GB GDDR7 | 1.79 TB/s |
| RTX 5080 | 16GB GDDR7 | 960 GB/s |
| RTX 5070 Ti | 16GB GDDR7 | 896 GB/s |
| RTX 5070 | 12GB GDDR7 | 672 GB/s |
No consumer RTX card exceeds 32GB. The RTX 4090 (24GB, prior generation) remains in strong demand and is widely available used. NVLink is absent from all RTX 40 and 50 consumer cards; multi-GPU inference runs over PCIe. The Ampere generation RTX 3090 retained NVLink, which makes a dual-3090 (48GB total) a well-established budget configuration for 70B serving.
Workstation tier:
The RTX 6000 Ada (48GB GDDR6 ECC) fits Llama 70B Q4 on a single card. The RTX PRO 6000 Blackwell (96GB GDDR7 ECC, ~$8,565) delivers roughly 34 tokens/sec on a 70B Q4 model and is the most capable single-card option currently available for on-premises large-model serving. The RTX PRO 5000 Blackwell sits at 48GB GDDR7.
Data-centre cards:
| GPU | VRAM | Bandwidth | Approximate used/rental |
|---|---|---|---|
| A100 40GB | 40GB HBM2e | 1.6 TB/s | ~$7,800 used / ~$1.29/hr |
| A100 80GB | 80GB HBM2e | 2.0 TB/s | ~$18,900 used / ~$3.43/hr |
| H100 | 80GB HBM3 | 3.35 TB/s | ~$25-40k / ~$1.38-2.50/hr |
| H200 | 141GB HBM3e | 4.8 TB/s | ~$35-45k / ~$2+/hr |
| B200 | 180-192GB HBM3e | ~8 TB/s | DGX systems ~$350k+ / ~$2.12-6.03/hr |
Individual purchase of new data-centre cards is not a realistic path: enterprise sales channels, long lead times, and no retail warranty. Used A100 80GB cards represent the most accessible route to 80GB of HBM bandwidth; prices are softening as Blackwell capacity expands. Renting via Lambda Labs, Vast.ai, RunPod or Thunder Compute is the practical option for intermittent high-VRAM workloads.
Jetson AGX Thor and DGX Spark:
Both platforms provide 128GB of unified memory (CPU and GPU share a single pool) at 273 GB/s on a 256-bit bus. The Jetson AGX Thor (~$3,499) is ARM64-based running Ubuntu via JetPack 7.x; the DGX Spark (~$3,000-4,699, GB10 Grace Blackwell) targets developers wanting a CUDA-native desktop AI box. Both run CUDA 13 and support Docker with GPU access.
The unified memory figure is the headline, but 273 GB/s is approximately one-seventh of the bandwidth available on an RTX 5090. A 70B model fits in memory but generates tokens slowly relative to a discrete GDDR7 or HBM card. These platforms suit large-capacity, lower-throughput edge inference, multi-model concurrent serving, or MoE architectures where each active expert is modest in size.
The ARM64 architecture on Jetson introduces operational overhead that is easy to underestimate. The vast majority of Docker images for AI/ML frameworks are built for x86-64. Running them on Jetson via QEMU emulation is both slower and unreliable for compiled CUDA extensions. The correct approach is ARM64-native images from NVIDIA NGC or the dusty-nv/jetson-containers project, or rebuilding from source for linux/arm64. Many Python packages in the AI/ML ecosystem do not publish ARM64 wheels on PyPI and must be compiled from source or sourced from NVIDIA’s Jetson-specific index (pypi.jetson-ai-lab.io), which has coverage gaps requiring fallback to public PyPI. PyTorch itself is not the standard pip install torch wheel on Jetson; it requires either NVIDIA’s BSP build or compilation from source against the ARM64 CUDA toolchain.
There is also a container flag difference that breaks orchestration tooling assuming standard NVIDIA syntax: Jetson Thor requires --runtime=nvidia rather than --gpus all. This is a known platform difference with no workaround.
AMD
AMD’s consumer options offer good VRAM per pound. The RX 7900 XTX (24GB GDDR6, ~£800) is the current UK leader on that metric. The workstation PRO line reaches 48GB with the W7900, and the Radeon AI PRO R9700 (32GB GDDR7, RDNA4) targets AI workloads specifically. The data-centre MI300X (192GB HBM3, 5.325 TB/s) is available for rent (~$1.71-3.41/hr on TensorWave, Vultr, RunPod) but not through accessible purchase channels for an SME. The MI325X and MI355X (256GB and 288GB HBM3e respectively) extend the MI series further.
The software situation on Linux x86-64 is now genuinely workable. ROCm Docker support runs via the AMD Container Runtime Toolkit (or manual device passthrough with --device /dev/kfd --device /dev/dri). Pre-built ROCm PyTorch images are available from rocm/pytorch:latest. The experience is similar to NVIDIA on Linux – host driver, container toolkit, pull a pre-built image – though the image catalogue is thinner, the supported GPU list is narrower, and some frameworks (ComfyUI extensions, TensorRT-dependent tools) have no ROCm equivalent.
ROCm is x86-64 only. There is no ARM64 ROCm support; AMD GPUs are not usable for GPU-accelerated ML on ARM hosts. The officially supported consumer cards for ROCm 7.x are principally RDNA3 and RDNA4 (the RX 7900 XTX shares the gfx1100 target with the W7900). Older RDNA2 cards require the HSA_OVERRIDE_GFX_VERSION workaround, which is unsupported and may break between ROCm releases.
Apple Silicon
Apple M-series Macs with unified memory are a serious local inference option on capacity grounds alone.
| Config | Unified Memory | Bandwidth |
|---|---|---|
| M3 Max | up to 128GB | 400 GB/s |
| M3 Ultra | up to 512GB | 819 GB/s |
| M4 Max | up to 128GB | 546 GB/s |
The 512GB M3 Ultra can hold the 671B-parameter DeepSeek R1 at 4-bit quantisation (approximately 404GB consumed) on a single machine that costs less than most data-centre GPU cards. At roughly 17-18 tokens/sec under 200W, throughput is modest. For models below 32B parameters, a discrete NVIDIA or AMD GPU produces substantially higher tokens/sec; the Apple capacity advantage only becomes decisive above roughly 32B, where VRAM becomes the binding constraint.
Apple’s GPU acceleration stack (Metal, MLX, MPS) is macOS-only and cannot be reproduced on Linux. Running Linux on Apple Silicon via Asahi Linux exposes the CPU but not the GPU for ML acceleration. MLX, Ollama’s Metal backend, and PyTorch MPS all require macOS.
Docker on macOS does not expose the Apple GPU to containers. Docker Desktop uses Apple’s mandatory Hypervisor.framework, which provides no virtual GPU surface; containers fall back to CPU for any workload that would use Metal on the host. Workarounds exist at the edges: Podman can route Vulkan API calls to the host GPU for Vulkan-capable applications, and Docker Model Runner runs GPU-accelerated inference as a native macOS service and exposes it as an API endpoint to containers. These cover specific use cases but are not general-purpose GPU passthrough.
The practical consequence is that an Apple Silicon Mac is architecturally incompatible with Docker-based inference pipelines intended to run the same image on cloud instances and on-premises hardware. Inference must run as native macOS processes.
Chinese GPUs
| Vendor | Flagship | Memory | Status |
|---|---|---|---|
| Biren | BR100 | 64GB HBM2e | US Entity List (Oct 2023); enterprise-only |
| Moore Threads | MTT S4000 | 48GB GDDR6 | US Entity List (Oct 2023); China distribution only |
| Cambricon | Siyuan 590 | 80GB HBM | US Entity List (2022); enterprise/state |
| Hygon | DCU (various) | HBM (unverified) | ROCm/HIP-based; China-only |
| Huawei | Ascend 910B | 64GB HBM2e | Export-banned; enterprise/state only |
| Lisuan | LX 7G106 / 7G105 | 12/24GB GDDR6 | Consumer; China-only; launched June 2026 |
Biren, Moore Threads and Cambricon are on the US Entity List. Huawei is under export controls. None of these vendors sells through accessible channels to UK businesses. The only card that has appeared in consumer channels at all (Moore Threads MTT S80, 16GB) has immature, primarily Chinese-language drivers. Chinese GPU stacks are not a viable path for an SME or startup outside China, irrespective of the raw hardware specifications.
Software compatibility at a glance
| Platform | OS | Docker GPU | ARM64 images |
|---|---|---|---|
| NVIDIA (x86-64) | Linux | --gpus all via nvidia-container-toolkit |
x86-64 only |
| NVIDIA Jetson (ARM64) | Ubuntu/JetPack | --runtime=nvidia; ARM64 images required |
Required |
| AMD ROCm (x86-64) | Linux x86-64 only | AMD Container Toolkit or device passthrough | x86-64 only |
| Intel Arc / oneAPI | Linux | Less mature; Vulkan via Docker Model Runner | x86-64 only |
| Apple Silicon | macOS only | No GPU passthrough in Docker | N/A |
No major inference framework is strictly CUDA-only as of mid-2026. CUDA receives new features first, has the broadest community, and is the least troubleshooting-intensive path. The framework coverage for NVIDIA, AMD and Apple:
| Framework | NVIDIA/Linux | AMD ROCm/Linux | Apple/macOS |
|---|---|---|---|
| llama.cpp | CUDA | HIP | Metal |
| Ollama | CUDA | ROCm (solid) | Metal / MLX |
| vLLM | CUDA | First-class ROCm | vllm-metal plugin |
| HuggingFace Transformers | CUDA | ROCm | MPS |
| ComfyUI | CUDA | ROCm (RDNA3/4) | Metal |
What to buy
Starting point – used RTX 3090 (24GB, ~£600-800) on x86-64 Linux. Runs 32B Q4 LLMs and full-precision Flux. Two over PCIe (48GB combined) run 70B Q4 without CPU offload. Standard Docker tooling throughout.
Best value under ~£1,000 for new hardware: AMD RX 7900 XTX (24GB) on Linux with ROCm. Verify the specific SKU is on the ROCm 7.x supported list before purchasing.
Best single new card for most workloads: RTX 5090 (32GB). UK pricing remains elevated (~£3,599); the RTX 4090 (24GB) used is strong value at current prices.
Maximum single-card capacity without workstation pricing: RTX PRO 6000 Blackwell (96GB, ~£7k). The right choice when 70B FP8 in-VRAM with context headroom is a hard requirement.
70B+ at full quality, or 100B+ models – three paths:
- Capacity with CUDA on Linux: Jetson AGX Thor or DGX Spark (128GB unified). Accept ARM64 complexity, ARM64-native container images, and ~273 GB/s bandwidth constraining throughput on very large models.
- Capacity with macOS workflows: Apple M3 Ultra (128-512GB). Accept macOS as the permanent OS; native inference tools only; no Docker GPU path.
- No architectural constraints: rent H100, H200, B200 or MI300X by the hour. (If you need GPU inference without owning the hardware, Marigold runs open-weight models on private AWS infrastructure and exposes them through a standard API – the same models, without the operational overhead of managing GPU instances.)
Rent before buying anything above £2k to validate throughput requirements against real workloads.
A few caveats
GPU pricing is volatile. A late-2025 memory-supply shortage pushed RTX 50-series prices above MSRP in the UK; used data-centre card prices are softening as Blackwell capacity expands.
Unified-memory bandwidth is the binding constraint on Jetson Thor, DGX Spark and Apple Silicon. 128GB at 273 GB/s and 512GB at 819 GB/s are both substantially below the bandwidth of discrete GDDR7 or HBM. Large models load but generate slowly.
ROCm’s supported-GPU list changes between releases. Cards that work via override flags in one release may break in the next. Confirm support for a specific GPU SKU against the current ROCm release notes before committing.
This overview covers inference. Training and fine-tuning impose stricter VRAM requirements (optimiser states, gradient checkpoints) and benefit more strongly from NVLink; CUDA’s lead over other stacks is larger in those contexts.