$ For millions of years mankind lived just like animals. Then something happened which unleashed the power of our imagination.

AMD Runs Trillion-Parameter Kimi K2.5 Locally on Four Ryzen AI Max+ Desktops

local-llm, amd, distributed-inference, llama-cpp, ryzen-ai

AMD has published a detailed technical guide demonstrating local inference of Moonshot AI’s trillion-parameter Kimi K2.5 model. The setup uses four Framework Desktop systems, each built around the Ryzen AI Max+ 395 APU with 128GB of unified memory. Through a Linux kernel TTM parameter override, each node exposes 120GB as GPU-addressable VRAM — well beyond the 96GB available through BIOS settings alone — putting 480GB of total GPU memory at the cluster’s disposal 1.

The model itself is the Q2_K_XL GGUF quantization of Kimi K2.5, weighing in at 375GB. Distributed inference is handled by llama.cpp’s built-in RPC mechanism over a 5Gbps Ethernet link, with one node acting as controller and the remaining three running lightweight RPC servers 1.

Cluster Architecture

One machine orchestrates tokenization, scheduling, and layer placement. The other three expose their GPU memory and compute via RPC endpoints. From the model’s perspective, layers are sharded across all four nodes at load time, and inference proceeds as though running on a single accelerator 1.

ComponentSpecification
Nodes4× Framework Desktop
APUAMD Ryzen AI Max+ 395 (gfx1151)
RAM per node128GB unified
VRAM per node (TTM)120GB
Total cluster VRAM480GB
ModelKimi K2.5 (Q2_K_XL, 375GB)
Inference enginellama.cpp with ROCm 7.0.2
Interconnect5Gbps Ethernet
OSUbuntu 24.04.3 LTS

The TTM Trick

The standard BIOS limit for dedicated VRAM on these systems is 96GB per node (384GB total) — not enough to fit the 375GB model with adequate headroom. AMD’s workaround uses the Linux Translation Table Manager kernel parameter (ttm.pages_limit) combined with amdgpu.gttsize to push the GPU-addressable allocation to 120GB per node. This is a boot-time kernel parameter change, not a runtime hack, and the allocation is verified via dmesg after reboot 1.

The math: 480GB of cluster VRAM for a 375GB model leaves roughly 105GB for KV cache and context. The guide demonstrates a 32,768-token context window with flash attention enabled.

Software Stack

Two paths are documented: a pre-built binary route via the Lemonade SDK project (which ships nightly ROCm-enabled llama.cpp builds targeting gfx1151), and a manual source build using CMake with HIP, RPC, and rocWMMA flash attention flags 1. Both produce the same llama-cli, llama-server, and rpc-server binaries. The server variant exposes an OpenAI-compatible HTTP API, making integration with existing tooling straightforward.

What This Actually Means

This is not a datacenter story. Four desktop-class machines with consumer APUs — no discrete GPUs, no NVLink, no InfiniBand — running a trillion-parameter model over Ethernet. The practical takeaways:

  • Memory is the bottleneck, not compute. The Ryzen AI Max+ 395’s value proposition here is 128GB of unified memory per node, not raw FLOPS. Quantization (Q2_K_XL) does the heavy lifting on the parameter side.
  • llama.cpp RPC makes multi-node accessible. No custom distributed framework, no Kubernetes, no NCCL. One binary on each machine, a few command-line flags, and Ethernet.
  • Cost context matters. Four Framework Desktops with maxed-out RAM will run north of $10,000. A cloud H100 instance handles trillion-parameter inference without the cluster management. The value is in the “locally” part — air-gapped, no API dependency, full control.
  • The TTM override is Linux-only. Windows users are capped at 96GB per node via BIOS, which limits the model sizes that fit in a four-node cluster to roughly 370GB quantized.

For teams already running local inference on single Ryzen AI Max+ systems, this guide is a concrete blueprint for scaling out. For everyone else, it’s a benchmark for where consumer-grade AI hardware stands in 2026.

References


  1. AMD, “How to Run a One Trillion-Parameter LLM Locally on an AMD Ryzen AI Max+ Cluster,” AMD Developer Resources, 2026. https://www.amd.com/en/developer/resources/technical-articles/2026/how-to-run-a-one-trillion-parameter-llm-locally-an-amd.html ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

Configuration details reflect a production environment at time of writing. Implementation specifics vary based on tooling versions, platform updates, and organizational requirements. Validate approaches against current documentation before deployment.