Skip to main content

AMD Strix Halo nodes (Linux / ROCm)

Skulk clusters are not Mac-only. An AMD Ryzen AI Max (Strix Halo, gfx1151) box running Linux can join a cluster as a worker that serves GGUF models through llama.cpp on its integrated Radeon GPU, alongside Apple Silicon nodes serving MLX models. The cluster is heterogeneous: each model is placed on the nodes that can actually run it.

This page covers what such a node needs, how to bring one up, and how the cluster decides what to run where.

What runs where

A Skulk node advertises the compute backends it can serve, written as <engine>-<compute> tags:

  • A macOS node advertises mlx and mlx-metal.
  • A Linux node with llama.cpp built for the Radeon GPU advertises llama_cpp and llama_cpp-vulkan.

Each model card declares which backends can run it. A GGUF model card lists llama.cpp backends as compatible; an MLX/safetensors model lists MLX. When you launch a model, the master places it only on nodes whose advertised backends include one the card accepts, and prefers the card's higher-ranked backend when several nodes qualify. So a GGUF model lands on the AMD node and an MLX model lands on the Macs, automatically.

The GPU path: Vulkan, not HIP

On gfx1151 the reliable, well-supported way to run llama.cpp on the GPU today is the Vulkan backend (Mesa's RADV driver), not the ROCm/HIP backend. ROCm is still installed for its runtime and driver stack, but Skulk's llama.cpp runner offloads through Vulkan. On a Ryzen AI Max+ 395 (Radeon 8060S) this fully offloads a 7B Q4_K_M model to the iGPU and decodes at interactive speed, which is what makes the box useful as a cluster node rather than a CPU-only fallback.

Validated configuration

ComponentVersion
HardwareAMD Ryzen AI Max+ 395 w/ Radeon 8060S (gfx1151)
OSUbuntu 26.04 LTS (kernel 7.0)
ROCm6.4
GPU computeVulkan via Mesa RADV (STRIX_HALO)
Python / uv3.13 / 0.11
llama-cpp-python0.3.30, built with the Vulkan backend

Bring-up

The full operational steps and a launcher template live in deployment/rocm/ in the repo. In outline:

  1. Install the GPU stack: ROCm plus a working Vulkan driver (RADV). Confirm with rocminfo | grep gfx (expect gfx1151) and vulkaninfo | grep deviceName (expect the Radeon device via RADV).
  2. Build the Skulk environment: git clone the repo and run uv sync. The Rust networking bindings compile here. No MLX is needed on a non-Mac node.
  3. Build llama-cpp-python with Vulkan: the default install is a CPU wheel, so reinstall it built from source with the Vulkan backend. --no-binary forces the source build, otherwise uv installs the prebuilt CPU wheel and CMAKE_ARGS is ignored:
    CMAKE_ARGS="-DGGML_VULKAN=on" uv pip install --force-reinstall \
    --no-cache-dir --no-binary llama-cpp-python \
    --python .venv/bin/python llama-cpp-python
    You build this once. The service entrypoint runs uv sync --inexact on a node that declares a GPU backend, so a routine sync no longer prunes this source-built wheel; rebuild it by hand only when bumping the llama.cpp version. As a safety net, if the wheel is ever replaced by a CPU-only one (no GPU offload compiled in), the node detects that and advertises llama_cpp-cpu instead of its GPU backend, so GPU work is never routed to a degraded build.
  4. Launch the node: declare its backend and point it at the rest of the cluster, using the launch-skulk.sh.example template (sets SKULK_LLAMA_CPP_BACKENDS=vulkan). Nodes on the same LAN segment find each other automatically (mDNS); if this node is on a different segment, set SKULK_BOOTSTRAP_PEERS to dial the existing nodes. On Linux there is no launchd, so start it detached so it survives an SSH disconnect:
    setsid bash -c 'exec ~/launch-skulk.sh > ~/skulk.log 2>&1' </dev/null >/dev/null 2>&1 &

A headless node needs no dashboard build: the API serves without the UI, and you reach the dashboard from any node that has it.

Serving a model on the AMD node

Launch a GGUF model (its card lists llama.cpp backends as compatible) and the master places it on the AMD node. From any node's API:

curl -s -X POST http://<any-node>:52415/place_instance \
-H 'Content-Type: application/json' \
-d '{"model_id":"<org>/<gguf-repo>","instance_meta":"MlxRing","min_nodes":1}'

Skulk downloads only the preferred quantization (not every quant in a multi-quant repo), loads it through llama.cpp on the Radeon GPU, and serves it through the same OpenAI-compatible endpoints as any other model.

What the AMD node can serve

A GGUF model on the AMD node uses the same OpenAI-compatible endpoints as any other Skulk model, with the same streaming behavior. It matches the MLX nodes on the generation capabilities llama.cpp supports:

  • Text generation, streamed token by token.
  • Vision / image input for a vision GGUF (with an mmproj projector): send images as OpenAI image_url content and the model describes or reasons over them. The projector loads through llama.cpp's multimodal chat handler.
  • Logprobs (opt-in): per-token logprobs and ranked alternatives (logprobs / top_logprobs). This needs the model loaded so it retains per-token logits, which pre-allocates a large buffer (context length x vocab), so it is off by default. Enable it per node with SKULK_LLAMA_CPP_LOGITS_ALL=1; doing so further caps the served context to a bounded window (SKULK_LLAMA_CPP_LOGITS_ALL_N_CTX, default 8192) so the buffer stays small. With it off, a logprobs request returns a clear error rather than silently omitting them. Either way the served context window is sized to the memory the cluster admitted for the instance, never the model's full trained context (a 128k-context model loaded at full context would otherwise allocate a KV cache far larger than placement reserved and could exhaust the node on load).
  • Tool calling: a request's tools are passed to the model. Whether the model returns a structured tool call or describes the call in prose depends on the model and its embedded chat template, which Skulk uses as-is; either way the request completes through the normal streaming path.

Vision / multimodal GGUF models are served. A vision GGUF that ships a separate mmproj projector (LLaVA / Qwen-VL style) runs on an AMD node: the projector downloads alongside the weights and the runner loads it through llama.cpp's multimodal chat handler, so image chat requests (OpenAI image_url content) work. A repo is recognized as a vision model from its config.json vision section, or, when it ships none, from the presence of the mmproj file.

One thing is deliberately not on the AMD path today:

  • Multi-token prediction (MTP) / speculative decoding is MLX-only. Those speedups come from MLX-specific drafting (model-native prediction heads or a sidecar drafter) and cross-rank coordination, which the llama.cpp engine does not implement, so an AMD node runs plain autoregressive decoding. GGUF models advertise no MTP capability, so it is never shown as available for them (there is no promise to fall short of); the AMD node serves at its native decode speed.

Scope

A single AMD node serves GGUF models on its own GPU. Sharding one model across an AMD node and a Mac is not supported (the two engines do not share a runtime); multi-node GGUF inference across several llama.cpp nodes is tracked separately. The interconnect doctrine still applies: the cluster fabric is trusted, so put untrusted segments behind your own network controls.