AMD Strix Halo nodes (Linux / ROCm)
Skulk clusters are not Mac-only. An AMD Ryzen AI Max (Strix Halo, gfx1151) box
running Linux can join a cluster as a worker that serves GGUF models through
llama.cpp on its integrated Radeon GPU, alongside Apple Silicon nodes serving
MLX models. The cluster is heterogeneous: each model is placed on the nodes that
can actually run it.
This page covers what such a node needs, how to bring one up, and how the cluster decides what to run where.
What runs where
A Skulk node advertises the compute backends it can serve, written as
<engine>-<compute> tags:
- A macOS node advertises
mlxandmlx-metal. - A Linux node with llama.cpp built for the Radeon GPU advertises
llama_cppandllama_cpp-vulkan.
Each model card declares which backends can run it. A GGUF model card lists llama.cpp backends as compatible; an MLX/safetensors model lists MLX. When you launch a model, the master places it only on nodes whose advertised backends include one the card accepts, and prefers the card's higher-ranked backend when several nodes qualify. So a GGUF model lands on the AMD node and an MLX model lands on the Macs, automatically.
The GPU path: Vulkan, not HIP
On gfx1151 the reliable, well-supported way to run llama.cpp on the GPU today
is the Vulkan backend (Mesa's RADV driver), not the ROCm/HIP backend. ROCm
is still installed for its runtime and driver stack, but Skulk's llama.cpp runner
offloads through Vulkan. On a Ryzen AI Max+ 395 (Radeon 8060S) this fully
offloads a 7B Q4_K_M model to the iGPU and decodes at interactive speed, which is
what makes the box useful as a cluster node rather than a CPU-only fallback.
Validated configuration
| Component | Version |
|---|---|
| Hardware | AMD Ryzen AI Max+ 395 w/ Radeon 8060S (gfx1151) |
| OS | Ubuntu 26.04 LTS (kernel 7.0) |
| ROCm | 6.4 |
| GPU compute | Vulkan via Mesa RADV (STRIX_HALO) |
| Python / uv | 3.13 / 0.11 |
| llama-cpp-python | 0.3.30, built with the Vulkan backend |
Bring-up
The full operational steps and a launcher template live in
deployment/rocm/
in the repo. In outline:
- Install the GPU stack: ROCm plus a working Vulkan driver (RADV). Confirm
with
rocminfo | grep gfx(expectgfx1151) andvulkaninfo | grep deviceName(expect the Radeon device via RADV). - Build the Skulk environment:
git clonethe repo and runuv sync. The Rust networking bindings compile here. No MLX is needed on a non-Mac node. - Build llama-cpp-python with Vulkan: the default install is a CPU wheel,
so reinstall it built from source with the Vulkan backend.
--no-binaryforces the source build, otherwise uv installs the prebuilt CPU wheel andCMAKE_ARGSis ignored:You build this once. The service entrypoint runsCMAKE_ARGS="-DGGML_VULKAN=on" uv pip install --force-reinstall \
--no-cache-dir --no-binary llama-cpp-python \
--python .venv/bin/python llama-cpp-pythonuv sync --inexacton a node that declares a GPU backend, so a routine sync no longer prunes this source-built wheel; rebuild it by hand only when bumping the llama.cpp version. As a safety net, if the wheel is ever replaced by a CPU-only one (no GPU offload compiled in), the node detects that and advertisesllama_cpp-cpuinstead of its GPU backend, so GPU work is never routed to a degraded build. - Launch the node: declare its backend and point it at the rest of the
cluster, using the
launch-skulk.sh.exampletemplate (setsSKULK_LLAMA_CPP_BACKENDS=vulkan). Nodes on the same LAN segment find each other automatically (mDNS); if this node is on a different segment, setSKULK_BOOTSTRAP_PEERSto dial the existing nodes. On Linux there is no launchd, so start it detached so it survives an SSH disconnect:setsid bash -c 'exec ~/launch-skulk.sh > ~/skulk.log 2>&1' </dev/null >/dev/null 2>&1 &
A headless node needs no dashboard build: the API serves without the UI, and you reach the dashboard from any node that has it.
Serving a model on the AMD node
Launch a GGUF model (its card lists llama.cpp backends as compatible) and the master places it on the AMD node. From any node's API:
curl -s -X POST http://<any-node>:52415/place_instance \
-H 'Content-Type: application/json' \
-d '{"model_id":"<org>/<gguf-repo>","instance_meta":"MlxRing","min_nodes":1}'
Skulk downloads only the preferred quantization (not every quant in a multi-quant repo), loads it through llama.cpp on the Radeon GPU, and serves it through the same OpenAI-compatible endpoints as any other model.
What the AMD node can serve
A GGUF model on the AMD node uses the same OpenAI-compatible endpoints as any other Skulk model, with the same streaming behavior. It matches the MLX nodes on the generation capabilities llama.cpp supports:
- Text generation, streamed token by token.
- Vision / image input for a vision GGUF (with an
mmprojprojector): send images as OpenAIimage_urlcontent and the model describes or reasons over them. The projector loads through llama.cpp's multimodal chat handler. - Logprobs (opt-in): per-token logprobs and ranked alternatives (
logprobs/top_logprobs). This needs the model loaded so it retains per-token logits, which pre-allocates a large buffer (context length x vocab), so it is off by default. Enable it per node withSKULK_LLAMA_CPP_LOGITS_ALL=1; doing so further caps the served context to a bounded window (SKULK_LLAMA_CPP_LOGITS_ALL_N_CTX, default 8192) so the buffer stays small. With it off, a logprobs request returns a clear error rather than silently omitting them. Either way the served context window is sized to the memory the cluster admitted for the instance, never the model's full trained context (a 128k-context model loaded at full context would otherwise allocate a KV cache far larger than placement reserved and could exhaust the node on load). - Tool calling: a request's
toolsare passed to the model. Whether the model returns a structured tool call or describes the call in prose depends on the model and its embedded chat template, which Skulk uses as-is; either way the request completes through the normal streaming path.
Vision / multimodal GGUF models are served. A vision GGUF that ships a
separate mmproj projector (LLaVA / Qwen-VL style) runs on an AMD node: the
projector downloads alongside the weights and the runner loads it through
llama.cpp's multimodal chat handler, so image chat requests (OpenAI image_url
content) work. A repo is recognized as a vision model from its config.json
vision section, or, when it ships none, from the presence of the mmproj file.
One thing is deliberately not on the AMD path today:
- Multi-token prediction (MTP) / speculative decoding is MLX-only. Those speedups come from MLX-specific drafting (model-native prediction heads or a sidecar drafter) and cross-rank coordination, which the llama.cpp engine does not implement, so an AMD node runs plain autoregressive decoding. GGUF models advertise no MTP capability, so it is never shown as available for them (there is no promise to fall short of); the AMD node serves at its native decode speed.
Scope
A single AMD node serves GGUF models on its own GPU. Sharding one model across an AMD node and a Mac is not supported (the two engines do not share a runtime); multi-node GGUF inference across several llama.cpp nodes is tracked separately. The interconnect doctrine still applies: the cluster fabric is trusted, so put untrusted segments behind your own network controls.