Release Notes 1.3.0
Release date: 2026-06-25
Skulk 1.3.0 turns the cluster into a heterogeneous inference fabric. Up to now a cluster was Apple Silicon running MLX. In 1.3.0 an AMD or other Linux GPU node can join the same cluster and serve models through a second inference engine, the cluster's networking is split into purpose-built planes, and the model catalog grows to cover GGUF weights, vision models, and a centralized model store.
Highlights
-
Heterogeneous clusters: AMD / Linux GPU nodes via a llama.cpp engine. A Linux GPU box (for example an AMD Strix Halo) can now join a Skulk cluster and serve GGUF models through a new llama.cpp inference engine on the Vulkan (Mesa RADV) backend, side by side with Apple Silicon MLX nodes in the same cluster. Each model is routed to a node that can run it: MLX weights to Apple Silicon, GGUF weights to a llama.cpp node. The llama.cpp engine reaches parity with MLX on tool calling and logprobs, honors cancellation, and surfaces clear errors. See the AMD / Strix Halo node guide.
-
A centralized model store. One node can act as a store host that downloads model weights once and serves them to the rest of the cluster, so every node does not re-download from Hugging Face. For GGUF repositories the store fetches only the quantization a model card pins (not every quant in the repo), and the store host advertises a routable address so downloads work on a Thunderbolt-meshed fleet. See the model store guide.
-
Vision models on GGUF. Vision-language GGUF models run on the llama.cpp engine: the runner loads the model's multimodal projector and accepts image inputs through the standard chat API.
-
Plane separation, with an optional Zenoh data plane. Cluster traffic is now split into three planes: a control plane (cluster decisions and task lifecycle), a telemetry plane (last-write-wins node readings), and a data plane that streams generated tokens straight to the owning API node. The data plane can ride an optional Eclipse Zenoh transport, which is soft default-on, with per-cluster namespace isolation. See cluster communication.
-
Cross-engine reasoning. Reasoning models keep their chain-of-thought out of the answer on both engines. gpt-oss "harmony" output and plain
<think>-delimited reasoning are parsed into a separate reasoning channel, so the visible content is the clean answer and the reasoning is available separately. On the llama.cpp engine this is done with a dependency-free parser so it runs on non-Mac GPU nodes. -
Richer telemetry and topology. Node readings (memory, system, disk, accelerator metrics, identities) are gossiped on the telemetry plane and shown in the dashboard, including a collector-agnostic accelerator block that covers both Apple and AMD GPUs, heterogeneous-node identity in the topology, and per-node health.
New and improved
- Heterogeneous clusters: AMD / Linux GPU nodes serving GGUF through the llama.cpp engine (Vulkan), with logprobs and tool-calling parity, cancellation, and structured errors.
- Centralized model store: selective single-quant GGUF download, routable-IP advertisement, and a pinned-quant download contract.
- Vision GGUF models on llama.cpp (the multimodal projector is downloaded and loaded automatically).
- Optional Eclipse Zenoh transport for the per-token data plane, key-addressed per owner, with namespace isolation and a transport-conditional reorder buffer.
- Control / telemetry / data plane separation: generation output and observational node readings move off the ordered event log.
- Cross-engine reasoning parsing (gpt-oss harmony and
<think>), thinking-aware output on the llama.cpp engine, and recovery of empty content for auto-imported Qwen3 reasoning models. - Context-aware admission for the llama.cpp engine, a named single-node engine capability, and a resolved-backend record stamped onto each placed shard.
- Flash Attention on by default for the llama.cpp engine.
- Node logs are bounded so they cannot fill the disk, and structured runner errors (context-length and others) surface on the API.
Fixes
A large set of placement, failover, and stability fixes, including: tight
multi-node placements no longer silently vanishing; master failover no longer
killing a healthy serving instance or churning elections under connection churn;
GPU-offload and unified-memory nodes admitted against the right memory; large
GGUF loads and llama.cpp logprobs no longer OOM-killing a node; a placement right
after a teardown no longer spuriously refused; data-plane streams no longer
hanging on a dropped final chunk and multi-node output no longer silently
reordered; the source-built GPU llama.cpp wheel surviving uv sync; and the
macOS Local Network permission being diagnosable with a clearer denial warning.
See the CHANGELOG for the complete per-change list.
Upgrading
All nodes in a cluster must run the same Skulk version. Upgrade the whole fleet together; a mixed-version cluster is unsupported.