Release Notes 1.3.0

Release date: 2026-06-25

Skulk 1.3.0 turns the cluster into a heterogeneous inference fabric. Up to now a cluster was Apple Silicon running MLX. In 1.3.0 an AMD or other Linux GPU node can join the same cluster and serve models through a second inference engine, the cluster's networking is split into purpose-built planes, and the model catalog grows to cover GGUF weights, vision models, and a centralized model store.

Highlights

Heterogeneous clusters: AMD / Linux GPU nodes via a llama.cpp engine. A Linux GPU box (for example an AMD Strix Halo) can now join a Skulk cluster and serve GGUF models through a new llama.cpp inference engine on the Vulkan (Mesa RADV) backend, side by side with Apple Silicon MLX nodes in the same cluster. Each model is routed to a node that can run it: MLX weights to Apple Silicon, GGUF weights to a llama.cpp node. The llama.cpp engine reaches parity with MLX on tool calling and logprobs, honors cancellation, and surfaces clear errors. See the AMD / Strix Halo node guide.
A centralized model store. One node can act as a store host that downloads model weights once and serves them to the rest of the cluster, so every node does not re-download from Hugging Face. For GGUF repositories the store fetches only the quantization a model card pins (not every quant in the repo), and the store host advertises a routable address so downloads work on a Thunderbolt-meshed fleet. See the model store guide.
Vision models on GGUF. Vision-language GGUF models run on the llama.cpp engine: the runner loads the model's multimodal projector and accepts image inputs through the standard chat API.
Plane separation, with an optional Zenoh data plane. Cluster traffic is now split into three planes: a control plane (cluster decisions and task lifecycle), a telemetry plane (last-write-wins node readings), and a data plane that streams generated tokens straight to the owning API node. The data plane can ride an optional Eclipse Zenoh transport, which is soft default-on, with per-cluster namespace isolation. See cluster communication.
Cross-engine reasoning. Reasoning models keep their chain-of-thought out of the answer on both engines. gpt-oss "harmony" output and plain <think>-delimited reasoning are parsed into a separate reasoning channel, so the visible content is the clean answer and the reasoning is available separately. On the llama.cpp engine this is done with a dependency-free parser so it runs on non-Mac GPU nodes.
Richer telemetry and topology. Node readings (memory, system, disk, accelerator metrics, identities) are gossiped on the telemetry plane and shown in the dashboard, including a collector-agnostic accelerator block that covers both Apple and AMD GPUs, heterogeneous-node identity in the topology, and per-node health.

New and improved

Heterogeneous clusters: AMD / Linux GPU nodes serving GGUF through the llama.cpp engine (Vulkan), with logprobs and tool-calling parity, cancellation, and structured errors.
Centralized model store: selective single-quant GGUF download, routable-IP advertisement, and a pinned-quant download contract.
Vision GGUF models on llama.cpp (the multimodal projector is downloaded and loaded automatically).
Optional Eclipse Zenoh transport for the per-token data plane, key-addressed per owner, with namespace isolation and a transport-conditional reorder buffer.
Control / telemetry / data plane separation: generation output and observational node readings move off the ordered event log.
Cross-engine reasoning parsing (gpt-oss harmony and <think>), thinking-aware output on the llama.cpp engine, and recovery of empty content for auto-imported Qwen3 reasoning models.
Context-aware admission for the llama.cpp engine, a named single-node engine capability, and a resolved-backend record stamped onto each placed shard.
Flash Attention on by default for the llama.cpp engine.
Node logs are bounded so they cannot fill the disk, and structured runner errors (context-length and others) surface on the API.

Fixes

A large set of placement, failover, and stability fixes, including: tight multi-node placements no longer silently vanishing; master failover no longer killing a healthy serving instance or churning elections under connection churn; GPU-offload and unified-memory nodes admitted against the right memory; large GGUF loads and llama.cpp logprobs no longer OOM-killing a node; a placement right after a teardown no longer spuriously refused; data-plane streams no longer hanging on a dropped final chunk and multi-node output no longer silently reordered; the source-built GPU llama.cpp wheel surviving uv sync; and the macOS Local Network permission being diagnosable with a clearer denial warning.

See the CHANGELOG for the complete per-change list.

Upgrading

All nodes in a cluster must run the same Skulk version. Upgrade the whole fleet together; a mixed-version cluster is unsupported.

Highlights​

New and improved​

Fixes​

Upgrading​

Highlights

New and improved

Fixes

Upgrading