Skip to main content

Release Notes 1.3.0

Release date: 2026-06-25

Skulk 1.3.0 turns the cluster into a heterogeneous inference fabric. Up to now a cluster was Apple Silicon running MLX. In 1.3.0 an AMD or other Linux GPU node can join the same cluster and serve models through a second inference engine, the cluster's networking is split into purpose-built planes, and the model catalog grows to cover GGUF weights, vision models, and a centralized model store.

Highlights

  • Heterogeneous clusters: AMD / Linux GPU nodes via a llama.cpp engine. A Linux GPU box (for example an AMD Strix Halo) can now join a Skulk cluster and serve GGUF models through a new llama.cpp inference engine on the Vulkan (Mesa RADV) backend, side by side with Apple Silicon MLX nodes in the same cluster. Each model is routed to a node that can run it: MLX weights to Apple Silicon, GGUF weights to a llama.cpp node. The llama.cpp engine reaches parity with MLX on tool calling and logprobs, honors cancellation, and surfaces clear errors. See the AMD / Strix Halo node guide.

  • A centralized model store. One node can act as a store host that downloads model weights once and serves them to the rest of the cluster, so every node does not re-download from Hugging Face. For GGUF repositories the store fetches only the quantization a model card pins (not every quant in the repo), and the store host advertises a routable address so downloads work on a Thunderbolt-meshed fleet. See the model store guide.

  • Vision models on GGUF. Vision-language GGUF models run on the llama.cpp engine: the runner loads the model's multimodal projector and accepts image inputs through the standard chat API.

  • Plane separation, with an optional Zenoh data plane. Cluster traffic is now split into three planes: a control plane (cluster decisions and task lifecycle), a telemetry plane (last-write-wins node readings), and a data plane that streams generated tokens straight to the owning API node. The data plane can ride an optional Eclipse Zenoh transport, which is soft default-on, with per-cluster namespace isolation. See cluster communication.

  • Cross-engine reasoning. Reasoning models keep their chain-of-thought out of the answer on both engines. gpt-oss "harmony" output and plain <think>-delimited reasoning are parsed into a separate reasoning channel, so the visible content is the clean answer and the reasoning is available separately. On the llama.cpp engine this is done with a dependency-free parser so it runs on non-Mac GPU nodes.

  • Richer telemetry and topology. Node readings (memory, system, disk, accelerator metrics, identities) are gossiped on the telemetry plane and shown in the dashboard, including a collector-agnostic accelerator block that covers both Apple and AMD GPUs, heterogeneous-node identity in the topology, and per-node health.

New and improved

  • Heterogeneous clusters: AMD / Linux GPU nodes serving GGUF through the llama.cpp engine (Vulkan), with logprobs and tool-calling parity, cancellation, and structured errors.
  • Centralized model store: selective single-quant GGUF download, routable-IP advertisement, and a pinned-quant download contract.
  • Vision GGUF models on llama.cpp (the multimodal projector is downloaded and loaded automatically).
  • Optional Eclipse Zenoh transport for the per-token data plane, key-addressed per owner, with namespace isolation and a transport-conditional reorder buffer.
  • Control / telemetry / data plane separation: generation output and observational node readings move off the ordered event log.
  • Cross-engine reasoning parsing (gpt-oss harmony and <think>), thinking-aware output on the llama.cpp engine, and recovery of empty content for auto-imported Qwen3 reasoning models.
  • Context-aware admission for the llama.cpp engine, a named single-node engine capability, and a resolved-backend record stamped onto each placed shard.
  • Flash Attention on by default for the llama.cpp engine.
  • Node logs are bounded so they cannot fill the disk, and structured runner errors (context-length and others) surface on the API.

Fixes

A large set of placement, failover, and stability fixes, including: tight multi-node placements no longer silently vanishing; master failover no longer killing a healthy serving instance or churning elections under connection churn; GPU-offload and unified-memory nodes admitted against the right memory; large GGUF loads and llama.cpp logprobs no longer OOM-killing a node; a placement right after a teardown no longer spuriously refused; data-plane streams no longer hanging on a dropped final chunk and multi-node output no longer silently reordered; the source-built GPU llama.cpp wheel surviving uv sync; and the macOS Local Network permission being diagnosable with a clearer denial warning.

See the CHANGELOG for the complete per-change list.

Upgrading

All nodes in a cluster must run the same Skulk version. Upgrade the whole fleet together; a mixed-version cluster is unsupported.