Release Notes 1.3.1

Release date: 2026-07-01

Skulk 1.3.1 brings speculative decoding to AMD and other GPU nodes. 1.3.0 let a Linux GPU box join the cluster and serve GGUF models through an in-process llama.cpp engine; 1.3.1 adds a second, served engine that unlocks llama.cpp's native multi-token prediction, so the speculative-decoding speedups that were Apple-Silicon-only now run on a Strix Halo too. It also hardens the model store against a few download and re-provisioning edge cases that large-model deployments hit.

Highlights

Native MTP on GPU nodes (served llama_server engine). A new inference-engine class launches an external llama-server process and proxies its OpenAI API, coexisting with the in-process mlx and llama_cpp engines. It is the only path to llama.cpp's native multi-token-prediction speculative decoding (--spec-type draft-mtp), which is not reachable from the in-process binding. Enable it on a node by pointing SKULK_LLAMA_SERVER_BIN at a llama-server binary; route a model to it with compatible_backends and configure speculation with the card's served_spec_type / served_spec_n_max runtime fields. Two MTP shapes are supported: GGUFs with baked-in MTP heads (Qwen3.5 / Qwen3.6) and a base plus a separate --model-draft GGUF (Gemma 4 31B, co-fetched through the store). Measured 2.19x on a dense Qwen3.6-27B on a Strix Halo (Radeon / Vulkan). See the AMD / Strix Halo node guide and Speculative Decoding.
Store-delete evicts staged copies cluster-wide. Deleting a model from the store (DELETE /store/models/{model_id}) now removes the store host's canonical copy and evicts the model's staged copy from every node, instead of leaving each worker's disk to reclaim it lazily under memory pressure. See the model store guide.

Fixes

Large model downloads no longer time out mid-transfer. The store's file-body download applied a fixed 30-minute total timeout that capped a transfer by wall-clock regardless of progress, so a multi-GB GGUF could fail partway through (a 17 GB model at ~7.5 MB/s failed at ~80%, with an empty error). Large downloads are now bounded by read-inactivity, not total time, so a slow-but-progressing transfer of any size completes and resumes from its partial on retry. The worker's wait for a store download is likewise progress-aware, so it no longer gives up on a live, still-progressing multi-hour download.
Store re-download after a delete no longer silently no-ops. A stale in-memory "complete" status could make a re-download short-circuit after a delete, leaving a model that reported complete while absent from the store and unstageable ("not found in store"). Re-provisioning a model after a store-delete now works.

Upgrading

All nodes in a cluster must run the same Skulk version. Upgrade the whole fleet together; mixed-version clusters are unsupported.

Highlights​

Fixes​

Upgrading​

Highlights

Fixes

Upgrading