Release Notes 1.3.1
Release date: 2026-07-01
Skulk 1.3.1 brings speculative decoding to AMD and other GPU nodes. 1.3.0 let a Linux GPU box join the cluster and serve GGUF models through an in-process llama.cpp engine; 1.3.1 adds a second, served engine that unlocks llama.cpp's native multi-token prediction, so the speculative-decoding speedups that were Apple-Silicon-only now run on a Strix Halo too. It also hardens the model store against a few download and re-provisioning edge cases that large-model deployments hit.
Highlights
-
Native MTP on GPU nodes (served
llama_serverengine). A new inference-engine class launches an externalllama-serverprocess and proxies its OpenAI API, coexisting with the in-processmlxandllama_cppengines. It is the only path to llama.cpp's native multi-token-prediction speculative decoding (--spec-type draft-mtp), which is not reachable from the in-process binding. Enable it on a node by pointingSKULK_LLAMA_SERVER_BINat allama-serverbinary; route a model to it withcompatible_backendsand configure speculation with the card'sserved_spec_type/served_spec_n_maxruntime fields. Two MTP shapes are supported: GGUFs with baked-in MTP heads (Qwen3.5 / Qwen3.6) and a base plus a separate--model-draftGGUF (Gemma 4 31B, co-fetched through the store). Measured 2.19x on a dense Qwen3.6-27B on a Strix Halo (Radeon / Vulkan). See the AMD / Strix Halo node guide and Speculative Decoding. -
Store-delete evicts staged copies cluster-wide. Deleting a model from the store (
DELETE /store/models/{model_id}) now removes the store host's canonical copy and evicts the model's staged copy from every node, instead of leaving each worker's disk to reclaim it lazily under memory pressure. See the model store guide.
Fixes
-
Large model downloads no longer time out mid-transfer. The store's file-body download applied a fixed 30-minute total timeout that capped a transfer by wall-clock regardless of progress, so a multi-GB GGUF could fail partway through (a 17 GB model at ~7.5 MB/s failed at ~80%, with an empty error). Large downloads are now bounded by read-inactivity, not total time, so a slow-but-progressing transfer of any size completes and resumes from its partial on retry. The worker's wait for a store download is likewise progress-aware, so it no longer gives up on a live, still-progressing multi-hour download.
-
Store re-download after a delete no longer silently no-ops. A stale in-memory "complete" status could make a re-download short-circuit after a delete, leaving a model that reported complete while absent from the store and unstageable ("not found in store"). Re-provisioning a model after a store-delete now works.
Upgrading
All nodes in a cluster must run the same Skulk version. Upgrade the whole fleet together; mixed-version clusters are unsupported.