Release Notes 1.2.0
Release date: 2026-06-11
Highlights
- Speculative decoding is production-grade. Multi-token-prediction (MTP) drafting now runs on single-node, pipeline-sharded, and tensor-parallel placements (including heterogeneous multi-node rings) with measured per-model draft depths, sampled-decoding support, and a dashboard badge showing speculation status per instance. Measured speedups range from 1.16× to 2.2× across the supported model cards.
- A stability overhaul, proven live. A multi-week hardening arc (every fix validated end-to-end on a real 4-node cluster) closed the failure classes that previously required operator intervention or reboots: placements now survive master failover, runner crash loops trip a breaker instead of leaking wired GPU memory, oversized placements are refused before they can OOM a node, node telemetry no longer collides with MLX inference, and an event-storm defect that could churn elections and silently drop placements is fixed at five layers.
- Rational context-length control. Each placed instance now derives a
usable context ceiling (the smaller of the model's advertised context
length and what actually fits in memory beside the weights) and admits
requests against it. Over-long requests get a clean OpenAI-style
context_length_exceedederror instead of growing the KV cache until the runner dies with a Metal OOM. - Multi-node networking hardened. Ring formation now runs under a hard deadline instead of hanging forever on a failed handshake, and interconnect selection follows operator intent: Thunderbolt first, LAN next, VPN (Tailscale) last. VPN exists for reachability, not as a preferred data path.
- The project is now Skulk all the way down. The
exopackage, env vars, config paths, and service identifiers are renamed; compatibility shims honor existingEXO_*environment variables and~/.exodata directories.
Speculative Decoding (MTP)
The speculative-decoding stack was rebuilt across this release cycle:
- Bonus-driven rounds: the draft/verify loop was restructured so the verified bonus token chains directly into the next draft round, raising acceptance-adjusted throughput on every card measured.
- Depth-K chained drafting with per-card measured optima: draft depth is no longer a global constant but a tuned property of each model card.
- Sampled decoding support: speculation now applies under temperature sampling, not just greedy decoding.
- Multi-node speculation on pipeline and tensor placements. Cross-rank
draft/accept decisions are explicitly broadcast (decider protocol), which
makes speculation correct on heterogeneous hardware where ranks could
otherwise diverge bit-wise and wedge the GPU. Multi-node speculation is on
by default with a per-card
speculative_multi_nodeopt-out. - Gemma 4 assistant drafting and embedded-MTP support for Qwen-family cards.
- Performance repair: a scheduling defect that made production MTP run 20 to 46× slower than plain decode under concurrent load is fixed, and speculation now engages for models that were already resident in memory.
- The dashboard shows a ⚡ MTP badge with draft depth on each running instance.
Stability and Resilience
Every item below was reproduced and re-verified live on a multi-node cluster:
- Placements survive master failover. A newly elected master seeds its session from the prior replicated state, so placed models keep serving through a master restart (resume measured at ~23 s) instead of silently becoming 404s. Open requests on the dead master fail fast with a clear error instead of hanging.
- Event-storm immunity. Abandoned client requests (short timeouts against a loading model) could previously ignite a self-sustaining event storm (~800 events/s) that drowned replicas, churned elections, and dropped placements. Fixed at five layers, from the runner-side mint to a master ingest cap that bounds any future emitter at zero amplification.
- Memory-safe placement. The placement fit-check and a worker-side pre-spawn guard share one memory model (GPU-wireable availability), so oversized placements are refused cleanly instead of OOM-aborting and leaking wired GPU memory until reboot.
- Crash containment. The crash circuit breaker is edge-triggered (one
trip per crash loop), GPU-wedge runner deaths are never retried (each retry
leaked ~5 GB of wired memory), a wedged warmup marks the instance failed
loudly instead of silently disabling the node, and ring group-connect runs
under a configurable deadline (
SKULK_GROUP_CONNECT_DEADLINE_SECONDS, default 120 s) with a network diagnosis instead of hanging forever. - Telemetry that cannot take down inference. The macmon-based system poller could collide with MLX and reboot the machine under load; node telemetry now uses mactop. Peer churn no longer crashes healthy bystander nodes, and the disk event log degrades gracefully on a full disk instead of killing the node.
Context-Length Control
- Requests are admitted against a per-instance context ceiling computed from the model card and the hosting nodes' memory. The ceiling is deterministic across the ranks of a multi-node placement.
- An explicit
max_tokensthat cannot fit is rejected with an OpenAI-stylecontext_length_exceededinvalid-request error (HTTP 400 when detectable before dispatch); a window-filling prompt is rejected before prefill; an omittedmax_tokensis clamped to the remaining window so generation ends withfinish_reason: "length". - See the API guide for the exact semantics.
Networking
- Ring/tensor interconnect selection ranks observed links Thunderbolt >
Ethernet > Wi-Fi > VPN, detecting Tailscale addresses (CGNAT and
fd7a:115c:a1e0::/48) by address so a VPN path is only used when nothing better exists. Thunderbolt interface labels survive hardware-port classification. - Multi-node placement failures are reported with actionable errors (the node and the memory arithmetic) instead of silent refusals.
Operations
- The macOS LaunchAgent can self-update on boot, an operator-editable
environment file lives at
~/.skulk/skulk.env, and a separateskulk-vectoragent ships logs without coupling log shipping to the inference service lifecycle. Installers for launchd and systemd set up both agents. - Structured external logging (
SKULK_LOGGING_EXTERNAL=1) with an operator guide at External logging. - Staged model copies have a lifecycle with per-node disk reporting, and empty or malformed generation requests are rejected at the API with a 400 instead of crashing a runner.
Platform
- Renamed
exo→skulkacross the package tree, services, env vars (SKULK_*honored alongside legacyEXO_*), and data directories. - Dependency ladder:
mlx0.31.2,mlx-vlm0.6.1, FastAPI 0.136 / Starlette 1.x. - Dashboard fixes: deep links and browser refresh no longer 404; the topology GPU bar renders at the correct scale.
For the exhaustive list of changes, see the CHANGELOG.