Skip to main content

Release Notes 1.2.0

Release date: 2026-06-11

Highlights

  • Speculative decoding is production-grade. Multi-token-prediction (MTP) drafting now runs on single-node, pipeline-sharded, and tensor-parallel placements — including heterogeneous multi-node rings — with measured per-model draft depths, sampled-decoding support, and a dashboard badge showing speculation status per instance. Measured speedups range from 1.16× to 2.2× across the supported model cards.
  • A stability overhaul, proven live. A multi-week hardening arc — every fix validated end-to-end on a real 4-node cluster — closed the failure classes that previously required operator intervention or reboots: placements now survive master failover, runner crash loops trip a breaker instead of leaking wired GPU memory, oversized placements are refused before they can OOM a node, node telemetry no longer collides with MLX inference, and an event-storm defect that could churn elections and silently drop placements is fixed at five layers.
  • Rational context-length control. Each placed instance now derives a usable context ceiling — the smaller of the model's advertised context length and what actually fits in memory beside the weights — and admits requests against it. Over-long requests get a clean OpenAI-style context_length_exceeded error instead of growing the KV cache until the runner dies with a Metal OOM.
  • Multi-node networking hardened. Ring formation now runs under a hard deadline instead of hanging forever on a failed handshake, and interconnect selection follows operator intent: Thunderbolt first, LAN next, VPN (Tailscale) last — VPN exists for reachability, not as a preferred data path.
  • The project is now Skulk all the way down. The exo package, env vars, config paths, and service identifiers are renamed; compatibility shims honor existing EXO_* environment variables and ~/.exo data directories.

Speculative Decoding (MTP)

The speculative-decoding stack was rebuilt across this release cycle:

  • Bonus-driven rounds: the draft/verify loop was restructured so the verified bonus token chains directly into the next draft round, raising acceptance-adjusted throughput on every card measured.
  • Depth-K chained drafting with per-card measured optima — draft depth is no longer a global constant but a tuned property of each model card.
  • Sampled decoding support: speculation now applies under temperature sampling, not just greedy decoding.
  • Multi-node speculation on pipeline and tensor placements. Cross-rank draft/accept decisions are explicitly broadcast (decider protocol), which makes speculation correct on heterogeneous hardware where ranks could otherwise diverge bit-wise and wedge the GPU. Multi-node speculation is on by default with a per-card speculative_multi_node opt-out.
  • Gemma 4 assistant drafting and embedded-MTP support for Qwen-family cards.
  • Performance repair: a scheduling defect that made production MTP run 20–46× slower than plain decode under concurrent load is fixed, and speculation now engages for models that were already resident in memory.
  • The dashboard shows a ⚡ MTP badge with draft depth on each running instance.

Stability and Resilience

Every item below was reproduced and re-verified live on a multi-node cluster:

  • Placements survive master failover. A newly elected master seeds its session from the prior replicated state, so placed models keep serving through a master restart (resume measured at ~23 s) instead of silently becoming 404s. Open requests on the dead master fail fast with a clear error instead of hanging.
  • Event-storm immunity. Abandoned client requests (short timeouts against a loading model) could previously ignite a self-sustaining event storm (~800 events/s) that drowned replicas, churned elections, and dropped placements. Fixed at five layers, from the runner-side mint to a master ingest cap that bounds any future emitter at zero amplification.
  • Memory-safe placement. The placement fit-check and a worker-side pre-spawn guard share one memory model (GPU-wireable availability), so oversized placements are refused cleanly instead of OOM-aborting and leaking wired GPU memory until reboot.
  • Crash containment. The crash circuit breaker is edge-triggered (one trip per crash loop), GPU-wedge runner deaths are never retried (each retry leaked ~5 GB of wired memory), a wedged warmup marks the instance failed loudly instead of silently disabling the node, and ring group-connect runs under a configurable deadline (SKULK_GROUP_CONNECT_DEADLINE_SECONDS, default 120 s) with a network diagnosis instead of hanging forever.
  • Telemetry that cannot take down inference. The macmon-based system poller could collide with MLX and reboot the machine under load; node telemetry now uses mactop. Peer churn no longer crashes healthy bystander nodes, and the disk event log degrades gracefully on a full disk instead of killing the node.

Context-Length Control

  • Requests are admitted against a per-instance context ceiling computed from the model card and the hosting nodes' memory. The ceiling is deterministic across the ranks of a multi-node placement.
  • An explicit max_tokens that cannot fit is rejected with an OpenAI-style context_length_exceeded invalid-request error (HTTP 400 when detectable before dispatch); a window-filling prompt is rejected before prefill; an omitted max_tokens is clamped to the remaining window so generation ends with finish_reason: "length".
  • See the API guide for the exact semantics.

Networking

  • Ring/tensor interconnect selection ranks observed links Thunderbolt > Ethernet > Wi-Fi > VPN, detecting Tailscale addresses (CGNAT and fd7a:115c:a1e0::/48) by address so a VPN path is only used when nothing better exists. Thunderbolt interface labels survive hardware-port classification.
  • Multi-node placement failures are reported with actionable errors (the node and the memory arithmetic) instead of silent refusals.

Operations

  • The macOS LaunchAgent can self-update on boot, an operator-editable environment file lives at ~/.skulk/skulk.env, and a separate skulk-vector agent ships logs without coupling log shipping to the inference service lifecycle. Installers for launchd and systemd set up both agents.
  • Structured external logging (SKULK_LOGGING_EXTERNAL=1) with an operator guide at External logging.
  • Staged model copies have a lifecycle with per-node disk reporting, and empty or malformed generation requests are rejected at the API with a 400 instead of crashing a runner.

Platform

  • Renamed exoskulk across the package tree, services, env vars (SKULK_* honored alongside legacy EXO_*), and data directories.
  • Dependency ladder: mlx 0.31.2, mlx-vlm 0.6.1, FastAPI 0.136 / Starlette 1.x.
  • Dashboard fixes: deep links and browser refresh no longer 404; the topology GPU bar renders at the correct scale.

For the exhaustive list of changes, see the CHANGELOG.