Skip to main content

Release Notes 1.0.3

Release date: 2026-05-02

Highlights

  • Per-placement node exclusion. Operators can now exclude specific nodes from a single placement decision without unplugging them from the cluster. Already-running instances on the listed nodes are not affected; exclusion is a per-launch hint to the master's planner, not a cluster-wide flag.
  • Observability consolidation. The dashboard's diagnostics surface is now one resizable panel with three tabs (Live / Node / Traces). The legacy standalone traces page is gone; everything lives inline.
  • Native trace waterfall. Saved-trace viewing no longer hands trace data off to a third-party hosted UI. Inline waterfall renders the same data inside the panel, with sub-pixel-event clustering so dense traces stay readable.
  • API trace janitor. Saved Chrome-trace JSON files are pruned after tracing.retention_days (default 3 days) so the cache directory stops growing without bound.
  • Snapshot bootstrap and bounded replay retention. Followers can hydrate cluster state from a master-published snapshot and replay only the retained tail; the master's live replay log no longer grows without limit.

Per-Placement Node Exclusion

POST /place_instance now accepts an optional excluded_nodes array of node IDs. The master's planner treats those nodes as if absent from the topology when scoring candidate cycles for that single placement; the rest of the cluster keeps participating in placements as usual.

GET /instance/previews accepts a matching excluded_node_ids repeatable query parameter so dashboards can preview a placement against the post-exclusion topology before the operator commits.

In the dashboard, the placement modal grows a click-to-toggle pill row under "Available Nodes". Each pill is a node; click to exclude it from this launch (strikethrough + transparent), click again to include. The cluster preview re-fetches and re-renders on every toggle so the visualization always reflects what would actually be placed.

Observability Surface

The right-side Observability panel hosts:

  • Live tab: cluster health header (master id, connectivity, hang counter, cluster-wide tracing toggle), per-runner synopsis grid with rank/phase/time-stuck, and a cross-rank flight-recorder feed driven by /v1/diagnostics/cluster/timeline.
  • Node tab: per-node diagnostics with a node selector at the top. Defaults to the master on first visit; ignores stale persisted node IDs that no longer exist in the topology.
  • Traces tab: saved-trace browser with inline filtering (kind, model, source node, free-text, tools-only) and per-row expansion that drops the trace's metadata, download affordance, and waterfall directly under the row.

The Perfetto popup integration is removed; saved traces stay in-cluster and render inline via a Skulk-native waterfall.

Trace Retention

Saved trace files accumulate under SKULK_CACHE_HOME/traces/ indefinitely otherwise. An hourly janitor in the API process deletes files older than tracing.retention_days (default 3 days; 0 disables pruning). First sweep runs 60 seconds after API startup.

Configure in skulk.yaml:

tracing:
retention_days: 3 # default; set to 0 to disable pruning

Theme Tokens

New palette tokens for status callouts:

  • Solid-fill iconography (palette-independent): errorFill (#dc2626) + errorOnFill (#ffffff); warningFill (#ffcc33) + warningOnFill (#000000).
  • On-surface text (palette-aware): errorOnSurface and warningOnSurface for body-text callouts that should stay semantically color-coded in both modes without saturated fill.

Existing errorBg / warningBg tints stay around for subtler tinted-surface usage.

Internal: Dashboard State

The dashboard migrated off Zustand to Redux Toolkit + RTK Query. Same shapes, same persistence (localStorage for theme + observability panel width + durable chat; sessionStorage for everything else). Native dedup, polling, cache invalidation, and tag-based refetch on mutation. The visible end of the migration: panels stop flickering between loading and data states because RTK Query keeps the previous payload visible until the next fetch lands.

Upgrade Guidance

  • The commands gossipsub topic ships with strict (extra="forbid") deserialization, so an older 1.0.2 master receiving a PlaceInstance with the new excluded_nodes field will fail to decode it. The 1.0.3 receive loop now catches that decode error, logs a warning, and drops the message instead of tearing down, but the launch silently fails on the older master. Upgrade every node before relying on per-placement exclusion; the planner runs on the master, so the master in particular must be on 1.0.3 for exclusions to take effect.
  • The trace janitor's first sweep runs 60s after startup. If your cluster has saved traces older than retention_days you didn't intend to lose, set tracing.retention_days: 0 (or a higher number) before restarting.

Validation Snapshot

  • uv run basedpyright passed with 0 errors, 0 warnings, 0 notes
  • uv run ruff check passed
  • uv run pytest passed across src/skulk/shared, src/skulk/master, src/skulk/api
  • Dashboard npm run build and Docusaurus npm run build both clean