Release Notes 1.0.3
Release date: 2026-05-02
Highlights
- Per-placement node exclusion. Operators can now exclude specific nodes from a single placement decision without unplugging them from the cluster. Already-running instances on the listed nodes are not affected — exclusion is a per-launch hint to the master's planner, not a cluster-wide flag.
- Observability consolidation. The dashboard's diagnostics surface is now one resizable panel with three tabs (Live / Node / Traces). The legacy standalone traces page is gone; everything lives inline.
- Native trace waterfall. Saved-trace viewing no longer hands trace data off to a third-party hosted UI. Inline waterfall renders the same data inside the panel, with sub-pixel-event clustering so dense traces stay readable.
- API trace janitor. Saved Chrome-trace JSON files are pruned after
tracing.retention_days(default 3 days) so the cache directory stops growing without bound. - Snapshot bootstrap and bounded replay retention. Followers can hydrate cluster state from a master-published snapshot and replay only the retained tail; the master's live replay log no longer grows without limit.
Per-Placement Node Exclusion
POST /place_instance now accepts an optional excluded_nodes array of node IDs. The master's planner treats those nodes as if absent from the topology when scoring candidate cycles for that single placement; the rest of the cluster keeps participating in placements as usual.
GET /instance/previews accepts a matching excluded_node_ids repeatable query parameter so dashboards can preview a placement against the post-exclusion topology before the operator commits.
In the dashboard, the placement modal grows a click-to-toggle pill row under "Available Nodes". Each pill is a node; click to exclude it from this launch (strikethrough + transparent), click again to include. The cluster preview re-fetches and re-renders on every toggle so the visualization always reflects what would actually be placed.
Observability Surface
The right-side Observability panel hosts:
- Live tab — cluster health header (master id, connectivity, hang counter, cluster-wide tracing toggle), per-runner synopsis grid with rank/phase/time-stuck, and a cross-rank flight-recorder feed driven by
/v1/diagnostics/cluster/timeline. - Node tab — per-node diagnostics with a node selector at the top. Defaults to the master on first visit; ignores stale persisted node IDs that no longer exist in the topology.
- Traces tab — saved-trace browser with inline filtering (kind, model, source node, free-text, tools-only) and per-row expansion that drops the trace's metadata, download affordance, and waterfall directly under the row.
The Perfetto popup integration is removed — saved traces stay in-cluster and render inline via a Skulk-native waterfall.
Trace Retention
Saved trace files accumulate under SKULK_CACHE_HOME/traces/ indefinitely otherwise. An hourly janitor in the API process deletes files older than tracing.retention_days (default 3 days; 0 disables pruning). First sweep runs 60 seconds after API startup.
Configure in skulk.yaml:
tracing:
retention_days: 3 # default; set to 0 to disable pruning
Theme Tokens
New palette tokens for status callouts:
- Solid-fill iconography (palette-independent):
errorFill(#dc2626) +errorOnFill(#ffffff);warningFill(#ffcc33) +warningOnFill(#000000). - On-surface text (palette-aware):
errorOnSurfaceandwarningOnSurfacefor body-text callouts that should stay semantically color-coded in both modes without saturated fill.
Existing errorBg / warningBg tints stay around for subtler tinted-surface usage.
Internal — Dashboard State
The dashboard migrated off Zustand to Redux Toolkit + RTK Query. Same shapes, same persistence (localStorage for theme + observability panel width + durable chat; sessionStorage for everything else). Native dedup, polling, cache invalidation, and tag-based refetch on mutation. The visible end of the migration: panels stop flickering between loading and data states because RTK Query keeps the previous payload visible until the next fetch lands.
Upgrade Guidance
- The
commandsgossipsub topic ships with strict (extra="forbid") deserialization, so an older 1.0.2 master receiving aPlaceInstancewith the newexcluded_nodesfield will fail to decode it. The 1.0.3 receive loop now catches that decode error, logs a warning, and drops the message instead of tearing down — but the launch silently fails on the older master. Upgrade every node before relying on per-placement exclusion; the planner runs on the master, so the master in particular must be on 1.0.3 for exclusions to take effect. - The trace janitor's first sweep runs 60s after startup. If your cluster has saved traces older than
retention_daysyou didn't intend to lose, settracing.retention_days: 0(or a higher number) before restarting.
Validation Snapshot
uv run basedpyrightpassed with0 errors, 0 warnings, 0 notesuv run ruff checkpasseduv run pytestpassed acrosssrc/skulk/shared,src/skulk/master,src/skulk/api- Dashboard
npm run buildand Docusaurusnpm run buildboth clean