Skip to main content

Skulk API

Skulk serves an API at http://localhost:52415.

That API has two jobs:

  • compatibility endpoints for tools that already speak OpenAI, Claude, or Ollama-style APIs
  • Skulk-specific control endpoints for placement, downloads, config, tracing, and model-store workflows

A model must be placed and running before chat requests for it succeed; calling /v1/chat/completions for an unplaced model returns a 404 No instance found. The First Success Flow below walks from placement to first token.

Quick Navigation

First Success Flow

1. Start Skulk

uv run skulk

2. Preview placements

curl "http://localhost:52415/instance/previews?model_id=mlx-community/Llama-3.2-1B-Instruct-4bit"

This shows what Skulk can actually place on the current node or cluster.

3. Launch a placement

curl -X POST http://localhost:52415/place_instance \
-H 'Content-Type: application/json' \
-d '{
"model_id": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"sharding": "Pipeline",
"instance_meta": "MlxRing",
"min_nodes": 1
}'

4. Send a chat request

curl -X POST http://localhost:52415/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [{"role": "user", "content": "Hello from Skulk"}]
}'

If this fails with 404 No instance found for model ..., the placement is not ready yet or never launched successfully.

Endpoint Overview

Compatibility APIs

  • POST /v1/chat/completions
  • POST /v1/responses
  • POST /v1/messages
  • POST /ollama/api/chat
  • POST /ollama/api/generate
  • GET /ollama/api/tags
  • POST /ollama/api/show
  • GET /ollama/api/ps
  • GET /ollama/api/version

Skulk Control APIs

  • GET /v1/models
  • POST /v1/tools/web_search
  • POST /v1/tools/open_url
  • POST /v1/tools/extract_page
  • GET /models/search
  • POST /models/add
  • DELETE /models/custom/{model_id}
  • POST /place_instance
  • POST /instance
  • GET /instance/placement
  • GET /instance/previews
  • GET /instance/{instance_id}
  • DELETE /instance/{instance_id}
  • GET /state
  • GET /events
  • POST /download/start
  • DELETE /download/{node_id}/{model_id}
  • GET /config
  • PUT /config
  • GET /store/health
  • GET /store/registry
  • GET /store/downloads
  • POST /store/models/{model_id}/download
  • GET /store/models/{model_id}/download/status
  • DELETE /store/models/{model_id}
  • POST /store/purge-staging
  • GET /store/storage
  • POST /store/models/{model_id}/optimize
  • GET /store/models/{model_id}/optimize/status
  • GET /filesystem/browse
  • GET /node/identity
  • GET /v1/tracing
  • PUT /v1/tracing
  • GET /v1/traces
  • GET /v1/traces/cluster
  • POST /v1/traces/delete
  • GET /v1/traces/{task_id}
  • GET /v1/traces/{task_id}/stats
  • GET /v1/traces/{task_id}/raw
  • GET /v1/traces/cluster/{task_id}
  • GET /v1/traces/cluster/{task_id}/stats
  • GET /v1/traces/cluster/{task_id}/raw
  • GET /v1/diagnostics/node
  • POST /v1/diagnostics/node/capture
  • POST /v1/diagnostics/node/runners/{runner_id}/cancel
  • GET /v1/diagnostics/cluster
  • GET /v1/diagnostics/cluster/timeline
  • GET /v1/diagnostics/cluster/{node_id}
  • POST /v1/diagnostics/cluster/{node_id}/capture
  • POST /v1/diagnostics/cluster/{node_id}/runners/{runner_id}/cancel

For the full interactive reference with request/response schemas, see the API Reference.

OpenAI Chat Completions

POST /v1/chat/completions

This is the main chat-generation endpoint for both text-only and multimodal models.

Requests are validated before dispatch: an empty messages array or a non-positive max_tokens returns 400 Bad Request rather than being accepted and failing during generation. (This applies across the Claude, Ollama, and Responses wire formats too, which share the same dispatch path.)

Context-length limits

Every placed instance has a usable context limit: the smaller of the model's advertised context length and the number of KV-cache tokens that fit in memory next to the model weights on the hosting node(s). Requests are admitted against that limit instead of growing the KV cache until the node runs out of memory:

  • A max_tokens value that cannot fit in the limit at all returns 400 Bad Request immediately (context_length_exceeded: ...).
  • After tokenization on the serving instance, a prompt that fills the window, or a prompt plus an explicit max_tokens that exceeds the limit, is rejected with an OpenAI-style invalid_request_error whose message starts with context_length_exceeded:. For streaming requests this arrives as the first SSE data: event; for non-streaming requests the response body is the error envelope (the HTTP status is already committed when the rejection is computed on the serving node).
  • When max_tokens is omitted, the server default output budget is clamped to the remaining window, so generation ends with finish_reason: "length" instead of overrunning the context.

OpenAI Python SDK Example

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:52415/v1",
api_key="unused",
)

response = client.chat.completions.create(
model="mlx-community/Llama-3.2-1B-Instruct-4bit",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Curl Example

curl -X POST http://localhost:52415/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [{"role": "user", "content": "Hello!"}]
}'

Streaming Example

stream = client.chat.completions.create(
model="mlx-community/Llama-3.2-1B-Instruct-4bit",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True,
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")

Common Request Fields

FieldTypeNotes
modelstringRequired. Must match a placed and running model.
messagesarrayRequired. Supports system, user, assistant, developer, tool, function.
streambooleanUse true for SSE streaming.
temperaturenumberSampling temperature.
top_pnumberNucleus sampling.
top_kintegerTop-k sampling.
min_pnumberMinimum-probability threshold.
max_tokensintegerMax generated tokens. When omitted, Skulk uses a backend default of 4096 generated tokens (DEFAULT_MAX_OUTPUT_TOKENS); operators can override that default with SKULK_MAX_OUTPUT_TOKENS (or the legacy SKULK_MAX_TOKENS).
stopstring or arrayStop sequences.
seedintegerReproducibility helper.
frequency_penaltynumberFrequency penalty.
presence_penaltynumberPresence penalty.
repetition_penaltynumberRepetition penalty.
repetition_context_sizeintegerContext window for repetition handling.
logprobsbooleanReturn token logprobs when supported.
top_logprobsintegerNumber of top logprobs to include.
toolsarrayOpenAI-style tool definitions.
tool_choicestring or objectauto, none, or a specific tool selection.
parallel_tool_callsbooleanAccepted for compatibility.
enable_thinkingbooleanSkulk extension for reasoning-capable models.
reasoning_effortstringReasoning hint when supported.
response_formatobjectAccepted for compatibility, not strictly enforced.
stream_optionsobjectIncludes include_usage.
userstringOptional caller identifier.

Message Format

{
"role": "user",
"content": "hello"
}

Assistant messages may include tool_calls. Tool response messages should include tool_call_id.

User messages may also be sent as structured content parts. Skulk accepts OpenAI-style image inputs for vision-capable models:

{
"role": "user",
"content": [
{ "type": "text", "text": "What is in this image?" },
{
"type": "image_url",
"image_url": { "url": "data:image/png;base64,..." }
}
]
}

Notes:

  • inline data: URLs are supported for image inputs
  • Anthropic-compatible requests can also carry image content for multimodal models
  • image understanding depends on the selected model exposing the vision capability

Finish Reasons

ValueMeaning
stopNatural stop or stop sequence reached
lengthmax_tokens limit reached
tool_callsModel is requesting a tool call
content_filterReserved for compatibility
function_callReserved for compatibility
errorGeneration failed

Tool Use

Skulk supports OpenAI-style function calling.

tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}]

response = client.chat.completions.create(
model="mlx-community/Qwen3.5-9B-4bit",
messages=[{"role": "user", "content": "What is the weather in Paris?"}],
tools=tools,
tool_choice="auto",
)

Typical flow:

  1. Send messages and tool definitions.
  2. Inspect finish_reason.
  3. If it is tool_calls, execute the tool in your app.
  4. Send the tool result back as a tool message.
  5. Request the final model response.

Thinking / Reasoning

Skulk supports reasoning-aware chat for compatible models.

response = client.chat.completions.create(
model="mlx-community/Qwen3.5-9B-4bit",
messages=[{"role": "user", "content": "What is 127 * 43?"}],
enable_thinking=True,
)

msg = response.choices[0].message
print(msg.reasoning_content)
print(msg.content)

Notes:

  • enable_thinking is a Skulk extension.
  • Reasoning support depends on model capabilities.
  • Use GET /v1/models response data[].resolved_capabilities to decide whether a model supports thinking and whether clients should render a thinking toggle.
  • Treat resolved_capabilities as the default tool-free request path; request-specific options such as tools can change prompt rendering and related resolved values for mixed-mode model families.
  • Thinking-control semantics are model-aware:
    • if supports_thinking_toggle is true, send enable_thinking=true or false explicitly
    • reasoning_effort="none" disables thinking for toggleable models
    • if a model does not support toggleable thinking, Skulk ignores explicit toggle overrides but still preserves explicit non-disabled reasoning-effort hints when the model family supports them

Builtin Browser Tools

POST /v1/tools/web_search

Execute Skulk's generic web_search tool and return structured search results.

curl -X POST http://localhost:52415/v1/tools/web_search \
-H 'Content-Type: application/json' \
-d '{
"query": "foxlight skulk distributed inference",
"top_k": 5
}'

Request fields:

FieldTypeNotes
querystringRequired search query.
top_kintegerOptional max results, 1 to 10, default 5.

Response fields:

FieldTypeNotes
querystringOriginal search query.
providerstringSearch backend identifier.
resultsarrayOrdered search results with title, url, and snippet.

This endpoint is designed for client-executed tool loops. GPT-OSS can request web_search, the client can call this endpoint, then send the JSON result back as a tool message.

POST /v1/tools/open_url

Fetch one HTTP or HTTPS URL, follow redirects, and return structured metadata.

curl -X POST http://localhost:52415/v1/tools/open_url \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/article"
}'

Request fields:

FieldTypeNotes
urlstringRequired absolute http:// or https:// URL.

Response fields:

FieldTypeNotes
urlstringOriginal requested URL.
final_urlstringFinal URL after redirects.
titlestring or nullBest-effort page title.
status_codeintegerFinal HTTP status code.
content_typestring or nullNormalized response content type.
providerstringBackend provider identifier.

POST /v1/tools/extract_page

Fetch one HTTP or HTTPS URL and return bounded readable text extracted from the response body.

curl -X POST http://localhost:52415/v1/tools/extract_page \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/article",
"max_chars": 12000
}'

Request fields:

FieldTypeNotes
urlstringRequired absolute http:// or https:// URL.
max_charsintegerOptional maximum characters, 500 to 50000, default 12000.

Response fields:

FieldTypeNotes
urlstringOriginal requested URL.
final_urlstringFinal URL after redirects.
titlestring or nullBest-effort page title.
textstringReadable extracted text.
truncatedbooleanWhether the text was clipped to max_chars.
providerstringBackend provider identifier.

These browser-tool endpoints are designed for client-executed tool loops. In dashboard chat, GPT-OSS can request web_search, open_url, or extract_page; the dashboard executes the endpoint call, then sends the JSON result back as a tool message.

Structured Output

response_format is accepted for compatibility, but Skulk does not currently enforce strict JSON mode or JSON schema validation.

response = client.chat.completions.create(
model="mlx-community/Qwen3.5-9B-4bit",
messages=[{"role": "user", "content": "Return valid JSON with three colors"}],
response_format={"type": "json_object"},
)

For the best results, explicitly instruct the model to return valid JSON.

OpenAI Responses API

POST /v1/responses

Use this when your client expects the OpenAI Responses format instead of Chat Completions.

curl -X POST http://localhost:52415/v1/responses \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"input": "Hello from the Responses API"
}'

Claude Messages API

POST /v1/messages

Use this when your client expects Anthropic-style request and response shapes.

curl -X POST http://localhost:52415/v1/messages \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 512
}'

Ollama API

Skulk supports several Ollama-compatible endpoints so tools like OpenWebUI can connect with minimal glue code.

Chat

curl -X POST http://localhost:52415/ollama/api/chat \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [{"role": "user", "content": "Hello"}]
}'

Generate

curl -X POST http://localhost:52415/ollama/api/generate \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"prompt": "Write a haiku about foxes"
}'

List models

curl http://localhost:52415/ollama/api/tags

Show model details

curl -X POST http://localhost:52415/ollama/api/show \
-H 'Content-Type: application/json' \
-d '{"name": "mlx-community/Llama-3.2-1B-Instruct-4bit"}'

Model Discovery

List models

GET /v1/models

curl http://localhost:52415/v1/models

This returns known model cards, not just running instances.

Search Hugging Face

GET /models/search?query=...&limit=...

curl "http://localhost:52415/models/search?query=qwen3&limit=5"

Behavior note:

  • Skulk searches mlx-community first.
  • If that returns nothing, it falls back to a broader Hugging Face search.

Per-node storage breakdown

GET /store/storage

Returns the local node's storage picture: every staged model with its size, last-use time, and whether a live instance (or one of its companion repos: MTP sidecar, assistant, vision weights) currently depends on it, plus event-log usage and free disk on the models volume. Cluster-wide views query each node's API.

curl http://localhost:52415/store/storage

Staged copies are managed automatically when the model store is on: when an instance shuts down (and at node startup, which reconciles copies orphaned by a crash), not-in-use staged models are kept newest-first up to the staging_keep_recent_gb grace budget (default 40 GiB) and evicted beyond it. Set cleanup_on_deactivate: false in the staging config to keep every staged copy and manage cleanup manually via POST /store/purge-staging.

Placement and Instance Management

These endpoints are the heart of the Skulk control plane.

Quick launch

POST /place_instance

curl -X POST http://localhost:52415/place_instance \
-H 'Content-Type: application/json' \
-d '{
"model_id": "mlx-community/Qwen3.5-9B-4bit",
"sharding": "Pipeline",
"instance_meta": "MlxRing",
"min_nodes": 1,
"excluded_nodes": []
}'
FieldMeaning
model_idHugging Face-style model ID
shardingPipeline or Tensor
instance_metaMlxRing or MlxJaccl
min_nodesMinimum nodes required for the placement
excluded_nodesOptional. Node IDs the master should treat as if absent when scoring this placement. Already-running instances on those nodes are unaffected (exclusion is per-placement, not cluster-wide). Default: []. Note: node IDs are per-session, so they change when a cluster session restarts.

The placement is validated against the current cluster state before the command is forwarded, so an impossible placement fails at the API instead of silently failing on the master:

  • 400 with the specific reason: no connected cycle of min_nodes nodes, exclusions removed every candidate, the model does not support Tensor sharding, or a node cannot fit its weight shard plus runtime headroom (the error names the node and the GB arithmetic).
  • 503 when cluster info is still being gossiped (a cluster that just formed): connection edges lag node identities by a few gossip rounds, and per-node memory info lags the edges. The request internally waits up to 15 seconds for the info to arrive before giving up, so retry shortly on 503.

Memory fitting is checked per node, not summed across the cycle: Tensor sharding splits weights evenly, Pipeline allocates layers proportionally to each node's available memory, and every node must hold its share times a runtime-overhead factor (KV cache, activations, runner) on top of the raw weight bytes. A model that exactly equals a node's free memory is rejected, because that placement would thrash, not run.

Preview valid placements

GET /instance/previews?model_id=...

curl "http://localhost:52415/instance/previews?model_id=mlx-community/Qwen3.5-9B-4bit"

This is usually the best first Skulk-specific endpoint to call. It shows which combinations of sharding mode, networking mode, and node count are valid, and why invalid combinations fail.

Query parameterMeaning
model_idRequired. Hugging Face-style model ID.
node_idsOptional, repeatable. Restricts previews to candidate cycles that contain all of these node IDs (subset matching).
excluded_node_idsOptional, repeatable. Excludes the listed node IDs from candidate cycles for every previewed combination. Mirrors the excluded_nodes field on POST /place_instance so dashboards can render an accurate preview against the post-exclusion topology.
# Preview with one node excluded:
curl "http://localhost:52415/instance/previews?model_id=mlx-community/Qwen3.5-9B-4bit&excluded_node_ids=12D3KooWAbc..."

Build a placement manually

GET /instance/placement

Use this when you want a specific combination and want to inspect the exact instance shape before launch.

Create an instance from a fully specified placement

POST /instance

Use this when you already have an instance object and want exact control.

Inspect one instance

GET /instance/{instance_id}

Delete an instance

DELETE /instance/{instance_id}

Download Management

Start a node download

POST /download/start

Lower-level endpoint for explicit node download control.

Delete a node download

DELETE /download/{node_id}/{model_id}

Model Store Endpoints

These endpoints are available when the model store is configured.

If it is not configured, Skulk returns 503 Store not configured.

Store health

GET /store/health

Use this to confirm whether the store is configured and reachable.

Store registry

GET /store/registry

Use this to inspect which models the shared store knows about.

The dashboard combines registry results with GET /v1/models metadata so it can display derived tags such as vision, thinking, embedding, tensor, and optiq in the Store list.

Store downloads

GET /store/downloads

Use this to inspect in-progress shared-store download activity.

Request a store download

POST /store/models/{model_id}/download

Use this when you want the store host to fetch and register a model.

Optional JSON body {"gguf_file": "<repo-relative path>"} pins which GGUF quant the store fetches for a multi-quant GGUF repo (it downloads that file's shard group plus config.json). Omit the body to use the default quant preference. A pin naming a file not present in the repo falls back to the default.

Store download status

GET /store/models/{model_id}/download/status

Delete a model from the store

DELETE /store/models/{model_id}

Removes the model from the store host (registry + disk) and broadcasts a cluster-wide eviction so every node also drops its locally-staged copy, freeing disk fleet-wide instead of leaving worker copies until they age out under staging pressure. Returns 404 if the model is not registered in the store. (To clear staged copies without deleting the store copy, use POST /store/purge-staging.)

Purge staging caches

POST /store/purge-staging

Use this to remove staged model artifacts from nodes without deleting the store copy itself.

Start optimization

POST /store/models/{model_id}/optimize

Use this for workflows such as model optimization or alternate artifact generation.

Models Endpoint

List models

GET /v1/models

Returns the known model catalog, including downloaded models and catalog-backed entries.

Important fields:

FieldTypeMeaning
idstringCanonical model ID
capabilitiesarrayFunctional capabilities such as text, vision, thinking, code, or embedding
tagsarrayUI-friendly derived labels such as vision, thinking, embedding, tensor, and optiq
supports_tensorbooleanWhether tensor parallel launch is supported
base_modelstringBase family or upstream source model when known
runtime.mtp_sidecar_repostringRepo of this model's MTP sidecar (prediction heads), when it declares one
runtime.assistant_model_repostringRepo of this model's speculative-decoding assistant (drafter), when it declares one
runtime.served_spec_draft_repostringRepo of this model's separate served-engine draft GGUF, when it declares one

The dashboard uses tags for compact badges and capabilities for filtering and richer tooltips. The three runtime.*_repo fields name a model's speculative-decoding companions (a draft model or an MTP-head sidecar). Those companion repos are downloaded and loaded automatically with their parent and are not independently placeable, so the dashboard marks any store entry matching one of these repos as a companion (a "Drafter" or "Sidecar" badge) rather than offering it launch, placement, or optimize actions.

Configuration Endpoints

Get config

GET /config

Returns the current cluster config and config path. Sensitive values (hf_token) are stripped from the response.

Update config

PUT /config

Updates cluster-wide config. Important behavior:

  • if you omit hf_token, Skulk preserves the existing value
  • if you omit logging, Skulk preserves the existing logging config
  • hf_token is not broadcast over gossipsub; it stays on the local node's skulk.yaml
  • logging changes (enable/disable) take effect immediately on all nodes
  • inference changes affect future launches
  • model-store location changes generally require restart

Filesystem browse

GET /filesystem/browse

Used by the dashboard to browse a safe subset of the filesystem when selecting config paths.

Node identity

GET /node/identity

Returns hostname, preferred IP, and node identity information used by the dashboard.

Restart a node

POST /admin/restart?node_id=<optional node id>

Gracefully restart the Skulk process on this or a remote node. When node_id is omitted or matches the local node, replaces the current process image in-place via os.execv (same PID). When node_id targets a remote node, sends a RestartNode command via pub/sub.

  • GPU/Metal memory is released when the process image is replaced
  • the node rejoins the cluster automatically on startup
  • active inference is interrupted

Returns {"status": "restarting", "node_id": "..."} for local restarts, or {"status": "restart_sent", "node_id": "..."} for remote restarts. If a local restart is already scheduled, returns HTTP 409 with {"status": "restart_already_pending"}.

State, Events, and Tracing

Cluster state

GET /state

Returns the cluster state as Skulk currently sees it.

The response also carries a derived nodeHealth map (keyed by node id) so a problem on a node is visible rather than silent. Each entry is a level (ok, warn, or error) plus a list of reasons, where each reason has a code, a message describing what is wrong, and a remediation describing how to fix it. It is computed read-only from state already in the response (terminal download failures, low or full models-volume disk, and late heartbeats), so it adds no new polling. A node with no problems reports level: "ok" with an empty reasons list.

Operational note:

  • a follower may briefly report a local view that is behind the elected master while it is catching up
  • on newer builds, catch-up can start from a snapshot plus retained replay tail instead of always rebuilding from event 0
  • if your cluster is mixed-version during rollout, upgrade all nodes before you rely on bounded replay retention on the master; an older restarted node may not be able to fully resync after old history has been compacted away

Event log

GET /events

Returns stored events from the API-side event log.

Diagnostics

  • GET /v1/diagnostics/node
  • POST /v1/diagnostics/node/capture
  • POST /v1/diagnostics/node/runners/{runner_id}/cancel
  • GET /v1/diagnostics/cluster
  • GET /v1/diagnostics/cluster/timeline
  • GET /v1/diagnostics/cluster/{node_id}
  • POST /v1/diagnostics/cluster/{node_id}/capture
  • POST /v1/diagnostics/cluster/{node_id}/runners/{runner_id}/cancel

Use these endpoints when a node appears stuck loading, warming up, decoding, or shutting down and you need a read-only snapshot without SSHing into every node.

Behavior notes:

  • GET /v1/diagnostics/node returns the local node's runtime/config facts, resources, process tree, live runner-supervisor state, flight-recorder phase state, and placement analysis.
  • POST /v1/diagnostics/node/capture collects an on-demand local diagnostic bundle. Body fields are runnerId, taskId, includeProcessSamples, and sampleDurationSeconds; all are optional. When a runner/task is provided, the response includes that runner's bounded flight recorder, latest MLX memory snapshot, and best-effort macOS sample, vmmap -summary, and footprint -p output. Sampling failures are returned as structured partial failures instead of failing the bundle.
  • POST /v1/diagnostics/node/runners/{runner_id}/cancel requests cooperative cancellation for one task that the local runner supervisor still knows about.
  • GET /v1/diagnostics/cluster fans out to reachable peer APIs and returns partial results when some peers are unavailable.
  • GET /v1/diagnostics/cluster/timeline stitches every reachable node's runner-supervisor diagnostics into one cross-rank chronological view. The response carries a per-runner synopsis sorted by (modelId, deviceRank) and every flight-recorder entry across all ranks merged and sorted by at. Use this when debugging a distributed deadlock: the rank-disagreement signature ("rank 0 entered pipeline_last_eval_output at T while rank 1 is still in pipeline_first_recv_like") is invisible from any single node's local diagnostics but obvious top-to-bottom in the merged timeline. Unreachable peers are returned in unreachableNodes instead of failing the request.
  • GET /v1/diagnostics/cluster/{node_id} proxies one reachable peer bundle or returns the local bundle if node_id is the current API node.
  • POST /v1/diagnostics/cluster/{node_id}/capture proxies the same on-demand capture request to a reachable peer node.
  • POST /v1/diagnostics/cluster/{node_id}/runners/{runner_id}/cancel proxies the same cooperative live-runner cancellation request to a reachable peer.
  • Placement diagnostics explicitly include whether the current master is part of each model placement, which helps investigate hangs where the master is not one of the inference ranks.
  • The dashboard node-card bug icon uses these endpoints to open a live diagnostics drawer for any reachable node.
  • The diagnostics drawer prefers Capture bundle before cancellation so operators can collect phase, MLX memory, and process samples before changing the runner state.
  • Runner cancellation is best-effort only. A wedged native/MLX runner may ignore the request and still require stronger intervention.
  • Diagnostics endpoints do not currently kill or restart runners. Capture is read-only; the only mutating diagnostics action is the cooperative task-cancel request above.

Example:

curl http://localhost:52415/v1/diagnostics/node
curl http://localhost:52415/v1/diagnostics/cluster
curl http://localhost:52415/v1/diagnostics/cluster/timeline
curl http://localhost:52415/v1/diagnostics/cluster/<node_id>
curl -X POST http://localhost:52415/v1/diagnostics/node/capture \
-H 'content-type: application/json' \
-d '{"runnerId":"<runner_id>","taskId":"<task_id>"}'
curl -X POST http://localhost:52415/v1/diagnostics/cluster/<node_id>/capture \
-H 'content-type: application/json' \
-d '{"runnerId":"<runner_id>","includeProcessSamples":true}'
curl -X POST http://localhost:52415/v1/diagnostics/node/runners/<runner_id>/cancel \
-H 'content-type: application/json' \
-d '{"taskId":"<task_id>"}'
curl -X POST http://localhost:52415/v1/diagnostics/cluster/<node_id>/runners/<runner_id>/cancel \
-H 'content-type: application/json' \
-d '{"taskId":"<task_id>"}'

Traces

  • GET /v1/tracing
  • PUT /v1/tracing
  • GET /v1/traces
  • GET /v1/traces/cluster
  • POST /v1/traces/delete
  • GET /v1/traces/{task_id}
  • GET /v1/traces/{task_id}/stats
  • GET /v1/traces/{task_id}/raw
  • GET /v1/traces/cluster/{task_id}
  • GET /v1/traces/cluster/{task_id}/stats
  • GET /v1/traces/cluster/{task_id}/raw

Use these endpoints when you are debugging generation behavior, cluster execution, or performance.

Behavior notes:

  • GET /v1/tracing returns whether runtime tracing is currently enabled for new requests across the live cluster session.
  • PUT /v1/tracing toggles tracing cluster-wide for new requests only. It does not retroactively trace in-flight work.
  • GET /v1/traces* reads local trace artifacts stored on the current node.
  • GET /v1/traces/cluster* fans out to reachable peer APIs, deduplicates by task_id, and proxies read-only trace access from any reachable node.
  • POST /v1/traces/delete remains local-only in v1 even when cluster browsing is enabled.

Runtime tracing control

GET /v1/tracing

Returns the current cluster tracing state:

{"enabled": false}

PUT /v1/tracing

Enable or disable tracing for new requests across the current cluster session.

Request body:

{"enabled": true}

Response body:

{"enabled": true}

Operational notes:

  • this is a runtime toggle, not a restart-required config edit
  • it applies to new requests only
  • it does not retroactively trace work already in flight
  • the dashboard traces page uses this same API

Local trace endpoints

These endpoints operate on trace artifacts stored on the current node:

  • GET /v1/traces lists local trace artifacts with metadata such as task kind, model, source nodes, and tool-activity tags
  • GET /v1/traces/{task_id} returns structured trace events for one task
  • GET /v1/traces/{task_id}/stats returns aggregated timing summaries
  • GET /v1/traces/{task_id}/raw downloads Chrome-trace-compatible JSON
  • POST /v1/traces/delete deletes one or more local trace artifacts

Example:

curl http://localhost:52415/v1/traces
curl http://localhost:52415/v1/traces/<task_id>/stats
curl -OJ http://localhost:52415/v1/traces/<task_id>/raw

Cluster trace endpoints

These endpoints let a dashboard or script on any reachable node browse traces across the cluster:

  • GET /v1/traces/cluster
  • GET /v1/traces/cluster/{task_id}
  • GET /v1/traces/cluster/{task_id}/stats
  • GET /v1/traces/cluster/{task_id}/raw

Operational notes:

  • cluster browsing is read-only in v1
  • the API fans out to reachable peer APIs and deduplicates traces by task_id
  • if some peers are unreachable, cluster results may be partial
  • source node metadata in responses tells you which nodes contributed trace content

Example:

curl http://localhost:52415/v1/traces/cluster
curl http://localhost:52415/v1/traces/cluster/<task_id>/stats
curl -OJ http://localhost:52415/v1/traces/cluster/<task_id>/raw

Connectivity Endpoints

Tailscale status

GET /v1/connectivity/tailscale
GET /v1/connectivity/tailscale?node_id=<id>

Returns whether tailscaled is running on a node and, if so, the node's Tailscale IP, hostname, DNS name, and tailnet. All fields except running are null when tailscaled is not installed or not running.

Pass node_id to proxy the request to a specific cluster node. Omit it to query the local node directly. Returns 404 if the target node is not reachable.

Response fields:

FieldTypeDescription
runningbooleantrue when tailscaled reports BackendState == "Running"
selfIpstring | nullNode's Tailscale IPv4 address (100.x.x.x range)
hostnamestring | nullNode hostname as registered in the tailnet
dnsNamestring | nullFully-qualified Tailscale MagicDNS name, e.g. my-node.tailnet-abc.ts.net
tailnetstring | nullTailnet name derived from dnsName
versionstring | nullTailscale client version string
# Local node
curl http://localhost:52415/v1/connectivity/tailscale

# Specific cluster node
curl "http://localhost:52415/v1/connectivity/tailscale?node_id=<node-id>"

Remote access info

GET /v1/connectivity/remote-access

Returns aggregated remote access information for the local node: LAN address, Tailscale address, and a preferredUrl (Tailscale if running, otherwise LAN). When Tailscale is running, preferredUrl uses the node's MagicDNS name (my-node.tailnet-abc.ts.net) if available, falling back to the raw 100.x.x.x IP. operatorUrl appends /operator to preferredUrl (suitable for QR code generation so mobile users land directly on the operator panel).

Response fields:

FieldTypeDescription
local.ipstring | nullPreferred LAN IPv4 address
local.portintegerAPI/dashboard port
local.urlstring | nullhttp://{ip}:{port}
tailscale.runningbooleantrue when tailscaled is connected
tailscale.ipstring | nullTailscale IPv4 address (100.x.x.x)
tailscale.dnsNamestring | nullMagicDNS fully-qualified name, e.g. my-node.tailnet-abc.ts.net
tailscale.portintegerAPI/dashboard port
tailscale.urlstring | nullhttp://{dnsName or ip}:{port} if running
preferredUrlstring | nullMagicDNS URL if available, else Tailscale IP URL, else LAN URL
operatorUrlstring | nullpreferredUrl + /operator
curl http://localhost:52415/v1/connectivity/remote-access | python3 -m json.tool

Example response when Tailscale is running with MagicDNS:

{
"local": { "ip": "192.168.1.5", "port": 52415, "url": "http://192.168.1.5:52415" },
"tailscale": {
"running": true,
"ip": "100.101.102.103",
"dnsName": "my-node.tailnet-abc.ts.net",
"port": 52415,
"url": "http://my-node.tailnet-abc.ts.net:52415"
},
"preferredUrl": "http://my-node.tailnet-abc.ts.net:52415",
"operatorUrl": "http://my-node.tailnet-abc.ts.net:52415/operator"
}

Operator App Integration

The operator panel at /operator is designed for mobile access and can also be driven by a native app. The relevant API endpoints are:

Node and cluster state

EndpointDescription
GET /v1/stateFull cluster state: nodes, instances, runners, memory, GPU
GET /node_idLocal node's ID
GET /node/identityNode ID, hostname, and preferred LAN IP

Remote access and connectivity

EndpointDescription
GET /v1/connectivity/remote-accessLAN + Tailscale addresses, preferred URL, operator URL for QR
GET /v1/connectivity/tailscaleTailscale status for local node
GET /v1/connectivity/tailscale?node_id=<id>Tailscale status for a specific peer node

Node management

EndpointDescription
POST /v1/nodes/{node_id}/restartSend a restart command to any node in the cluster

Typical operator app workflow

  1. Call GET /v1/connectivity/remote-access on the initially discovered node to get the preferredUrl, then use that as the base URL for subsequent calls.
  2. Poll GET /v1/state every 5 seconds for node health (memory, GPU, temperature).
  3. Show per-node cards with restart buttons that call POST /v1/nodes/{node_id}/restart.
  4. On first launch or settings screen, show the operatorUrl as a QR code so users can hand it off to another device.

Helpful Next Docs