Skulk API
Skulk serves an API at http://localhost:52415.
That API has two jobs:
- compatibility endpoints for tools that already speak OpenAI, Claude, or Ollama-style APIs
- Skulk-specific control endpoints for placement, downloads, config, tracing, and model-store workflows
A model must be placed and running before chat requests for it succeed; calling
/v1/chat/completions for an unplaced model returns a 404 No instance found.
The First Success Flow below walks from placement to first
token.
Quick Navigation
- First working request: First Success Flow
- OpenAI-compatible chat: OpenAI Chat Completions
- OpenAI Responses format: OpenAI Responses API
- Claude format: Claude Messages API
- Ollama compatibility: Ollama API
- Placement and launch: Placement and Instance Management
- Store and config: Model Store Endpoints and Configuration Endpoints
- Debugging: State, Events, and Tracing
First Success Flow
1. Start Skulk
uv run skulk
2. Preview placements
curl "http://localhost:52415/instance/previews?model_id=mlx-community/Llama-3.2-1B-Instruct-4bit"
This shows what Skulk can actually place on the current node or cluster.
3. Launch a placement
curl -X POST http://localhost:52415/place_instance \
-H 'Content-Type: application/json' \
-d '{
"model_id": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"sharding": "Pipeline",
"instance_meta": "MlxRing",
"min_nodes": 1
}'
4. Send a chat request
curl -X POST http://localhost:52415/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [{"role": "user", "content": "Hello from Skulk"}]
}'
If this fails with 404 No instance found for model ..., the placement is not ready yet or never launched successfully.
Endpoint Overview
Compatibility APIs
POST /v1/chat/completionsPOST /v1/responsesPOST /v1/messagesPOST /ollama/api/chatPOST /ollama/api/generateGET /ollama/api/tagsPOST /ollama/api/showGET /ollama/api/psGET /ollama/api/version
Skulk Control APIs
GET /v1/modelsPOST /v1/tools/web_searchPOST /v1/tools/open_urlPOST /v1/tools/extract_pageGET /models/searchPOST /models/addDELETE /models/custom/{model_id}POST /place_instancePOST /instanceGET /instance/placementGET /instance/previewsGET /instance/{instance_id}DELETE /instance/{instance_id}GET /stateGET /eventsPOST /download/startDELETE /download/{node_id}/{model_id}GET /configPUT /configGET /store/healthGET /store/registryGET /store/downloadsPOST /store/models/{model_id}/downloadGET /store/models/{model_id}/download/statusDELETE /store/models/{model_id}POST /store/purge-stagingGET /store/storagePOST /store/models/{model_id}/optimizeGET /store/models/{model_id}/optimize/statusGET /filesystem/browseGET /node/identityGET /v1/tracingPUT /v1/tracingGET /v1/tracesGET /v1/traces/clusterPOST /v1/traces/deleteGET /v1/traces/{task_id}GET /v1/traces/{task_id}/statsGET /v1/traces/{task_id}/rawGET /v1/traces/cluster/{task_id}GET /v1/traces/cluster/{task_id}/statsGET /v1/traces/cluster/{task_id}/rawGET /v1/diagnostics/nodePOST /v1/diagnostics/node/capturePOST /v1/diagnostics/node/runners/{runner_id}/cancelGET /v1/diagnostics/clusterGET /v1/diagnostics/cluster/timelineGET /v1/diagnostics/cluster/{node_id}POST /v1/diagnostics/cluster/{node_id}/capturePOST /v1/diagnostics/cluster/{node_id}/runners/{runner_id}/cancel
For the full interactive reference with request/response schemas, see the API Reference.
OpenAI Chat Completions
POST /v1/chat/completions
This is the main chat-generation endpoint for both text-only and multimodal models.
Requests are validated before dispatch: an empty messages array or a
non-positive max_tokens returns 400 Bad Request rather than being
accepted and failing during generation. (This applies across the Claude,
Ollama, and Responses wire formats too, which share the same dispatch path.)
Context-length limits
Every placed instance has a usable context limit: the smaller of the model's advertised context length and the number of KV-cache tokens that fit in memory next to the model weights on the hosting node(s). Requests are admitted against that limit instead of growing the KV cache until the node runs out of memory:
- A
max_tokensvalue that cannot fit in the limit at all returns 400 Bad Request immediately (context_length_exceeded: ...). - After tokenization on the serving instance, a prompt that fills the window,
or a prompt plus an explicit
max_tokensthat exceeds the limit, is rejected with an OpenAI-styleinvalid_request_errorwhose message starts withcontext_length_exceeded:. For streaming requests this arrives as the first SSEdata:event; for non-streaming requests the response body is the error envelope (the HTTP status is already committed when the rejection is computed on the serving node). - When
max_tokensis omitted, the server default output budget is clamped to the remaining window, so generation ends withfinish_reason: "length"instead of overrunning the context.
OpenAI Python SDK Example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:52415/v1",
api_key="unused",
)
response = client.chat.completions.create(
model="mlx-community/Llama-3.2-1B-Instruct-4bit",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Curl Example
curl -X POST http://localhost:52415/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Streaming Example
stream = client.chat.completions.create(
model="mlx-community/Llama-3.2-1B-Instruct-4bit",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True,
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Common Request Fields
| Field | Type | Notes |
|---|---|---|
model | string | Required. Must match a placed and running model. |
messages | array | Required. Supports system, user, assistant, developer, tool, function. |
stream | boolean | Use true for SSE streaming. |
temperature | number | Sampling temperature. |
top_p | number | Nucleus sampling. |
top_k | integer | Top-k sampling. |
min_p | number | Minimum-probability threshold. |
max_tokens | integer | Max generated tokens. When omitted, Skulk uses a backend default of 4096 generated tokens (DEFAULT_MAX_OUTPUT_TOKENS); operators can override that default with SKULK_MAX_OUTPUT_TOKENS (or the legacy SKULK_MAX_TOKENS). |
stop | string or array | Stop sequences. |
seed | integer | Reproducibility helper. |
frequency_penalty | number | Frequency penalty. |
presence_penalty | number | Presence penalty. |
repetition_penalty | number | Repetition penalty. |
repetition_context_size | integer | Context window for repetition handling. |
logprobs | boolean | Return token logprobs when supported. |
top_logprobs | integer | Number of top logprobs to include. |
tools | array | OpenAI-style tool definitions. |
tool_choice | string or object | auto, none, or a specific tool selection. |
parallel_tool_calls | boolean | Accepted for compatibility. |
enable_thinking | boolean | Skulk extension for reasoning-capable models. |
reasoning_effort | string | Reasoning hint when supported. |
response_format | object | Accepted for compatibility, not strictly enforced. |
stream_options | object | Includes include_usage. |
user | string | Optional caller identifier. |
Message Format
{
"role": "user",
"content": "hello"
}
Assistant messages may include tool_calls.
Tool response messages should include tool_call_id.
User messages may also be sent as structured content parts. Skulk accepts OpenAI-style image inputs for vision-capable models:
{
"role": "user",
"content": [
{ "type": "text", "text": "What is in this image?" },
{
"type": "image_url",
"image_url": { "url": "data:image/png;base64,..." }
}
]
}
Notes:
- inline
data:URLs are supported for image inputs - Anthropic-compatible requests can also carry image content for multimodal models
- image understanding depends on the selected model exposing the
visioncapability
Finish Reasons
| Value | Meaning |
|---|---|
stop | Natural stop or stop sequence reached |
length | max_tokens limit reached |
tool_calls | Model is requesting a tool call |
content_filter | Reserved for compatibility |
function_call | Reserved for compatibility |
error | Generation failed |
Tool Use
Skulk supports OpenAI-style function calling.
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}]
response = client.chat.completions.create(
model="mlx-community/Qwen3.5-9B-4bit",
messages=[{"role": "user", "content": "What is the weather in Paris?"}],
tools=tools,
tool_choice="auto",
)
Typical flow:
- Send messages and tool definitions.
- Inspect
finish_reason. - If it is
tool_calls, execute the tool in your app. - Send the tool result back as a
toolmessage. - Request the final model response.
Thinking / Reasoning
Skulk supports reasoning-aware chat for compatible models.
response = client.chat.completions.create(
model="mlx-community/Qwen3.5-9B-4bit",
messages=[{"role": "user", "content": "What is 127 * 43?"}],
enable_thinking=True,
)
msg = response.choices[0].message
print(msg.reasoning_content)
print(msg.content)
Notes:
enable_thinkingis a Skulk extension.- Reasoning support depends on model capabilities.
- Use
GET /v1/modelsresponsedata[].resolved_capabilitiesto decide whether a model supports thinking and whether clients should render a thinking toggle. - Treat
resolved_capabilitiesas the default tool-free request path; request-specific options such as tools can change prompt rendering and related resolved values for mixed-mode model families. - Thinking-control semantics are model-aware:
- if
supports_thinking_toggleistrue, sendenable_thinking=trueorfalseexplicitly reasoning_effort="none"disables thinking for toggleable models- if a model does not support toggleable thinking, Skulk ignores explicit toggle overrides but still preserves explicit non-disabled reasoning-effort hints when the model family supports them
- if
Builtin Browser Tools
POST /v1/tools/web_search
Execute Skulk's generic web_search tool and return structured search results.
curl -X POST http://localhost:52415/v1/tools/web_search \
-H 'Content-Type: application/json' \
-d '{
"query": "foxlight skulk distributed inference",
"top_k": 5
}'
Request fields:
| Field | Type | Notes |
|---|---|---|
query | string | Required search query. |
top_k | integer | Optional max results, 1 to 10, default 5. |
Response fields:
| Field | Type | Notes |
|---|---|---|
query | string | Original search query. |
provider | string | Search backend identifier. |
results | array | Ordered search results with title, url, and snippet. |
This endpoint is designed for client-executed tool loops. GPT-OSS can request
web_search, the client can call this endpoint, then send the JSON result back
as a tool message.
POST /v1/tools/open_url
Fetch one HTTP or HTTPS URL, follow redirects, and return structured metadata.
curl -X POST http://localhost:52415/v1/tools/open_url \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/article"
}'
Request fields:
| Field | Type | Notes |
|---|---|---|
url | string | Required absolute http:// or https:// URL. |
Response fields:
| Field | Type | Notes |
|---|---|---|
url | string | Original requested URL. |
final_url | string | Final URL after redirects. |
title | string or null | Best-effort page title. |
status_code | integer | Final HTTP status code. |
content_type | string or null | Normalized response content type. |
provider | string | Backend provider identifier. |
POST /v1/tools/extract_page
Fetch one HTTP or HTTPS URL and return bounded readable text extracted from the response body.
curl -X POST http://localhost:52415/v1/tools/extract_page \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/article",
"max_chars": 12000
}'
Request fields:
| Field | Type | Notes |
|---|---|---|
url | string | Required absolute http:// or https:// URL. |
max_chars | integer | Optional maximum characters, 500 to 50000, default 12000. |
Response fields:
| Field | Type | Notes |
|---|---|---|
url | string | Original requested URL. |
final_url | string | Final URL after redirects. |
title | string or null | Best-effort page title. |
text | string | Readable extracted text. |
truncated | boolean | Whether the text was clipped to max_chars. |
provider | string | Backend provider identifier. |
These browser-tool endpoints are designed for client-executed tool loops. In
dashboard chat, GPT-OSS can request web_search, open_url, or
extract_page; the dashboard executes the endpoint call, then sends the JSON
result back as a tool message.
Structured Output
response_format is accepted for compatibility, but Skulk does not currently enforce strict JSON mode or JSON schema validation.
response = client.chat.completions.create(
model="mlx-community/Qwen3.5-9B-4bit",
messages=[{"role": "user", "content": "Return valid JSON with three colors"}],
response_format={"type": "json_object"},
)
For the best results, explicitly instruct the model to return valid JSON.
OpenAI Responses API
POST /v1/responses
Use this when your client expects the OpenAI Responses format instead of Chat Completions.
curl -X POST http://localhost:52415/v1/responses \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"input": "Hello from the Responses API"
}'
Claude Messages API
POST /v1/messages
Use this when your client expects Anthropic-style request and response shapes.
curl -X POST http://localhost:52415/v1/messages \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 512
}'
Ollama API
Skulk supports several Ollama-compatible endpoints so tools like OpenWebUI can connect with minimal glue code.
Chat
curl -X POST http://localhost:52415/ollama/api/chat \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [{"role": "user", "content": "Hello"}]
}'
Generate
curl -X POST http://localhost:52415/ollama/api/generate \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"prompt": "Write a haiku about foxes"
}'
List models
curl http://localhost:52415/ollama/api/tags
Show model details
curl -X POST http://localhost:52415/ollama/api/show \
-H 'Content-Type: application/json' \
-d '{"name": "mlx-community/Llama-3.2-1B-Instruct-4bit"}'
Model Discovery
List models
GET /v1/models
curl http://localhost:52415/v1/models
This returns known model cards, not just running instances.
Search Hugging Face
GET /models/search?query=...&limit=...
curl "http://localhost:52415/models/search?query=qwen3&limit=5"
Behavior note:
- Skulk searches
mlx-communityfirst. - If that returns nothing, it falls back to a broader Hugging Face search.
Per-node storage breakdown
GET /store/storage
Returns the local node's storage picture: every staged model with its size, last-use time, and whether a live instance (or one of its companion repos: MTP sidecar, assistant, vision weights) currently depends on it, plus event-log usage and free disk on the models volume. Cluster-wide views query each node's API.
curl http://localhost:52415/store/storage
Staged copies are managed automatically when the model store is on: when an
instance shuts down (and at node startup, which reconciles copies orphaned
by a crash), not-in-use staged models are kept newest-first up to the
staging_keep_recent_gb grace budget (default 40 GiB) and evicted beyond
it. Set cleanup_on_deactivate: false in the staging config to keep every
staged copy and manage cleanup manually via POST /store/purge-staging.
Placement and Instance Management
These endpoints are the heart of the Skulk control plane.
Quick launch
POST /place_instance
curl -X POST http://localhost:52415/place_instance \
-H 'Content-Type: application/json' \
-d '{
"model_id": "mlx-community/Qwen3.5-9B-4bit",
"sharding": "Pipeline",
"instance_meta": "MlxRing",
"min_nodes": 1,
"excluded_nodes": []
}'
| Field | Meaning |
|---|---|
model_id | Hugging Face-style model ID |
sharding | Pipeline or Tensor |
instance_meta | MlxRing or MlxJaccl |
min_nodes | Minimum nodes required for the placement |
excluded_nodes | Optional. Node IDs the master should treat as if absent when scoring this placement. Already-running instances on those nodes are unaffected (exclusion is per-placement, not cluster-wide). Default: []. Note: node IDs are per-session, so they change when a cluster session restarts. |
The placement is validated against the current cluster state before the command is forwarded, so an impossible placement fails at the API instead of silently failing on the master:
- 400 with the specific reason: no connected cycle of
min_nodesnodes, exclusions removed every candidate, the model does not support Tensor sharding, or a node cannot fit its weight shard plus runtime headroom (the error names the node and the GB arithmetic). - 503 when cluster info is still being gossiped (a cluster that just formed): connection edges lag node identities by a few gossip rounds, and per-node memory info lags the edges. The request internally waits up to 15 seconds for the info to arrive before giving up, so retry shortly on 503.
Memory fitting is checked per node, not summed across the cycle: Tensor sharding splits weights evenly, Pipeline allocates layers proportionally to each node's available memory, and every node must hold its share times a runtime-overhead factor (KV cache, activations, runner) on top of the raw weight bytes. A model that exactly equals a node's free memory is rejected, because that placement would thrash, not run.
Preview valid placements
GET /instance/previews?model_id=...
curl "http://localhost:52415/instance/previews?model_id=mlx-community/Qwen3.5-9B-4bit"
This is usually the best first Skulk-specific endpoint to call. It shows which combinations of sharding mode, networking mode, and node count are valid, and why invalid combinations fail.
| Query parameter | Meaning |
|---|---|
model_id | Required. Hugging Face-style model ID. |
node_ids | Optional, repeatable. Restricts previews to candidate cycles that contain all of these node IDs (subset matching). |
excluded_node_ids | Optional, repeatable. Excludes the listed node IDs from candidate cycles for every previewed combination. Mirrors the excluded_nodes field on POST /place_instance so dashboards can render an accurate preview against the post-exclusion topology. |
# Preview with one node excluded:
curl "http://localhost:52415/instance/previews?model_id=mlx-community/Qwen3.5-9B-4bit&excluded_node_ids=12D3KooWAbc..."
Build a placement manually
GET /instance/placement
Use this when you want a specific combination and want to inspect the exact instance shape before launch.
Create an instance from a fully specified placement
POST /instance
Use this when you already have an instance object and want exact control.
Inspect one instance
GET /instance/{instance_id}
Delete an instance
DELETE /instance/{instance_id}
Download Management
Start a node download
POST /download/start
Lower-level endpoint for explicit node download control.
Delete a node download
DELETE /download/{node_id}/{model_id}
Model Store Endpoints
These endpoints are available when the model store is configured.
If it is not configured, Skulk returns 503 Store not configured.
Store health
GET /store/health
Use this to confirm whether the store is configured and reachable.
Store registry
GET /store/registry
Use this to inspect which models the shared store knows about.
The dashboard combines registry results with GET /v1/models metadata so it can
display derived tags such as vision, thinking, embedding, tensor, and
optiq in the Store list.
Store downloads
GET /store/downloads
Use this to inspect in-progress shared-store download activity.
Request a store download
POST /store/models/{model_id}/download
Use this when you want the store host to fetch and register a model.
Optional JSON body {"gguf_file": "<repo-relative path>"} pins which GGUF quant
the store fetches for a multi-quant GGUF repo (it downloads that file's shard
group plus config.json). Omit the body to use the default quant preference.
A pin naming a file not present in the repo falls back to the default.
Store download status
GET /store/models/{model_id}/download/status
Delete a model from the store
DELETE /store/models/{model_id}
Removes the model from the store host (registry + disk) and broadcasts a
cluster-wide eviction so every node also drops its locally-staged copy, freeing
disk fleet-wide instead of leaving worker copies until they age out under
staging pressure. Returns 404 if the model is not registered in the store. (To
clear staged copies without deleting the store copy, use
POST /store/purge-staging.)
Purge staging caches
POST /store/purge-staging
Use this to remove staged model artifacts from nodes without deleting the store copy itself.
Start optimization
POST /store/models/{model_id}/optimize
Use this for workflows such as model optimization or alternate artifact generation.
Models Endpoint
List models
GET /v1/models
Returns the known model catalog, including downloaded models and catalog-backed entries.
Important fields:
| Field | Type | Meaning |
|---|---|---|
id | string | Canonical model ID |
capabilities | array | Functional capabilities such as text, vision, thinking, code, or embedding |
tags | array | UI-friendly derived labels such as vision, thinking, embedding, tensor, and optiq |
supports_tensor | boolean | Whether tensor parallel launch is supported |
base_model | string | Base family or upstream source model when known |
runtime.mtp_sidecar_repo | string | Repo of this model's MTP sidecar (prediction heads), when it declares one |
runtime.assistant_model_repo | string | Repo of this model's speculative-decoding assistant (drafter), when it declares one |
runtime.served_spec_draft_repo | string | Repo of this model's separate served-engine draft GGUF, when it declares one |
The dashboard uses tags for compact badges and capabilities for filtering
and richer tooltips. The three runtime.*_repo fields name a model's
speculative-decoding companions (a draft model or an MTP-head sidecar). Those
companion repos are downloaded and loaded automatically with their parent and
are not independently placeable, so the dashboard marks any store entry matching
one of these repos as a companion (a "Drafter" or "Sidecar" badge) rather than
offering it launch, placement, or optimize actions.
Configuration Endpoints
Get config
GET /config
Returns the current cluster config and config path. Sensitive values (hf_token) are stripped from the response.
Update config
PUT /config
Updates cluster-wide config. Important behavior:
- if you omit
hf_token, Skulk preserves the existing value - if you omit
logging, Skulk preserves the existing logging config hf_tokenis not broadcast over gossipsub; it stays on the local node'sskulk.yaml- logging changes (enable/disable) take effect immediately on all nodes
- inference changes affect future launches
- model-store location changes generally require restart
Filesystem browse
GET /filesystem/browse
Used by the dashboard to browse a safe subset of the filesystem when selecting config paths.
Node identity
GET /node/identity
Returns hostname, preferred IP, and node identity information used by the dashboard.
Restart a node
POST /admin/restart?node_id=<optional node id>
Gracefully restart the Skulk process on this or a remote node. When node_id is omitted or matches the local node, replaces the current process image in-place via os.execv (same PID). When node_id targets a remote node, sends a RestartNode command via pub/sub.
- GPU/Metal memory is released when the process image is replaced
- the node rejoins the cluster automatically on startup
- active inference is interrupted
Returns {"status": "restarting", "node_id": "..."} for local restarts, or {"status": "restart_sent", "node_id": "..."} for remote restarts.
If a local restart is already scheduled, returns HTTP 409 with {"status": "restart_already_pending"}.
State, Events, and Tracing
Cluster state
GET /state
Returns the cluster state as Skulk currently sees it.
The response also carries a derived nodeHealth map (keyed by node id) so a
problem on a node is visible rather than silent. Each entry is a level
(ok, warn, or error) plus a list of reasons, where each reason has a
code, a message describing what is wrong, and a remediation describing how
to fix it. It is computed read-only from state already in the response (terminal
download failures, low or full models-volume disk, and late heartbeats), so it
adds no new polling. A node with no problems reports level: "ok" with an empty
reasons list.
Operational note:
- a follower may briefly report a local view that is behind the elected master while it is catching up
- on newer builds, catch-up can start from a snapshot plus retained replay tail
instead of always rebuilding from event
0 - if your cluster is mixed-version during rollout, upgrade all nodes before you rely on bounded replay retention on the master; an older restarted node may not be able to fully resync after old history has been compacted away
Event log
GET /events
Returns stored events from the API-side event log.
Diagnostics
GET /v1/diagnostics/nodePOST /v1/diagnostics/node/capturePOST /v1/diagnostics/node/runners/{runner_id}/cancelGET /v1/diagnostics/clusterGET /v1/diagnostics/cluster/timelineGET /v1/diagnostics/cluster/{node_id}POST /v1/diagnostics/cluster/{node_id}/capturePOST /v1/diagnostics/cluster/{node_id}/runners/{runner_id}/cancel
Use these endpoints when a node appears stuck loading, warming up, decoding, or shutting down and you need a read-only snapshot without SSHing into every node.
Behavior notes:
GET /v1/diagnostics/nodereturns the local node's runtime/config facts, resources, process tree, live runner-supervisor state, flight-recorder phase state, and placement analysis.POST /v1/diagnostics/node/capturecollects an on-demand local diagnostic bundle. Body fields arerunnerId,taskId,includeProcessSamples, andsampleDurationSeconds; all are optional. When a runner/task is provided, the response includes that runner's bounded flight recorder, latest MLX memory snapshot, and best-effort macOSsample,vmmap -summary, andfootprint -poutput. Sampling failures are returned as structured partial failures instead of failing the bundle.POST /v1/diagnostics/node/runners/{runner_id}/cancelrequests cooperative cancellation for one task that the local runner supervisor still knows about.GET /v1/diagnostics/clusterfans out to reachable peer APIs and returns partial results when some peers are unavailable.GET /v1/diagnostics/cluster/timelinestitches every reachable node's runner-supervisor diagnostics into one cross-rank chronological view. The response carries a per-runner synopsis sorted by(modelId, deviceRank)and every flight-recorder entry across all ranks merged and sorted byat. Use this when debugging a distributed deadlock: the rank-disagreement signature ("rank 0 enteredpipeline_last_eval_outputat T while rank 1 is still inpipeline_first_recv_like") is invisible from any single node's local diagnostics but obvious top-to-bottom in the merged timeline. Unreachable peers are returned inunreachableNodesinstead of failing the request.GET /v1/diagnostics/cluster/{node_id}proxies one reachable peer bundle or returns the local bundle ifnode_idis the current API node.POST /v1/diagnostics/cluster/{node_id}/captureproxies the same on-demand capture request to a reachable peer node.POST /v1/diagnostics/cluster/{node_id}/runners/{runner_id}/cancelproxies the same cooperative live-runner cancellation request to a reachable peer.- Placement diagnostics explicitly include whether the current master is part of each model placement, which helps investigate hangs where the master is not one of the inference ranks.
- The dashboard node-card bug icon uses these endpoints to open a live diagnostics drawer for any reachable node.
- The diagnostics drawer prefers
Capture bundlebefore cancellation so operators can collect phase, MLX memory, and process samples before changing the runner state. - Runner cancellation is best-effort only. A wedged native/MLX runner may ignore the request and still require stronger intervention.
- Diagnostics endpoints do not currently kill or restart runners. Capture is read-only; the only mutating diagnostics action is the cooperative task-cancel request above.
Example:
curl http://localhost:52415/v1/diagnostics/node
curl http://localhost:52415/v1/diagnostics/cluster
curl http://localhost:52415/v1/diagnostics/cluster/timeline
curl http://localhost:52415/v1/diagnostics/cluster/<node_id>
curl -X POST http://localhost:52415/v1/diagnostics/node/capture \
-H 'content-type: application/json' \
-d '{"runnerId":"<runner_id>","taskId":"<task_id>"}'
curl -X POST http://localhost:52415/v1/diagnostics/cluster/<node_id>/capture \
-H 'content-type: application/json' \
-d '{"runnerId":"<runner_id>","includeProcessSamples":true}'
curl -X POST http://localhost:52415/v1/diagnostics/node/runners/<runner_id>/cancel \
-H 'content-type: application/json' \
-d '{"taskId":"<task_id>"}'
curl -X POST http://localhost:52415/v1/diagnostics/cluster/<node_id>/runners/<runner_id>/cancel \
-H 'content-type: application/json' \
-d '{"taskId":"<task_id>"}'
Traces
GET /v1/tracingPUT /v1/tracingGET /v1/tracesGET /v1/traces/clusterPOST /v1/traces/deleteGET /v1/traces/{task_id}GET /v1/traces/{task_id}/statsGET /v1/traces/{task_id}/rawGET /v1/traces/cluster/{task_id}GET /v1/traces/cluster/{task_id}/statsGET /v1/traces/cluster/{task_id}/raw
Use these endpoints when you are debugging generation behavior, cluster execution, or performance.
Behavior notes:
GET /v1/tracingreturns whether runtime tracing is currently enabled for new requests across the live cluster session.PUT /v1/tracingtoggles tracing cluster-wide for new requests only. It does not retroactively trace in-flight work.GET /v1/traces*reads local trace artifacts stored on the current node.GET /v1/traces/cluster*fans out to reachable peer APIs, deduplicates bytask_id, and proxies read-only trace access from any reachable node.POST /v1/traces/deleteremains local-only in v1 even when cluster browsing is enabled.
Runtime tracing control
GET /v1/tracing
Returns the current cluster tracing state:
{"enabled": false}
PUT /v1/tracing
Enable or disable tracing for new requests across the current cluster session.
Request body:
{"enabled": true}
Response body:
{"enabled": true}
Operational notes:
- this is a runtime toggle, not a restart-required config edit
- it applies to new requests only
- it does not retroactively trace work already in flight
- the dashboard traces page uses this same API
Local trace endpoints
These endpoints operate on trace artifacts stored on the current node:
GET /v1/traceslists local trace artifacts with metadata such as task kind, model, source nodes, and tool-activity tagsGET /v1/traces/{task_id}returns structured trace events for one taskGET /v1/traces/{task_id}/statsreturns aggregated timing summariesGET /v1/traces/{task_id}/rawdownloads Chrome-trace-compatible JSONPOST /v1/traces/deletedeletes one or more local trace artifacts
Example:
curl http://localhost:52415/v1/traces
curl http://localhost:52415/v1/traces/<task_id>/stats
curl -OJ http://localhost:52415/v1/traces/<task_id>/raw
Cluster trace endpoints
These endpoints let a dashboard or script on any reachable node browse traces across the cluster:
GET /v1/traces/clusterGET /v1/traces/cluster/{task_id}GET /v1/traces/cluster/{task_id}/statsGET /v1/traces/cluster/{task_id}/raw
Operational notes:
- cluster browsing is read-only in v1
- the API fans out to reachable peer APIs and deduplicates traces by
task_id - if some peers are unreachable, cluster results may be partial
- source node metadata in responses tells you which nodes contributed trace content
Example:
curl http://localhost:52415/v1/traces/cluster
curl http://localhost:52415/v1/traces/cluster/<task_id>/stats
curl -OJ http://localhost:52415/v1/traces/cluster/<task_id>/raw
Connectivity Endpoints
Tailscale status
GET /v1/connectivity/tailscale
GET /v1/connectivity/tailscale?node_id=<id>
Returns whether tailscaled is running on a node and, if so, the node's Tailscale IP, hostname, DNS name, and tailnet. All fields except running are null when tailscaled is not installed or not running.
Pass node_id to proxy the request to a specific cluster node. Omit it to query the local node directly. Returns 404 if the target node is not reachable.
Response fields:
| Field | Type | Description |
|---|---|---|
running | boolean | true when tailscaled reports BackendState == "Running" |
selfIp | string | null | Node's Tailscale IPv4 address (100.x.x.x range) |
hostname | string | null | Node hostname as registered in the tailnet |
dnsName | string | null | Fully-qualified Tailscale MagicDNS name, e.g. my-node.tailnet-abc.ts.net |
tailnet | string | null | Tailnet name derived from dnsName |
version | string | null | Tailscale client version string |
# Local node
curl http://localhost:52415/v1/connectivity/tailscale
# Specific cluster node
curl "http://localhost:52415/v1/connectivity/tailscale?node_id=<node-id>"
Remote access info
GET /v1/connectivity/remote-access
Returns aggregated remote access information for the local node: LAN address, Tailscale address, and a preferredUrl (Tailscale if running, otherwise LAN). When Tailscale is running, preferredUrl uses the node's MagicDNS name (my-node.tailnet-abc.ts.net) if available, falling back to the raw 100.x.x.x IP. operatorUrl appends /operator to preferredUrl (suitable for QR code generation so mobile users land directly on the operator panel).
Response fields:
| Field | Type | Description |
|---|---|---|
local.ip | string | null | Preferred LAN IPv4 address |
local.port | integer | API/dashboard port |
local.url | string | null | http://{ip}:{port} |
tailscale.running | boolean | true when tailscaled is connected |
tailscale.ip | string | null | Tailscale IPv4 address (100.x.x.x) |
tailscale.dnsName | string | null | MagicDNS fully-qualified name, e.g. my-node.tailnet-abc.ts.net |
tailscale.port | integer | API/dashboard port |
tailscale.url | string | null | http://{dnsName or ip}:{port} if running |
preferredUrl | string | null | MagicDNS URL if available, else Tailscale IP URL, else LAN URL |
operatorUrl | string | null | preferredUrl + /operator |
curl http://localhost:52415/v1/connectivity/remote-access | python3 -m json.tool
Example response when Tailscale is running with MagicDNS:
{
"local": { "ip": "192.168.1.5", "port": 52415, "url": "http://192.168.1.5:52415" },
"tailscale": {
"running": true,
"ip": "100.101.102.103",
"dnsName": "my-node.tailnet-abc.ts.net",
"port": 52415,
"url": "http://my-node.tailnet-abc.ts.net:52415"
},
"preferredUrl": "http://my-node.tailnet-abc.ts.net:52415",
"operatorUrl": "http://my-node.tailnet-abc.ts.net:52415/operator"
}
Operator App Integration
The operator panel at /operator is designed for mobile access and can also be driven by a native app. The relevant API endpoints are:
Node and cluster state
| Endpoint | Description |
|---|---|
GET /v1/state | Full cluster state: nodes, instances, runners, memory, GPU |
GET /node_id | Local node's ID |
GET /node/identity | Node ID, hostname, and preferred LAN IP |
Remote access and connectivity
| Endpoint | Description |
|---|---|
GET /v1/connectivity/remote-access | LAN + Tailscale addresses, preferred URL, operator URL for QR |
GET /v1/connectivity/tailscale | Tailscale status for local node |
GET /v1/connectivity/tailscale?node_id=<id> | Tailscale status for a specific peer node |
Node management
| Endpoint | Description |
|---|---|
POST /v1/nodes/{node_id}/restart | Send a restart command to any node in the cluster |
Typical operator app workflow
- Call
GET /v1/connectivity/remote-accesson the initially discovered node to get thepreferredUrl, then use that as the base URL for subsequent calls. - Poll
GET /v1/stateevery 5 seconds for node health (memory, GPU, temperature). - Show per-node cards with restart buttons that call
POST /v1/nodes/{node_id}/restart. - On first launch or settings screen, show the
operatorUrlas a QR code so users can hand it off to another device.