Skulk API

Skulk serves an API at http://localhost:52415.

That API has two jobs:

compatibility endpoints for tools that already speak OpenAI, Claude, or Ollama-style APIs
Skulk-specific control endpoints for placement, downloads, config, tracing, and model-store workflows

A model must be placed and running before chat requests for it succeed; calling /v1/chat/completions for an unplaced model returns a 404 No instance found. The First Success Flow below walks from placement to first token.

First working request: First Success Flow
OpenAI-compatible chat: OpenAI Chat Completions
OpenAI Responses format: OpenAI Responses API
Claude format: Claude Messages API
Ollama compatibility: Ollama API
Placement and launch: Placement and Instance Management
Store and config: Model Store Endpoints and Configuration Endpoints
Debugging: State, Events, and Tracing

First Success Flow

1. Start Skulk

uv run skulk

2. Preview placements

curl "http://localhost:52415/instance/previews?model_id=mlx-community/Llama-3.2-1B-Instruct-4bit"

This shows what Skulk can actually place on the current node or cluster.

3. Launch a placement

curl -X POST http://localhost:52415/place_instance \
  -H 'Content-Type: application/json' \
  -d '{
    "model_id": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "sharding": "Pipeline",
    "instance_meta": "MlxRing",
    "min_nodes": 1
  }'

4. Send a chat request

curl -X POST http://localhost:52415/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello from Skulk"}]
  }'

If this fails with 404 No instance found for model ..., the placement is not ready yet or never launched successfully.

Endpoint Overview

Compatibility APIs

POST /v1/chat/completions
POST /v1/responses
POST /v1/messages
POST /ollama/api/chat
POST /ollama/api/generate
GET /ollama/api/tags
POST /ollama/api/show
GET /ollama/api/ps
GET /ollama/api/version

Skulk Control APIs

GET /v1/models
POST /v1/tools/web_search
POST /v1/tools/open_url
POST /v1/tools/extract_page
GET /models/search
POST /models/add
DELETE /models/custom/{model_id}
POST /place_instance
POST /instance
GET /instance/placement
GET /instance/previews
GET /instance/{instance_id}
DELETE /instance/{instance_id}
GET /state
GET /events
POST /download/start
DELETE /download/{node_id}/{model_id}
GET /config
PUT /config
GET /store/health
GET /store/registry
GET /store/downloads
POST /store/models/{model_id}/download
GET /store/models/{model_id}/download/status
DELETE /store/models/{model_id}
POST /store/purge-staging
GET /store/storage
POST /store/models/{model_id}/optimize
GET /store/models/{model_id}/optimize/status
GET /filesystem/browse
GET /node/identity
GET /v1/tracing
PUT /v1/tracing
GET /v1/traces
GET /v1/traces/cluster
POST /v1/traces/delete
GET /v1/traces/{task_id}
GET /v1/traces/{task_id}/stats
GET /v1/traces/{task_id}/raw
GET /v1/traces/cluster/{task_id}
GET /v1/traces/cluster/{task_id}/stats
GET /v1/traces/cluster/{task_id}/raw
GET /v1/diagnostics/node
POST /v1/diagnostics/node/capture
POST /v1/diagnostics/node/runners/{runner_id}/cancel
GET /v1/diagnostics/cluster
GET /v1/diagnostics/cluster/timeline
GET /v1/diagnostics/cluster/{node_id}
POST /v1/diagnostics/cluster/{node_id}/capture
POST /v1/diagnostics/cluster/{node_id}/runners/{runner_id}/cancel

For the full interactive reference with request/response schemas, see the API Reference.

OpenAI Chat Completions

POST /v1/chat/completions

This is the main chat-generation endpoint for both text-only and multimodal models.

Requests are validated before dispatch: an empty messages array or a non-positive max_tokens returns 400 Bad Request rather than being accepted and failing during generation. (This applies across the Claude, Ollama, and Responses wire formats too, which share the same dispatch path.)

Context-length limits

Every placed instance has a usable context limit: the smaller of the model's advertised context length and the number of KV-cache tokens that fit in memory next to the model weights on the hosting node(s). Requests are admitted against that limit instead of growing the KV cache until the node runs out of memory:

A max_tokens value that cannot fit in the limit at all returns 400 Bad Request immediately (context_length_exceeded: ...).
After tokenization on the serving instance, a prompt that fills the window, or a prompt plus an explicit max_tokens that exceeds the limit, is rejected with an OpenAI-style invalid_request_error whose message starts with context_length_exceeded:. For streaming requests this arrives as the first SSE data: event; for non-streaming requests the response body is the error envelope (the HTTP status is already committed when the rejection is computed on the serving node).
When max_tokens is omitted, the server default output budget is clamped to the remaining window, so generation ends with finish_reason: "length" instead of overrunning the context.

OpenAI Python SDK Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:52415/v1",
    api_key="unused",
)

response = client.chat.completions.create(
    model="mlx-community/Llama-3.2-1B-Instruct-4bit",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Curl Example

curl -X POST http://localhost:52415/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Streaming Example

stream = client.chat.completions.create(
    model="mlx-community/Llama-3.2-1B-Instruct-4bit",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Common Request Fields

Field	Type	Notes
`model`	string	Required. Must match a placed and running model.
`messages`	array	Required. Supports `system`, `user`, `assistant`, `developer`, `tool`, `function`.
`stream`	boolean	Use `true` for SSE streaming.
`temperature`	number	Sampling temperature.
`top_p`	number	Nucleus sampling.
`top_k`	integer	Top-k sampling.
`min_p`	number	Minimum-probability threshold.
`max_tokens`	integer	Max generated tokens. When omitted, Skulk uses a backend default of 4096 generated tokens (`DEFAULT_MAX_OUTPUT_TOKENS`); operators can override that default with `SKULK_MAX_OUTPUT_TOKENS` (or the legacy `SKULK_MAX_TOKENS`).
`stop`	string or array	Stop sequences.
`seed`	integer	Reproducibility helper.
`frequency_penalty`	number	Frequency penalty.
`presence_penalty`	number	Presence penalty.
`repetition_penalty`	number	Repetition penalty.
`repetition_context_size`	integer	Context window for repetition handling.
`logprobs`	boolean	Return token logprobs when supported.
`top_logprobs`	integer	Number of top logprobs to include.
`tools`	array	OpenAI-style tool definitions.
`tool_choice`	string or object	`auto`, `none`, or a specific tool selection.
`parallel_tool_calls`	boolean	Accepted for compatibility.
`enable_thinking`	boolean	Skulk extension for reasoning-capable models.
`reasoning_effort`	string	Reasoning hint when supported.
`response_format`	object	Accepted for compatibility, not strictly enforced.
`stream_options`	object	Includes `include_usage`.
`user`	string	Optional caller identifier.

Message Format

{
  "role": "user",
  "content": "hello"
}

Assistant messages may include tool_calls. Tool response messages should include tool_call_id.

User messages may also be sent as structured content parts. Skulk accepts OpenAI-style image inputs for vision-capable models:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What is in this image?" },
    {
      "type": "image_url",
      "image_url": { "url": "data:image/png;base64,..." }
    }
  ]
}

Notes:

inline data: URLs are supported for image inputs
Anthropic-compatible requests can also carry image content for multimodal models
image understanding depends on the selected model exposing the vision capability

Finish Reasons

Value	Meaning
`stop`	Natural stop or stop sequence reached
`length`	`max_tokens` limit reached
`tool_calls`	Model is requesting a tool call
`content_filter`	Reserved for compatibility
`function_call`	Reserved for compatibility
`error`	Generation failed

Tool Use

Skulk supports OpenAI-style function calling.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="mlx-community/Qwen3.5-9B-4bit",
    messages=[{"role": "user", "content": "What is the weather in Paris?"}],
    tools=tools,
    tool_choice="auto",
)

Typical flow:

Send messages and tool definitions.
Inspect finish_reason.
If it is tool_calls, execute the tool in your app.
Send the tool result back as a tool message.
Request the final model response.

Thinking / Reasoning

Skulk supports reasoning-aware chat for compatible models.

response = client.chat.completions.create(
    model="mlx-community/Qwen3.5-9B-4bit",
    messages=[{"role": "user", "content": "What is 127 * 43?"}],
    enable_thinking=True,
)

msg = response.choices[0].message
print(msg.reasoning_content)
print(msg.content)

Notes:

enable_thinking is a Skulk extension.
Reasoning support depends on model capabilities.
Use GET /v1/models response data[].resolved_capabilities to decide whether a model supports thinking and whether clients should render a thinking toggle.
Treat resolved_capabilities as the default tool-free request path; request-specific options such as tools can change prompt rendering and related resolved values for mixed-mode model families.
Thinking-control semantics are model-aware:
- if supports_thinking_toggle is true, send enable_thinking=true or false explicitly
- reasoning_effort="none" disables thinking for toggleable models
- if a model does not support toggleable thinking, Skulk ignores explicit toggle overrides but still preserves explicit non-disabled reasoning-effort hints when the model family supports them

Builtin Browser Tools

POST /v1/tools/web_search

Execute Skulk's generic web_search tool and return structured search results.

curl -X POST http://localhost:52415/v1/tools/web_search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "foxlight skulk distributed inference",
    "top_k": 5
  }'

Request fields:

Field	Type	Notes
`query`	string	Required search query.
`top_k`	integer	Optional max results, `1` to `10`, default `5`.

Response fields:

Field	Type	Notes
`query`	string	Original search query.
`provider`	string	Search backend identifier.
`results`	array	Ordered search results with `title`, `url`, and `snippet`.

This endpoint is designed for client-executed tool loops. GPT-OSS can request web_search, the client can call this endpoint, then send the JSON result back as a tool message.

POST /v1/tools/open_url

Fetch one HTTP or HTTPS URL, follow redirects, and return structured metadata.

curl -X POST http://localhost:52415/v1/tools/open_url \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com/article"
  }'

Request fields:

Field	Type	Notes
`url`	string	Required absolute `http://` or `https://` URL.

Response fields:

Field	Type	Notes
`url`	string	Original requested URL.
`final_url`	string	Final URL after redirects.
`title`	string or null	Best-effort page title.
`status_code`	integer	Final HTTP status code.
`content_type`	string or null	Normalized response content type.
`provider`	string	Backend provider identifier.

POST /v1/tools/extract_page

Fetch one HTTP or HTTPS URL and return bounded readable text extracted from the response body.

curl -X POST http://localhost:52415/v1/tools/extract_page \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com/article",
    "max_chars": 12000
  }'

Request fields:

Field	Type	Notes
`url`	string	Required absolute `http://` or `https://` URL.
`max_chars`	integer	Optional maximum characters, `500` to `50000`, default `12000`.

Response fields:

Field	Type	Notes
`url`	string	Original requested URL.
`final_url`	string	Final URL after redirects.
`title`	string or null	Best-effort page title.
`text`	string	Readable extracted text.
`truncated`	boolean	Whether the text was clipped to `max_chars`.
`provider`	string	Backend provider identifier.

These browser-tool endpoints are designed for client-executed tool loops. In dashboard chat, GPT-OSS can request web_search, open_url, or extract_page; the dashboard executes the endpoint call, then sends the JSON result back as a tool message.

Structured Output

response_format is accepted for compatibility, but Skulk does not currently enforce strict JSON mode or JSON schema validation.

response = client.chat.completions.create(
    model="mlx-community/Qwen3.5-9B-4bit",
    messages=[{"role": "user", "content": "Return valid JSON with three colors"}],
    response_format={"type": "json_object"},
)

For the best results, explicitly instruct the model to return valid JSON.

OpenAI Responses API

POST /v1/responses

Use this when your client expects the OpenAI Responses format instead of Chat Completions.

curl -X POST http://localhost:52415/v1/responses \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "input": "Hello from the Responses API"
  }'

Claude Messages API

POST /v1/messages

Use this when your client expects Anthropic-style request and response shapes.

curl -X POST http://localhost:52415/v1/messages \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 512
  }'

Ollama API

Skulk supports several Ollama-compatible endpoints so tools like OpenWebUI can connect with minimal glue code.

Chat

curl -X POST http://localhost:52415/ollama/api/chat \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Generate

curl -X POST http://localhost:52415/ollama/api/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "prompt": "Write a haiku about foxes"
  }'

List models

curl http://localhost:52415/ollama/api/tags

Show model details

curl -X POST http://localhost:52415/ollama/api/show \
  -H 'Content-Type: application/json' \
  -d '{"name": "mlx-community/Llama-3.2-1B-Instruct-4bit"}'

Model Discovery

List models

GET /v1/models

curl http://localhost:52415/v1/models

This returns known model cards, not just running instances.

Search Hugging Face

GET /models/search?query=...&limit=...

curl "http://localhost:52415/models/search?query=qwen3&limit=5"

Behavior note:

Skulk searches mlx-community first.
If that returns nothing, it falls back to a broader Hugging Face search.

Per-node storage breakdown

GET /store/storage

Returns the local node's storage picture: every staged model with its size, last-use time, and whether a live instance (or one of its companion repos: MTP sidecar, assistant, vision weights) currently depends on it, plus event-log usage and free disk on the models volume. Cluster-wide views query each node's API.

curl http://localhost:52415/store/storage

Staged copies are managed automatically when the model store is on: when an instance shuts down (and at node startup, which reconciles copies orphaned by a crash), not-in-use staged models are kept newest-first up to the staging_keep_recent_gb grace budget (default 40 GiB) and evicted beyond it. Set cleanup_on_deactivate: false in the staging config to keep every staged copy and manage cleanup manually via POST /store/purge-staging.

Placement and Instance Management

These endpoints are the heart of the Skulk control plane.

Quick launch

POST /place_instance

curl -X POST http://localhost:52415/place_instance \
  -H 'Content-Type: application/json' \
  -d '{
    "model_id": "mlx-community/Qwen3.5-9B-4bit",
    "sharding": "Pipeline",
    "instance_meta": "MlxRing",
    "min_nodes": 1,
    "excluded_nodes": []
  }'

Field	Meaning
`model_id`	Hugging Face-style model ID
`sharding`	`Pipeline` or `Tensor`
`instance_meta`	`MlxRing` or `MlxJaccl`
`min_nodes`	Minimum nodes required for the placement
`excluded_nodes`	Optional. Node IDs the master should treat as if absent when scoring this placement. Already-running instances on those nodes are unaffected (exclusion is per-placement, not cluster-wide). Default: `[]`. Note: node IDs are per-session, so they change when a cluster session restarts.

The placement is validated against the current cluster state before the command is forwarded, so an impossible placement fails at the API instead of silently failing on the master:

400 with the specific reason: no connected cycle of min_nodes nodes, exclusions removed every candidate, the model does not support Tensor sharding, or a node cannot fit its weight shard plus runtime headroom (the error names the node and the GB arithmetic).
503 when cluster info is still being gossiped (a cluster that just formed): connection edges lag node identities by a few gossip rounds, and per-node memory info lags the edges. The request internally waits up to 15 seconds for the info to arrive before giving up, so retry shortly on 503.

Memory fitting is checked per node, not summed across the cycle: Tensor sharding splits weights evenly, Pipeline allocates layers proportionally to each node's available memory, and every node must hold its share times a runtime-overhead factor (KV cache, activations, runner) on top of the raw weight bytes. A model that exactly equals a node's free memory is rejected, because that placement would thrash, not run.

Preview valid placements

GET /instance/previews?model_id=...

curl "http://localhost:52415/instance/previews?model_id=mlx-community/Qwen3.5-9B-4bit"

This is usually the best first Skulk-specific endpoint to call. It shows which combinations of sharding mode, networking mode, and node count are valid, and why invalid combinations fail.

Query parameter	Meaning
`model_id`	Required. Hugging Face-style model ID.
`node_ids`	Optional, repeatable. Restricts previews to candidate cycles that contain all of these node IDs (subset matching).
`excluded_node_ids`	Optional, repeatable. Excludes the listed node IDs from candidate cycles for every previewed combination. Mirrors the `excluded_nodes` field on `POST /place_instance` so dashboards can render an accurate preview against the post-exclusion topology.

# Preview with one node excluded:
curl "http://localhost:52415/instance/previews?model_id=mlx-community/Qwen3.5-9B-4bit&excluded_node_ids=12D3KooWAbc..."

Build a placement manually

GET /instance/placement

Use this when you want a specific combination and want to inspect the exact instance shape before launch.

Create an instance from a fully specified placement

POST /instance

Use this when you already have an instance object and want exact control.

Inspect one instance

GET /instance/{instance_id}

Delete an instance

DELETE /instance/{instance_id}

Download Management

Start a node download

POST /download/start

Lower-level endpoint for explicit node download control.

Delete a node download

DELETE /download/{node_id}/{model_id}

Model Store Endpoints

These endpoints are available when the model store is configured.

If it is not configured, Skulk returns 503 Store not configured.

Store health

GET /store/health

Use this to confirm whether the store is configured and reachable.

Store registry

GET /store/registry

Use this to inspect which models the shared store knows about.

The dashboard combines registry results with GET /v1/models metadata so it can display derived tags such as vision, thinking, embedding, tensor, and optiq in the Store list.

Store downloads

GET /store/downloads

Use this to inspect in-progress shared-store download activity.

Request a store download

POST /store/models/{model_id}/download

Use this when you want the store host to fetch and register a model.

Optional JSON body {"gguf_file": "<repo-relative path>"} pins which GGUF quant the store fetches for a multi-quant GGUF repo (it downloads that file's shard group plus config.json). Omit the body to use the default quant preference. A pin naming a file not present in the repo falls back to the default.

Store download status

GET /store/models/{model_id}/download/status

Delete a model from the store

DELETE /store/models/{model_id}

Removes the model from the store host (registry + disk) and broadcasts a cluster-wide eviction so every node also drops its locally-staged copy, freeing disk fleet-wide instead of leaving worker copies until they age out under staging pressure. Returns 404 if the model is not registered in the store. (To clear staged copies without deleting the store copy, use POST /store/purge-staging.)

Purge staging caches

POST /store/purge-staging

Use this to remove staged model artifacts from nodes without deleting the store copy itself.

Start optimization

POST /store/models/{model_id}/optimize

Use this for workflows such as model optimization or alternate artifact generation.

Models Endpoint

List models

GET /v1/models

Returns the known model catalog, including downloaded models and catalog-backed entries.

Important fields:

Field	Type	Meaning
`id`	string	Canonical model ID
`capabilities`	array	Functional capabilities such as `text`, `vision`, `thinking`, `code`, or `embedding`
`tags`	array	UI-friendly derived labels such as `vision`, `thinking`, `embedding`, `tensor`, and `optiq`
`supports_tensor`	boolean	Whether tensor parallel launch is supported
`base_model`	string	Base family or upstream source model when known
`runtime.mtp_sidecar_repo`	string	Repo of this model's MTP sidecar (prediction heads), when it declares one
`runtime.assistant_model_repo`	string	Repo of this model's speculative-decoding assistant (drafter), when it declares one
`runtime.served_spec_draft_repo`	string	Repo of this model's separate served-engine draft GGUF, when it declares one

The dashboard uses tags for compact badges and capabilities for filtering and richer tooltips. The three runtime.*_repo fields name a model's speculative-decoding companions (a draft model or an MTP-head sidecar). Those companion repos are downloaded and loaded automatically with their parent and are not independently placeable, so the dashboard marks any store entry matching one of these repos as a companion (a "Drafter" or "Sidecar" badge) rather than offering it launch, placement, or optimize actions.

Configuration Endpoints

Get config

GET /config

Returns the current cluster config and config path. Sensitive values (hf_token) are stripped from the response.

Update config

PUT /config

Updates cluster-wide config. Important behavior:

if you omit hf_token, Skulk preserves the existing value
if you omit logging, Skulk preserves the existing logging config
hf_token is not broadcast over gossipsub; it stays on the local node's skulk.yaml
logging changes (enable/disable) take effect immediately on all nodes
inference changes affect future launches
model-store location changes generally require restart

Filesystem browse

GET /filesystem/browse

Used by the dashboard to browse a safe subset of the filesystem when selecting config paths.

Node identity

GET /node/identity

Returns hostname, preferred IP, and node identity information used by the dashboard.

Restart a node

POST /admin/restart?node_id=<optional node id>

Gracefully restart the Skulk process on this or a remote node. When node_id is omitted or matches the local node, replaces the current process image in-place via os.execv (same PID). When node_id targets a remote node, sends a RestartNode command via pub/sub.

GPU/Metal memory is released when the process image is replaced
the node rejoins the cluster automatically on startup
active inference is interrupted

Returns {"status": "restarting", "node_id": "..."} for local restarts, or {"status": "restart_sent", "node_id": "..."} for remote restarts. If a local restart is already scheduled, returns HTTP 409 with {"status": "restart_already_pending"}.

State, Events, and Tracing

Cluster state

GET /state

Returns the cluster state as Skulk currently sees it.

The response also carries a derived nodeHealth map (keyed by node id) so a problem on a node is visible rather than silent. Each entry is a level (ok, warn, or error) plus a list of reasons, where each reason has a code, a message describing what is wrong, and a remediation describing how to fix it. It is computed read-only from state already in the response (terminal download failures, low or full models-volume disk, and late heartbeats), so it adds no new polling. A node with no problems reports level: "ok" with an empty reasons list.

Operational note:

a follower may briefly report a local view that is behind the elected master while it is catching up
on newer builds, catch-up can start from a snapshot plus retained replay tail instead of always rebuilding from event 0
if your cluster is mixed-version during rollout, upgrade all nodes before you rely on bounded replay retention on the master; an older restarted node may not be able to fully resync after old history has been compacted away

Event log

GET /events

Returns stored events from the API-side event log.

Diagnostics

GET /v1/diagnostics/node
POST /v1/diagnostics/node/capture
POST /v1/diagnostics/node/runners/{runner_id}/cancel
GET /v1/diagnostics/cluster
GET /v1/diagnostics/cluster/timeline
GET /v1/diagnostics/cluster/{node_id}
POST /v1/diagnostics/cluster/{node_id}/capture
POST /v1/diagnostics/cluster/{node_id}/runners/{runner_id}/cancel

Use these endpoints when a node appears stuck loading, warming up, decoding, or shutting down and you need a read-only snapshot without SSHing into every node.

Behavior notes:

GET /v1/diagnostics/node returns the local node's runtime/config facts, resources, process tree, live runner-supervisor state, flight-recorder phase state, and placement analysis.
POST /v1/diagnostics/node/capture collects an on-demand local diagnostic bundle. Body fields are runnerId, taskId, includeProcessSamples, and sampleDurationSeconds; all are optional. When a runner/task is provided, the response includes that runner's bounded flight recorder, latest MLX memory snapshot, and best-effort macOS sample, vmmap -summary, and footprint -p output. Sampling failures are returned as structured partial failures instead of failing the bundle.
POST /v1/diagnostics/node/runners/{runner_id}/cancel requests cooperative cancellation for one task that the local runner supervisor still knows about.
GET /v1/diagnostics/cluster fans out to reachable peer APIs and returns partial results when some peers are unavailable.
GET /v1/diagnostics/cluster/timeline stitches every reachable node's runner-supervisor diagnostics into one cross-rank chronological view. The response carries a per-runner synopsis sorted by (modelId, deviceRank) and every flight-recorder entry across all ranks merged and sorted by at. Use this when debugging a distributed deadlock: the rank-disagreement signature ("rank 0 entered pipeline_last_eval_output at T while rank 1 is still in pipeline_first_recv_like") is invisible from any single node's local diagnostics but obvious top-to-bottom in the merged timeline. Unreachable peers are returned in unreachableNodes instead of failing the request.
GET /v1/diagnostics/cluster/{node_id} proxies one reachable peer bundle or returns the local bundle if node_id is the current API node.
POST /v1/diagnostics/cluster/{node_id}/capture proxies the same on-demand capture request to a reachable peer node.
POST /v1/diagnostics/cluster/{node_id}/runners/{runner_id}/cancel proxies the same cooperative live-runner cancellation request to a reachable peer.
Placement diagnostics explicitly include whether the current master is part of each model placement, which helps investigate hangs where the master is not one of the inference ranks.
The dashboard node-card bug icon uses these endpoints to open a live diagnostics drawer for any reachable node.
The diagnostics drawer prefers Capture bundle before cancellation so operators can collect phase, MLX memory, and process samples before changing the runner state.
Runner cancellation is best-effort only. A wedged native/MLX runner may ignore the request and still require stronger intervention.
Diagnostics endpoints do not currently kill or restart runners. Capture is read-only; the only mutating diagnostics action is the cooperative task-cancel request above.

Example:

curl http://localhost:52415/v1/diagnostics/node
curl http://localhost:52415/v1/diagnostics/cluster
curl http://localhost:52415/v1/diagnostics/cluster/timeline
curl http://localhost:52415/v1/diagnostics/cluster/<node_id>
curl -X POST http://localhost:52415/v1/diagnostics/node/capture \
  -H 'content-type: application/json' \
  -d '{"runnerId":"<runner_id>","taskId":"<task_id>"}'
curl -X POST http://localhost:52415/v1/diagnostics/cluster/<node_id>/capture \
  -H 'content-type: application/json' \
  -d '{"runnerId":"<runner_id>","includeProcessSamples":true}'
curl -X POST http://localhost:52415/v1/diagnostics/node/runners/<runner_id>/cancel \
  -H 'content-type: application/json' \
  -d '{"taskId":"<task_id>"}'
curl -X POST http://localhost:52415/v1/diagnostics/cluster/<node_id>/runners/<runner_id>/cancel \
  -H 'content-type: application/json' \
  -d '{"taskId":"<task_id>"}'

Traces

GET /v1/tracing
PUT /v1/tracing
GET /v1/traces
GET /v1/traces/cluster
POST /v1/traces/delete
GET /v1/traces/{task_id}
GET /v1/traces/{task_id}/stats
GET /v1/traces/{task_id}/raw
GET /v1/traces/cluster/{task_id}
GET /v1/traces/cluster/{task_id}/stats
GET /v1/traces/cluster/{task_id}/raw

Use these endpoints when you are debugging generation behavior, cluster execution, or performance.

Behavior notes:

GET /v1/tracing returns whether runtime tracing is currently enabled for new requests across the live cluster session.
PUT /v1/tracing toggles tracing cluster-wide for new requests only. It does not retroactively trace in-flight work.
GET /v1/traces* reads local trace artifacts stored on the current node.
GET /v1/traces/cluster* fans out to reachable peer APIs, deduplicates by task_id, and proxies read-only trace access from any reachable node.
POST /v1/traces/delete remains local-only in v1 even when cluster browsing is enabled.

Runtime tracing control

GET /v1/tracing

Returns the current cluster tracing state:

{"enabled": false}

PUT /v1/tracing

Enable or disable tracing for new requests across the current cluster session.

Request body:

{"enabled": true}

Response body:

{"enabled": true}

Operational notes:

this is a runtime toggle, not a restart-required config edit
it applies to new requests only
it does not retroactively trace work already in flight
the dashboard traces page uses this same API

Local trace endpoints

These endpoints operate on trace artifacts stored on the current node:

GET /v1/traces lists local trace artifacts with metadata such as task kind, model, source nodes, and tool-activity tags
GET /v1/traces/{task_id} returns structured trace events for one task
GET /v1/traces/{task_id}/stats returns aggregated timing summaries
GET /v1/traces/{task_id}/raw downloads Chrome-trace-compatible JSON
POST /v1/traces/delete deletes one or more local trace artifacts

Example:

curl http://localhost:52415/v1/traces
curl http://localhost:52415/v1/traces/<task_id>/stats
curl -OJ http://localhost:52415/v1/traces/<task_id>/raw

Cluster trace endpoints

These endpoints let a dashboard or script on any reachable node browse traces across the cluster:

GET /v1/traces/cluster
GET /v1/traces/cluster/{task_id}
GET /v1/traces/cluster/{task_id}/stats
GET /v1/traces/cluster/{task_id}/raw

Operational notes:

cluster browsing is read-only in v1
the API fans out to reachable peer APIs and deduplicates traces by task_id
if some peers are unreachable, cluster results may be partial
source node metadata in responses tells you which nodes contributed trace content

Example:

curl http://localhost:52415/v1/traces/cluster
curl http://localhost:52415/v1/traces/cluster/<task_id>/stats
curl -OJ http://localhost:52415/v1/traces/cluster/<task_id>/raw

Connectivity Endpoints

Tailscale status

GET /v1/connectivity/tailscale
GET /v1/connectivity/tailscale?node_id=<id>

Returns whether tailscaled is running on a node and, if so, the node's Tailscale IP, hostname, DNS name, and tailnet. All fields except running are null when tailscaled is not installed or not running.

Pass node_id to proxy the request to a specific cluster node. Omit it to query the local node directly. Returns 404 if the target node is not reachable.

Response fields:

Field	Type	Description
`running`	boolean	`true` when tailscaled reports `BackendState == "Running"`
`selfIp`	string \| null	Node's Tailscale IPv4 address (100.x.x.x range)
`hostname`	string \| null	Node hostname as registered in the tailnet
`dnsName`	string \| null	Fully-qualified Tailscale MagicDNS name, e.g. `my-node.tailnet-abc.ts.net`
`tailnet`	string \| null	Tailnet name derived from `dnsName`
`version`	string \| null	Tailscale client version string

# Local node
curl http://localhost:52415/v1/connectivity/tailscale

# Specific cluster node
curl "http://localhost:52415/v1/connectivity/tailscale?node_id=<node-id>"

Remote access info

GET /v1/connectivity/remote-access

Returns aggregated remote access information for the local node: LAN address, Tailscale address, and a preferredUrl (Tailscale if running, otherwise LAN). When Tailscale is running, preferredUrl uses the node's MagicDNS name (my-node.tailnet-abc.ts.net) if available, falling back to the raw 100.x.x.x IP. operatorUrl appends /operator to preferredUrl (suitable for QR code generation so mobile users land directly on the operator panel).

Response fields:

Field	Type	Description
`local.ip`	string \| null	Preferred LAN IPv4 address
`local.port`	integer	API/dashboard port
`local.url`	string \| null	`http://{ip}:{port}`
`tailscale.running`	boolean	`true` when tailscaled is connected
`tailscale.ip`	string \| null	Tailscale IPv4 address (100.x.x.x)
`tailscale.dnsName`	string \| null	MagicDNS fully-qualified name, e.g. `my-node.tailnet-abc.ts.net`
`tailscale.port`	integer	API/dashboard port
`tailscale.url`	string \| null	`http://{dnsName or ip}:{port}` if running
`preferredUrl`	string \| null	MagicDNS URL if available, else Tailscale IP URL, else LAN URL
`operatorUrl`	string \| null	`preferredUrl + /operator`

curl http://localhost:52415/v1/connectivity/remote-access | python3 -m json.tool

Example response when Tailscale is running with MagicDNS:

{
  "local": { "ip": "192.168.1.5", "port": 52415, "url": "http://192.168.1.5:52415" },
  "tailscale": {
    "running": true,
    "ip": "100.101.102.103",
    "dnsName": "my-node.tailnet-abc.ts.net",
    "port": 52415,
    "url": "http://my-node.tailnet-abc.ts.net:52415"
  },
  "preferredUrl": "http://my-node.tailnet-abc.ts.net:52415",
  "operatorUrl": "http://my-node.tailnet-abc.ts.net:52415/operator"
}

Operator App Integration

The operator panel at /operator is designed for mobile access and can also be driven by a native app. The relevant API endpoints are:

Node and cluster state

Endpoint	Description
`GET /v1/state`	Full cluster state: nodes, instances, runners, memory, GPU
`GET /node_id`	Local node's ID
`GET /node/identity`	Node ID, hostname, and preferred LAN IP

Remote access and connectivity

Endpoint	Description
`GET /v1/connectivity/remote-access`	LAN + Tailscale addresses, preferred URL, operator URL for QR
`GET /v1/connectivity/tailscale`	Tailscale status for local node
`GET /v1/connectivity/tailscale?node_id=<id>`	Tailscale status for a specific peer node

Node management

Endpoint	Description
`POST /v1/nodes/{node_id}/restart`	Send a restart command to any node in the cluster

Typical operator app workflow

Call GET /v1/connectivity/remote-access on the initially discovered node to get the preferredUrl, then use that as the base URL for subsequent calls.
Poll GET /v1/state every 5 seconds for node health (memory, GPU, temperature).
Show per-node cards with restart buttons that call POST /v1/nodes/{node_id}/restart.
On first launch or settings screen, show the operatorUrl as a QR code so users can hand it off to another device.

Quick Navigation​

First Success Flow​

1. Start Skulk​

2. Preview placements​

3. Launch a placement​

4. Send a chat request​

Endpoint Overview​

Compatibility APIs​

Skulk Control APIs​

OpenAI Chat Completions​

Context-length limits​

OpenAI Python SDK Example​

Curl Example​

Streaming Example​

Common Request Fields​

Message Format​

Finish Reasons​

Tool Use​

Thinking / Reasoning​

Builtin Browser Tools​

Structured Output​

OpenAI Responses API​

Claude Messages API​

Ollama API​

Chat​

Generate​

List models​

Show model details​

Model Discovery​

List models​

Search Hugging Face​

Per-node storage breakdown​

Placement and Instance Management​

Quick launch​

Preview valid placements​

Build a placement manually​

Create an instance from a fully specified placement​

Inspect one instance​

Delete an instance​

Download Management​

Start a node download​

Delete a node download​

Model Store Endpoints​

Store health​

Store registry​

Store downloads​

Request a store download​

Store download status​

Delete a model from the store​

Purge staging caches​

Start optimization​

Models Endpoint​

List models​

Configuration Endpoints​

Get config​

Update config​

Filesystem browse​

Node identity​

Restart a node​

State, Events, and Tracing​

Cluster state​

Event log​

Diagnostics​

Traces​

Runtime tracing control​

Local trace endpoints​

Cluster trace endpoints​

Connectivity Endpoints​

Tailscale status​

Remote access info​

Operator App Integration​

Node and cluster state​

Remote access and connectivity​

Node management​

Typical operator app workflow​

Helpful Next Docs​

Quick Navigation

First Success Flow

1. Start Skulk

2. Preview placements

3. Launch a placement

4. Send a chat request

Endpoint Overview

Compatibility APIs

Skulk Control APIs

OpenAI Chat Completions

Context-length limits

OpenAI Python SDK Example

Curl Example

Streaming Example

Common Request Fields

Message Format

Finish Reasons

Tool Use

Thinking / Reasoning

Builtin Browser Tools

Structured Output

OpenAI Responses API

Claude Messages API

Ollama API

Chat

Generate

List models

Show model details

Model Discovery

List models

Search Hugging Face

Per-node storage breakdown

Placement and Instance Management

Quick launch

Preview valid placements

Build a placement manually

Create an instance from a fully specified placement

Inspect one instance

Delete an instance

Download Management

Start a node download

Delete a node download

Model Store Endpoints

Store health

Store registry

Store downloads

Request a store download

Store download status

Delete a model from the store

Purge staging caches

Start optimization

Models Endpoint

List models

Configuration Endpoints

Get config

Update config

Filesystem browse

Node identity

Restart a node

State, Events, and Tracing

Cluster state

Event log

Diagnostics

Traces

Runtime tracing control

Local trace endpoints

Cluster trace endpoints

Connectivity Endpoints

Tailscale status

Remote access info

Operator App Integration

Node and cluster state

Remote access and connectivity

Node management

Typical operator app workflow

Helpful Next Docs