CreateInstanceParams

instance objectrequired

anyOf

MlxRingInstance
MlxJacclInstance

instanceIdInstanceid (string)required

shardAssignments objectrequired

modelIdModelid (string)required

runnerToShard objectrequired

property name* object

anyOf

PipelineShardMetadata
CfgShardMetadata
TensorShardMetadata

modelCard objectrequired

The persisted, declarative metadata Skulk holds for one model.

This is the model-card interface: the single source of truth for how a model is sized, sharded, placed, and run. It is created once (from a HuggingFace repo or hand-authored), broadcast cluster-wide, and read by the planner (placement), the downloader (sizing + which files to fetch), and the worker runner (engine + behavior). As a CamelCaseModel it is camelCase on the wire and strict (extra="forbid"), so every node in a cluster must run the same Skulk version (a stale node rejects newer fields).

Two layers live here: the card (this declarative metadata) and the normalized resolved capability profile derived from it plus family defaults (see capabilities.py and website/docs/model-capabilities.md). The optional reasoning / modalities / tooling / runtime / vision / placement sub-configs refine that resolution; when absent, conservative family defaults apply.

modelIdModelid (string)required

The model's identifier (a HuggingFace repo id, e.g. mlx-community/Qwen3.5-9B-4bit, or a custom id). Slashes become -- in the on-disk store directory name.

storageSize objectrequired

On-disk size of the weights this card loads (for a GGUF card, just the selected quant's shard group, not every quant the repo hosts). The planner uses this for memory-fit and placement-width decisions.

inBytesInbytes (integer)

Default value: 0

nLayersNlayers (integer)required

Number of transformer layers. Drives pipeline sharding (how layers split across nodes) and KV-cache sizing.

Possible values: > 0

hiddenSizeHiddensize (integer)required

Model hidden dimension, used in memory/KV-cache estimates.

Possible values: > 0

supportsTensorSupportstensor (boolean)required

Whether the model may be served with tensor parallelism (Sharding.Tensor). GGUF/llama.cpp cards set this False (single-node engine).

numKeyValueHeads object

KV-head count for grouped-query attention, used in KV-cache sizing. None when unknown/not applicable.

anyOf

integer
null

integer

Possible values: > 0

tasksModelTask (string)[]required

The task types this model serves (TextGeneration, TextEmbedding, TextToImage, ImageToImage); selects which runner handles it.

Possible values: [TextGeneration, TextToImage, ImageToImage, TextEmbedding]

components object

For multi-component models (e.g. a diffusion stack), the per-component weight layout. None for a single-weights model.

anyOf

object[]
null

Array [

componentNameComponentname (string)required

Logical name of this component (e.g. text_encoder, transformer).

componentPathComponentpath (string)required

Repo-relative subdirectory holding this component's weights.

storageSize objectrequired

On-disk size of this component's weights.

inBytesInbytes (integer)

Default value: 0

nLayers object

Layer count for this component when it is shardable; None otherwise.

anyOf

integer
null

integer

Possible values: > 0

canShardCanshard (boolean)required

Whether this component may be split across nodes (vs. loaded whole).

safetensorsIndexFilename object

The component's *.safetensors.index.json filename when sharded across files; None for a single-file component.

anyOf

string
null

string

]

familyFamily (string)

Model family token (e.g. qwen3, gemma4) used to pick family-specific defaults during capability resolution. Empty when not classified.

Default value:

quantizationQuantization (string)

Human quantization label (e.g. 4bit, Q4_K_M); informational.

Default value:

baseModelBasemodel (string)

The upstream base model id when this is a quant/finetune of another; empty if not applicable.

Default value:

ggufFile object

For GGUF (llama.cpp) models: the repo-relative path of the weights file the runner loads (the selected quant's first shard). Resolved once at card creation (preferring a quant over BF16) so the download fetches only that quant and the runner loads deterministically, instead of each layer re-globbing/guessing. None for non-GGUF (safetensors/MLX) cards.

anyOf

string
null

string

capabilitiesstring[]

Free-form capability tags carried for compatibility/auxiliary use; the structured reasoning/modalities/tooling configs are authoritative for capability resolution.

Default value: []

contextLengthContextlength (integer)

The model's advertised maximum context length in tokens (0 if unknown). The admission ceiling is the smaller of this and what fits in memory.

Default value: 0

usesCfgUsescfg (boolean)

Whether the model uses classifier-free guidance (relevant to some image / diffusion models).

Default value: false

trustRemoteCodeTrustremotecode (boolean)

Passed to the model loader: whether to execute the repo's custom Python. Defaults True to match upstream loaders; set False to refuse it.

Default value: true

isCustomIscustom (boolean)

Marks an operator-added custom card (not from the curated catalog). Excluded from the persisted card file so it is recomputed per environment.

Default value: false

vision object

Optional vision (image-input) configuration; None for text-only models.

anyOf

VisionCardConfig
null

imageTokenId object

Token id the model uses as the image placeholder in the prompt. Required by the MLX vision path (which splices image embeddings at this token); None is allowed for a llama.cpp-only vision GGUF, whose chat handler inserts image features itself and never reads this. MLX cards always set it (from config.json).

anyOf

integer
null

integer

modelTypeModeltype (string)

Vision model-type tag (from config.json's vision_config), selecting the image processor (MLX) or chat handler (llama.cpp). Empty when a bare GGUF repo only signals vision via its mmproj projector; the llama.cpp runner then falls back to its general multimodal handler.

Default value:

weightsRepoWeightsrepo (string)

Repo holding the vision-tower weights when separate from the LM; empty if bundled with the main weights.

Default value:

imageToken object

The literal image placeholder string, when distinct from image_token_id.

anyOf

string
null

string

processorRepo object

Repo providing the image processor/preprocessor config, if not the main repo.

anyOf

string
null

string

boiTokenId object

Begin-of-image token id, for families that bracket image spans.

anyOf

integer
null

integer

eoiTokenId object

End-of-image token id, for families that bracket image spans.

anyOf

integer
null

integer

reasoning object

Optional reasoning/thinking configuration (toggle, budget, format, default effort); None falls back to family defaults.

anyOf

ReasoningCardConfig
null

supportsToggle object

Whether the model can have reasoning turned on/off per request.

anyOf

boolean
null

boolean

supportsBudget object

Whether the model accepts a reasoning-effort/budget control.

anyOf

boolean
null

boolean

format object

How reasoning is marked in the output stream: none, token_delimited (special tokens), or channel_delimited (a separate reasoning channel).

anyOf

ReasoningFormat
null

ReasoningFormat (string)

Reasoning marker formats used by model families.

Possible values: [none, token_delimited, channel_delimited]

defaultEffort object

Reasoning effort applied when the request does not specify one.

anyOf

string
null

string

Possible values: [none, minimal, low, medium, high, xhigh]

disabledEffort object

The effort value that means "reasoning off" for this model.

anyOf

string
null

string

Possible values: [none, minimal, low, medium, high, xhigh]

modalities object

Optional extra-modality flags (audio input, native multimodal); None falls back to family defaults.

anyOf

ModalitiesCardConfig
null

supportsAudioInput object

Whether the model accepts audio input.

anyOf

boolean
null

boolean

supportsNativeMultimodal object

Whether the model natively interleaves modalities (vs. a bolt-on adapter).

anyOf

boolean
null

boolean

tooling object

Optional tool-calling configuration (support, call format, builtin tools); None falls back to family defaults.

anyOf

ToolingCardConfig
null

supportsToolCalling object

Whether the model supports function/tool calling.

anyOf

boolean
null

boolean

toolCallFormat object

The wire format the model emits tool calls in (generic, gemma4, gpt_oss, dsml), selecting the output parser.

anyOf

ToolCallFormat
null

ToolCallFormat (string)

Tool-call output formats emitted by model families.

Possible values: [generic, gemma4, gpt_oss, dsml]

builtinTools object

Builtin tools Skulk advertises to this model (e.g. web_search, open_url, extract_page).

anyOf

BuiltinToolType (string)[]
null

Array [

BuiltinToolType (string)

Builtin tool contracts that Skulk can advertise to model families.

Possible values: [web_search, open_url, extract_page]

]

runtime object

Optional runtime-behavior configuration (prompt renderer, output parser, MTP/speculative-decoding sidecar, MLX knobs); None falls back to defaults.

anyOf

RuntimeCapabilityCardConfig
null

promptRenderer object

How prompts are rendered for this model (tokenizer chat template, gemma4, dsml); None uses the family default.

anyOf

PromptRendererType
null

PromptRendererType (string)

Prompt renderer strategies supported by the runtime.

Possible values: [tokenizer, gemma4, dsml]

outputParser object

How model output is parsed (generic, gemma4, gpt_oss, deepseek_v32), e.g. for reasoning/tool-call extraction; None uses the family default.

anyOf

OutputParserType
null

OutputParserType (string)

Output parser strategies supported by the runtime.

Possible values: [generic, gemma4, gpt_oss, deepseek_v32]

metalFastSynch object

Per-model override for the MLX MLX_METAL_FAST_SYNCH flag.

None means "no opinion" — fall through to the cluster default selected by the runner. Set explicitly to False for models that deadlock under FAST_SYNCH on the ring backend (e.g. gemma-4 with multimodal load: the Metal command queue wedges in pipeline_last_eval_output, transitively starves WindowServer, and trips the macOS kernel watchdog into a panic). Set explicitly to True for models that have been measured to benefit and are known to be safe under the deployment's collective backend.

anyOf

boolean
null

boolean

mtpHeads object

True when native MTP prediction heads are available via sidecar.

Set alongside mtp_sidecar_repo. When false or absent, the runner skips sidecar loading and uses standard autoregressive generation.

anyOf

boolean
null

boolean

mtpMaxDepth object

Maximum draft depth the MTP heads support.

Start at 1 for Apple Silicon. Deeper values can be evaluated via profiling but are unlikely to amortize on Metal due to near-linear verify-pass scaling.

anyOf

integer
null

integer

mtpSidecarRepo object

Hugging Face repo ID containing the published mtp.safetensors sidecar.

Example: "FoxlightAI/qwen3-5-7b-instruct-mtp-q4k" The sidecar is downloaded alongside the base model weights and loaded into the runner for speculative decoding. Produced by SWP.

anyOf

string
null

string

mtpNormConvention object

How the sidecar stores its RMSNorm weights.

"zero_centered" means deviation-from-1 (the raw Qwen3.5 checkpoint convention — the runner applies a +1.0 shift at load, mirroring what mlx-lm's sanitize() does for trunk weights). "actual_scale" means the stored value is the scale itself (DeepSeek convention). None falls through to the family default keyed off the detected sidecar layout. Override per card when a publisher changes conventions — getting this wrong measured 0% draft acceptance on Qwen3.5-2B (issue #192).

anyOf

string
null

string

Possible values: [zero_centered, actual_scale]

mtpConcatOrder object

Concatenation order of the MTP fc projection input.

"embed_first" = fc(concat([enorm(embed(t_next)), hnorm(h)])) — verified for Qwen3.5 (72.4% offline agreement, issue #192). "hidden_first" is the inherited DeepSeek assumption (unverified). None falls through to the family default keyed off the detected sidecar layout.

anyOf

string
null

string

Possible values: [embed_first, hidden_first]

speculativeMultiNode object

Whether speculation may run on multi-node placements of this model.

None (default) places no restriction. Set False for models where multi-node speculation is measured SLOWER than plain distributed decode: the 2026-06-06 benchmark matrix found gemma-4-26B-A4B (MoE) at 30.2 tok/s plain vs 28.2 with MTP on a 2-node pipeline (-7%), while single-node MTP on the same model measures 2.2x — fast sharded MoE decode plus modest acceptance makes the per-round draft+verify overhead net negative. Single-node speculation is unaffected by this knob. The decision is card-driven so every rank makes the same speculate-or-not choice (the distributed agreement collective requires rank symmetry).

anyOf

boolean
null

boolean

assistantModelRepo object

Hugging Face repo ID of a companion assistant (drafter) model.

Gemma 4 does speculative decoding differently from the Qwen3/DeepSeek mtp.* heads: instead of embedded prediction heads, it pairs the target with a separate small gemma4_assistant model (e.g. "mlx-community/gemma-4-26B-A4B-it-assistant-bf16") that cross-attends over the target's KV cache. When set, the assistant repo is downloaded alongside the base model. Mutually exclusive with the mtp_* fields.

NOTE: consuming the assistant for speculative generation requires the gemma4_assistant drafter from mlx-vlm >= 0.5.0 and is not yet wired into the runner — declaring it here only pre-downloads it. See the Gemma 4 MTP initiative in the foxlight-docs hub (Phase C).

anyOf

string
null

string

servedSpecType object

Speculative-decoding mode for the llama_server (served-backend) engine.

Maps to a llama-server --spec-type token in the runner (_SPEC_TYPE_FLAG): draft_mtp -> draft-mtp (the model's own built-in MTP heads, no draft model needed; Qwen3.6/DeepSeek/GLM/Kimi/Nemotron), draft_eagle3 -> draft-eagle3 (an EAGLE-3 head), draft_simple -> draft-simple (a separate draft model), ngram -> ngram-cache (prompt-lookup), none/None plain decoding. Only the served engine reads this; the in-process mlx and llama_cpp engines ignore it (MLX speculation is the mtp_* / assistant_model_repo fields above).

anyOf

string
null

string

Possible values: [none, draft_mtp, draft_eagle3, draft_simple, ngram]

servedSpecNMax object

Max draft tokens per step for the served engine (--spec-draft-n-max).

Must be a positive integer (validated at card load so a bad value fails fast rather than producing an undefined --spec-draft-n-max at the server). None uses the llama-server default (3). Acceptance falls off with depth (per-position acceptance drops), so 2-3 is the usual sweet spot; tune per card from measured acceptance.

anyOf

integer
null

integer

Possible values: > 0

servedSpecDraftRepo object

Hugging Face repo of a separate draft GGUF for the served engine.

Some served speculative modes need a second model passed to llama-server via --model-draft, NOT built-in heads: draft_simple (a vocab-matched small draft model) and draft_eagle3 (an EAGLE-3 head) always require one, and Gemma 4 draft_mtp uses its assistant as a separate draft GGUF (llama.cpp PR #23398) rather than baking heads into the base. Qwen3.6/DeepSeek/GLM draft_mtp leave this unset (heads are in the base GGUF). When set, the draft GGUF is downloaded as a companion alongside the base and passed as --model-draft. Pairs with served_spec_draft_file.

anyOf

string
null

string

servedSpecDraftFile object

Repo-relative GGUF filename of the served draft model (in served_spec_draft_repo), e.g. "mtp-gemma-4-31B-it.gguf". Required when served_spec_draft_repo is set; selects the exact draft quant the runner passes to --model-draft.

anyOf

string
null

string

placement object

Where the model is allowed to run and which backend is preferred: the compatible_backends hard filter and backend_preference soft score the planner uses to route the model to suitable nodes.

compatibleBackendsstring[]

Hard constraint: only route to nodes whose advertised backends intersect this set. Making the implicit {"mlx"} explicit is what enables future heterogeneous (llama_cpp / rocm / cuda) routing.

Default value: ["mlx"]

minVramGib object

Hard constraint: planner gates on node available memory when set.

anyOf

number
null

number

maxContextTokens object

Soft: caps the placement-time KV budget check (see #145) when set.

anyOf

integer
null

integer

backendPreferencestring[]

Soft, ordered preference among the node's backend tags (e.g. ("llama_cpp-vulkan", "llama_cpp-rocm")).

Unlike compatible_backends (a hard filter on which nodes are eligible), this only ranks eligible nodes/devices: the planner prefers a node that advertises an earlier-listed tag, and the runner picks the earliest-listed backend the chosen node actually has. The same model runs on any compatible backend, but their performance differs per model, so this captures "fastest on Vulkan, ROCm is an acceptable fallback" while still degrading gracefully to a node that only offers the fallback. Order is significant and preserved; an empty tuple means no preference (use the node's default).

Default value: []

deviceRankDevicerank (integer)required

worldSizeWorldsize (integer)required

resolvedBackend object

anyOf

string
null

string

immediateExceptionImmediateexception (boolean)

Default value: false

shouldTimeout object

anyOf

number
null

number

startLayerStartlayer (integer)required

Possible values: >= 0

endLayerEndlayer (integer)required

Possible values: >= 0

nLayersNlayers (integer)required

Possible values: >= 0

nodeToRunner objectrequired

property name*string

contextTokenLimit object

anyOf

integer
null

integer

hostsByNode objectrequired

property name* object[]

Array [

ipIp (string)required

portPort (integer)required

]

ephemeralPortEphemeralport (integer)required

CreateInstanceParams
{
  "instance": {
    "MlxRingInstance": {
      "ephemeralPort": 52416,
      "hostsByNode": {
        "node-1": []
      },
      "instanceId": "00000000-0000-0000-0000-000000000000",
      "shardAssignments": {
        "modelId": "mlx-community/Llama-3.2-1B-Instruct-4bit",
        "nodeToRunner": {
          "node-1": "runner-1"
        },
        "runnerToShard": {
          "runner-1": {
            "PipelineShardMetadata": {
              "deviceRank": 0,
              "endLayer": 32,
              "modelCard": {
                "hiddenSize": 2048,
                "modelId": "mlx-community/Llama-3.2-1B-Instruct-4bit",
                "nLayers": 32,
                "storageSize": {
                  "inBytes": 2147483648
                },
                "supportsTensor": false,
                "tasks": [
                  "TextGeneration"
                ]
              },
              "nLayers": 32,
              "startLayer": 0,
              "worldSize": 1
            }
          }
        }
      }
    }
  }
}