CreateInstanceParams
instance objectrequired
- MlxRingInstance
- MlxJacclInstance
shardAssignments objectrequired
runnerToShard objectrequired
property name* object
- PipelineShardMetadata
- CfgShardMetadata
- TensorShardMetadata
modelCard objectrequired
The persisted, declarative metadata Skulk holds for one model.
This is the model-card interface: the single source of truth for how a
model is sized, sharded, placed, and run. It is created once (from a
HuggingFace repo or hand-authored), broadcast cluster-wide, and read by the
planner (placement), the downloader (sizing + which files to fetch), and the
worker runner (engine + behavior). As a CamelCaseModel it is camelCase on
the wire and strict (extra="forbid"), so every node in a cluster must run
the same Skulk version (a stale node rejects newer fields).
Two layers live here: the card (this declarative metadata) and the
normalized resolved capability profile derived from it plus family defaults
(see capabilities.py and website/docs/model-capabilities.md). The
optional reasoning / modalities / tooling / runtime /
vision / placement sub-configs refine that resolution; when absent,
conservative family defaults apply.
The model's identifier (a HuggingFace repo id, e.g.
mlx-community/Qwen3.5-9B-4bit, or a custom id). Slashes become -- in
the on-disk store directory name.
storageSize objectrequired
On-disk size of the weights this card loads (for a GGUF card, just the selected quant's shard group, not every quant the repo hosts). The planner uses this for memory-fit and placement-width decisions.
0Number of transformer layers. Drives pipeline sharding (how layers split across nodes) and KV-cache sizing.
Possible values: > 0
Model hidden dimension, used in memory/KV-cache estimates.
Possible values: > 0
Whether the model may be served with tensor parallelism (Sharding.Tensor).
GGUF/llama.cpp cards set this False (single-node engine).
numKeyValueHeads object
KV-head count for grouped-query attention, used in KV-cache sizing. None
when unknown/not applicable.
- integer
- null
Possible values: > 0
The task types this model serves (TextGeneration, TextEmbedding,
TextToImage, ImageToImage); selects which runner handles it.
Possible values: [TextGeneration, TextToImage, ImageToImage, TextEmbedding]
components object
For multi-component models (e.g. a diffusion stack), the per-component
weight layout. None for a single-weights model.
- object[]
- null
Logical name of this component (e.g. text_encoder, transformer).
Repo-relative subdirectory holding this component's weights.
storageSize objectrequired
On-disk size of this component's weights.
0nLayers object
Layer count for this component when it is shardable; None otherwise.
- integer
- null
Possible values: > 0
Whether this component may be split across nodes (vs. loaded whole).
safetensorsIndexFilename object
The component's *.safetensors.index.json filename when sharded across
files; None for a single-file component.
- string
- null
Model family token (e.g. qwen3, gemma4) used to pick family-specific
defaults during capability resolution. Empty when not classified.
Human quantization label (e.g. 4bit, Q4_K_M); informational.
The upstream base model id when this is a quant/finetune of another; empty if not applicable.
ggufFile object
For GGUF (llama.cpp) models: the repo-relative path of the weights file the
runner loads (the selected quant's first shard). Resolved once at card creation
(preferring a quant over BF16) so the download fetches only that quant and the
runner loads deterministically, instead of each layer re-globbing/guessing.
None for non-GGUF (safetensors/MLX) cards.
- string
- null
Free-form capability tags carried for compatibility/auxiliary use; the
structured reasoning/modalities/tooling configs are authoritative
for capability resolution.
[]The model's advertised maximum context length in tokens (0 if unknown).
The admission ceiling is the smaller of this and what fits in memory.
0Whether the model uses classifier-free guidance (relevant to some image / diffusion models).
falsePassed to the model loader: whether to execute the repo's custom Python.
Defaults True to match upstream loaders; set False to refuse it.
trueMarks an operator-added custom card (not from the curated catalog). Excluded from the persisted card file so it is recomputed per environment.
falsevision object
Optional vision (image-input) configuration; None for text-only models.
- VisionCardConfig
- null
imageTokenId object
Token id the model uses as the image placeholder in the prompt. Required by
the MLX vision path (which splices image embeddings at this token); None
is allowed for a llama.cpp-only vision GGUF, whose chat handler inserts image
features itself and never reads this. MLX cards always set it (from
config.json).
- integer
- null
Vision model-type tag (from config.json's vision_config), selecting
the image processor (MLX) or chat handler (llama.cpp). Empty when a bare GGUF
repo only signals vision via its mmproj projector; the llama.cpp runner
then falls back to its general multimodal handler.
Repo holding the vision-tower weights when separate from the LM; empty if bundled with the main weights.
imageToken object
The literal image placeholder string, when distinct from image_token_id.
- string
- null
processorRepo object
Repo providing the image processor/preprocessor config, if not the main repo.
- string
- null
boiTokenId object
Begin-of-image token id, for families that bracket image spans.
- integer
- null
eoiTokenId object
End-of-image token id, for families that bracket image spans.
- integer
- null
reasoning object
Optional reasoning/thinking configuration (toggle, budget, format, default
effort); None falls back to family defaults.
- ReasoningCardConfig
- null
supportsToggle object
Whether the model can have reasoning turned on/off per request.
- boolean
- null
supportsBudget object
Whether the model accepts a reasoning-effort/budget control.
- boolean
- null
format object
How reasoning is marked in the output stream: none, token_delimited
(special tokens), or channel_delimited (a separate reasoning channel).
- ReasoningFormat
- null
Reasoning marker formats used by model families.
Possible values: [none, token_delimited, channel_delimited]
defaultEffort object
Reasoning effort applied when the request does not specify one.
- string
- null
Possible values: [none, minimal, low, medium, high, xhigh]
disabledEffort object
The effort value that means "reasoning off" for this model.
- string
- null
Possible values: [none, minimal, low, medium, high, xhigh]
modalities object
Optional extra-modality flags (audio input, native multimodal); None
falls back to family defaults.
- ModalitiesCardConfig
- null
supportsAudioInput object
Whether the model accepts audio input.
- boolean
- null
supportsNativeMultimodal object
Whether the model natively interleaves modalities (vs. a bolt-on adapter).
- boolean
- null
tooling object
Optional tool-calling configuration (support, call format, builtin tools);
None falls back to family defaults.
- ToolingCardConfig
- null
supportsToolCalling object
Whether the model supports function/tool calling.
- boolean
- null
toolCallFormat object
The wire format the model emits tool calls in (generic, gemma4,
gpt_oss, dsml), selecting the output parser.
- ToolCallFormat
- null
Tool-call output formats emitted by model families.
Possible values: [generic, gemma4, gpt_oss, dsml]
builtinTools object
Builtin tools Skulk advertises to this model (e.g. web_search,
open_url, extract_page).
- BuiltinToolType (string)[]
- null
Builtin tool contracts that Skulk can advertise to model families.
Possible values: [web_search, open_url, extract_page]
runtime object
Optional runtime-behavior configuration (prompt renderer, output parser,
MTP/speculative-decoding sidecar, MLX knobs); None falls back to defaults.
- RuntimeCapabilityCardConfig
- null
promptRenderer object
How prompts are rendered for this model (tokenizer chat template,
gemma4, dsml); None uses the family default.
- PromptRendererType
- null
Prompt renderer strategies supported by the runtime.
Possible values: [tokenizer, gemma4, dsml]
outputParser object
How model output is parsed (generic, gemma4, gpt_oss,
deepseek_v32), e.g. for reasoning/tool-call extraction; None uses the
family default.
- OutputParserType
- null
Output parser strategies supported by the runtime.
Possible values: [generic, gemma4, gpt_oss, deepseek_v32]
metalFastSynch object
Per-model override for the MLX MLX_METAL_FAST_SYNCH flag.
None means "no opinion" — fall through to the cluster default
selected by the runner. Set explicitly to False for models that
deadlock under FAST_SYNCH on the ring backend (e.g. gemma-4 with
multimodal load: the Metal command queue wedges in
pipeline_last_eval_output, transitively starves WindowServer,
and trips the macOS kernel watchdog into a panic). Set explicitly
to True for models that have been measured to benefit and are
known to be safe under the deployment's collective backend.
- boolean
- null
mtpHeads object
True when native MTP prediction heads are available via sidecar.
Set alongside mtp_sidecar_repo. When false or absent, the runner
skips sidecar loading and uses standard autoregressive generation.
- boolean
- null
mtpMaxDepth object
Maximum draft depth the MTP heads support.
Start at 1 for Apple Silicon. Deeper values can be evaluated via profiling but are unlikely to amortize on Metal due to near-linear verify-pass scaling.
- integer
- null
mtpSidecarRepo object
Hugging Face repo ID containing the published mtp.safetensors sidecar.
Example: "FoxlightAI/qwen3-5-7b-instruct-mtp-q4k"
The sidecar is downloaded alongside the base model weights and loaded
into the runner for speculative decoding. Produced by SWP.
- string
- null
mtpNormConvention object
How the sidecar stores its RMSNorm weights.
"zero_centered" means deviation-from-1 (the raw Qwen3.5 checkpoint
convention — the runner applies a +1.0 shift at load, mirroring what
mlx-lm's sanitize() does for trunk weights). "actual_scale" means
the stored value is the scale itself (DeepSeek convention). None
falls through to the family default keyed off the detected sidecar
layout. Override per card when a publisher changes conventions — getting
this wrong measured 0% draft acceptance on Qwen3.5-2B (issue #192).
- string
- null
Possible values: [zero_centered, actual_scale]
mtpConcatOrder object
Concatenation order of the MTP fc projection input.
"embed_first" = fc(concat([enorm(embed(t_next)), hnorm(h)])) —
verified for Qwen3.5 (72.4% offline agreement, issue #192).
"hidden_first" is the inherited DeepSeek assumption (unverified).
None falls through to the family default keyed off the detected
sidecar layout.
- string
- null
Possible values: [embed_first, hidden_first]
speculativeMultiNode object
Whether speculation may run on multi-node placements of this model.
None (default) places no restriction. Set False for models
where multi-node speculation is measured SLOWER than plain distributed
decode: the 2026-06-06 benchmark matrix found gemma-4-26B-A4B (MoE)
at 30.2 tok/s plain vs 28.2 with MTP on a 2-node pipeline (-7%), while
single-node MTP on the same model measures 2.2x — fast sharded MoE
decode plus modest acceptance makes the per-round draft+verify
overhead net negative. Single-node speculation is unaffected by this
knob. The decision is card-driven so every rank makes the same
speculate-or-not choice (the distributed agreement collective requires
rank symmetry).
- boolean
- null
assistantModelRepo object
Hugging Face repo ID of a companion assistant (drafter) model.
Gemma 4 does speculative decoding differently from the Qwen3/DeepSeek
mtp.* heads: instead of embedded prediction heads, it pairs the target
with a separate small gemma4_assistant model (e.g.
"mlx-community/gemma-4-26B-A4B-it-assistant-bf16") that cross-attends
over the target's KV cache. When set, the assistant repo is downloaded
alongside the base model. Mutually exclusive with the mtp_* fields.
NOTE: consuming the assistant for speculative generation requires the
gemma4_assistant drafter from mlx-vlm >= 0.5.0 and is not yet wired into
the runner — declaring it here only pre-downloads it. See the Gemma 4 MTP
initiative in the foxlight-docs hub (Phase C).
- string
- null
servedSpecType object
Speculative-decoding mode for the llama_server (served-backend) engine.
Maps to a llama-server --spec-type token in the runner
(_SPEC_TYPE_FLAG): draft_mtp -> draft-mtp (the model's own built-in
MTP heads, no draft model needed; Qwen3.6/DeepSeek/GLM/Kimi/Nemotron),
draft_eagle3 -> draft-eagle3 (an EAGLE-3 head), draft_simple ->
draft-simple (a separate draft model), ngram -> ngram-cache
(prompt-lookup), none/None plain decoding. Only the served engine reads
this; the in-process mlx and llama_cpp engines ignore it (MLX
speculation is the mtp_* / assistant_model_repo fields above).
- string
- null
Possible values: [none, draft_mtp, draft_eagle3, draft_simple, ngram]
servedSpecNMax object
Max draft tokens per step for the served engine (--spec-draft-n-max).
Must be a positive integer (validated at card load so a bad value fails fast
rather than producing an undefined --spec-draft-n-max at the server).
None uses the llama-server default (3). Acceptance falls off with depth
(per-position acceptance drops), so 2-3 is the usual sweet spot; tune per
card from measured acceptance.
- integer
- null
Possible values: > 0
servedSpecDraftRepo object
Hugging Face repo of a separate draft GGUF for the served engine.
Some served speculative modes need a second model passed to llama-server via
--model-draft, NOT built-in heads: draft_simple (a vocab-matched small
draft model) and draft_eagle3 (an EAGLE-3 head) always require one, and
Gemma 4 draft_mtp uses its assistant as a separate draft GGUF (llama.cpp
PR #23398) rather than baking heads into the base. Qwen3.6/DeepSeek/GLM
draft_mtp leave this unset (heads are in the base GGUF). When set, the
draft GGUF is downloaded as a companion alongside the base and passed as
--model-draft. Pairs with served_spec_draft_file.
- string
- null
servedSpecDraftFile object
Repo-relative GGUF filename of the served draft model (in
served_spec_draft_repo), e.g. "mtp-gemma-4-31B-it.gguf". Required when
served_spec_draft_repo is set; selects the exact draft quant the runner
passes to --model-draft.
- string
- null
placement object
Where the model is allowed to run and which backend is preferred: the
compatible_backends hard filter and backend_preference soft score the
planner uses to route the model to suitable nodes.
Hard constraint: only route to nodes whose advertised backends intersect
this set. Making the implicit {"mlx"} explicit is what enables future
heterogeneous (llama_cpp / rocm / cuda) routing.
["mlx"]minVramGib object
Hard constraint: planner gates on node available memory when set.
- number
- null
maxContextTokens object
Soft: caps the placement-time KV budget check (see #145) when set.
- integer
- null
Soft, ordered preference among the node's backend tags (e.g.
("llama_cpp-vulkan", "llama_cpp-rocm")).
Unlike compatible_backends (a hard filter on which nodes are eligible),
this only ranks eligible nodes/devices: the planner prefers a node that
advertises an earlier-listed tag, and the runner picks the earliest-listed
backend the chosen node actually has. The same model runs on any compatible
backend, but their performance differs per model, so this captures "fastest
on Vulkan, ROCm is an acceptable fallback" while still degrading gracefully
to a node that only offers the fallback. Order is significant and preserved;
an empty tuple means no preference (use the node's default).
[]resolvedBackend object
- string
- null
falseshouldTimeout object
- number
- null
Possible values: >= 0
Possible values: >= 0
Possible values: >= 0
nodeToRunner objectrequired
contextTokenLimit object
- integer
- null
hostsByNode objectrequired
property name* object[]
{
"instance": {
"MlxRingInstance": {
"ephemeralPort": 52416,
"hostsByNode": {
"node-1": []
},
"instanceId": "00000000-0000-0000-0000-000000000000",
"shardAssignments": {
"modelId": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"nodeToRunner": {
"node-1": "runner-1"
},
"runnerToShard": {
"runner-1": {
"PipelineShardMetadata": {
"deviceRank": 0,
"endLayer": 32,
"modelCard": {
"hiddenSize": 2048,
"modelId": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"nLayers": 32,
"storageSize": {
"inBytes": 2147483648
},
"supportsTensor": false,
"tasks": [
"TextGeneration"
]
},
"nLayers": 32,
"startLayer": 0,
"worldSize": 1
}
}
}
}
}
}
}