Skip to main content

TensorShardMetadata

modelCard objectrequired

The persisted, declarative metadata Skulk holds for one model.

This is the model-card interface: the single source of truth for how a model is sized, sharded, placed, and run. It is created once (from a HuggingFace repo or hand-authored), broadcast cluster-wide, and read by the planner (placement), the downloader (sizing + which files to fetch), and the worker runner (engine + behavior). As a CamelCaseModel it is camelCase on the wire and strict (extra="forbid"), so every node in a cluster must run the same Skulk version (a stale node rejects newer fields).

Two layers live here: the card (this declarative metadata) and the normalized resolved capability profile derived from it plus family defaults (see capabilities.py and website/docs/model-capabilities.md). The optional reasoning / modalities / tooling / runtime / vision / placement sub-configs refine that resolution; when absent, conservative family defaults apply.

modelIdModelid (string)required

The model's identifier (a HuggingFace repo id, e.g. mlx-community/Qwen3.5-9B-4bit, or a custom id). Slashes become -- in the on-disk store directory name.

storageSize objectrequired

On-disk size of the weights this card loads (for a GGUF card, just the selected quant's shard group, not every quant the repo hosts). The planner uses this for memory-fit and placement-width decisions.

inBytesInbytes (integer)
Default value: 0
nLayersNlayers (integer)required

Number of transformer layers. Drives pipeline sharding (how layers split across nodes) and KV-cache sizing.

Possible values: > 0

hiddenSizeHiddensize (integer)required

Model hidden dimension, used in memory/KV-cache estimates.

Possible values: > 0

supportsTensorSupportstensor (boolean)required

Whether the model may be served with tensor parallelism (Sharding.Tensor). GGUF/llama.cpp cards set this False (single-node engine).

numKeyValueHeads object

KV-head count for grouped-query attention, used in KV-cache sizing. None when unknown/not applicable.

anyOf
integer

Possible values: > 0

tasksModelTask (string)[]required

The task types this model serves (TextGeneration, TextEmbedding, TextToImage, ImageToImage); selects which runner handles it.

Possible values: [TextGeneration, TextToImage, ImageToImage, TextEmbedding]

components object

For multi-component models (e.g. a diffusion stack), the per-component weight layout. None for a single-weights model.

anyOf
  • Array [
  • componentNameComponentname (string)required

    Logical name of this component (e.g. text_encoder, transformer).

    componentPathComponentpath (string)required

    Repo-relative subdirectory holding this component's weights.

    storageSize objectrequired

    On-disk size of this component's weights.

    inBytesInbytes (integer)
    Default value: 0
    nLayers object

    Layer count for this component when it is shardable; None otherwise.

    anyOf
    integer

    Possible values: > 0

    canShardCanshard (boolean)required

    Whether this component may be split across nodes (vs. loaded whole).

    safetensorsIndexFilename object

    The component's *.safetensors.index.json filename when sharded across files; None for a single-file component.

    anyOf
    string
  • ]
  • familyFamily (string)

    Model family token (e.g. qwen3, gemma4) used to pick family-specific defaults during capability resolution. Empty when not classified.

    Default value:
    quantizationQuantization (string)

    Human quantization label (e.g. 4bit, Q4_K_M); informational.

    Default value:
    baseModelBasemodel (string)

    The upstream base model id when this is a quant/finetune of another; empty if not applicable.

    Default value:
    ggufFile object

    For GGUF (llama.cpp) models: the repo-relative path of the weights file the runner loads (the selected quant's first shard). Resolved once at card creation (preferring a quant over BF16) so the download fetches only that quant and the runner loads deterministically, instead of each layer re-globbing/guessing. None for non-GGUF (safetensors/MLX) cards.

    anyOf
    string
    capabilitiesstring[]

    Free-form capability tags carried for compatibility/auxiliary use; the structured reasoning/modalities/tooling configs are authoritative for capability resolution.

    Default value: []
    contextLengthContextlength (integer)

    The model's advertised maximum context length in tokens (0 if unknown). The admission ceiling is the smaller of this and what fits in memory.

    Default value: 0
    usesCfgUsescfg (boolean)

    Whether the model uses classifier-free guidance (relevant to some image / diffusion models).

    Default value: false
    trustRemoteCodeTrustremotecode (boolean)

    Passed to the model loader: whether to execute the repo's custom Python. Defaults True to match upstream loaders; set False to refuse it.

    Default value: true
    isCustomIscustom (boolean)

    Marks an operator-added custom card (not from the curated catalog). Excluded from the persisted card file so it is recomputed per environment.

    Default value: false
    vision object

    Optional vision (image-input) configuration; None for text-only models.

    anyOf
    imageTokenId object

    Token id the model uses as the image placeholder in the prompt. Required by the MLX vision path (which splices image embeddings at this token); None is allowed for a llama.cpp-only vision GGUF, whose chat handler inserts image features itself and never reads this. MLX cards always set it (from config.json).

    anyOf
    integer
    modelTypeModeltype (string)

    Vision model-type tag (from config.json's vision_config), selecting the image processor (MLX) or chat handler (llama.cpp). Empty when a bare GGUF repo only signals vision via its mmproj projector; the llama.cpp runner then falls back to its general multimodal handler.

    Default value:
    weightsRepoWeightsrepo (string)

    Repo holding the vision-tower weights when separate from the LM; empty if bundled with the main weights.

    Default value:
    imageToken object

    The literal image placeholder string, when distinct from image_token_id.

    anyOf
    string
    processorRepo object

    Repo providing the image processor/preprocessor config, if not the main repo.

    anyOf
    string
    boiTokenId object

    Begin-of-image token id, for families that bracket image spans.

    anyOf
    integer
    eoiTokenId object

    End-of-image token id, for families that bracket image spans.

    anyOf
    integer
    reasoning object

    Optional reasoning/thinking configuration (toggle, budget, format, default effort); None falls back to family defaults.

    anyOf
    supportsToggle object

    Whether the model can have reasoning turned on/off per request.

    anyOf
    boolean
    supportsBudget object

    Whether the model accepts a reasoning-effort/budget control.

    anyOf
    boolean
    format object

    How reasoning is marked in the output stream: none, token_delimited (special tokens), or channel_delimited (a separate reasoning channel).

    anyOf
    ReasoningFormat (string)

    Reasoning marker formats used by model families.

    Possible values: [none, token_delimited, channel_delimited]

    defaultEffort object

    Reasoning effort applied when the request does not specify one.

    anyOf
    string

    Possible values: [none, minimal, low, medium, high, xhigh]

    disabledEffort object

    The effort value that means "reasoning off" for this model.

    anyOf
    string

    Possible values: [none, minimal, low, medium, high, xhigh]

    modalities object

    Optional extra-modality flags (audio input, native multimodal); None falls back to family defaults.

    anyOf
    supportsAudioInput object

    Whether the model accepts audio input.

    anyOf
    boolean
    supportsNativeMultimodal object

    Whether the model natively interleaves modalities (vs. a bolt-on adapter).

    anyOf
    boolean
    tooling object

    Optional tool-calling configuration (support, call format, builtin tools); None falls back to family defaults.

    anyOf
    supportsToolCalling object

    Whether the model supports function/tool calling.

    anyOf
    boolean
    toolCallFormat object

    The wire format the model emits tool calls in (generic, gemma4, gpt_oss, dsml), selecting the output parser.

    anyOf
    ToolCallFormat (string)

    Tool-call output formats emitted by model families.

    Possible values: [generic, gemma4, gpt_oss, dsml]

    builtinTools object

    Builtin tools Skulk advertises to this model (e.g. web_search, open_url, extract_page).

    anyOf
  • Array [
  • BuiltinToolType (string)

    Builtin tool contracts that Skulk can advertise to model families.

    Possible values: [web_search, open_url, extract_page]

  • ]
  • runtime object

    Optional runtime-behavior configuration (prompt renderer, output parser, MTP/speculative-decoding sidecar, MLX knobs); None falls back to defaults.

    anyOf
    promptRenderer object

    How prompts are rendered for this model (tokenizer chat template, gemma4, dsml); None uses the family default.

    anyOf
    PromptRendererType (string)

    Prompt renderer strategies supported by the runtime.

    Possible values: [tokenizer, gemma4, dsml]

    outputParser object

    How model output is parsed (generic, gemma4, gpt_oss, deepseek_v32), e.g. for reasoning/tool-call extraction; None uses the family default.

    anyOf
    OutputParserType (string)

    Output parser strategies supported by the runtime.

    Possible values: [generic, gemma4, gpt_oss, deepseek_v32]

    metalFastSynch object

    Per-model override for the MLX MLX_METAL_FAST_SYNCH flag.

    None means "no opinion" — fall through to the cluster default selected by the runner. Set explicitly to False for models that deadlock under FAST_SYNCH on the ring backend (e.g. gemma-4 with multimodal load: the Metal command queue wedges in pipeline_last_eval_output, transitively starves WindowServer, and trips the macOS kernel watchdog into a panic). Set explicitly to True for models that have been measured to benefit and are known to be safe under the deployment's collective backend.

    anyOf
    boolean
    mtpHeads object

    True when native MTP prediction heads are available via sidecar.

    Set alongside mtp_sidecar_repo. When false or absent, the runner skips sidecar loading and uses standard autoregressive generation.

    anyOf
    boolean
    mtpMaxDepth object

    Maximum draft depth the MTP heads support.

    Start at 1 for Apple Silicon. Deeper values can be evaluated via profiling but are unlikely to amortize on Metal due to near-linear verify-pass scaling.

    anyOf
    integer
    mtpSidecarRepo object

    Hugging Face repo ID containing the published mtp.safetensors sidecar.

    Example: "FoxlightAI/qwen3-5-7b-instruct-mtp-q4k" The sidecar is downloaded alongside the base model weights and loaded into the runner for speculative decoding. Produced by SWP.

    anyOf
    string
    mtpNormConvention object

    How the sidecar stores its RMSNorm weights.

    "zero_centered" means deviation-from-1 (the raw Qwen3.5 checkpoint convention — the runner applies a +1.0 shift at load, mirroring what mlx-lm's sanitize() does for trunk weights). "actual_scale" means the stored value is the scale itself (DeepSeek convention). None falls through to the family default keyed off the detected sidecar layout. Override per card when a publisher changes conventions — getting this wrong measured 0% draft acceptance on Qwen3.5-2B (issue #192).

    anyOf
    string

    Possible values: [zero_centered, actual_scale]

    mtpConcatOrder object

    Concatenation order of the MTP fc projection input.

    "embed_first" = fc(concat([enorm(embed(t_next)), hnorm(h)])) — verified for Qwen3.5 (72.4% offline agreement, issue #192). "hidden_first" is the inherited DeepSeek assumption (unverified). None falls through to the family default keyed off the detected sidecar layout.

    anyOf
    string

    Possible values: [embed_first, hidden_first]

    speculativeMultiNode object

    Whether speculation may run on multi-node placements of this model.

    None (default) places no restriction. Set False for models where multi-node speculation is measured SLOWER than plain distributed decode: the 2026-06-06 benchmark matrix found gemma-4-26B-A4B (MoE) at 30.2 tok/s plain vs 28.2 with MTP on a 2-node pipeline (-7%), while single-node MTP on the same model measures 2.2x — fast sharded MoE decode plus modest acceptance makes the per-round draft+verify overhead net negative. Single-node speculation is unaffected by this knob. The decision is card-driven so every rank makes the same speculate-or-not choice (the distributed agreement collective requires rank symmetry).

    anyOf
    boolean
    assistantModelRepo object

    Hugging Face repo ID of a companion assistant (drafter) model.

    Gemma 4 does speculative decoding differently from the Qwen3/DeepSeek mtp.* heads: instead of embedded prediction heads, it pairs the target with a separate small gemma4_assistant model (e.g. "mlx-community/gemma-4-26B-A4B-it-assistant-bf16") that cross-attends over the target's KV cache. When set, the assistant repo is downloaded alongside the base model. Mutually exclusive with the mtp_* fields.

    NOTE: consuming the assistant for speculative generation requires the gemma4_assistant drafter from mlx-vlm >= 0.5.0 and is not yet wired into the runner — declaring it here only pre-downloads it. See the Gemma 4 MTP initiative in the foxlight-docs hub (Phase C).

    anyOf
    string
    servedSpecType object

    Speculative-decoding mode for the llama_server (served-backend) engine.

    Maps to a llama-server --spec-type token in the runner (_SPEC_TYPE_FLAG): draft_mtp -> draft-mtp (the model's own built-in MTP heads, no draft model needed; Qwen3.6/DeepSeek/GLM/Kimi/Nemotron), draft_eagle3 -> draft-eagle3 (an EAGLE-3 head), draft_simple -> draft-simple (a separate draft model), ngram -> ngram-cache (prompt-lookup), none/None plain decoding. Only the served engine reads this; the in-process mlx and llama_cpp engines ignore it (MLX speculation is the mtp_* / assistant_model_repo fields above).

    anyOf
    string

    Possible values: [none, draft_mtp, draft_eagle3, draft_simple, ngram]

    servedSpecNMax object

    Max draft tokens per step for the served engine (--spec-draft-n-max).

    Must be a positive integer (validated at card load so a bad value fails fast rather than producing an undefined --spec-draft-n-max at the server). None uses the llama-server default (3). Acceptance falls off with depth (per-position acceptance drops), so 2-3 is the usual sweet spot; tune per card from measured acceptance.

    anyOf
    integer

    Possible values: > 0

    servedSpecDraftRepo object

    Hugging Face repo of a separate draft GGUF for the served engine.

    Some served speculative modes need a second model passed to llama-server via --model-draft, NOT built-in heads: draft_simple (a vocab-matched small draft model) and draft_eagle3 (an EAGLE-3 head) always require one, and Gemma 4 draft_mtp uses its assistant as a separate draft GGUF (llama.cpp PR #23398) rather than baking heads into the base. Qwen3.6/DeepSeek/GLM draft_mtp leave this unset (heads are in the base GGUF). When set, the draft GGUF is downloaded as a companion alongside the base and passed as --model-draft. Pairs with served_spec_draft_file.

    anyOf
    string
    servedSpecDraftFile object

    Repo-relative GGUF filename of the served draft model (in served_spec_draft_repo), e.g. "mtp-gemma-4-31B-it.gguf". Required when served_spec_draft_repo is set; selects the exact draft quant the runner passes to --model-draft.

    anyOf
    string
    placement object

    Where the model is allowed to run and which backend is preferred: the compatible_backends hard filter and backend_preference soft score the planner uses to route the model to suitable nodes.

    compatibleBackendsstring[]

    Hard constraint: only route to nodes whose advertised backends intersect this set. Making the implicit {"mlx"} explicit is what enables future heterogeneous (llama_cpp / rocm / cuda) routing.

    Default value: ["mlx"]
    minVramGib object

    Hard constraint: planner gates on node available memory when set.

    anyOf
    number
    maxContextTokens object

    Soft: caps the placement-time KV budget check (see #145) when set.

    anyOf
    integer
    backendPreferencestring[]

    Soft, ordered preference among the node's backend tags (e.g. ("llama_cpp-vulkan", "llama_cpp-rocm")).

    Unlike compatible_backends (a hard filter on which nodes are eligible), this only ranks eligible nodes/devices: the planner prefers a node that advertises an earlier-listed tag, and the runner picks the earliest-listed backend the chosen node actually has. The same model runs on any compatible backend, but their performance differs per model, so this captures "fastest on Vulkan, ROCm is an acceptable fallback" while still degrading gracefully to a node that only offers the fallback. Order is significant and preserved; an empty tuple means no preference (use the node's default).

    Default value: []
    deviceRankDevicerank (integer)required
    worldSizeWorldsize (integer)required
    resolvedBackend object
    anyOf
    string
    immediateExceptionImmediateexception (boolean)
    Default value: false
    shouldTimeout object
    anyOf
    number
    startLayerStartlayer (integer)required

    Possible values: >= 0

    endLayerEndlayer (integer)required

    Possible values: >= 0

    nLayersNlayers (integer)required

    Possible values: >= 0

    TensorShardMetadata
    {
    "modelCard": {
    "modelId": "string",
    "storageSize": {
    "inBytes": 0
    },
    "nLayers": 0,
    "hiddenSize": 0,
    "supportsTensor": true,
    "numKeyValueHeads": 0,
    "tasks": [
    "TextGeneration"
    ],
    "components": [
    {
    "componentName": "string",
    "componentPath": "string",
    "storageSize": {
    "inBytes": 0
    },
    "nLayers": 0,
    "canShard": true,
    "safetensorsIndexFilename": "string"
    }
    ],
    "family": "",
    "quantization": "",
    "baseModel": "",
    "ggufFile": "string",
    "capabilities": [
    "string"
    ],
    "contextLength": 0,
    "usesCfg": false,
    "trustRemoteCode": true,
    "isCustom": false,
    "vision": {
    "imageTokenId": 0,
    "modelType": "",
    "weightsRepo": "",
    "imageToken": "string",
    "processorRepo": "string",
    "boiTokenId": 0,
    "eoiTokenId": 0
    },
    "reasoning": {
    "supportsToggle": true,
    "supportsBudget": true,
    "format": "none",
    "defaultEffort": "none",
    "disabledEffort": "none"
    },
    "modalities": {
    "supportsAudioInput": true,
    "supportsNativeMultimodal": true
    },
    "tooling": {
    "supportsToolCalling": true,
    "toolCallFormat": "generic",
    "builtinTools": [
    "web_search"
    ]
    },
    "runtime": {
    "promptRenderer": "tokenizer",
    "outputParser": "generic",
    "metalFastSynch": true,
    "mtpHeads": true,
    "mtpMaxDepth": 0,
    "mtpSidecarRepo": "string",
    "mtpNormConvention": "zero_centered",
    "mtpConcatOrder": "embed_first",
    "speculativeMultiNode": true,
    "assistantModelRepo": "string",
    "servedSpecType": "none",
    "servedSpecNMax": 0,
    "servedSpecDraftRepo": "string",
    "servedSpecDraftFile": "string"
    },
    "placement": {
    "compatibleBackends": [
    "string"
    ],
    "minVramGib": 0,
    "maxContextTokens": 0,
    "backendPreference": [
    "string"
    ]
    }
    },
    "deviceRank": 0,
    "worldSize": 0,
    "resolvedBackend": "string",
    "immediateException": false,
    "shouldTimeout": 0,
    "startLayer": 0,
    "endLayer": 0,
    "nLayers": 0
    }