Model Cards
Model cards are Skulk's durable source of truth for model metadata.
They are how Skulk knows things like:
- what a model is called
- what task types it supports
- how large it is for placement and download planning
- whether it supports tensor sharding
- whether it has vision support
- whether it declares advanced model-specific behavior
What Model Cards Do
Model cards sit at the boundary between static metadata and runtime behavior.
They drive:
- placement and memory calculations
- model browsing in the API and dashboard
- custom model registration
- modality hints such as vision
- advanced capability declarations for model-specific runtime behavior
Where They Live
Built-in cards are shipped in:
Custom cards are stored under the user data directory and synced through the cluster event flow.
The card interface (source of truth)
The authoritative definition of the model-card interface is the ModelCard
type in
src/skulk/shared/models/model_cards.py.
Every field is documented in that model, and the exhaustive, always-current field
reference is the generated API schema (ModelCard and its nested
PlacementCardConfig / RuntimeCapabilityCardConfig / VisionCardConfig /
ReasoningCardConfig / ModalitiesCardConfig / ToolingCardConfig /
ComponentInfo) in the API reference. This page is the curated
narrative; when in doubt about an exact field, the schema is canonical.
Cards are camelCase on the wire and strict (unknown fields are rejected), so every node in a cluster must run the same Skulk version.
Core Fields
Identity and size
model_id- Hugging Face / MLX model identifier
storage_size- total model size used for store/download/placement planning
n_layers- number of transformer layers used for pipeline sharding
hidden_size- hidden dimension used for placement and compatibility checks
num_key_value_heads- optional KV head count for tensor compatibility decisions
gguf_file- for GGUF (llama.cpp) models only: the repo-relative weights file the runner loads (the selected quant's first shard), resolved once at card creation;
nullfor safetensors/MLX cards
- for GGUF (llama.cpp) models only: the repo-relative weights file the runner loads (the selected quant's first shard), resolved once at card creation;
components- for multi-component models (such as a diffusion stack): the per-component weight layout;
nullfor a single-weights model
- for multi-component models (such as a diffusion stack): the per-component weight layout;
Runtime and placement
supports_tensor- whether tensor-style placement is allowed (GGUF/llama.cpp cards set this
false)
- whether tensor-style placement is allowed (GGUF/llama.cpp cards set this
tasks- supported task families such as
TextGeneration,TextEmbedding, or image tasks
- supported task families such as
trust_remote_code- whether the loader may enable remote-code behavior for this model
uses_cfg- whether the model uses classifier-free guidance (relevant to some image/diffusion models)
Catalog metadata
family- coarse family label such as
gemma,qwen,deepseek
- coarse family label such as
quantization- human-facing quantization label
base_model- display-friendly base model name
context_length- advertised context length if known
capabilities- coarse capability list such as
text,vision,thinking,embedding
- coarse capability list such as
These coarse capabilities remain useful for browsing, badges, and basic compatibility, but they are not expressive enough for model-specific runtime behavior on their own.
Vision Section
[vision] is the existing structured section for multimodal text-generation models.
Fields include:
image_token_id- token used to represent image slots in prompt tokenization
model_type- MLX-VLM model family identifier such as
gemma4
- MLX-VLM model family identifier such as
weights_repo- optional alternate weights repository for the vision tower
image_token- optional literal image token string
processor_repo- optional alternate processor repository
boi_token_id- optional begin-of-image token id
eoi_token_id- optional end-of-image token id
Placement Section
[placement] declares where a model is allowed to run and which backend is
preferred. It is what the planner reads to route a model to suitable nodes, and
it is how a heterogeneous cluster keeps each model on hardware that can run it.
Backends are named <engine>-<compute> tags (for example mlx-metal,
llama_cpp-vulkan, llama_cpp-rocm, llama_cpp-cpu). The engine selects the
worker runner; the compute names the accelerator. A bare engine tag (e.g. mlx)
is also valid vocabulary: nodes advertise it alongside their compound tags, so a
card written against the original {"mlx"} set keeps matching.
compatible_backends- the hard filter: the set of backend tags this model may run on. The planner excludes any node whose advertised backends do not intersect this set. The default is
{"mlx"}(so an unannotated card stays on MLX nodes); a GGUF card lists the llama.cpp tags. This is what keeps an MLX model off an AMD node and a GGUF model off a Mac.
- the hard filter: the set of backend tags this model may run on. The planner excludes any node whose advertised backends do not intersect this set. The default is
backend_preference- the soft score: an ordered list of preferred tags. When several compatible nodes qualify, the planner prefers the node whose backend ranks earliest, with graceful fallback to the rest. This lets a card say "fastest on Vulkan, but ROCm is fine."
min_vram_gib- optional minimum accelerator memory a node must have to be eligible.
max_context_tokens- optional cap on the admission context for this model, independent of the model's trained context length.
Extended Capability Sections
Skulk now supports optional structured sections that declare refined model behavior.
Existing cards do not need these sections. If they are absent, Skulk falls back to generic behavior.
[reasoning]
Declares advanced reasoning behavior:
supports_toggle- whether thinking/reasoning can be explicitly enabled or disabled
supports_budget- whether the model supports a reasoning budget control
format- reasoning marker format such as
channel_delimitedortoken_delimited
- reasoning marker format such as
default_effort- reasoning effort used when thinking is enabled without an explicit effort
disabled_effort- reasoning effort used when thinking is explicitly disabled
[modalities]
Declares refined modality support:
supports_audio_input- whether the model supports audio input
supports_native_multimodal- whether the model uses a native multimodal path rather than generic text-only prompting
[tooling]
Declares tool-calling behavior:
supports_tool_calling- whether tool calling is supported
builtin_tools- optional list of builtin platform tool contracts such as
web_search,open_url, orextract_page
- optional list of builtin platform tool contracts such as
tool_call_format- expected tool-call output format such as
generic,gemma4,gpt_oss, ordsml
- expected tool-call output format such as
[runtime]
Declares runtime integration preferences:
prompt_renderer- prompt renderer to use, such as
tokenizer,gemma4, ordsml
- prompt renderer to use, such as
output_parser- output parser to use, such as
generic,gemma4,gpt_oss, ordeepseek_v32
- output parser to use, such as
metal_fast_synch- per-model override for the MLX
MLX_METAL_FAST_SYNCHflag; set tofalsefor models that deadlock under FAST_SYNCH on the ring backend
- per-model override for the MLX
mtp_heads- set to
truewhen the model has native MTP prediction heads available via sidecar; required alongsidemtp_sidecar_repoto enable speculative decoding
- set to
mtp_max_depth- maximum draft depth the MTP heads support; start at
1for Apple Silicon (deeper values rarely amortize on Metal due to near-linear verify-pass scaling)
- maximum draft depth the MTP heads support; start at
mtp_sidecar_repo- Hugging Face repo ID containing the published
mtp.safetensorssidecar (e.g."FoxlightAI/qwen3-5-9b-base-mtp"); produced by SWP (Skulk Weights Publisher) from the original BF16 checkpoint
- Hugging Face repo ID containing the published
mtp_norm_convention- how the MTP heads normalize hidden states (
zero_centeredoractual_scale); must match how the sidecar was produced
- how the MTP heads normalize hidden states (
mtp_concat_order- the order the MTP heads concatenate the embedding and hidden state (
embed_firstorhidden_first); must match the sidecar
- the order the MTP heads concatenate the embedding and hidden state (
speculative_multi_node- set to
falseto forbid speculative decoding when the model is sharded across multiple nodes (it stays single-node speculative); the runner and the generation loop read this to make the same rank-symmetric decision
- set to
assistant_model_repo- Hugging Face repo of a small companion model used as an external drafter for speculative decoding (the Gemma 4 path), as opposed to native MTP heads
MTP Speculative Decoding
Some models include native multi-token prediction (MTP) heads baked into their checkpoint weights. Skulk uses these heads for speculative decoding: the MTP heads draft candidate tokens cheaply, a full forward pass verifies them, and accepted tokens are emitted in bulk, substantially increasing throughput with no accuracy loss.
Why a sidecar
Standard quantization pipelines (including mlx-lm's sanitize()) strip mtp.* tensor keys at conversion time. The MLX-quantized checkpoint you download from Hugging Face typically does not contain MTP weights.
SWP solves this by re-extracting the mtp.* tensors from the original BF16 checkpoint, quantizing only those tensors, and publishing the result as mtp.safetensors to a dedicated sidecar repo on Hugging Face. This is the same pattern Skulk uses for vision encoder weights.
How it works
When a model card declares mtp_heads = true and mtp_sidecar_repo, Skulk:
- Downloads
mtp.safetensorsfrom the sidecar repo alongside the base model weights. - Loads the sidecar at model load time and makes the weights available to the runner.
- Uses the MTP heads during generation for speculative decoding (on runners that support it).
If the sidecar is declared but the file is not found locally, Skulk logs a warning and continues with standard autoregressive generation.
Models with native MTP heads
The shipped cards that declare an MTP sidecar are the Qwen3.5 and Qwen3.6 quantizations:
mlx-community/Qwen3.5-2B-4bitmlx-community/Qwen3.5-9B-MLX-4bitmlx-community/Qwen3.5-27B-4bitmlx-community/Qwen3.6-27B-4bit
Gemma 4 does not use native MTP heads; it uses an external drafter model (assistant_model_repo) for speculative decoding, which is a separate feature.
Adding MTP to a card
[runtime]
mtp_heads = true
mtp_max_depth = 1
mtp_sidecar_repo = "FoxlightAI/qwen3-5-9b-base-mtp"
The sidecar repo must be published by SWP before adding these fields. See the SWP documentation for how sidecars are produced and published.
Speculation on the served (GGUF) engine
The sidecar path above is MLX-only. GGUF cards that run under the served
llama-server engine declare speculation through a different set of [runtime]
fields: served_spec_type, served_spec_n_max, served_spec_draft_repo, and
served_spec_draft_file. See Speculative Decoding for
how those fields map to the served backend.
Declarative vs Resolved
The model card is the declarative capability source.
At runtime, Skulk resolves the card plus tokenizer/model-family facts into a normalized execution profile. That resolved profile is what prompt rendering, reasoning defaults, and output parsing consume.
That gives Skulk three good properties:
- old cards remain valid
- advanced cards unlock refined behavior
- runtime code can rely on normalized values instead of ad hoc optional checks
Resolution Precedence
When Skulk resolves a runtime capability profile, it uses this order:
- explicit advanced model-card declarations
- conservative family/model heuristics
- generic fallback behavior
That means a custom or built-in card can refine behavior without breaking old cards that only declare coarse metadata.
Extended Card Example
This is a minimal example of a custom card that opts into refined runtime behavior:
model_id = "custom/gemma-compatible"
n_layers = 10
hidden_size = 1024
supports_tensor = false
tasks = ["TextGeneration"]
family = "gemma"
capabilities = ["text", "vision", "thinking"]
[storage_size]
in_bytes = 1073741824
[reasoning]
supports_toggle = true
format = "channel_delimited"
default_effort = "medium"
disabled_effort = "none"
[modalities]
supports_native_multimodal = true
[tooling]
tool_call_format = "gemma4"
[runtime]
prompt_renderer = "gemma4"
output_parser = "gemma4"
The card stays declarative. Skulk still resolves it into a normalized runtime profile before execution code consumes it.
When to Extend a Card
Extend a card when:
- the model needs special prompt rendering
- the model uses a non-generic reasoning format
- the model supports modalities or controls that generic metadata cannot express
- the dashboard or API needs to expose richer behavior safely
For concrete examples, see Model Capabilities and the per-family notes in Model Behaviors.