Model Cards

Model cards are Skulk's durable source of truth for model metadata.

They are how Skulk knows things like:

what a model is called
what task types it supports
how large it is for placement and download planning
whether it supports tensor sharding
whether it has vision support
whether it declares advanced model-specific behavior

What Model Cards Do

Model cards sit at the boundary between static metadata and runtime behavior.

They drive:

placement and memory calculations
model browsing in the API and dashboard
custom model registration
modality hints such as vision
advanced capability declarations for model-specific runtime behavior

Where They Live

Built-in cards are shipped in:

Custom cards are stored under the user data directory and synced through the cluster event flow.

The card interface (source of truth)

The authoritative definition of the model-card interface is the ModelCard type in src/skulk/shared/models/model_cards.py. Every field is documented in that model, and the exhaustive, always-current field reference is the generated API schema (ModelCard and its nested PlacementCardConfig / RuntimeCapabilityCardConfig / VisionCardConfig / ReasoningCardConfig / ModalitiesCardConfig / ToolingCardConfig / ComponentInfo) in the API reference. This page is the curated narrative; when in doubt about an exact field, the schema is canonical.

Cards are camelCase on the wire and strict (unknown fields are rejected), so every node in a cluster must run the same Skulk version.

Core Fields

Identity and size

model_id
- Hugging Face / MLX model identifier
storage_size
- total model size used for store/download/placement planning
n_layers
- number of transformer layers used for pipeline sharding
hidden_size
- hidden dimension used for placement and compatibility checks
num_key_value_heads
- optional KV head count for tensor compatibility decisions
gguf_file
- for GGUF (llama.cpp) models only: the repo-relative weights file the runner loads (the selected quant's first shard), resolved once at card creation; null for safetensors/MLX cards
components
- for multi-component models (such as a diffusion stack): the per-component weight layout; null for a single-weights model

Runtime and placement

supports_tensor
- whether tensor-style placement is allowed (GGUF/llama.cpp cards set this false)
tasks
- supported task families such as TextGeneration, TextEmbedding, or image tasks
trust_remote_code
- whether the loader may enable remote-code behavior for this model
uses_cfg
- whether the model uses classifier-free guidance (relevant to some image/diffusion models)

Catalog metadata

family
- coarse family label such as gemma, qwen, deepseek
quantization
- human-facing quantization label
base_model
- display-friendly base model name
context_length
- advertised context length if known
capabilities
- coarse capability list such as text, vision, thinking, embedding

These coarse capabilities remain useful for browsing, badges, and basic compatibility, but they are not expressive enough for model-specific runtime behavior on their own.

Vision Section

[vision] is the existing structured section for multimodal text-generation models.

Fields include:

image_token_id
- token used to represent image slots in prompt tokenization
model_type
- MLX-VLM model family identifier such as gemma4
weights_repo
- optional alternate weights repository for the vision tower
image_token
- optional literal image token string
processor_repo
- optional alternate processor repository
boi_token_id
- optional begin-of-image token id
eoi_token_id
- optional end-of-image token id

Placement Section

[placement] declares where a model is allowed to run and which backend is preferred. It is what the planner reads to route a model to suitable nodes, and it is how a heterogeneous cluster keeps each model on hardware that can run it.

Backends are named <engine>-<compute> tags (for example mlx-metal, llama_cpp-vulkan, llama_cpp-rocm, llama_cpp-cpu). The engine selects the worker runner; the compute names the accelerator. A bare engine tag (e.g. mlx) is also valid vocabulary: nodes advertise it alongside their compound tags, so a card written against the original {"mlx"} set keeps matching.

compatible_backends
- the hard filter: the set of backend tags this model may run on. The planner excludes any node whose advertised backends do not intersect this set. The default is {"mlx"} (so an unannotated card stays on MLX nodes); a GGUF card lists the llama.cpp tags. This is what keeps an MLX model off an AMD node and a GGUF model off a Mac.
backend_preference
- the soft score: an ordered list of preferred tags. When several compatible nodes qualify, the planner prefers the node whose backend ranks earliest, with graceful fallback to the rest. This lets a card say "fastest on Vulkan, but ROCm is fine."
min_vram_gib
- optional minimum accelerator memory a node must have to be eligible.
max_context_tokens
- optional cap on the admission context for this model, independent of the model's trained context length.

Extended Capability Sections

Skulk now supports optional structured sections that declare refined model behavior.

Existing cards do not need these sections. If they are absent, Skulk falls back to generic behavior.

`[reasoning]`

Declares advanced reasoning behavior:

supports_toggle
- whether thinking/reasoning can be explicitly enabled or disabled
supports_budget
- whether the model supports a reasoning budget control
format
- reasoning marker format such as channel_delimited or token_delimited
default_effort
- reasoning effort used when thinking is enabled without an explicit effort
disabled_effort
- reasoning effort used when thinking is explicitly disabled

`[modalities]`

Declares refined modality support:

supports_audio_input
- whether the model supports audio input
supports_native_multimodal
- whether the model uses a native multimodal path rather than generic text-only prompting

`[tooling]`

Declares tool-calling behavior:

supports_tool_calling
- whether tool calling is supported
builtin_tools
- optional list of builtin platform tool contracts such as web_search, open_url, or extract_page
tool_call_format
- expected tool-call output format such as generic, gemma4, gpt_oss, or dsml

`[runtime]`

Declares runtime integration preferences:

prompt_renderer
- prompt renderer to use, such as tokenizer, gemma4, or dsml
output_parser
- output parser to use, such as generic, gemma4, gpt_oss, or deepseek_v32
metal_fast_synch
- per-model override for the MLX MLX_METAL_FAST_SYNCH flag; set to false for models that deadlock under FAST_SYNCH on the ring backend
mtp_heads
- set to true when the model has native MTP prediction heads available via sidecar; required alongside mtp_sidecar_repo to enable speculative decoding
mtp_max_depth
- maximum draft depth the MTP heads support; start at 1 for Apple Silicon (deeper values rarely amortize on Metal due to near-linear verify-pass scaling)
mtp_sidecar_repo
- Hugging Face repo ID containing the published mtp.safetensors sidecar (e.g. "FoxlightAI/qwen3-5-9b-base-mtp"); produced by SWP (Skulk Weights Publisher) from the original BF16 checkpoint
mtp_norm_convention
- how the MTP heads normalize hidden states (zero_centered or actual_scale); must match how the sidecar was produced
mtp_concat_order
- the order the MTP heads concatenate the embedding and hidden state (embed_first or hidden_first); must match the sidecar
speculative_multi_node
- set to false to forbid speculative decoding when the model is sharded across multiple nodes (it stays single-node speculative); the runner and the generation loop read this to make the same rank-symmetric decision
assistant_model_repo
- Hugging Face repo of a small companion model used as an external drafter for speculative decoding (the Gemma 4 path), as opposed to native MTP heads

MTP Speculative Decoding

Some models include native multi-token prediction (MTP) heads baked into their checkpoint weights. Skulk uses these heads for speculative decoding: the MTP heads draft candidate tokens cheaply, a full forward pass verifies them, and accepted tokens are emitted in bulk, substantially increasing throughput with no accuracy loss.

Why a sidecar

Standard quantization pipelines (including mlx-lm's sanitize()) strip mtp.* tensor keys at conversion time. The MLX-quantized checkpoint you download from Hugging Face typically does not contain MTP weights.

SWP solves this by re-extracting the mtp.* tensors from the original BF16 checkpoint, quantizing only those tensors, and publishing the result as mtp.safetensors to a dedicated sidecar repo on Hugging Face. This is the same pattern Skulk uses for vision encoder weights.

How it works

When a model card declares mtp_heads = true and mtp_sidecar_repo, Skulk:

Downloads mtp.safetensors from the sidecar repo alongside the base model weights.
Loads the sidecar at model load time and makes the weights available to the runner.
Uses the MTP heads during generation for speculative decoding (on runners that support it).

If the sidecar is declared but the file is not found locally, Skulk logs a warning and continues with standard autoregressive generation.

Models with native MTP heads

The shipped cards that declare an MTP sidecar are the Qwen3.5 and Qwen3.6 quantizations:

mlx-community/Qwen3.5-2B-4bit
mlx-community/Qwen3.5-9B-MLX-4bit
mlx-community/Qwen3.5-27B-4bit
mlx-community/Qwen3.6-27B-4bit

Gemma 4 does not use native MTP heads; it uses an external drafter model (assistant_model_repo) for speculative decoding, which is a separate feature.

Adding MTP to a card

[runtime]
mtp_heads = true
mtp_max_depth = 1
mtp_sidecar_repo = "FoxlightAI/qwen3-5-9b-base-mtp"

The sidecar repo must be published by SWP before adding these fields. See the SWP documentation for how sidecars are produced and published.

Speculation on the served (GGUF) engine

The sidecar path above is MLX-only. GGUF cards that run under the served llama-server engine declare speculation through a different set of [runtime] fields: served_spec_type, served_spec_n_max, served_spec_draft_repo, and served_spec_draft_file. See Speculative Decoding for how those fields map to the served backend.

Declarative vs Resolved

The model card is the declarative capability source.

At runtime, Skulk resolves the card plus tokenizer/model-family facts into a normalized execution profile. That resolved profile is what prompt rendering, reasoning defaults, and output parsing consume.

That gives Skulk three good properties:

old cards remain valid
advanced cards unlock refined behavior
runtime code can rely on normalized values instead of ad hoc optional checks

Resolution Precedence

When Skulk resolves a runtime capability profile, it uses this order:

explicit advanced model-card declarations
conservative family/model heuristics
generic fallback behavior

That means a custom or built-in card can refine behavior without breaking old cards that only declare coarse metadata.

Extended Card Example

This is a minimal example of a custom card that opts into refined runtime behavior:

model_id = "custom/gemma-compatible"
n_layers = 10
hidden_size = 1024
supports_tensor = false
tasks = ["TextGeneration"]
family = "gemma"
capabilities = ["text", "vision", "thinking"]

[storage_size]
in_bytes = 1073741824

[reasoning]
supports_toggle = true
format = "channel_delimited"
default_effort = "medium"
disabled_effort = "none"

[modalities]
supports_native_multimodal = true

[tooling]
tool_call_format = "gemma4"

[runtime]
prompt_renderer = "gemma4"
output_parser = "gemma4"

The card stays declarative. Skulk still resolves it into a normalized runtime profile before execution code consumes it.

When to Extend a Card

Extend a card when:

the model needs special prompt rendering
the model uses a non-generic reasoning format
the model supports modalities or controls that generic metadata cannot express
the dashboard or API needs to expose richer behavior safely

For concrete examples, see Model Capabilities and the per-family notes in Model Behaviors.

What Model Cards Do​

Where They Live​

The card interface (source of truth)​

Core Fields​

Identity and size​

Runtime and placement​

Catalog metadata​

Vision Section​

Placement Section​

Extended Capability Sections​

[reasoning]​

[modalities]​

[tooling]​

[runtime]​

MTP Speculative Decoding​

Why a sidecar​

How it works​

Models with native MTP heads​

Adding MTP to a card​

Speculation on the served (GGUF) engine​

Declarative vs Resolved​

Resolution Precedence​

Extended Card Example​

When to Extend a Card​