RuntimeCapabilityCardConfig
Optional runtime behavior hints for a model card.
promptRenderer object
How prompts are rendered for this model (tokenizer chat template,
gemma4, dsml); None uses the family default.
- PromptRendererType
- null
Prompt renderer strategies supported by the runtime.
Possible values: [tokenizer, gemma4, dsml]
outputParser object
How model output is parsed (generic, gemma4, gpt_oss,
deepseek_v32), e.g. for reasoning/tool-call extraction; None uses the
family default.
- OutputParserType
- null
Output parser strategies supported by the runtime.
Possible values: [generic, gemma4, gpt_oss, deepseek_v32]
metalFastSynch object
Per-model override for the MLX MLX_METAL_FAST_SYNCH flag.
None means "no opinion" — fall through to the cluster default
selected by the runner. Set explicitly to False for models that
deadlock under FAST_SYNCH on the ring backend (e.g. gemma-4 with
multimodal load: the Metal command queue wedges in
pipeline_last_eval_output, transitively starves WindowServer,
and trips the macOS kernel watchdog into a panic). Set explicitly
to True for models that have been measured to benefit and are
known to be safe under the deployment's collective backend.
- boolean
- null
mtpHeads object
True when native MTP prediction heads are available via sidecar.
Set alongside mtp_sidecar_repo. When false or absent, the runner
skips sidecar loading and uses standard autoregressive generation.
- boolean
- null
mtpMaxDepth object
Maximum draft depth the MTP heads support.
Start at 1 for Apple Silicon. Deeper values can be evaluated via profiling but are unlikely to amortize on Metal due to near-linear verify-pass scaling.
- integer
- null
mtpSidecarRepo object
Hugging Face repo ID containing the published mtp.safetensors sidecar.
Example: "FoxlightAI/qwen3-5-7b-instruct-mtp-q4k"
The sidecar is downloaded alongside the base model weights and loaded
into the runner for speculative decoding. Produced by SWP.
- string
- null
mtpNormConvention object
How the sidecar stores its RMSNorm weights.
"zero_centered" means deviation-from-1 (the raw Qwen3.5 checkpoint
convention — the runner applies a +1.0 shift at load, mirroring what
mlx-lm's sanitize() does for trunk weights). "actual_scale" means
the stored value is the scale itself (DeepSeek convention). None
falls through to the family default keyed off the detected sidecar
layout. Override per card when a publisher changes conventions — getting
this wrong measured 0% draft acceptance on Qwen3.5-2B (issue #192).
- string
- null
Possible values: [zero_centered, actual_scale]
mtpConcatOrder object
Concatenation order of the MTP fc projection input.
"embed_first" = fc(concat([enorm(embed(t_next)), hnorm(h)])) —
verified for Qwen3.5 (72.4% offline agreement, issue #192).
"hidden_first" is the inherited DeepSeek assumption (unverified).
None falls through to the family default keyed off the detected
sidecar layout.
- string
- null
Possible values: [embed_first, hidden_first]
speculativeMultiNode object
Whether speculation may run on multi-node placements of this model.
None (default) places no restriction. Set False for models
where multi-node speculation is measured SLOWER than plain distributed
decode: the 2026-06-06 benchmark matrix found gemma-4-26B-A4B (MoE)
at 30.2 tok/s plain vs 28.2 with MTP on a 2-node pipeline (-7%), while
single-node MTP on the same model measures 2.2x — fast sharded MoE
decode plus modest acceptance makes the per-round draft+verify
overhead net negative. Single-node speculation is unaffected by this
knob. The decision is card-driven so every rank makes the same
speculate-or-not choice (the distributed agreement collective requires
rank symmetry).
- boolean
- null
assistantModelRepo object
Hugging Face repo ID of a companion assistant (drafter) model.
Gemma 4 does speculative decoding differently from the Qwen3/DeepSeek
mtp.* heads: instead of embedded prediction heads, it pairs the target
with a separate small gemma4_assistant model (e.g.
"mlx-community/gemma-4-26B-A4B-it-assistant-bf16") that cross-attends
over the target's KV cache. When set, the assistant repo is downloaded
alongside the base model. Mutually exclusive with the mtp_* fields.
NOTE: consuming the assistant for speculative generation requires the
gemma4_assistant drafter from mlx-vlm >= 0.5.0 and is not yet wired into
the runner — declaring it here only pre-downloads it. See the Gemma 4 MTP
initiative in the foxlight-docs hub (Phase C).
- string
- null
servedSpecType object
Speculative-decoding mode for the llama_server (served-backend) engine.
Maps to a llama-server --spec-type token in the runner
(_SPEC_TYPE_FLAG): draft_mtp -> draft-mtp (the model's own built-in
MTP heads, no draft model needed; Qwen3.6/DeepSeek/GLM/Kimi/Nemotron),
draft_eagle3 -> draft-eagle3 (an EAGLE-3 head), draft_simple ->
draft-simple (a separate draft model), ngram -> ngram-cache
(prompt-lookup), none/None plain decoding. Only the served engine reads
this; the in-process mlx and llama_cpp engines ignore it (MLX
speculation is the mtp_* / assistant_model_repo fields above).
- string
- null
Possible values: [none, draft_mtp, draft_eagle3, draft_simple, ngram]
servedSpecNMax object
Max draft tokens per step for the served engine (--spec-draft-n-max).
Must be a positive integer (validated at card load so a bad value fails fast
rather than producing an undefined --spec-draft-n-max at the server).
None uses the llama-server default (3). Acceptance falls off with depth
(per-position acceptance drops), so 2-3 is the usual sweet spot; tune per
card from measured acceptance.
- integer
- null
Possible values: > 0
servedSpecDraftRepo object
Hugging Face repo of a separate draft GGUF for the served engine.
Some served speculative modes need a second model passed to llama-server via
--model-draft, NOT built-in heads: draft_simple (a vocab-matched small
draft model) and draft_eagle3 (an EAGLE-3 head) always require one, and
Gemma 4 draft_mtp uses its assistant as a separate draft GGUF (llama.cpp
PR #23398) rather than baking heads into the base. Qwen3.6/DeepSeek/GLM
draft_mtp leave this unset (heads are in the base GGUF). When set, the
draft GGUF is downloaded as a companion alongside the base and passed as
--model-draft. Pairs with served_spec_draft_file.
- string
- null
servedSpecDraftFile object
Repo-relative GGUF filename of the served draft model (in
served_spec_draft_repo), e.g. "mtp-gemma-4-31B-it.gguf". Required when
served_spec_draft_repo is set; selects the exact draft quant the runner
passes to --model-draft.
- string
- null
{
"promptRenderer": "tokenizer",
"outputParser": "generic",
"metalFastSynch": true,
"mtpHeads": true,
"mtpMaxDepth": 0,
"mtpSidecarRepo": "string",
"mtpNormConvention": "zero_centered",
"mtpConcatOrder": "embed_first",
"speculativeMultiNode": true,
"assistantModelRepo": "string",
"servedSpecType": "none",
"servedSpecNMax": 0,
"servedSpecDraftRepo": "string",
"servedSpecDraftFile": "string"
}