RuntimeCapabilityCardConfig
Optional runtime behavior hints for a model card.
promptRenderer object
How prompts are rendered for this model (tokenizer chat template,
gemma4, dsml); None uses the family default.
- PromptRendererType
- null
Prompt renderer strategies supported by the runtime.
Possible values: [tokenizer, gemma4, dsml]
outputParser object
How model output is parsed (generic, gemma4, gpt_oss,
deepseek_v32), e.g. for reasoning/tool-call extraction; None uses the
family default.
- OutputParserType
- null
Output parser strategies supported by the runtime.
Possible values: [generic, gemma4, gpt_oss, deepseek_v32]
metalFastSynch object
Per-model override for the MLX MLX_METAL_FAST_SYNCH flag.
None means "no opinion" — fall through to the cluster default
selected by the runner. Set explicitly to False for models that
deadlock under FAST_SYNCH on the ring backend (e.g. gemma-4 with
multimodal load: the Metal command queue wedges in
pipeline_last_eval_output, transitively starves WindowServer,
and trips the macOS kernel watchdog into a panic). Set explicitly
to True for models that have been measured to benefit and are
known to be safe under the deployment's collective backend.
- boolean
- null
mtpHeads object
True when native MTP prediction heads are available via sidecar.
Set alongside mtp_sidecar_repo. When false or absent, the runner
skips sidecar loading and uses standard autoregressive generation.
- boolean
- null
mtpMaxDepth object
Maximum draft depth the MTP heads support.
Start at 1 for Apple Silicon. Deeper values can be evaluated via profiling but are unlikely to amortize on Metal due to near-linear verify-pass scaling.
- integer
- null
mtpSidecarRepo object
Hugging Face repo ID containing the published mtp.safetensors sidecar.
Example: "FoxlightAI/qwen3-5-7b-instruct-mtp-q4k"
The sidecar is downloaded alongside the base model weights and loaded
into the runner for speculative decoding. Produced by SWP.
- string
- null
mtpNormConvention object
How the sidecar stores its RMSNorm weights.
"zero_centered" means deviation-from-1 (the raw Qwen3.5 checkpoint
convention — the runner applies a +1.0 shift at load, mirroring what
mlx-lm's sanitize() does for trunk weights). "actual_scale" means
the stored value is the scale itself (DeepSeek convention). None
falls through to the family default keyed off the detected sidecar
layout. Override per card when a publisher changes conventions — getting
this wrong measured 0% draft acceptance on Qwen3.5-2B (issue #192).
- string
- null
Possible values: [zero_centered, actual_scale]
mtpConcatOrder object
Concatenation order of the MTP fc projection input.
"embed_first" = fc(concat([enorm(embed(t_next)), hnorm(h)])) —
verified for Qwen3.5 (72.4% offline agreement, issue #192).
"hidden_first" is the inherited DeepSeek assumption (unverified).
None falls through to the family default keyed off the detected
sidecar layout.
- string
- null
Possible values: [embed_first, hidden_first]
speculativeMultiNode object
Whether speculation may run on multi-node placements of this model.
None (default) places no restriction. Set False for models
where multi-node speculation is measured SLOWER than plain distributed
decode: the 2026-06-06 benchmark matrix found gemma-4-26B-A4B (MoE)
at 30.2 tok/s plain vs 28.2 with MTP on a 2-node pipeline (-7%), while
single-node MTP on the same model measures 2.2x — fast sharded MoE
decode plus modest acceptance makes the per-round draft+verify
overhead net negative. Single-node speculation is unaffected by this
knob. The decision is card-driven so every rank makes the same
speculate-or-not choice (the distributed agreement collective requires
rank symmetry).
- boolean
- null
assistantModelRepo object
Hugging Face repo ID of a companion assistant (drafter) model.
Gemma 4 does speculative decoding differently from the Qwen3/DeepSeek
mtp.* heads: instead of embedded prediction heads, it pairs the target
with a separate small gemma4_assistant model (e.g.
"mlx-community/gemma-4-26B-A4B-it-assistant-bf16") that cross-attends
over the target's KV cache. When set, the assistant repo is downloaded
alongside the base model. Mutually exclusive with the mtp_* fields.
NOTE: consuming the assistant for speculative generation requires the
gemma4_assistant drafter from mlx-vlm >= 0.5.0 and is not yet wired into
the runner — declaring it here only pre-downloads it. See the Gemma 4 MTP
initiative in the foxlight-docs hub (Phase C).
- string
- null
{
"promptRenderer": "tokenizer",
"outputParser": "generic",
"metalFastSynch": true,
"mtpHeads": true,
"mtpMaxDepth": 0,
"mtpSidecarRepo": "string",
"mtpNormConvention": "zero_centered",
"mtpConcatOrder": "embed_first",
"speculativeMultiNode": true,
"assistantModelRepo": "string"
}