Skip to main content

RuntimeCapabilityCardConfig

Optional runtime behavior hints for a model card.

promptRenderer object

How prompts are rendered for this model (tokenizer chat template, gemma4, dsml); None uses the family default.

anyOf
PromptRendererType (string)

Prompt renderer strategies supported by the runtime.

Possible values: [tokenizer, gemma4, dsml]

outputParser object

How model output is parsed (generic, gemma4, gpt_oss, deepseek_v32), e.g. for reasoning/tool-call extraction; None uses the family default.

anyOf
OutputParserType (string)

Output parser strategies supported by the runtime.

Possible values: [generic, gemma4, gpt_oss, deepseek_v32]

metalFastSynch object

Per-model override for the MLX MLX_METAL_FAST_SYNCH flag.

None means "no opinion" — fall through to the cluster default selected by the runner. Set explicitly to False for models that deadlock under FAST_SYNCH on the ring backend (e.g. gemma-4 with multimodal load: the Metal command queue wedges in pipeline_last_eval_output, transitively starves WindowServer, and trips the macOS kernel watchdog into a panic). Set explicitly to True for models that have been measured to benefit and are known to be safe under the deployment's collective backend.

anyOf
boolean
mtpHeads object

True when native MTP prediction heads are available via sidecar.

Set alongside mtp_sidecar_repo. When false or absent, the runner skips sidecar loading and uses standard autoregressive generation.

anyOf
boolean
mtpMaxDepth object

Maximum draft depth the MTP heads support.

Start at 1 for Apple Silicon. Deeper values can be evaluated via profiling but are unlikely to amortize on Metal due to near-linear verify-pass scaling.

anyOf
integer
mtpSidecarRepo object

Hugging Face repo ID containing the published mtp.safetensors sidecar.

Example: "FoxlightAI/qwen3-5-7b-instruct-mtp-q4k" The sidecar is downloaded alongside the base model weights and loaded into the runner for speculative decoding. Produced by SWP.

anyOf
string
mtpNormConvention object

How the sidecar stores its RMSNorm weights.

"zero_centered" means deviation-from-1 (the raw Qwen3.5 checkpoint convention — the runner applies a +1.0 shift at load, mirroring what mlx-lm's sanitize() does for trunk weights). "actual_scale" means the stored value is the scale itself (DeepSeek convention). None falls through to the family default keyed off the detected sidecar layout. Override per card when a publisher changes conventions — getting this wrong measured 0% draft acceptance on Qwen3.5-2B (issue #192).

anyOf
string

Possible values: [zero_centered, actual_scale]

mtpConcatOrder object

Concatenation order of the MTP fc projection input.

"embed_first" = fc(concat([enorm(embed(t_next)), hnorm(h)])) — verified for Qwen3.5 (72.4% offline agreement, issue #192). "hidden_first" is the inherited DeepSeek assumption (unverified). None falls through to the family default keyed off the detected sidecar layout.

anyOf
string

Possible values: [embed_first, hidden_first]

speculativeMultiNode object

Whether speculation may run on multi-node placements of this model.

None (default) places no restriction. Set False for models where multi-node speculation is measured SLOWER than plain distributed decode: the 2026-06-06 benchmark matrix found gemma-4-26B-A4B (MoE) at 30.2 tok/s plain vs 28.2 with MTP on a 2-node pipeline (-7%), while single-node MTP on the same model measures 2.2x — fast sharded MoE decode plus modest acceptance makes the per-round draft+verify overhead net negative. Single-node speculation is unaffected by this knob. The decision is card-driven so every rank makes the same speculate-or-not choice (the distributed agreement collective requires rank symmetry).

anyOf
boolean
assistantModelRepo object

Hugging Face repo ID of a companion assistant (drafter) model.

Gemma 4 does speculative decoding differently from the Qwen3/DeepSeek mtp.* heads: instead of embedded prediction heads, it pairs the target with a separate small gemma4_assistant model (e.g. "mlx-community/gemma-4-26B-A4B-it-assistant-bf16") that cross-attends over the target's KV cache. When set, the assistant repo is downloaded alongside the base model. Mutually exclusive with the mtp_* fields.

NOTE: consuming the assistant for speculative generation requires the gemma4_assistant drafter from mlx-vlm >= 0.5.0 and is not yet wired into the runner — declaring it here only pre-downloads it. See the Gemma 4 MTP initiative in the foxlight-docs hub (Phase C).

anyOf
string
RuntimeCapabilityCardConfig
{
"promptRenderer": "tokenizer",
"outputParser": "generic",
"metalFastSynch": true,
"mtpHeads": true,
"mtpMaxDepth": 0,
"mtpSidecarRepo": "string",
"mtpNormConvention": "zero_centered",
"mtpConcatOrder": "embed_first",
"speculativeMultiNode": true,
"assistantModelRepo": "string"
}