Skip to main content

RuntimeCapabilityCardConfig

Optional runtime behavior hints for a model card.

promptRenderer object

How prompts are rendered for this model (tokenizer chat template, gemma4, dsml); None uses the family default.

anyOf
PromptRendererType (string)

Prompt renderer strategies supported by the runtime.

Possible values: [tokenizer, gemma4, dsml]

outputParser object

How model output is parsed (generic, gemma4, gpt_oss, deepseek_v32), e.g. for reasoning/tool-call extraction; None uses the family default.

anyOf
OutputParserType (string)

Output parser strategies supported by the runtime.

Possible values: [generic, gemma4, gpt_oss, deepseek_v32]

metalFastSynch object

Per-model override for the MLX MLX_METAL_FAST_SYNCH flag.

None means "no opinion" — fall through to the cluster default selected by the runner. Set explicitly to False for models that deadlock under FAST_SYNCH on the ring backend (e.g. gemma-4 with multimodal load: the Metal command queue wedges in pipeline_last_eval_output, transitively starves WindowServer, and trips the macOS kernel watchdog into a panic). Set explicitly to True for models that have been measured to benefit and are known to be safe under the deployment's collective backend.

anyOf
boolean
mtpHeads object

True when native MTP prediction heads are available via sidecar.

Set alongside mtp_sidecar_repo. When false or absent, the runner skips sidecar loading and uses standard autoregressive generation.

anyOf
boolean
mtpMaxDepth object

Maximum draft depth the MTP heads support.

Start at 1 for Apple Silicon. Deeper values can be evaluated via profiling but are unlikely to amortize on Metal due to near-linear verify-pass scaling.

anyOf
integer
mtpSidecarRepo object

Hugging Face repo ID containing the published mtp.safetensors sidecar.

Example: "FoxlightAI/qwen3-5-7b-instruct-mtp-q4k" The sidecar is downloaded alongside the base model weights and loaded into the runner for speculative decoding. Produced by SWP.

anyOf
string
mtpNormConvention object

How the sidecar stores its RMSNorm weights.

"zero_centered" means deviation-from-1 (the raw Qwen3.5 checkpoint convention — the runner applies a +1.0 shift at load, mirroring what mlx-lm's sanitize() does for trunk weights). "actual_scale" means the stored value is the scale itself (DeepSeek convention). None falls through to the family default keyed off the detected sidecar layout. Override per card when a publisher changes conventions — getting this wrong measured 0% draft acceptance on Qwen3.5-2B (issue #192).

anyOf
string

Possible values: [zero_centered, actual_scale]

mtpConcatOrder object

Concatenation order of the MTP fc projection input.

"embed_first" = fc(concat([enorm(embed(t_next)), hnorm(h)])) — verified for Qwen3.5 (72.4% offline agreement, issue #192). "hidden_first" is the inherited DeepSeek assumption (unverified). None falls through to the family default keyed off the detected sidecar layout.

anyOf
string

Possible values: [embed_first, hidden_first]

speculativeMultiNode object

Whether speculation may run on multi-node placements of this model.

None (default) places no restriction. Set False for models where multi-node speculation is measured SLOWER than plain distributed decode: the 2026-06-06 benchmark matrix found gemma-4-26B-A4B (MoE) at 30.2 tok/s plain vs 28.2 with MTP on a 2-node pipeline (-7%), while single-node MTP on the same model measures 2.2x — fast sharded MoE decode plus modest acceptance makes the per-round draft+verify overhead net negative. Single-node speculation is unaffected by this knob. The decision is card-driven so every rank makes the same speculate-or-not choice (the distributed agreement collective requires rank symmetry).

anyOf
boolean
assistantModelRepo object

Hugging Face repo ID of a companion assistant (drafter) model.

Gemma 4 does speculative decoding differently from the Qwen3/DeepSeek mtp.* heads: instead of embedded prediction heads, it pairs the target with a separate small gemma4_assistant model (e.g. "mlx-community/gemma-4-26B-A4B-it-assistant-bf16") that cross-attends over the target's KV cache. When set, the assistant repo is downloaded alongside the base model. Mutually exclusive with the mtp_* fields.

NOTE: consuming the assistant for speculative generation requires the gemma4_assistant drafter from mlx-vlm >= 0.5.0 and is not yet wired into the runner — declaring it here only pre-downloads it. See the Gemma 4 MTP initiative in the foxlight-docs hub (Phase C).

anyOf
string
servedSpecType object

Speculative-decoding mode for the llama_server (served-backend) engine.

Maps to a llama-server --spec-type token in the runner (_SPEC_TYPE_FLAG): draft_mtp -> draft-mtp (the model's own built-in MTP heads, no draft model needed; Qwen3.6/DeepSeek/GLM/Kimi/Nemotron), draft_eagle3 -> draft-eagle3 (an EAGLE-3 head), draft_simple -> draft-simple (a separate draft model), ngram -> ngram-cache (prompt-lookup), none/None plain decoding. Only the served engine reads this; the in-process mlx and llama_cpp engines ignore it (MLX speculation is the mtp_* / assistant_model_repo fields above).

anyOf
string

Possible values: [none, draft_mtp, draft_eagle3, draft_simple, ngram]

servedSpecNMax object

Max draft tokens per step for the served engine (--spec-draft-n-max).

Must be a positive integer (validated at card load so a bad value fails fast rather than producing an undefined --spec-draft-n-max at the server). None uses the llama-server default (3). Acceptance falls off with depth (per-position acceptance drops), so 2-3 is the usual sweet spot; tune per card from measured acceptance.

anyOf
integer

Possible values: > 0

servedSpecDraftRepo object

Hugging Face repo of a separate draft GGUF for the served engine.

Some served speculative modes need a second model passed to llama-server via --model-draft, NOT built-in heads: draft_simple (a vocab-matched small draft model) and draft_eagle3 (an EAGLE-3 head) always require one, and Gemma 4 draft_mtp uses its assistant as a separate draft GGUF (llama.cpp PR #23398) rather than baking heads into the base. Qwen3.6/DeepSeek/GLM draft_mtp leave this unset (heads are in the base GGUF). When set, the draft GGUF is downloaded as a companion alongside the base and passed as --model-draft. Pairs with served_spec_draft_file.

anyOf
string
servedSpecDraftFile object

Repo-relative GGUF filename of the served draft model (in served_spec_draft_repo), e.g. "mtp-gemma-4-31B-it.gguf". Required when served_spec_draft_repo is set; selects the exact draft quant the runner passes to --model-draft.

anyOf
string
RuntimeCapabilityCardConfig
{
"promptRenderer": "tokenizer",
"outputParser": "generic",
"metalFastSynch": true,
"mtpHeads": true,
"mtpMaxDepth": 0,
"mtpSidecarRepo": "string",
"mtpNormConvention": "zero_centered",
"mtpConcatOrder": "embed_first",
"speculativeMultiNode": true,
"assistantModelRepo": "string",
"servedSpecType": "none",
"servedSpecNMax": 0,
"servedSpecDraftRepo": "string",
"servedSpecDraftFile": "string"
}