Assistant Models
Skulk uses speculative decoding to accelerate inference. For speculative decoding to work, the serving node must have access to a smaller draft model that proposes token candidates for the target model to verify. Different model families ship their draft-model weights in different ways — SWP handles both.
Pattern 1: Embedded MTP heads (Qwen3, DeepSeek)
Qwen3 and DeepSeek V3/R1 embed multi-token prediction (MTP) heads directly as
mtp.* tensor keys inside the BF16 checkpoint. Standard quantization pipelines
(mlx-lm's sanitize()) strip these keys when converting the model. SWP
re-extracts them from the original BF16 checkpoint and publishes the result at
full precision (bf16, unquantized) as mtp.safetensors to a separate Hugging
Face repository. One bf16 sidecar serves every quantization of the base model,
so there is one sidecar per base model.
Catalog fields for this pattern:
mtp_source_repo: Qwen/Qwen3.5-9B # BF16 source to extract from
mtp_sidecar_repo: FoxlightAI/qwen3-5-9b-mtp # where mtp.safetensors lands
Detection: skulk-weights catalog add fetches model.safetensors.index.json
from the resolved BF16 base and counts keys that start with mtp. or contain
.mtp.. If any are found, MTP fields are written to the catalog entry.
See the MTP sidecar guide for full instructions.
Pattern 2: Companion assistant (Gemma 4)
Gemma 4 does not embed MTP tensors in the base checkpoint. Instead, Google publishes a separate assistant model alongside each instruction-tuned release. The assistant is already quantized and published on Hugging Face — no tensor extraction is needed.
The naming convention is {base_model}-assistant:
| Instruction-tuned model | Companion assistant |
|---|---|
google/gemma-4-31B-it | google/gemma-4-31B-it-assistant |
google/gemma-4-26B-A4B-it | google/gemma-4-26B-A4B-it-assistant |
Catalog field for this pattern:
assistant_model_repo: google/gemma-4-31B-it-assistant
assistant_model_repo is mutually exclusive with mtp_source_repo and
mtp_sidecar_repo. Setting both on the same entry is a validation error.
Detection: find_assistant_model checks three candidates in order via a HEAD
request against the HuggingFace API—{model}-assistant for the pasted model
itself, then {immediate_base}-assistant, then {resolved_base}-assistant for
the resolved BF16 root. The first candidate that exists (HTTP 200) is written to
the catalog entry. No download occurs.
When an assistant is found, the written entry uses assistant_model_repo and
omits the mtp_* fields—assistant presence takes priority at write time.
Choosing between the two patterns
You do not choose — SWP detects the correct pattern automatically. The decision tree is:
- Resolve the BF16 base model from the quantized model's
base_model:quantized:tag. - Fetch
model.safetensors.index.jsonfrom the BF16 base. Countmtp.*keys. - If
mtp.*keys are found → write MTP fields. - If no
mtp.*keys are found → check HuggingFace for a companion assistant ({base}-assistant). If found → writeassistant_model_repo. - If neither → no speculative-decoding fields are written.
Catalog schema summary
# Embedded MTP heads (Qwen3, DeepSeek) — sidecar ships bf16, one per base model
mtp_source_repo: owner/model-bf16
mtp_sidecar_repo: FoxlightAI/model-mtp
# Companion assistant (Gemma 4)
assistant_model_repo: google/gemma-4-31B-it-assistant
Both groups are optional and mutually exclusive. An entry with neither is a valid vindex-only entry with no speculative-decoding support.