Publishing Lifecycle
A vindex goes through a build-and-release lifecycle because it is the boundary between expensive GPU inference nodes and cheaper CPU/high-memory weight servers. The goal is a stable, published model representation that every Skulk node can agree on before runtime placement begins.
1. Describe The Vindex
The vindex starts as a catalog source entry. The built-in Foxlight entries
are packaged with the CLI, and operator entries can be added through
skulk-weights.yaml. Each entry names the source model, quantization, slice
mode, local .vindex directory, and target Hugging Face repository. The slice
mode is part of the runtime contract: it tells operators whether they are
publishing a complete vindex or a specialized expert-server shape for weight
serving.
2. Validate The Catalog
skulk-weights catalog validate
Validation catches duplicate keys, unsupported slice names, bad repository names, and output names that would not be safe to write.
3. Check The Runner
skulk-weights doctor --publish
The publishing runner needs Python, LARQL, writable scratch storage, network
access to Hugging Face, and HF_TOKEN. It does not have to be the eventual
runtime host; it is the machine that performs the expensive extraction and
upload.
4. Review The Plan
skulk-weights publish --model foxlight/gemma-3-4b-full-q4-k --dry-run
The dry-run prints the exact larql extract and larql publish commands. This
is the last cheap place to catch a wrong source model, path, slice mode, or
repository before disk-heavy extraction begins.
publish builds whichever artifacts you ask for through --artifact. The
values are:
vindex— the LARQL retrieval index (the default subject of this page).mtp— the Multi-Token Prediction sidecar (step 7).vision— the vision-encoder sidecar mirror (step 8).all— every artifact configured on the catalog entry.
5. Extract The Vindex
Real publication starts by running larql extract. This can use substantial
scratch disk because it creates a local vindex directory before anything is
uploaded.
6. Publish The Vindex
After extraction, the publisher runs larql publish and uploads the vindex to
the Hugging Face repository in the catalog entry.
7. Upload The Self-Describing Model Card
Every real publish—vindex, MTP, or vision—also uploads a README.md model card
to the published repository. The card is self-describing so the artifact carries
its own provenance instead of relying on external records.
The frontmatter sets base_model to the source repo, tags the artifact
([artifact_type, skulk, foxlight, quant]), and inherits the source model's
license unchanged (custom licenses also carry license_name/license_link). It
also embeds a foxlight: provenance block: artifact type, source repo, the
pinned source_revision commit SHA, target model, quant, catalog key, the tool
that extracted it, and a timestamp. The body explains what the artifact is, a
provenance table, usage, and a license note.
The source commit SHA and license are resolved best-effort from the Hub using
HF_TOKEN. Published artifacts inherit the source model's license unchanged and
are never re-licensed—everything published is for the community.
8. Extract And Publish The MTP Sidecar (optional)
For models that carry native Multi-Token Prediction heads—Qwen3, DeepSeek V3/R1,
and similar architectures—a second extraction pass pulls the mtp.* tensors from
the original BF16 checkpoint and uploads them at full precision (bf16,
unquantized) as mtp.safetensors to a separate sidecar repository — one sidecar
per base model, shared across every quantization of it.
This step is separate from vindex publication because the MTP weights must come from
the original PyTorch checkpoint. mlx-lm's sanitize() strips mtp.* keys during
conversion, so the mlx-converted source used for vindex extraction does not contain
them.
skulk-weights publish --model acme/qwen3-6b-full-q4-k --artifact mtp --dry-run
skulk-weights publish --model acme/qwen3-6b-full-q4-k --artifact mtp
The dry-run prints the source repo, sidecar repo, precision, and output path
before any download begins. Real execution downloads only the shards that contain
mtp.* keys (using the model's model.safetensors.index.json to identify them),
saves the tensors at full precision (bf16, unquantized) as a local
mtp.safetensors, and uploads it to the sidecar repo — alongside its own
self-describing model card (step 7). One bf16 sidecar serves every quantization
of the base model; if the sidecar already exists it is skipped (with a message
saying it already covers this model and all its quantizations) unless --force
is passed.
If mtp_source_repo and mtp_sidecar_repo are not set on the catalog entry,
--artifact mtp raises an error with a clear message rather than silently
skipping.
See the MTP Sidecar Guide for a complete walkthrough.
9. Mirror And Publish The Vision Sidecar (optional)
Some vision-language models ship an mlx-community checkpoint that omits the vision encoder—Kimi K2.5, for example, keeps its vision weights in a third-party repository. For those models SWP publishes a vision sidecar: a Foxlight-owned mirror so the cluster does not depend on a third party.
Unlike the MTP step, the vision sidecar performs no quantization and no dtype
conversion. It copies the vision_source_repo's weights and configs into
vision_sidecar_repo byte-for-byte, alongside its own self-describing model
card (step 7). It needs huggingface_hub but not mlx.
skulk-weights publish --model acme/kimi-k2-5-full-q4-k --artifact vision --dry-run
skulk-weights publish --model acme/kimi-k2-5-full-q4-k --artifact vision
If vision_source_repo and vision_sidecar_repo are not set on the catalog
entry, --artifact vision raises an error with a clear message rather than
silently skipping.
10. File Into A Collection
Each successful publish is filed into the Hugging Face collection for its
artifact type: Vindexes, MTP Sidecars, or Vision Sidecars. The sidecar
collections are resolved by title—created if missing, reused if they already
exist—so a delete-and-republish stays in the right collection. The vindex is
filed into the configured slug exactly (the catalog hf_collection or
SKULK_WEIGHTS_COLLECTION).
Collection filing is disabled when no collection is configured, or when
SKULK_WEIGHTS_COLLECTION is set to one of none, 0, false, no, off,
or disabled.
11. Use The Published Weights
Once published, the vindex and any sidecar have stable repository names. Skulk operators can use those names when assigning GPU nodes to the latency-sensitive inference path and CPU/high-memory LARQL servers to FFN or expert weight serving. Skulk loads the MTP sidecar at inference time when MTP is enabled for a given deployment.