Skip to main content

SWP: Skulk Weights Publisher

SWP: Skulk Weights Publisher publishes model weights for Skulk clusters. It handles three distinct artifact types:

LARQL vindexes — LARQL decompiles transformer weights into a queryable vindex format so Skulk does not have to keep every weight resident in expensive GPU memory. CPU/high-memory LARQL servers host the feed-forward and expert weights while GPU nodes handle the latency-sensitive attention path.

MTP sidecars — Models with native multi-token prediction heads (Qwen3, DeepSeek V3/R1, and others that expose mtp.* tensor keys) need those heads published separately. Standard quantization pipelines strip MTP tensors. SWP re-extracts them from the original BF16 checkpoint and uploads them at full precision (bf16, unquantized) as mtp.safetensors to a dedicated repo — one sidecar per base model, shared across every quantization of it — so Skulk can use speculative decoding.

Vision sidecars — mlx-community VLM quants frequently omit the vision encoder, leaving the vision weights to live in a third-party repository. SWP mirrors those encoder weights byte-for-byte into a Foxlight-owned repo so the multimodal path has no external dependency. See the Vision sidecar guide.

Not every model embeds MTP heads. Gemma 4, for example, ships a separate companion {model}-assistant drafter model for speculative decoding rather than mtp.* tensors. SWP detects this pattern and records the pairing in the catalog — no tensor extraction is needed. See the Gemma 4 assistant guide.

Every real publish also uploads a self-describing README.md model card: the source model's license is inherited unchanged, with a Foxlight provenance block pinning the exact source SHA, plus usage notes. Each artifact is filed into its per-artifact-type Hugging Face collection (Vindexes, MTP Sidecars, or Vision Sidecars).

SWP keeps the list of models Skulk wants, validates how each one should be extracted and sliced or quantized, shows the exact commands that will run, and executes the publish workflow from a configured runner.

The Foxlight catalog is built in. You can use it immediately, or add your own operator catalog with skulk-weights.yaml. The merged catalog uses namespaced keys such as foxlight/gemma-3-4b-full-q4-k and my-org/my-model-full-q4-k so shared Foxlight entries and local operator entries remain distinct.

New to Skulk and LARQL? Read How Skulk Works before the quickstart. It explains the cluster architecture, what vindexes are, what MTP sidecars are for, and why the publisher exists.

What is LARQL?

LARQL treats a model as a database. It decompiles transformer weights into a queryable format and exposes LQL, the Lazarus Query Language, for browsing, editing, running inference against, and recompiling model knowledge. In this project, we use LARQL through larql extract to create vindexes and larql publish to upload the full and expert-server forms that Skulk can place on the right machines.

What is a vindex?

A vindex, short for vector index, is LARQL's on-disk representation of a model. It is a directory of memory-mapped files where transformer weights have been reorganized for queryability: gate vectors become a nearest-neighbor index, embeddings become token lookups, and down projections become edge labels. Because the weights are in this structured form, LARQL can serve FFN and expert weights from CPU/high-memory nodes instead of requiring every weight-serving machine to be a GPU inference node.

What is an MTP sidecar?

Multi-token prediction is a training technique where a model learns to predict several future tokens simultaneously rather than one at a time. Models trained this way expose native MTP heads as tensors with mtp.* keys in their checkpoint. At inference time Skulk can use these heads for speculative decoding — drafting multiple tokens in one forward pass and verifying them — which substantially increases throughput with no accuracy loss.

Standard quantization pipelines strip MTP tensors to reduce download size. SWP's MTP pipeline re-extracts those tensors from the original BF16 checkpoint and publishes them at full precision (bf16, unquantized) as mtp.safetensors to a separate Hugging Face repo — one sidecar per base model, shared across every quantization of it. The vindex and the MTP sidecar are versioned independently so each can be updated or replaced without disturbing the other.

Why do this?

Weight publication is expensive and easy to get wrong. Bad commands can write hundreds of gigabytes to the wrong path, publish under the wrong Hugging Face repository name, or silently skip MTP heads that a model needs for speculative decoding. SWP makes publication repeatable: the catalog records the exact source, quantization, slice mode, and target for each artifact; --dry-run shows the full plan before anything is extracted or uploaded; and the GitHub Actions workflow validates every catalog entry on every pull request.