Gemma 4

Gemma 4 is one of the first model families in Skulk that required explicit, model-specific runtime handling.

That makes it a useful reference point for how the capability system is meant to work.

Why Gemma 4 Is Special

Gemma 4 is not just “a text model with vision.”

It brings together several distinct behaviors:

custom multimodal prompt structure
channel-delimited reasoning blocks
native multimodal execution paths
model-family-specific tool formatting considerations
variant-specific modality support, including audio on some smaller variants

Generic least-common-denominator handling was enough to get partial behavior, but not enough to get reliable, correct behavior.

Prompt Handling

Plain Gemma 4 requests use a dedicated renderer instead of the generic tokenizer chat template when Skulk needs exact control over the prompt shape.

That is especially important for:

multimodal user messages
assistant generation prefix handling
reasoning channel initialization

Reasoning Format

Gemma 4 reasoning uses a channel-delimited format rather than the simpler token-delimited approach used by some other models.

In practice, that means Skulk needs to:

render the correct thought-channel structure
parse the channel markers correctly
route reasoning text away from visible assistant content

This is exactly the kind of behavior the capability system is meant to describe explicitly.

Vision and Native Multimodality

Gemma 4 can use a native multimodal execution path.

That means model support is not just a matter of accepting image content parts. The runtime also needs to know:

whether native multimodal execution is expected
what processor/model type is used
how to interpret media token regions

Current Capability Declarations

The built-in Gemma 4 cards now declare advanced capability sections so the runtime does not have to infer everything from scattered family checks.

Today that includes declarations for:

reasoning toggle support
reasoning format
prompt renderer
output parser
native multimodal support
tool-call format family

Current Gaps

Current support does not mean Gemma 4 is fully feature-complete yet.

Some follow-up work is intentionally tracked separately, including:

reasoning budget support
audio input support for variants that expose it upstream
fuller Gemma 4 tool grammar support

Debugging Gemma 4 Stalls

If Gemma 4 appears to hang during warmup or prefill, enable the MLX hang-debug instrumentation before starting Skulk:

export SKULK_MLX_HANG_DEBUG=1
export SKULK_MLX_HANG_DEBUG_INTERVAL_SECONDS=10
uv run skulk

This causes the runner to log:

warmup and prefill phase boundaries
the selected prefill path
whether the first prefill token was ever produced
repeated Python stack traces while the stuck phase is still active

For distributed Gemma 4 pipeline warmup specifically, Skulk now forces a minimal synthetic prompt by design:

no synthetic instructions
a single user message with content hello

This is intentional. During debugging, richer synthetic warmup prompts were observed to trigger the distributed warmup hang path on multi-node pipeline setups. Pipeline models now stay on Skulk's explicit pipeline-prefill path even for short one-chunk prompts, while distributed warmup keeps the synthetic prompt minimal and suppresses unnecessary distributed progress polling for those tiny prefills.

Current Clustered Runtime Envelope

Gemma 4 currently has a narrower trusted clustered path than more generic text models.

Today, the boring path is:

Gemma 4-specific prompt rendering
Gemma 4-specific thinking-channel parsing
SequentialGenerator for distributed inference
explicit pipeline prefill during warmup
default KV cache as the baseline

This is conservative by design. It reflects the current support envelope that has proven stable in cluster debugging.

Two debug-only warmup shaping env vars exist:

SKULK_DEBUG_WARMUP_REPEAT_COUNT
SKULK_DEBUG_WARMUP_INCLUDE_INSTRUCTIONS

However, distributed pipeline warmup intentionally ignores them and stays on the minimal sanity-check prompt. They remain useful for single-node investigation.

If you need to bypass synthetic warmup entirely during diagnosis, you can use:

export SKULK_SKIP_LLM_WARMUP=1

That bypass should be treated as a temporary debugging escape hatch only. Distributed pipeline groups ignore it so one rank cannot skip warmup while peers are still participating in warmup coordination.

Why This Matters

Gemma 4 is the proof point for the model capability system.

If Skulk can express Gemma 4 behavior through model-card-backed capability declarations plus a resolved runtime profile, then future model-family support gets much cleaner:

less hidden coupling
fewer one-off patches
more accurate dashboard and API behavior

Why Gemma 4 Is Special​

Prompt Handling​

Reasoning Format​

Vision and Native Multimodality​

Current Capability Declarations​

Current Gaps​

Debugging Gemma 4 Stalls​

Current Clustered Runtime Envelope​

Why This Matters​

Related​