Skip to main content

KV Cache Backends

Skulk includes several opt-in KV cache backends for MLX text generation. These backends are intended for long-context and memory-pressure experiments, while preserving existing behavior unless explicitly enabled.

Current Status

  • default: existing behavior (no cache quantization)
  • mlx_quantized: MLX LM built-in QuantizedKVCache
  • turboquant: correctness-first TurboQuant-inspired KV cache for standard KVCache layers
  • turboquant_adaptive: keeps outer KV layers in FP16 and applies TurboQuant to middle KV layers
  • optiq: rotation-based KV cache via mlx-optiq; uses randomized orthogonal rotations with Lloyd-Max quantization and rotated-space attention for superior long-context quality

If SKULK_KV_CACHE_BACKEND is unset, or is set to default, Skulk behaves as before.

mlx-optiq (best quality)

SKULK_KV_CACHE_BACKEND=optiq \
SKULK_OPTIQ_BITS=4 \
SKULK_OPTIQ_FP16_LAYERS=4 \
uv run skulk

The optiq backend uses mlx-optiq's rotation-based vector quantization, which eliminates per-key rotation overhead at inference time via rotated-space attention. It keeps the first and last N KV layers in FP16 for adaptive quality.

TurboQuant Adaptive (proven stable)

SKULK_KV_CACHE_BACKEND=turboquant_adaptive \
SKULK_TQ_K_BITS=3 \
SKULK_TQ_V_BITS=4 \
SKULK_TQ_FP16_LAYERS=4 \
uv run skulk

This mode keeps the first and last 4 KV layers in normal FP16-style cache and applies TurboQuant only to the middle KV layers. Proven stable across most models.

Available Environment Variables

VariableBackendsDefaultDescription
SKULK_KV_CACHE_BACKENDalldefaultBackend selection
SKULK_KV_CACHE_BITSmlx_quantized(required)Bit width for MLX quantized cache
SKULK_OPTIQ_BITSoptiq4Bit width for mlx-optiq cache
SKULK_OPTIQ_FP16_LAYERSoptiq4Edge layers kept in FP16
SKULK_TQ_K_BITSturboquant, turboquant_adaptive3Key quantization bits
SKULK_TQ_V_BITSturboquant, turboquant_adaptive4Value quantization bits
SKULK_TQ_FP16_LAYERSturboquant_adaptive4Edge layers kept in FP16

Invocation Examples

Default behavior:

SKULK_KV_CACHE_BACKEND=default uv run skulk

mlx-optiq (rotation-based):

SKULK_KV_CACHE_BACKEND=optiq SKULK_OPTIQ_BITS=4 SKULK_OPTIQ_FP16_LAYERS=4 uv run skulk

MLX quantized KV cache:

SKULK_KV_CACHE_BACKEND=mlx_quantized SKULK_KV_CACHE_BITS=4 uv run skulk

TurboQuant adaptive:

SKULK_KV_CACHE_BACKEND=turboquant_adaptive SKULK_TQ_K_BITS=3 SKULK_TQ_V_BITS=4 SKULK_TQ_FP16_LAYERS=4 uv run skulk

Practical Expectations

BackendMemoryQualitySpeedNotes
defaultHighestBaselineFastestNo quantization
optiqLowBest quantizedNear-baselineRotation-based, best long-context
turboquant_adaptiveLowGoodModerateProven stable, Hadamard-based
turboquantLowestVariableModerateMost aggressive compression
mlx_quantizedLowGoodModerateMLX built-in quantization

Supported Cache Layouts

All quantized backends (optiq, turboquant, mlx_quantized) compress only standard KVCache entries and preserve these cache types unchanged:

  • ArraysCache
  • RotatingKVCache

Mixed cache layouts are supported:

  • KVCache + ArraysCache
  • KVCache + RotatingKVCache
  • KVCache + ArraysCache + RotatingKVCache

Current Limitations

  • All quantized KV cache backends force sequential generation (no batch/history mode)
  • The optiq backend requires mlx-optiq to be installed (pip install mlx-optiq)
  • The optiq backend's patch_attention() monkey-patches MLX's SDPA, so avoid switching between optiq and other backends within the same process lifetime without a restart

About mlx-optiq

The optiq backend is powered by mlx-optiq, which provides:

  • Rotation-based vector quantization: Random orthogonal rotations + Lloyd-Max centroids
  • Rotated-space attention: Eliminates per-key rotation overhead (O(d²) fixed cost vs O(seq_len × d²))
  • Superior long-context quality: Claims 100% needle retrieval at 4-bit vs 73% FP16

mlx-optiq also provides mixed-precision weight quantization (per-layer sensitivity analysis via KL divergence), which Skulk plans to integrate as a model store feature in a future release.