Skip to main content

ClusterDiagnostics

Read-only diagnostic bundle collected from reachable cluster nodes.

generatedAtGeneratedat (string)required

UTC timestamp when collection finished.

localNodeIdLocalnodeid (string)required

Node ID of the API serving this response.

masterNodeId object

Current master node ID, when known.

anyOf
string
nodes object[]

Local and reachable peer diagnostic results.

  • Array [
  • nodeIdNodeid (string)required

    Node ID for this cluster diagnostics result.

    url object

    Peer API URL used to collect diagnostics, if remote.

    anyOf
    string
    okOk (boolean)required

    Whether diagnostics were collected successfully.

    diagnostics object

    Collected node diagnostics when ok is true.

    anyOf
    generatedAtGeneratedat (string)required

    UTC timestamp when this bundle was built.

    runtime objectrequired

    Runtime identity and config.

    nodeIdNodeid (string)required

    Local node ID.

    hostnameHostname (string)required

    Local hostname.

    friendlyName object

    Friendly node name from gathered identity data.

    anyOf
    string
    isMasterIsmaster (boolean)required

    Whether this node is the current master.

    masterNodeId object

    Current master node ID, when known.

    anyOf
    string
    cwdCwd (string)required

    Current working directory of the API process.

    configPathConfigpath (string)required

    Config path resolved by this API process.

    configFileExistsConfigfileexists (boolean)required

    Whether the resolved config path exists from this process cwd.

    skulkVersionSkulkversion (string)required

    Installed Skulk package version.

    skulkCommitSkulkcommit (string)required

    Git commit reported by node identity.

    libp2PNamespace object

    Configured libp2p namespace environment value, if set.

    anyOf
    string
    pythonUnbufferedPythonunbuffered (boolean)required

    Whether PYTHONUNBUFFERED is enabled for this process.

    tracingEnabledTracingenabled (boolean)required

    Current cluster runtime tracing state as seen by this API node.

    structuredLoggingConfiguredStructuredloggingconfigured (boolean)required

    Whether config enables centralized structured logging.

    loggingIngestUrl object

    Configured centralized logging ingest URL, when present.

    anyOf
    string
    identity object

    Last gathered node identity data.

    anyOf
    modelIdModelid (string)
    Default value: Unknown
    chipIdChipid (string)
    Default value: Unknown
    friendlyNameFriendlyname (string)
    Default value: Unknown
    osVersionOsversion (string)
    Default value: Unknown
    osBuildVersionOsbuildversion (string)
    Default value: Unknown
    skulkVersionSkulkversion (string)
    Default value: Unknown
    skulkCommitSkulkcommit (string)
    Default value: Unknown
    resources objectrequired

    Resource readings.

    gatheredMemory object

    Last event-sourced memory reading for this node.

    anyOf
    ramTotal objectrequired
    inBytesInbytes (integer)
    Default value: 0
    ramAvailable objectrequired
    inBytesInbytes (integer)
    Default value: 0
    swapTotal objectrequired
    inBytesInbytes (integer)
    Default value: 0
    swapAvailable objectrequired
    inBytesInbytes (integer)
    Default value: 0
    currentMemory object

    Live memory reading from the API process.

    anyOf
    ramTotal objectrequired
    inBytesInbytes (integer)
    Default value: 0
    ramAvailable objectrequired
    inBytesInbytes (integer)
    Default value: 0
    swapTotal objectrequired
    inBytesInbytes (integer)
    Default value: 0
    swapAvailable objectrequired
    inBytesInbytes (integer)
    Default value: 0
    currentWired object

    Live OS-level wired (unpageable) memory in use (macOS only). Read locally on this endpoint — deliberately NOT on the gossiped MemoryUsage, whose schema rides extra=forbid events — to detect leaked wired memory after an abnormal Metal termination (#239).

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    disk object

    Last event-sourced disk reading for this node.

    anyOf
    total objectrequired
    inBytesInbytes (integer)
    Default value: 0
    available objectrequired
    inBytesInbytes (integer)
    Default value: 0
    system object

    Last event-sourced system performance reading.

    anyOf
    gpuUsageGpuusage (number)
    Default value: 0
    tempTemp (number)
    Default value: 0
    sysPowerSyspower (number)
    Default value: 0
    pcpuUsagePcpuusage (number)
    Default value: 0
    ecpuUsageEcpuusage (number)
    Default value: 0
    accelerator object
    anyOf
    vendorVendor (string)

    Possible values: [apple, amd, nvidia, intel, cpu, unknown]

    Default value: unknown
    nameName (string)
    Default value: Unknown
    utilizationRatio object
    anyOf
    number
    vramTotalBytes object
    anyOf
    integer
    vramUsedBytes object
    anyOf
    integer
    gttTotalBytes object

    GPU-mappable host (GTT) memory, for unified-memory APUs (e.g. AMD Strix Halo). On such a node the GPU addresses system RAM beyond the BIOS VRAM carve-out through GTT, so the usable GPU pool is far larger than vram_total_bytes (placement uses this to admit big models on a UMA node). None on discrete GPUs / collectors that do not report it.

    anyOf
    integer
    powerWatts object
    anyOf
    number
    temperatureCelsius object
    anyOf
    number
    clockMhz object
    anyOf
    integer
    network object

    Last event-sourced network interface reading.

    anyOf
    interfaces object[]
  • Array [
  • nameName (string)required
    ipAddressIpaddress (string)required
    interfaceTypeInterfacetype (string)

    Possible values: [wifi, ethernet, maybe_ethernet, thunderbolt, unknown]

    Default value: unknown
  • ]
  • processes object[]

    Relevant local OS processes.

  • Array [
  • pidPid (integer)required

    Operating-system process ID.

    parentPid object

    Operating-system parent process ID, when visible.

    anyOf
    integer
    roleRole (string)required

    Best-effort Skulk role inferred from process lineage and command line.

    Possible values: [skulk, runner, vector, python, other]

    commandCommand (string)required

    Joined process command line.

    status object

    Operating-system process status such as running or sleeping.

    anyOf
    string
    cpuPercent object

    Recent CPU percentage reported by psutil.

    anyOf
    number
    memoryPercent object

    Percent of physical memory used by this process.

    anyOf
    number
    rss object

    Resident set size for this process, when available.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    elapsedSeconds object

    Seconds since process creation, when available.

    anyOf
    number
    isChildOfSkulkIschildofskulk (boolean)

    Whether this process is in the current Skulk API process tree.

    Default value: false
  • ]
  • supervisorRunners object[]

    Live local runner-supervisor diagnostics.

  • Array [
  • runnerIdRunnerid (string)required

    Runner ID.

    instanceIdInstanceid (string)required

    Instance ID.

    nodeIdNodeid (string)required

    Node ID that owns this runner.

    modelIdModelid (string)required

    Model assigned to this runner.

    deviceRankDevicerank (integer)required

    Distributed device rank.

    worldSizeWorldsize (integer)required

    Distributed world size.

    startLayerStartlayer (integer)required

    Inclusive first model layer on this shard.

    endLayerEndlayer (integer)required

    Exclusive final model layer on this shard.

    nLayersNlayers (integer)required

    Total number of model layers.

    pid object

    Runner subprocess PID, when started.

    anyOf
    integer
    processAliveProcessalive (boolean)required

    Whether the runner subprocess is alive.

    exitCode object

    Runner subprocess exit code, when exited.

    anyOf
    integer
    statusKindStatuskind (string)required

    Current runner status variant.

    statusSinceStatussince (string)required

    UTC timestamp for the current status.

    secondsInStatusSecondsinstatus (number)required

    Wall-clock seconds spent in the current runner status.

    phasePhase (string)required

    Last runner phase reported.

    Possible values: [created, idle, connect_group, load_model, warmup, task_submission, task_agreement, prompt_build, vision_preprocess, kv_cache_lookup, prefill_barrier, prefill_pipeline, prefill_stream, decode_barrier, decode_wait_first_token, decode_stream, parser, cancel_requested, cancel_observed, completion, error, shutdown_cleanup]

    phaseStartedAtPhasestartedat (string)required

    UTC timestamp when the current phase started.

    secondsInPhaseSecondsinphase (number)required

    Wall-clock seconds spent in the current phase.

    lastProgressAt object

    UTC timestamp for the last flight-recorder update.

    anyOf
    string
    activeTaskId object

    Task ID associated with the current phase, when known.

    anyOf
    string
    activeCommandId object

    Command ID associated with the current phase, when known.

    anyOf
    string
    phaseDetail object

    Compact human-readable detail for the current phase.

    anyOf
    string
    lastMlxMemory object

    Most recent MLX memory snapshot reported by the runner.

    anyOf
    generatedAtGeneratedat (string)required

    UTC timestamp when the snapshot was taken.

    active object

    Currently active MLX memory, when the runtime exposes it.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    cache object

    MLX cache memory, when the runtime exposes it.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    peak object

    Peak MLX memory since the last reset, when available.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    wiredLimit object

    Configured MLX wired memory limit when known. Current MLX releases do not expose a getter on all platforms, so this may be null.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    sourceSource (string)required

    Runtime module that supplied the measurement, such as mlx.core.

    flightRecorder object[]

    Last 128 local-only runner diagnostic events.

  • Array [
  • atAt (string)required

    UTC timestamp when the runner emitted the update.

    phasePhase (string)required

    Runner phase at this entry.

    Possible values: [created, idle, connect_group, load_model, warmup, task_submission, task_agreement, prompt_build, vision_preprocess, kv_cache_lookup, prefill_barrier, prefill_pipeline, prefill_stream, decode_barrier, decode_wait_first_token, decode_stream, parser, cancel_requested, cancel_observed, completion, error, shutdown_cleanup]

    eventEvent (string)required

    Short event name within the phase.

    detail object

    Compact human-readable detail for diagnostics.

    anyOf
    string
    attrs object

    Structured low-cardinality diagnostic attributes.

    property name* object
    anyOf
    string
    context objectrequired

    Stable runner identity fields for this entry.

    nodeIdNodeid (string)required

    Node ID that owns this runner.

    runnerIdRunnerid (string)required

    Runner ID.

    pid object

    Runner subprocess PID.

    anyOf
    integer
    instanceIdInstanceid (string)required

    Instance ID.

    modelIdModelid (string)required

    Model assigned to this runner.

    rankRank (integer)required

    Distributed rank for this runner.

    worldSizeWorldsize (integer)required

    Distributed world size.

    startLayerStartlayer (integer)required

    Inclusive first layer on this shard.

    endLayerEndlayer (integer)required

    Exclusive final layer on this shard.

    nLayersNlayers (integer)required

    Total model layers.

    taskId object

    Task ID associated with the entry, when known.

    anyOf
    string
    commandId object

    Command ID associated with the entry, when known.

    anyOf
    string
    mlxMemory object

    MLX memory snapshot captured with this entry, when present.

    anyOf
    generatedAtGeneratedat (string)required

    UTC timestamp when the snapshot was taken.

    active object

    Currently active MLX memory, when the runtime exposes it.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    cache object

    MLX cache memory, when the runtime exposes it.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    peak object

    Peak MLX memory since the last reset, when available.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    wiredLimit object

    Configured MLX wired memory limit when known. Current MLX releases do not expose a getter on all platforms, so this may be null.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    sourceSource (string)required

    Runtime module that supplied the measurement, such as mlx.core.

  • ]
  • pendingTaskIdsstring[]

    Tasks sent to the supervisor but not acknowledged by the runner.

    inProgressTasks object[]

    Tasks currently known as in progress by the supervisor.

  • Array [
  • taskIdTaskid (string)required

    Skulk task ID.

    taskKindTaskkind (string)required

    Concrete task model name.

    taskStatusTaskstatus (string)required

    Current event-sourced task status.

    instanceIdInstanceid (string)required

    Instance associated with the task.

    commandId object

    External command ID for user-facing inference tasks.

    anyOf
    string
    runnerId object

    Runner assigned to the task, if known.

    anyOf
    string
    modelId object

    Model associated with the task, if known.

    anyOf
    string
  • ]
  • completedTaskCountCompletedtaskcount (integer)required

    Number of tasks completed by this supervisor.

    cancelledTaskIdsstring[]

    Task IDs cancelled through this supervisor.

    lastTaskSentAt object

    UTC timestamp for the last task submitted to the runner.

    anyOf
    string
    lastEventReceivedAt object

    UTC timestamp for the last event received from the runner.

    anyOf
    string
    lastEventType object

    Class name of the last event received from the runner.

    anyOf
    string
    milestones object[]

    Recent lifecycle milestones retained by the supervisor.

  • Array [
  • atAt (string)required

    UTC timestamp when the milestone was recorded.

    nameName (string)required

    Short milestone name.

    detail object

    Optional compact detail for the milestone.

    anyOf
    string
  • ]
  • ]
  • placements object[]

    Event-sourced placement analysis for current instances.

  • Array [
  • instanceIdInstanceid (string)required

    Instance ID.

    modelIdModelid (string)required

    Placed model ID.

    masterNodeId object

    Current master node ID, when known.

    anyOf
    string
    masterIsPlacementNodeMasterisplacementnode (boolean)required

    Whether the current master is part of this model placement.

    localNodeIsPlacementNodeLocalnodeisplacementnode (boolean)required

    Whether the API node is part of this model placement.

    placementNodeIdsstring[]

    Node IDs participating in the placement.

    runners object[]

    Per-runner placement details.

  • Array [
  • runnerIdRunnerid (string)required

    Runner ID.

    nodeIdNodeid (string)required

    Node ID assigned to this runner.

    friendlyName object

    Friendly node name, when known.

    anyOf
    string
    statusKind object

    Current event-sourced runner status variant.

    anyOf
    string
    deviceRankDevicerank (integer)required

    Distributed device rank.

    worldSizeWorldsize (integer)required

    Distributed world size.

    startLayerStartlayer (integer)required

    Inclusive first model layer on this shard.

    endLayerEndlayer (integer)required

    Exclusive final model layer on this shard.

    nLayersNlayers (integer)required

    Total number of model layers.

    isLocalIslocal (boolean)required

    Whether this assignment is on the API node.

    isMasterIsmaster (boolean)required

    Whether this assignment is on the master node.

    tasks object[]

    Event-sourced tasks associated with this runner assignment.

  • Array [
  • taskIdTaskid (string)required

    Skulk task ID.

    taskKindTaskkind (string)required

    Concrete task model name.

    taskStatusTaskstatus (string)required

    Current event-sourced task status.

    instanceIdInstanceid (string)required

    Instance associated with the task.

    commandId object

    External command ID for user-facing inference tasks.

    anyOf
    string
    runnerId object

    Runner assigned to the task, if known.

    anyOf
    string
    modelId object

    Model associated with the task, if known.

    anyOf
    string
  • ]
  • ]
  • warningsstring[]

    Heuristic warnings that may help explain a stuck placement.

  • ]
  • warningsstring[]

    Top-level diagnostic warnings for this node.

    error object

    Collection error when ok is false.

    anyOf
    string
  • ]
  • ClusterDiagnostics
    {
    "generatedAt": "string",
    "localNodeId": "string",
    "masterNodeId": "string",
    "nodes": [
    {
    "nodeId": "string",
    "url": "string",
    "ok": true,
    "diagnostics": {
    "generatedAt": "string",
    "runtime": {
    "nodeId": "string",
    "hostname": "string",
    "friendlyName": "string",
    "isMaster": true,
    "masterNodeId": "string",
    "cwd": "string",
    "configPath": "string",
    "configFileExists": true,
    "skulkVersion": "string",
    "skulkCommit": "string",
    "libp2PNamespace": "string",
    "pythonUnbuffered": true,
    "tracingEnabled": true,
    "structuredLoggingConfigured": true,
    "loggingIngestUrl": "string"
    },
    "identity": {
    "modelId": "Unknown",
    "chipId": "Unknown",
    "friendlyName": "Unknown",
    "osVersion": "Unknown",
    "osBuildVersion": "Unknown",
    "skulkVersion": "Unknown",
    "skulkCommit": "Unknown"
    },
    "resources": {
    "gatheredMemory": {
    "ramTotal": {
    "inBytes": 0
    },
    "ramAvailable": {
    "inBytes": 0
    },
    "swapTotal": {
    "inBytes": 0
    },
    "swapAvailable": {
    "inBytes": 0
    }
    },
    "currentMemory": {
    "ramTotal": {
    "inBytes": 0
    },
    "ramAvailable": {
    "inBytes": 0
    },
    "swapTotal": {
    "inBytes": 0
    },
    "swapAvailable": {
    "inBytes": 0
    }
    },
    "currentWired": {
    "inBytes": 0
    },
    "disk": {
    "total": {
    "inBytes": 0
    },
    "available": {
    "inBytes": 0
    }
    },
    "system": {
    "gpuUsage": 0,
    "temp": 0,
    "sysPower": 0,
    "pcpuUsage": 0,
    "ecpuUsage": 0,
    "accelerator": {
    "vendor": "unknown",
    "name": "Unknown",
    "utilizationRatio": 0,
    "vramTotalBytes": 0,
    "vramUsedBytes": 0,
    "gttTotalBytes": 0,
    "powerWatts": 0,
    "temperatureCelsius": 0,
    "clockMhz": 0
    }
    },
    "network": {
    "interfaces": [
    {
    "name": "string",
    "ipAddress": "string",
    "interfaceType": "unknown"
    }
    ]
    }
    },
    "processes": [
    {
    "pid": 0,
    "parentPid": 0,
    "role": "skulk",
    "command": "string",
    "status": "string",
    "cpuPercent": 0,
    "memoryPercent": 0,
    "rss": {
    "inBytes": 0
    },
    "elapsedSeconds": 0,
    "isChildOfSkulk": false
    }
    ],
    "supervisorRunners": [
    {
    "runnerId": "string",
    "instanceId": "string",
    "nodeId": "string",
    "modelId": "string",
    "deviceRank": 0,
    "worldSize": 0,
    "startLayer": 0,
    "endLayer": 0,
    "nLayers": 0,
    "pid": 0,
    "processAlive": true,
    "exitCode": 0,
    "statusKind": "string",
    "statusSince": "string",
    "secondsInStatus": 0,
    "phase": "created",
    "phaseStartedAt": "string",
    "secondsInPhase": 0,
    "lastProgressAt": "string",
    "activeTaskId": "string",
    "activeCommandId": "string",
    "phaseDetail": "string",
    "lastMlxMemory": {
    "generatedAt": "string",
    "active": {
    "inBytes": 0
    },
    "cache": {
    "inBytes": 0
    },
    "peak": {
    "inBytes": 0
    },
    "wiredLimit": {
    "inBytes": 0
    },
    "source": "string"
    },
    "flightRecorder": [
    {
    "at": "string",
    "phase": "created",
    "event": "string",
    "detail": "string",
    "attrs": {},
    "context": {
    "nodeId": "string",
    "runnerId": "string",
    "pid": 0,
    "instanceId": "string",
    "modelId": "string",
    "rank": 0,
    "worldSize": 0,
    "startLayer": 0,
    "endLayer": 0,
    "nLayers": 0
    },
    "taskId": "string",
    "commandId": "string",
    "mlxMemory": {
    "generatedAt": "string",
    "active": {
    "inBytes": 0
    },
    "cache": {
    "inBytes": 0
    },
    "peak": {
    "inBytes": 0
    },
    "wiredLimit": {
    "inBytes": 0
    },
    "source": "string"
    }
    }
    ],
    "pendingTaskIds": [
    "string"
    ],
    "inProgressTasks": [
    {
    "taskId": "string",
    "taskKind": "string",
    "taskStatus": "string",
    "instanceId": "string",
    "commandId": "string",
    "runnerId": "string",
    "modelId": "string"
    }
    ],
    "completedTaskCount": 0,
    "cancelledTaskIds": [
    "string"
    ],
    "lastTaskSentAt": "string",
    "lastEventReceivedAt": "string",
    "lastEventType": "string",
    "milestones": [
    {
    "at": "string",
    "name": "string",
    "detail": "string"
    }
    ]
    }
    ],
    "placements": [
    {
    "instanceId": "string",
    "modelId": "string",
    "masterNodeId": "string",
    "masterIsPlacementNode": true,
    "localNodeIsPlacementNode": true,
    "placementNodeIds": [
    "string"
    ],
    "runners": [
    {
    "runnerId": "string",
    "nodeId": "string",
    "friendlyName": "string",
    "statusKind": "string",
    "deviceRank": 0,
    "worldSize": 0,
    "startLayer": 0,
    "endLayer": 0,
    "nLayers": 0,
    "isLocal": true,
    "isMaster": true,
    "tasks": [
    {
    "taskId": "string",
    "taskKind": "string",
    "taskStatus": "string",
    "instanceId": "string",
    "commandId": "string",
    "runnerId": "string",
    "modelId": "string"
    }
    ]
    }
    ],
    "warnings": [
    "string"
    ]
    }
    ],
    "warnings": [
    "string"
    ]
    },
    "error": "string"
    }
    ]
    }