Skip to main content

ClusterTimeline

Cross-rank chronological view of runner activity across the cluster.

Produced by stitching the per-node RunnerSupervisorDiagnostics from every reachable node into one unified shape. The runners list gives a per-rank "where is each rank right now" snapshot; the timeline list gives every flight-recorder entry across all ranks, merged and sorted by at so a distributed deadlock's rank-disagreement signature is visible at a glance.

generatedAtGeneratedat (string)required

UTC timestamp when the timeline was built.

localNodeIdLocalnodeid (string)required

Node ID of the API serving this response.

masterNodeId object

Current master node ID, when known.

anyOf
string
runners object[]

Current synopsis for each runner, sorted by (model_id, rank).

  • Array [
  • nodeIdNodeid (string)required

    Node owning this runner.

    runnerIdRunnerid (string)required

    Runner ID.

    instanceIdInstanceid (string)required

    Instance ID.

    modelIdModelid (string)required

    Model assigned to this runner.

    deviceRankDevicerank (integer)required

    Distributed device rank.

    worldSizeWorldsize (integer)required

    Distributed world size.

    pid object

    Runner subprocess PID, when started.

    anyOf
    integer
    processAliveProcessalive (boolean)required

    Whether the runner subprocess is alive.

    statusKindStatuskind (string)required

    Current runner status variant.

    phasePhase (string)required

    Last runner phase reported.

    Possible values: [created, idle, connect_group, load_model, warmup, task_submission, task_agreement, prompt_build, vision_preprocess, kv_cache_lookup, prefill_barrier, prefill_pipeline, prefill_stream, decode_barrier, decode_wait_first_token, decode_stream, parser, cancel_requested, cancel_observed, completion, error, shutdown_cleanup]

    phaseDetail object

    Compact human-readable detail for the current phase.

    anyOf
    string
    secondsInPhaseSecondsinphase (number)required

    Wall-clock seconds spent in the current phase.

    lastProgressAt object

    UTC timestamp for the last flight-recorder update.

    anyOf
    string
    activeTaskId object

    Task ID associated with the current phase, when known.

    anyOf
    string
    activeCommandId object

    Command ID associated with the current phase, when known.

    anyOf
    string
    lastMlxMemory object

    Most recent MLX memory snapshot reported by the runner.

    anyOf
    generatedAtGeneratedat (string)required

    UTC timestamp when the snapshot was taken.

    active object

    Currently active MLX memory, when the runtime exposes it.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    cache object

    MLX cache memory, when the runtime exposes it.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    peak object

    Peak MLX memory since the last reset, when available.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    wiredLimit object

    Configured MLX wired memory limit when known. Current MLX releases do not expose a getter on all platforms, so this may be null.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    sourceSource (string)required

    Runtime module that supplied the measurement, such as mlx.core.

  • ]
  • timeline object[]

    Flight-recorder entries from all runners, merged and sorted by at ascending.

  • Array [
  • atAt (string)required

    UTC timestamp when the runner emitted the update.

    nodeIdNodeid (string)required

    Node owning the runner that emitted this entry.

    runnerIdRunnerid (string)required

    Runner ID that emitted this entry.

    deviceRankDevicerank (integer)required

    Distributed device rank for this entry.

    worldSizeWorldsize (integer)required

    Distributed world size for this entry.

    phasePhase (string)required

    Runner phase at this entry.

    Possible values: [created, idle, connect_group, load_model, warmup, task_submission, task_agreement, prompt_build, vision_preprocess, kv_cache_lookup, prefill_barrier, prefill_pipeline, prefill_stream, decode_barrier, decode_wait_first_token, decode_stream, parser, cancel_requested, cancel_observed, completion, error, shutdown_cleanup]

    eventEvent (string)required

    Short event name within the phase.

    detail object

    Compact human-readable detail for diagnostics.

    anyOf
    string
    attrs object

    Structured low-cardinality diagnostic attributes.

    property name* object
    anyOf
    string
    taskId object

    Task ID associated with the entry, when known.

    anyOf
    string
    commandId object

    Command ID associated with the entry, when known.

    anyOf
    string
    mlxMemory object

    MLX memory snapshot captured with this entry, when present.

    anyOf
    generatedAtGeneratedat (string)required

    UTC timestamp when the snapshot was taken.

    active object

    Currently active MLX memory, when the runtime exposes it.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    cache object

    MLX cache memory, when the runtime exposes it.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    peak object

    Peak MLX memory since the last reset, when available.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    wiredLimit object

    Configured MLX wired memory limit when known. Current MLX releases do not expose a getter on all platforms, so this may be null.

    anyOf
    inBytesInbytes (integer)
    Default value: 0
    sourceSource (string)required

    Runtime module that supplied the measurement, such as mlx.core.

  • ]
  • unreachableNodes object[]

    Peer nodes that could not be reached for this timeline.

  • Array [
  • nodeIdNodeid (string)required

    Node ID for the unreachable peer.

    url object

    Peer API URL that was attempted, if known.

    anyOf
    string
    errorError (string)required

    Reason the peer was unreachable.

  • ]
  • ClusterTimeline
    {
    "generatedAt": "string",
    "localNodeId": "string",
    "masterNodeId": "string",
    "runners": [
    {
    "nodeId": "string",
    "runnerId": "string",
    "instanceId": "string",
    "modelId": "string",
    "deviceRank": 0,
    "worldSize": 0,
    "pid": 0,
    "processAlive": true,
    "statusKind": "string",
    "phase": "created",
    "phaseDetail": "string",
    "secondsInPhase": 0,
    "lastProgressAt": "string",
    "activeTaskId": "string",
    "activeCommandId": "string",
    "lastMlxMemory": {
    "generatedAt": "string",
    "active": {
    "inBytes": 0
    },
    "cache": {
    "inBytes": 0
    },
    "peak": {
    "inBytes": 0
    },
    "wiredLimit": {
    "inBytes": 0
    },
    "source": "string"
    }
    }
    ],
    "timeline": [
    {
    "at": "string",
    "nodeId": "string",
    "runnerId": "string",
    "deviceRank": 0,
    "worldSize": 0,
    "phase": "created",
    "event": "string",
    "detail": "string",
    "attrs": {},
    "taskId": "string",
    "commandId": "string",
    "mlxMemory": {
    "generatedAt": "string",
    "active": {
    "inBytes": 0
    },
    "cache": {
    "inBytes": 0
    },
    "peak": {
    "inBytes": 0
    },
    "wiredLimit": {
    "inBytes": 0
    },
    "source": "string"
    }
    }
    ],
    "unreachableNodes": [
    {
    "nodeId": "string",
    "url": "string",
    "error": "string"
    }
    ]
    }