ClusterTimeline
Cross-rank chronological view of runner activity across the cluster.
Produced by stitching the per-node RunnerSupervisorDiagnostics from
every reachable node into one unified shape. The runners list gives
a per-rank "where is each rank right now" snapshot; the timeline
list gives every flight-recorder entry across all ranks, merged and
sorted by at so a distributed deadlock's rank-disagreement signature
is visible at a glance.
UTC timestamp when the timeline was built.
Node ID of the API serving this response.
masterNodeId object
Current master node ID, when known.
- string
- null
runners object[]
Current synopsis for each runner, sorted by (model_id, rank).
Node owning this runner.
Runner ID.
Instance ID.
Model assigned to this runner.
Distributed device rank.
Distributed world size.
pid object
Runner subprocess PID, when started.
- integer
- null
Whether the runner subprocess is alive.
Current runner status variant.
Last runner phase reported.
Possible values: [created, idle, connect_group, load_model, warmup, task_submission, task_agreement, prompt_build, vision_preprocess, kv_cache_lookup, prefill_barrier, prefill_pipeline, prefill_stream, decode_barrier, decode_wait_first_token, decode_stream, parser, cancel_requested, cancel_observed, completion, error, shutdown_cleanup]
phaseDetail object
Compact human-readable detail for the current phase.
- string
- null
Wall-clock seconds spent in the current phase.
lastProgressAt object
UTC timestamp for the last flight-recorder update.
- string
- null
activeTaskId object
Task ID associated with the current phase, when known.
- string
- null
activeCommandId object
Command ID associated with the current phase, when known.
- string
- null
lastMlxMemory object
Most recent MLX memory snapshot reported by the runner.
- MlxMemorySnapshot
- null
UTC timestamp when the snapshot was taken.
active object
Currently active MLX memory, when the runtime exposes it.
- Memory
- null
0cache object
MLX cache memory, when the runtime exposes it.
- Memory
- null
0peak object
Peak MLX memory since the last reset, when available.
- Memory
- null
0wiredLimit object
Configured MLX wired memory limit when known. Current MLX releases do not expose a getter on all platforms, so this may be null.
- Memory
- null
0Runtime module that supplied the measurement, such as mlx.core.
timeline object[]
Flight-recorder entries from all runners, merged and sorted by at ascending.
UTC timestamp when the runner emitted the update.
Node owning the runner that emitted this entry.
Runner ID that emitted this entry.
Distributed device rank for this entry.
Distributed world size for this entry.
Runner phase at this entry.
Possible values: [created, idle, connect_group, load_model, warmup, task_submission, task_agreement, prompt_build, vision_preprocess, kv_cache_lookup, prefill_barrier, prefill_pipeline, prefill_stream, decode_barrier, decode_wait_first_token, decode_stream, parser, cancel_requested, cancel_observed, completion, error, shutdown_cleanup]
Short event name within the phase.
detail object
Compact human-readable detail for diagnostics.
- string
- null
attrs object
Structured low-cardinality diagnostic attributes.
property name* object
- string
- integer
- number
- boolean
- string[]
taskId object
Task ID associated with the entry, when known.
- string
- null
commandId object
Command ID associated with the entry, when known.
- string
- null
mlxMemory object
MLX memory snapshot captured with this entry, when present.
- MlxMemorySnapshot
- null
UTC timestamp when the snapshot was taken.
active object
Currently active MLX memory, when the runtime exposes it.
- Memory
- null
0cache object
MLX cache memory, when the runtime exposes it.
- Memory
- null
0peak object
Peak MLX memory since the last reset, when available.
- Memory
- null
0wiredLimit object
Configured MLX wired memory limit when known. Current MLX releases do not expose a getter on all platforms, so this may be null.
- Memory
- null
0Runtime module that supplied the measurement, such as mlx.core.
unreachableNodes object[]
Peer nodes that could not be reached for this timeline.
Node ID for the unreachable peer.
url object
Peer API URL that was attempted, if known.
- string
- null
Reason the peer was unreachable.
{
"generatedAt": "string",
"localNodeId": "string",
"masterNodeId": "string",
"runners": [
{
"nodeId": "string",
"runnerId": "string",
"instanceId": "string",
"modelId": "string",
"deviceRank": 0,
"worldSize": 0,
"pid": 0,
"processAlive": true,
"statusKind": "string",
"phase": "created",
"phaseDetail": "string",
"secondsInPhase": 0,
"lastProgressAt": "string",
"activeTaskId": "string",
"activeCommandId": "string",
"lastMlxMemory": {
"generatedAt": "string",
"active": {
"inBytes": 0
},
"cache": {
"inBytes": 0
},
"peak": {
"inBytes": 0
},
"wiredLimit": {
"inBytes": 0
},
"source": "string"
}
}
],
"timeline": [
{
"at": "string",
"nodeId": "string",
"runnerId": "string",
"deviceRank": 0,
"worldSize": 0,
"phase": "created",
"event": "string",
"detail": "string",
"attrs": {},
"taskId": "string",
"commandId": "string",
"mlxMemory": {
"generatedAt": "string",
"active": {
"inBytes": 0
},
"cache": {
"inBytes": 0
},
"peak": {
"inBytes": 0
},
"wiredLimit": {
"inBytes": 0
},
"source": "string"
}
}
],
"unreachableNodes": [
{
"nodeId": "string",
"url": "string",
"error": "string"
}
]
}