GPU Observability Agent

Automated Root-Cause Diagnosis for GPU Training

ApexData reads your training telemetry and produces a grounded root-cause verdict — what failed, why, and what to do — without instrumenting your code.

20+ failure modes
No SDK or code changes
Verdict in minutes
Slurm-scheduled clusters

Book a Demo See sample verdict

ApexData Mission Control — live fleet view with AI verdicts

What slow root-cause costs

From a day of guessing to a verdict in minutes

Manual, today

Up to 1 day

to trace a slow or failed training run by hand — across logs, metrics, and the scheduler.

≈ $10,000 of wasted GPU on a 128-GPU cluster while you investigate.

With ApexData

30 minutes

to a grounded root-cause verdict — what failed, why, and what to do — cited to the exact metric and log.

Assumes ~$3 / GPU-hour on-demand; a 128-GPU cluster ≈ $10K of compute per day.

How it works

Zero instrumentation. Evidence from every layer.

A lightweight on-node agent collects across the stack — no SDK, no code changes, no special logging sink.

📊 GPU telemetry

Memory & allocation curve, engine-active vs SM-active duty cycle, XID / ECC / thermal health.

⚡ CUDA & NCCL tracing (eBPF)

Kernel-launch counts, API durations, in-flight collectives — separates an idle hang from real compute.

📝 Job logs

Loss trajectory, framework progress, Python tracebacks — attributed by rank, node, and process.

🗄 Slurm scheduler records

Job & step lifecycle, per-node up/down/drain, exit codes — a node failure isn’t read as a code fault.

No single signal is trusted alone — the verdict is only as strong as the evidence that agrees across stores.

A real verdict

A verdict you can act on — with the evidence

confidence 97%nccl-hang

“Job 2260 is currently hung inside an NCCL collective and will not recover without intervention.”

Every claim cited to a metric and log.
Recommended action, not just a diagnosis.
Names what it cannot decide — no guessing.

ML Investigation — grounded verdict with cited evidence

What we catch

We catch the runs other tools call “healthy”

The dangerous failures are “done-but-bad” — the job completes cleanly, so infrastructure tools read green. ApexData judges on positive evidence, across four families.

Throughput & utilization

GPU under-utilization, I/O save-storms, stragglers — one slow rank drags the whole multi-GPU allocation.

Training correctness

Loss divergence, silent resume-from-scratch (checkpoint never loaded), undertrained runs.

Cluster & scheduler

Node-down mid-run, requeue / preemption, timeout — a job fault vs. an infrastructure event.

Healthy baseline

Confirms a genuinely healthy run on positive evidence, not on the absence of errors.

+ 3 more families covered (Startup & config · GPU memory & hardware · Distributed coordination) — see capabilities PDF.

Framework support

Frameworks & training stacks we recognize

The agent reads loss, gradient norm, learning rate, and checkpoints directly from each job's standard output, via a registry of framework-specific parsers.

Hugging Face Transformers

Full ecosystem incl. the fine-tuning & alignment stack.

TRL
LLaMA-Factory
Axolotl
Unsloth

Megatron-LM

Megatron-Core support for large-scale tensor and pipeline-parallel workloads.

PyTorch Lightning

Full trainer lifecycle, step- and epoch-level telemetry parsing.

Keras / TensorFlow

Callback-level and fit-loop telemetry for TF2 and Keras training runs.

Coverage comes from parsers, not from SDKs the team has to install.

Book a Demo

30-minute demo on a sample cluster — we walk through a real failure report.

PCI DSS & GDPR
On-prem / air-gapped
SSO / SAML
Per-tenant isolation
Local LLMs — telemetry never leaves your environment