Automated Root-Cause Diagnosis for GPU Training

ApexData reads your training telemetry and produces a grounded root-cause verdict — what failed, why, and what to do — without instrumenting your code.

  • 20+ failure modes
  • No SDK or code changes
  • Verdict in minutes
  • Slurm-scheduled clusters
ApexData Mission Control — live fleet view with AI verdicts

From a day of guessing to a verdict in minutes

Manual, today
Up to 1 day

to trace a slow or failed training run by hand — across logs, metrics, and the scheduler.

With ApexData
30 minutes

to a grounded root-cause verdict — what failed, why, and what to do — cited to the exact metric and log.

Assumes ~$3 / GPU-hour on-demand; a 128-GPU cluster ≈ $10K of compute per day.

Zero instrumentation. Evidence from every layer.

A lightweight on-node agent collects across the stack — no SDK, no code changes, no special logging sink.

📊 GPU telemetry

Memory & allocation curve, engine-active vs SM-active duty cycle, XID / ECC / thermal health.

CUDA & NCCL tracing (eBPF)

Kernel-launch counts, API durations, in-flight collectives — separates an idle hang from real compute.

📝 Job logs

Loss trajectory, framework progress, Python tracebacks — attributed by rank, node, and process.

🗄 Slurm scheduler records

Job & step lifecycle, per-node up/down/drain, exit codes — a node failure isn’t read as a code fault.

No single signal is trusted alone — the verdict is only as strong as the evidence that agrees across stores.

A verdict you can act on — with the evidence

confidence 97%nccl-hang

“Job 2260 is currently hung inside an NCCL collective and will not recover without intervention.”

  • Every claim cited to a metric and log.
  • Recommended action, not just a diagnosis.
  • Names what it cannot decide — no guessing.
ML Investigation — grounded verdict with cited evidence

We catch the runs other tools call “healthy”

The dangerous failures are “done-but-bad” — the job completes cleanly, so infrastructure tools read green. ApexData judges on positive evidence, across four families.

D

Throughput & utilization

GPU under-utilization, I/O save-storms, stragglers — one slow rank drags the whole multi-GPU allocation.

E

Training correctness

Loss divergence, silent resume-from-scratch (checkpoint never loaded), undertrained runs.

F

Cluster & scheduler

Node-down mid-run, requeue / preemption, timeout — a job fault vs. an infrastructure event.

G

Healthy baseline

Confirms a genuinely healthy run on positive evidence, not on the absence of errors.

+ 3 more families covered (Startup & config · GPU memory & hardware · Distributed coordination) — see capabilities PDF.

Frameworks & training stacks we recognize

The agent reads loss, gradient norm, learning rate, and checkpoints directly from each job's standard output, via a registry of framework-specific parsers.

Hugging Face Transformers

Full ecosystem incl. the fine-tuning & alignment stack.

  • TRL
  • LLaMA-Factory
  • Axolotl
  • Unsloth

Megatron-LM

Megatron-Core support for large-scale tensor and pipeline-parallel workloads.

PyTorch Lightning

Full trainer lifecycle, step- and epoch-level telemetry parsing.

Keras / TensorFlow

Callback-level and fit-loop telemetry for TF2 and Keras training runs.

Coverage comes from parsers, not from SDKs the team has to install.

Book a Demo

30-minute demo on a sample cluster — we walk through a real failure report.

  • PCI DSS & GDPR
  • On-prem / air-gapped
  • SSO / SAML
  • Per-tenant isolation
  • Local LLMs — telemetry never leaves your environment