Automated Root-Cause Diagnosis for GPU Training
ApexData reads your training telemetry and produces a grounded root-cause verdict — what failed, why, and what to do — without instrumenting your code.
- 20+ failure modes
- No SDK or code changes
- Verdict in minutes
- Slurm-scheduled clusters

From a day of guessing to a verdict in minutes
to trace a slow or failed training run by hand — across logs, metrics, and the scheduler.
to a grounded root-cause verdict — what failed, why, and what to do — cited to the exact metric and log.
Assumes ~$3 / GPU-hour on-demand; a 128-GPU cluster ≈ $10K of compute per day.
Zero instrumentation. Evidence from every layer.
A lightweight on-node agent collects across the stack — no SDK, no code changes, no special logging sink.
📊 GPU telemetry
Memory & allocation curve, engine-active vs SM-active duty cycle, XID / ECC / thermal health.
⚡ CUDA & NCCL tracing (eBPF)
Kernel-launch counts, API durations, in-flight collectives — separates an idle hang from real compute.
📝 Job logs
Loss trajectory, framework progress, Python tracebacks — attributed by rank, node, and process.
🗄 Slurm scheduler records
Job & step lifecycle, per-node up/down/drain, exit codes — a node failure isn’t read as a code fault.
No single signal is trusted alone — the verdict is only as strong as the evidence that agrees across stores.
A verdict you can act on — with the evidence
“Job 2260 is currently hung inside an NCCL collective and will not recover without intervention.”
- Every claim cited to a metric and log.
- Recommended action, not just a diagnosis.
- Names what it cannot decide — no guessing.

We catch the runs other tools call “healthy”
The dangerous failures are “done-but-bad” — the job completes cleanly, so infrastructure tools read green. ApexData judges on positive evidence, across four families.
Throughput & utilization
GPU under-utilization, I/O save-storms, stragglers — one slow rank drags the whole multi-GPU allocation.
Training correctness
Loss divergence, silent resume-from-scratch (checkpoint never loaded), undertrained runs.
Cluster & scheduler
Node-down mid-run, requeue / preemption, timeout — a job fault vs. an infrastructure event.
Healthy baseline
Confirms a genuinely healthy run on positive evidence, not on the absence of errors.
+ 3 more families covered (Startup & config · GPU memory & hardware · Distributed coordination) — see capabilities PDF.
Frameworks & training stacks we recognize
The agent reads loss, gradient norm, learning rate, and checkpoints directly from each job's standard output, via a registry of framework-specific parsers.
Hugging Face Transformers
Full ecosystem incl. the fine-tuning & alignment stack.
Megatron-LM
Megatron-Core support for large-scale tensor and pipeline-parallel workloads.
PyTorch Lightning
Full trainer lifecycle, step- and epoch-level telemetry parsing.
Keras / TensorFlow
Callback-level and fit-loop telemetry for TF2 and Keras training runs.
Coverage comes from parsers, not from SDKs the team has to install.
30-minute demo on a sample cluster — we walk through a real failure report.
- PCI DSS & GDPR
- On-prem / air-gapped
- SSO / SAML
- Per-tenant isolation
- Local LLMs — telemetry never leaves your environment