Web-based server hardware monitoring via IPMI and Redfish
IPMI System Event Log reference for NVIDIA DGX A100 AI systems.
| Platform: NVIDIA DGX A100 | BMC Version: 0.20+ |
The NVIDIA DGX A100 is a purpose-built AI system featuring 8x NVIDIA A100 GPUs. It uses a custom BMC firmware with NVIDIA-specific event types prefixed with SEL_NV_.
Description: Indicates GPU power mode changes.
| Mode | Meaning | Use Case |
|---|---|---|
| MaxP | Maximum Performance | Full power, maximum GPU clock speeds |
| MaxQ | Maximum Efficiency | Power-optimized, reduced clocks |
Unknown SEL_NV_MAXP_MAXQ | info | | Asserted | [Power Mode Change]
Interpretation:
Troubleshooting Excessive Mode Changes:
Description: Power-On Self Test errors during system boot.
Unknown SEL_NV_POST_ERR | info | | Asserted | [POST Error]
Common POST Error Causes:
Troubleshooting:
Description: BIOS/UEFI firmware events during system initialization.
Unknown SEL_NV_BIOS | info | | Asserted | [BIOS Event]
Common BIOS Events:
Note: These are typically informational and don’t require action.
Description: System boot and restart events.
Unknown SEL_NV_BOOT | info | | Asserted | [Boot Event]
Tracked Events:
Description: BMC security audit events.
Unknown SEL_NV_AUDIT | info | | Asserted | [Security Audit]
Tracked Activities:
Security Best Practice: Review these events regularly for unauthorized access attempts.
Description: Chassis intrusion and enclosure events.
Unknown SEL_NV_CHASSIS | info | | Asserted
Events Include:
Description: Firmware-related events.
Unknown SEL_NV_FIRMWARE | info | | Asserted
Events Include:
| Sensor | Description |
|---|---|
| STATUS_GB_GPU | GPU baseboard presence/status |
| PWRGD_GB_GPU | GPU baseboard power good signal |
| Event | Meaning | Action |
|---|---|---|
| Asserted | GPU baseboard detected and healthy | Normal |
| Deasserted | GPU baseboard not detected | Check GPU installation |
| Sensor | Description |
|---|---|
| STATUS_SYS_PWR | Overall system power status |
| Event | Severity | Meaning |
|---|---|---|
| Power off/down | Asserted | Info | System powered down |
| Power off/down | Deasserted | Info | System powered on |
| AC lost | Asserted | Warning | AC power lost |
The DGX A100 has 4 or 6 power supplies for redundancy.
| Sensor | Description |
|---|---|
| STATUS_PSU0 | PSU 0 status |
| STATUS_PSU1 | PSU 1 status |
| STATUS_PSU2 | PSU 2 status |
| STATUS_PSU3 | PSU 3 status |
| Event | Severity | Cause | Action |
|---|---|---|---|
| Power Supply AC lost | Asserted | 🔴 Critical | PSU lost AC power | Check PDU, power cord |
| AC lost or out-of-range | Asserted | 🟡 Warning | AC voltage issue | Verify utility power |
Power Requirements:
| Sensor | Description |
|---|---|
| STATUS_CPU0 | CPU 0 presence and status |
| STATUS_CPU1 | CPU 1 presence and status |
The DGX A100 uses dual AMD EPYC processors for host CPU.
| Sensor | Description |
|---|---|
| STATUS_M.2_0 | M.2 NVMe drive 0 status |
| STATUS_M.2_1 | M.2 NVMe drive 1 status |
The DGX A100 includes system NVMe drives for OS and caching.
Symptoms:
Resolution:
Symptoms:
Resolution:
dcgmi diag -r 3nvidia-smi topo -mSymptoms:
Resolution:
nvidia-smi -q | grep -E "GPU|Temp|Power|Mem"
dcgmi diag -r 3 -j
nvidia-smi nvlink -s
nvidia-smi -q | grep -A5 "Xid Errors"
nvidia-smi -q | grep "Power Mode"
GPU errors are reported as Xid codes in the kernel log:
| Xid | Severity | Description | Action |
|---|---|---|---|
| 13 | 🔴 Critical | Graphics Engine Exception | Check driver, may need GPU reset |
| 31 | 🔴 Critical | GPU memory page fault | Check application, may indicate HW issue |
| 43 | 🔴 Critical | GPU stopped processing | Reboot required, check thermals |
| 45 | 🔴 Critical | Preemptive cleanup | Check cooling, may be thermal |
| 48 | 🔴 Critical | Double bit ECC error | GPU memory hardware failure |
| 56 | 🟡 Warning | Display engine error | Usually recoverable |
| 57 | 🔴 Critical | TCC/TPU error | Check driver version |
| 61 | 🟡 Warning | Internal µcode error | Update driver |
| 62 | 🟡 Warning | Internal µcode breakpoint | Update driver |
| 63 | 🔴 Critical | Row remapping failure | GPU needs service |
| 64 | 🟡 Warning | Row remapping pending | Monitor, will auto-remap |
| 74 | 🔴 Critical | NVLink error | Check NVLink bridges |
| 79 | 🔴 Critical | GPU fell off bus | Check PCIe seating |
| 94 | 🟡 Warning | Memory page retired | Monitor ECC errors |
| 95 | 🟡 Warning | Memory page retirement | Monitor ECC errors |
Last updated: December 2025