IPMI Monitor

Web-based server hardware monitoring via IPMI and Redfish

View the Project on GitHub cryptolabsza/ipmi-monitor

NVIDIA DGX A100 SEL Reference

IPMI System Event Log reference for NVIDIA DGX A100 AI systems.

Platform: NVIDIA DGX A100 BMC Version: 0.20+

Overview

The NVIDIA DGX A100 is a purpose-built AI system featuring 8x NVIDIA A100 GPUs. It uses a custom BMC firmware with NVIDIA-specific event types prefixed with SEL_NV_.


NVIDIA-Specific Event Types

SEL_NV_MAXP_MAXQ (Power Mode)

Description: Indicates GPU power mode changes.

Mode Meaning Use Case
MaxP Maximum Performance Full power, maximum GPU clock speeds
MaxQ Maximum Efficiency Power-optimized, reduced clocks

Event Format

Unknown SEL_NV_MAXP_MAXQ | info | | Asserted | [Power Mode Change]

Interpretation:

Troubleshooting Excessive Mode Changes:

  1. Check GPU temperatures
  2. Verify cooling system operation
  3. Review power supply capacity
  4. Check for software forcing power limits

SEL_NV_POST_ERR (POST Errors)

Description: Power-On Self Test errors during system boot.

Unknown SEL_NV_POST_ERR | info | | Asserted | [POST Error]

Common POST Error Causes:

Troubleshooting:

  1. Check BMC console for detailed error messages
  2. Verify all GPUs are properly seated
  3. Check NVLink bridges
  4. Review memory configuration
  5. Contact NVIDIA Enterprise Support if persistent

SEL_NV_BIOS (BIOS Events)

Description: BIOS/UEFI firmware events during system initialization.

Unknown SEL_NV_BIOS | info | | Asserted | [BIOS Event]

Common BIOS Events:

Note: These are typically informational and don’t require action.


SEL_NV_BOOT (Boot Events)

Description: System boot and restart events.

Unknown SEL_NV_BOOT | info | | Asserted | [Boot Event]

Tracked Events:


SEL_NV_AUDIT (Security Audit)

Description: BMC security audit events.

Unknown SEL_NV_AUDIT | info | | Asserted | [Security Audit]

Tracked Activities:

Security Best Practice: Review these events regularly for unauthorized access attempts.


SEL_NV_CHASSIS

Description: Chassis intrusion and enclosure events.

Unknown SEL_NV_CHASSIS | info | | Asserted

Events Include:


SEL_NV_FIRMWARE

Description: Firmware-related events.

Unknown SEL_NV_FIRMWARE | info | | Asserted

Events Include:


DGX-Specific Hardware Sensors

GPU Baseboard Status

Sensor Description
STATUS_GB_GPU GPU baseboard presence/status
PWRGD_GB_GPU GPU baseboard power good signal

GPU Baseboard Events

Event Meaning Action
Asserted GPU baseboard detected and healthy Normal
Deasserted GPU baseboard not detected Check GPU installation

System Power Status

Sensor Description
STATUS_SYS_PWR Overall system power status

System Power Events

Event Severity Meaning
Power off/down | Asserted Info System powered down
Power off/down | Deasserted Info System powered on
AC lost | Asserted Warning AC power lost

Multi-PSU Configuration

The DGX A100 has 4 or 6 power supplies for redundancy.

Sensor Description
STATUS_PSU0 PSU 0 status
STATUS_PSU1 PSU 1 status
STATUS_PSU2 PSU 2 status
STATUS_PSU3 PSU 3 status

PSU Events

Event Severity Cause Action
Power Supply AC lost | Asserted 🔴 Critical PSU lost AC power Check PDU, power cord
AC lost or out-of-range | Asserted 🟡 Warning AC voltage issue Verify utility power

Power Requirements:


Dual CPU Status

Sensor Description
STATUS_CPU0 CPU 0 presence and status
STATUS_CPU1 CPU 1 presence and status

The DGX A100 uses dual AMD EPYC processors for host CPU.


NVMe Drive Status

Sensor Description
STATUS_M.2_0 M.2 NVMe drive 0 status
STATUS_M.2_1 M.2 NVMe drive 1 status

The DGX A100 includes system NVMe drives for OS and caching.


Common DGX A100 Issues

High GPU Temperatures

Symptoms:

Resolution:

  1. Verify datacenter cooling (18-27°C inlet)
  2. Check GPU fans via BMC
  3. Ensure proper airflow (front-to-back)
  4. Clean air filters monthly
  5. Review workload distribution

Symptoms:

Resolution:

  1. Check NVLink bridge connections
  2. Run NVIDIA diagnostics: dcgmi diag -r 3
  3. Review GPU topology: nvidia-smi topo -m
  4. Contact NVIDIA support if persistent

Power Supply Issues

Symptoms:

Resolution:

  1. Verify all power cords connected
  2. Check PDU breaker status
  3. Verify 200-240V supply
  4. Balance load across PDUs
  5. Replace failed PSU (hot-swap capable)

NVIDIA Diagnostics Commands

GPU Health Check

nvidia-smi -q | grep -E "GPU|Temp|Power|Mem"

Run DCGM Diagnostics

dcgmi diag -r 3 -j
nvidia-smi nvlink -s

View GPU Errors

nvidia-smi -q | grep -A5 "Xid Errors"

Check Power Mode

nvidia-smi -q | grep "Power Mode"

Xid Error Reference

GPU errors are reported as Xid codes in the kernel log:

Xid Severity Description Action
13 🔴 Critical Graphics Engine Exception Check driver, may need GPU reset
31 🔴 Critical GPU memory page fault Check application, may indicate HW issue
43 🔴 Critical GPU stopped processing Reboot required, check thermals
45 🔴 Critical Preemptive cleanup Check cooling, may be thermal
48 🔴 Critical Double bit ECC error GPU memory hardware failure
56 🟡 Warning Display engine error Usually recoverable
57 🔴 Critical TCC/TPU error Check driver version
61 🟡 Warning Internal µcode error Update driver
62 🟡 Warning Internal µcode breakpoint Update driver
63 🔴 Critical Row remapping failure GPU needs service
64 🟡 Warning Row remapping pending Monitor, will auto-remap
74 🔴 Critical NVLink error Check NVLink bridges
79 🔴 Critical GPU fell off bus Check PCIe seating
94 🟡 Warning Memory page retired Monitor ECC errors
95 🟡 Warning Memory page retirement Monitor ECC errors

Real-time Metrics

Daily Checks

Weekly



Last updated: December 2025