Web-based server hardware monitoring via IPMI and Redfish
Complete documentation for IPMI Monitor - a web-based server hardware monitoring tool.
| Version: v1.1.1 | Last Updated: 2026-01-24 |
IPMI Monitor is a web-based tool for monitoring server hardware via IPMI (Intelligent Platform Management Interface) and Redfish APIs. It provides real-time visibility into your server fleet’s health.
IPMI Monitor works with any server that has an IPMI-compliant BMC (Baseboard Management Controller):
The easiest way to get started is the interactive quickstart wizard:
# Install pipx (prerequisite)
apt install pipx -y && pipx ensurepath
source ~/.bashrc
# Install the CLI tool
pipx install ipmi-monitor
# Run the wizard (use full path since pipx bin isn't in sudo PATH)
sudo ~/.local/bin/ipmi-monitor quickstart
The wizard will:
If DC Overview is already installed, the quickstart wizard automatically:
/etc/dc-overview/ configurationprometheus.yml/etc/ipmi-monitor/ssh_keys/This makes it easy to add IPMI monitoring to an existing GPU monitoring setup.
On a fresh installation, IPMI Monitor automatically performs an initial data collection:
A progress modal appears in the dashboard showing:
This ensures your dashboard has data immediately after setup.
192.168.1.100)server-01)If your servers use custom IPMI credentials:
Return to the Dashboard to see your servers. Click any server card to view detailed events, sensors, and inventory.
A dedicated processor on the server motherboard that operates independently of the main CPU. It allows remote monitoring and management even when the server is powered off or the OS has crashed.
| Feature | IPMI | Redfish |
|---|---|---|
| Protocol | Binary (port 623) | REST API (HTTPS 443) |
| Data Format | Binary | JSON |
| Support | Widely available | Modern BMCs |
| Detail Level | Basic | More detailed |
Recommendation: Use Auto protocol mode - IPMI Monitor will try Redfish first for more detailed data, then fall back to IPMI.
.0, e.g., 192.168.1.100).1, e.g., 192.168.1.101)The main dashboard shows all monitored servers in a grid view.
Each card displays:
| Status | Meaning |
|---|---|
| 🟢 Online | Server and BMC responding normally |
| 🟡 Warning | Warning events detected or partial connectivity |
| 🔴 Offline | BMC not reachable |
| ⚫ Unknown | Never successfully polled |
Data refreshes automatically every 60 seconds. Event collection runs every 5 minutes by default (configurable via POLL_INTERVAL environment variable).
IPMI Monitor displays version information in the dashboard header and can check for updates.
The header shows:
IPMI Monitor
v1.6.0 (main@8d7150c, 2025-12-07 22:41 UTC) Last updated: 12:46:02 AM
Components:
When an update is available, a popup shows:
# Pull the latest image
docker pull ghcr.io/cryptolabsza/ipmi-monitor:latest
# Restart with docker-compose
docker-compose up -d --force-recreate ipmi-monitor
# Or with docker run
docker stop ipmi-monitor
docker rm ipmi-monitor
docker run -d ... ghcr.io/cryptolabsza/ipmi-monitor:latest
GET /api/version - Get current version and build info
GET /api/version/check - Check GitHub for newer releases
Example response from /api/version:
{
"version": "1.6.0",
"version_string": "v1.6.0 (main@8d7150c, 2025-12-07 22:41 UTC)",
"git_branch": "main",
"git_commit": "8d7150c",
"build_time": "2025-12-07 22:41 UTC"
}
💡 Note: Update checking requires network access to api.github.com. If your server can’t reach GitHub, the check will silently fail.
If you have servers in multiple datacenters or locations, you can deploy an IPMI Monitor instance at each site while using a single license.
Your Company (Single CryptoLabs Account)
├── NYC Datacenter
│ └── IPMI Monitor instance (50 servers)
│ Site Name: "NYC Datacenter"
├── London Office
│ └── IPMI Monitor instance (30 servers)
│ Site Name: "London Office"
└── Singapore Colo
└── IPMI Monitor instance (20 servers)
Site Name: "Singapore Colo"
Total: 100 servers, 1 license, 3 sites
Reset the BMC (Baseboard Management Controller) without affecting the running host OS. This is useful when the BMC becomes unresponsive but the server itself is still running.
| Scenario | Recommended Reset |
|---|---|
| BMC unresponsive to web/IPMI | Cold Reset |
| BMC slow but responding | Warm Reset |
| IPMI commands failing | Cold Reset |
| After firmware update | Cold Reset |
# Cold reset
ipmitool -I lanplus -H 192.168.1.100 -U admin -P password mc reset cold
# Warm reset
ipmitool -I lanplus -H 192.168.1.100 -U admin -P password mc reset warm
# Check BMC info
ipmitool -I lanplus -H 192.168.1.100 -U admin -P password mc info
Click any server card to view detailed information across three tabs.
Shows System Event Log (SEL) entries with:
| Event | Meaning | Action |
|---|---|---|
| Correctable ECC Error | Memory error detected and corrected | Monitor frequency; replace DIMM if recurring |
| Uncorrectable ECC Error | Memory error that couldn’t be fixed | Replace DIMM immediately |
| Temperature Threshold | Component exceeded temperature limit | Check cooling, clean dust, verify airflow |
| Fan Failure | Fan stopped or below speed threshold | Replace fan ASAP |
| Power Supply Failure | PSU issue detected | Check/replace PSU |
📖 See Also: IPMI SEL Reference Guide for detailed event code interpretation including hex data decoding.
Real-time sensor readings including:
Click 🔄 Refresh Sensors to collect fresh data from the BMC. After refresh:
| Sensor | Normal | Warning | Critical |
|---|---|---|---|
| CPU Temperature | < 70°C | 70-85°C | > 85°C |
| Inlet Temperature | < 30°C | 30-40°C | > 40°C |
| DIMM Temperature | < 60°C | 60-75°C | > 75°C |
| Rail | Normal Range |
|---|---|
| 3.3V | 3.1V - 3.5V |
| 5V | 4.75V - 5.25V |
| 12V | 11.4V - 12.6V |
| VBAT | 2.8V - 3.3V |
⚠️ Low VBAT Warning: If VBAT drops below 2.5V, the CMOS battery needs replacement. This can cause BIOS settings to reset.
Hardware information collected via IPMI FRU, Redfish, and SSH:
| Source | Data Collected | Requirements |
|---|---|---|
| IPMI FRU | Manufacturer, model, serial | IPMI access |
| IPMI SDR | Sensor list, CPU/DIMM counts | IPMI access |
| Redfish API | Detailed CPU, memory, storage, GPU | Redfish-enabled BMC |
| SSH to OS | Exact CPU model, memory config, drives | SSH enabled + credentials |
💡 Tip: Enable SSH in Settings → SSH tab for the most detailed inventory data.
Admin and Read-Write users have access to the Diagnostics section for troubleshooting:
| Download | Description |
|---|---|
| Raw SEL Log | Unparsed IPMI SEL events directly from BMC |
| Raw Sensor Data | All sensors with thresholds in raw format |
| SSH Logs | dmesg, journalctl, GPU logs collected via SSH |
| Full Diagnostic Package | Everything bundled in a ZIP file |
Loading States: All download buttons show progress (e.g., “Collecting SEL…”) and are disabled during collection to prevent duplicate downloads.
Custom Commands: Admins can execute custom IPMI or SSH commands directly from the Diagnostics section.
IPMI Monitor detects NVIDIA GPU errors via SSH by parsing dmesg for Xid errors.
dmesg | grep "NVRM.*Xid" on the serverTechnical Xid codes are hidden from the UI. Instead of “Xid 48”, you’ll see:
| What You See | Technical Code | Meaning |
|---|---|---|
| GPU Memory Error | Xid 48, 94, 95 | ECC or memory fault |
| GPU Not Responding | Xid 43 | GPU hang |
| GPU Disconnected | Xid 79 | GPU fell off PCIe bus |
| GPU Requires Recovery | Xid 154 | Driver requests recovery |
💡 Technical details are stored internally for admin debugging via the API.
GPU events appear in the Events tab with:
dmesg accessIPMI Monitor can collect system logs from your servers via SSH for centralized viewing and AI analysis.
| Source | Command | Purpose |
|---|---|---|
| Kernel Log | dmesg |
Hardware errors, driver issues, boot messages |
| Journal | journalctl |
Systemd service logs |
| Syslog | /var/log/syslog |
System messages |
| MCE Log | mcelog |
Machine check exceptions (ECC, CPU errors) |
| Auth Log | /var/log/auth.log |
SSH login attempts, sudo usage |
| Secure Log | /var/log/secure |
Security events (RHEL/CentOS) |
| Docker Daemon | journalctl -u docker |
Docker service errors and warnings |
IPMI Monitor collects Docker daemon logs to help troubleshoot container issues common on GPU hosting servers.
Detected Issues:
storage-opt errors (XFS pquota configuration)How It Works:
journalctl -u docker or /var/log/docker.logAI Integration: When you ask Docker-related questions in AI Chat, it automatically:
When IPMI Monitor is deployed via dc-overview quickstart with Vast.ai or RunPod exporters enabled, it automatically collects platform-specific daemon logs.
Vast.ai Daemon Logs:
/var/log/vastai/ and journalctl -u vastaiRunPod Agent Logs:
Auto-Configuration: When deployed via DC Overview with exporters enabled:
# These are automatically set by dc-overview quickstart
COLLECT_VASTAI_LOGS=true # When vast_exporter is enabled
COLLECT_RUNPOD_LOGS=true # When runpod_exporter is enabled
You don’t need to configure this manually - it’s automatically enabled based on your DC Overview configuration.
IPMI Monitor parses SSH authentication logs to detect:
| Event Type | Detection | Severity |
|---|---|---|
| Failed SSH Login | Multiple failed password/key attempts | Warning |
| Brute Force Attack | 5+ failed logins from same IP in 5 min | Critical |
| Successful Login | Successful authentication | Info |
| Invalid User | Login attempt with non-existent user | Warning |
| Root Login | Direct root login (if enabled) | Info |
During Quickstart: If you have servers with SSH configured, the wizard asks:
Step 5b: SSH Log Collection (Optional)
Collect system logs from servers via SSH (dmesg, syslog, GPU errors).
Useful for troubleshooting hardware issues.
? Enable SSH log collection? (y/N)
After Installation:
The setting is stored in the ENABLE_SSH_LOGS environment variable and persists across container restarts.
| Severity | Example Entries |
|---|---|
| Critical | Kernel panic, OOM killer, GPU fell off bus, hardware failure |
| Error | I/O errors, driver failures, service crashes |
| Warning | Correctable ECC errors, high temperature, failed logins |
| Info | Service started, successful logins, normal operations |
/var/log/ for log file collectionIPMI Monitor tracks server uptime and detects unexpected reboots.
/proc/uptime via SSH each collection cycle| Event | Severity | Meaning |
|---|---|---|
| Unexpected server reboot | Warning | Server rebooted without system initiation |
Go to the server details page or use the API:
GET /api/uptime?server={bmc_ip}
Returns:
uptime_days - Days since last bootlast_boot_time - When server last bootedreboot_count - Total detected rebootsunexpected_reboot_count - Reboots not initiated by systemIPMI Monitor automatically creates maintenance tasks when error patterns indicate hardware issues.
| Pattern | Task Created |
|---|---|
| 3+ reboots in 24 hours | High severity maintenance required |
| 2+ power cycles in 24 hours | High severity maintenance required |
| 5+ GPU errors for same device in 24 hours | Critical maintenance required |
automated_maintenancemedium, high, or criticalpending, scheduled, in_progress, completed, cancelledView tasks at /api/maintenance or through the dashboard (when enabled).
Update task status via API:
PUT /api/maintenance/{id}
{
"status": "completed",
"notes": "Replaced GPU"
}
Go to Settings → Manage Servers → Add New Server:
Click any server in the list to open the edit dialog:
Import servers from a YAML/JSON file:
# servers.yaml example
# Global defaults applied to all servers
defaults:
ipmi_user: admin
ipmi_pass: YourDefaultPassword
ssh_user: root
ssh_key_name: production # References a stored SSH key by name
servers:
# Minimal - just name and BMC IP (uses defaults)
- name: server-01
bmc_ip: 192.168.1.100
server_ip: 192.168.1.101
# Override specific credentials
- name: server-02
bmc_ip: 192.168.1.102 # Required: BMC/IPMI IP
server_ip: 10.0.0.102 # Optional: OS IP for SSH inventory
public_ip: 203.0.113.50 # Optional: External/public IP (documentation)
ipmi_user: custom_admin # Override default
ipmi_pass: secretpass # Override default
notes: Production database # Optional: Notes/description
Available fields:
| Field | Required | Description |
|---|---|---|
name |
✅ Yes | Display name for the server |
bmc_ip |
✅ Yes | BMC/IPMI management IP |
server_ip |
No | OS IP address (for SSH inventory) |
public_ip |
No | Public/external IP (for reference) |
ipmi_user |
No | IPMI username (uses default if not set) |
ipmi_pass |
No | IPMI password (uses default if not set) |
ssh_user |
No | SSH username (default: root) |
ssh_key_name |
No | Name of a stored SSH key to use |
ssh_pass |
No | SSH password (if not using key auth) |
notes |
No | Notes or description |
Set default credentials that apply to all servers unless overridden.
GET /api/settings/credentials/defaults
PUT /api/settings/credentials/defaults
Example:
{
"ipmi_user": "admin",
"ipmi_pass": "DefaultPassword",
"ssh_user": "root",
"ssh_port": 22,
"default_ssh_key_id": 1
}
Apply defaults to multiple servers at once:
POST /api/settings/credentials/apply
{
"server_ips": ["192.168.1.100", "192.168.1.101"],
"apply_ipmi": true,
"apply_ssh": true,
"overwrite": false
}
Set server_ips to "all" to apply to all servers.
SSH enables detailed inventory collection from the server’s OS. This is optional and supplements data from IPMI/Redfish.
IPMI Monitor collects data in this order, only filling gaps:
For SSH inventory collection to work, the target server’s OS needs these standard Linux tools:
| Tool | Package | Used For | Required |
|---|---|---|---|
lspci |
pciutils |
GPU detection, NIC detection, PCIe health | ✅ Yes |
lsblk |
util-linux |
Storage devices (NVMe, SSD, HDD) | ✅ Yes |
lscpu |
util-linux |
CPU model, socket count, core count | ✅ Yes |
/proc/cpuinfo |
(kernel) | CPU fallback if lscpu unavailable | Built-in |
/proc/meminfo |
(kernel) | Total memory | Built-in |
/sys/class/net/ |
(kernel) | Network interfaces, MACs, speeds | Built-in |
/sys/class/dmi/id/ |
(kernel) | System manufacturer, product name | Built-in |
/sys/class/hwmon/ |
(kernel) | Temperature sensors | Built-in |
dmidecode |
dmidecode |
Memory DIMM details (needs root) | Optional |
virsh |
libvirt |
KVM host device passthrough info | Optional |
setpci |
pciutils |
Advanced PCIe diagnostics | Optional |
Typical installation (Debian/Ubuntu):
sudo apt install pciutils util-linux
Typical installation (RHEL/CentOS/Rocky):
sudo dnf install pciutils util-linux
⚠️ Note: SSH collection does NOT require
nvidia-smior any vendor drivers. GPU detection useslspciwhich sees all PCI devices including GPUs passed through to VMs.
💡 Note: No software installation is needed if you only use IPMI/Redfish for monitoring. SSH is purely supplemental.
When SSH is enabled, IPMI Monitor checks PCIe device health using lspci -vvv. This parses AER (Advanced Error Reporting) status registers.
Device Status Flags:
| Flag | Severity | Description |
|---|---|---|
FatalError |
Critical | PCIe fatal error - device may be non-functional |
NonFatalError |
Warning | Recoverable PCIe error |
UnsupportedRequest |
Warning | Device received unsupported PCIe request |
Uncorrectable Errors (UESta) - Critical:
| Code | Description |
|---|---|
DLP |
Data Link Protocol Error |
SDES |
Surprise Down Error (device unexpectedly removed) |
TLP |
TLP Prefix Blocked |
FCP |
Flow Control Protocol Error |
CmpltTO |
Completion Timeout |
CmpltAbrt |
Completer Abort |
UnxCmplt |
Unexpected Completion |
RxOF |
Receiver Overflow |
MalfTLP |
Malformed TLP |
ECRC |
ECRC Error |
UnsupReq |
Unsupported Request |
Correctable Errors (CESta) - Warning:
| Code | Description |
|---|---|
RxErr |
Receiver Error |
BadTLP |
Bad TLP (recoverable) |
BadDLLP |
Bad DLLP (recoverable) |
Rollover |
Replay Number Rollover |
Timeout |
Replay Timer Timeout |
NonFatalErr |
Non-Fatal Error (Advisory) |
💡 Tip: Uncorrectable errors (UE) indicate serious hardware issues that may require replacement. Correctable errors (CE) are recovered automatically but frequent occurrences may indicate failing hardware.
The inventory page shows PCIe health status for GPUs and VGA devices. Devices with errors are highlighted and logged as warnings.
Store SSH keys centrally and assign them to servers:
💡 Keys should be in OpenSSH format, starting with
-----BEGIN OPENSSH PRIVATE KEY-----
When running ipmi-monitor quickstart, you have multiple options for SSH authentication:
| Option | Description |
|---|---|
| Select Detected Key | Auto-detects keys in ~/.ssh/ (id_rsa, id_ed25519, etc.) with fingerprint |
| Enter Path Manually | Specify a custom path to your private key |
| Paste Key Content | Paste the private key directly (saved to ~/.ssh/ipmi_monitor_pasted_key) |
| Generate New Key | Creates ED25519 key pair and prints public key with instructions |
When generating a new key, you’ll see:
✓ New SSH key generated!
Private key: /root/.ssh/ipmi_monitor_key
Fingerprint: SHA256:xxxxx... (ED25519)
━━━ PUBLIC KEY ━━━
ssh-ed25519 AAAAC3NzaC1... ipmi-monitor
━━━━━━━━━━━━━━━━━━
To allow SSH access, add this public key to your servers:
1. Copy the public key above
2. On each server, add it to: ~/.ssh/authorized_keys
3. Ensure permissions: chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys
Each server can have custom SSH settings:
Control what recovery actions the AI agent can perform on your servers.
| Permission | Description | Risk Level |
|---|---|---|
allow_soft_reset |
PCI unbind/rebind, restart NVIDIA services | Low |
allow_clock_limit |
Reduce GPU clocks to stabilize | Low |
allow_kill_workload |
Stop containers using failed GPU | Medium |
allow_reboot |
Full server reboot | High |
allow_power_cycle |
BMC power cycle | High |
auto_maintenance_flag |
Auto-create maintenance tasks | Low |
System-wide defaults:
GET /api/recovery/permissions/default
PUT /api/recovery/permissions/default
Per-server overrides:
GET /api/recovery/permissions/server/{bmc_ip}
PUT /api/recovery/permissions/server/{bmc_ip}
Apply to multiple servers:
POST /api/recovery/permissions/apply
{
"server_ips": ["192.168.1.100", "192.168.1.101"],
"permissions": {
"allow_soft_reset": true,
"allow_reboot": false
}
}
⚠️ Warning: Enabling
allow_rebootandallow_power_cyclegives the AI agent permission to reboot servers automatically. Use with caution.
@BotFather on Telegram/newbot@userinfobot)Configure SMTP settings for email notifications. Works with Gmail, SendGrid, or any SMTP server.
Send alerts to Slack, Discord, or custom endpoints. Webhooks receive JSON payloads with alert details.
| Role | Dashboard | Settings | Server Management | User Management | AI Features |
|---|---|---|---|---|---|
| Admin | ✅ | ✅ | ✅ | ✅ | ✅ |
| Read-Write | ✅ | ✅ | ✅ | ❌ | ✅ |
| Read-Only | ✅ | ❌ | ❌ | ❌ | View only |
Enable to allow viewing the dashboard without login. Anonymous users get read-only access.
⚠️ Security Note: Only enable anonymous access on trusted networks.
IPMI Monitor provides a built-in Prometheus exporter for integration with your existing monitoring stack.
Metrics are exposed at /metrics on the same port as the web interface (default: 5000):
http://ipmi-monitor:5000/metrics
Common target configurations:
ipmi-monitor:5000 - Docker network (using container name)localhost:5000 - Same host as Prometheus192.168.1.50:5000 - Remote IP address| Metric | Type | Description |
|---|---|---|
ipmi_server_reachable |
Gauge | BMC reachable (1=yes, 0=no) |
ipmi_server_power_on |
Gauge | Power state (1=on, 0=off) |
ipmi_temperature_celsius |
Gauge | Temperature per sensor |
ipmi_fan_speed_rpm |
Gauge | Fan speed readings |
ipmi_voltage_volts |
Gauge | Voltage readings |
ipmi_power_watts |
Gauge | Power consumption |
ipmi_events_total |
Gauge | Total events per server |
ipmi_events_critical_24h |
Gauge | Critical events in 24h |
ipmi_events_warning_24h |
Gauge | Warning events in 24h |
ipmi_total_servers |
Gauge | Total monitored servers |
ipmi_reachable_servers |
Gauge | Reachable server count |
ipmi_alerts_total |
Gauge | Total fired alerts |
ipmi_alerts_unacknowledged |
Gauge | Unacknowledged alerts |
ipmi_last_collection_timestamp |
Gauge | Last collection time |
Add this to your prometheus.yml:
scrape_configs:
- job_name: 'ipmi-monitor'
static_configs:
- targets: ['ipmi-monitor:5000']
scrape_interval: 60s
scrape_timeout: 30s
metrics_path: /metrics
Target options:
ipmi-monitor:5000 - Docker network (container name)localhost:5000 - Same host192.168.1.50:5000 - Remote IPWe provide a ready-to-import Grafana dashboard with:
Import: Download from grafana/dashboards/ipmi-monitor.json
# High Temperature Alert
ipmi_temperature_celsius{sensor=~"CPU.*"} > 80
# Server Unreachable
ipmi_server_reachable == 0
# Critical Events Spike
increase(ipmi_events_critical_24h[1h]) > 5
# Multiple Servers Down
count(ipmi_server_reachable == 0) > 2
💡 Note: Scraping
/metricsreads cached data from the last collection cycle (default: every 5 minutes). Faster scrape intervals won’t give you fresher data - they’ll just read the same values repeatedly.
Premium AI features provide intelligent analysis of your server fleet. Access AI features directly from the IPMI Monitor dashboard via the AI Insights panel.
The AI Insights panel is displayed on the right side of your dashboard and contains the following tabs:
| Tab | Description |
|---|---|
| 📊 Summary | Fleet-wide health summary with critical issues, frequent errors, and trends |
| 🔧 Tasks | AI-generated maintenance tasks with specific component recommendations |
| 📈 Predictions | Failure predictions based on sensor trends and event patterns |
| 🔍 RCA | Root cause analysis for specific events or server issues |
| 💬 Chat | Interactive AI assistant for asking questions about your fleet |
| 📈 Usage | Token usage, subscription status, and billing information |
| 🤖 Agent | AI Recovery Agent configuration and monitoring |
The AI analyzes all your servers to generate a comprehensive health report:
The summary uses Agentic RAG (Retrieval Augmented Generation) to:
AI generates specific maintenance tasks with:
Example task:
Replace DIMM_A1 on BrickBox-40 (Critical)
47 correctable ECC errors in past 24 hours with increasing frequency.
Schedule memory replacement during next maintenance window.
AI predicts potential failures based on:
Predictions include confidence levels and recommended preventive actions.
Deep analysis for specific events or issues:
Filter RCA by:
Natural language interface for asking questions:
Example Questions:
Tips for Better Responses:
The Usage tab shows:
Ask questions about your servers in natural language:
AI analyzes events and sensors to generate maintenance work items.
| Priority | Meaning | Timeframe |
|---|---|---|
| 🔴 Critical | Immediate risk of outage | Today |
| 🟡 High | Component degrading | This week |
| 🔵 Medium | Needs attention | Next maintenance window |
| ⚪ Low | Monitor and plan | When convenient |
Each task includes:
The AI Recovery Agent autonomously handles GPU failures and other hardware issues with an intelligent escalation ladder.
Configure the agent’s behavior via the Agent tab in AI Insights:
| Mode | Description |
|---|---|
| ⏸️ Disabled | Agent does not monitor or take actions |
| 👁️ Monitoring Only | Agent monitors and reports issues but takes no automatic actions (default) |
| ⚡ Actions Enabled | Agent can automatically execute recovery actions on your servers |
⚠️ Warning: Actions Enabled mode allows the agent to automatically reboot servers or stop workloads. Enable with caution.
Recovery actions are grouped by risk level:
Low Risk Actions (GPU Only) | Action | Description | |——–|————-| | Stop Workload | Gracefully stop containers using the failed GPU | | GPU Soft Reset | PCI unbind/rebind to reset GPU without rebooting |
Medium Risk Actions | Action | Description | |——–|————-| | Graceful Reboot | Shutdown services cleanly then reboot | | Disk Cleanup | Clear temp files, logs, and cache if disk full |
High Risk Actions | Action | Description | |——–|————-| | IPMI Power Cycle | Force power cycle via BMC (data loss risk) |
| Stage | Action | Description | Cooldown |
|---|---|---|---|
| 1 | Check Status | Verify GPU is actually failed | - |
| 2 | Soft Reset | PCI unbind/rebind, restart NVIDIA services | 5 min |
| 3 | Clock Limit | Reduce GPU clocks 20% to stabilize | 15 min |
| 4 | Kill Workload | Stop containers/VMs using the GPU | 30 min |
| 5 | Reboot | Full server reboot | 60 min |
| 6 | Power Cycle | BMC power cycle | 120 min |
| 7 | Maintenance | Flag for manual intervention | - |
Before enabling Actions mode, the agent checks:
All agent actions are logged as events:
| Event | Description |
|---|---|
| GPU Requires Recovery | GPU error detected |
| GPU Reset Attempted | Soft reset performed |
| GPU Clock Limited | Clock reduction applied |
| Server Rebooted | Reboot performed |
| Workload Stopped | Container(s) killed to free GPU |
| Power Cycle Executed | IPMI power cycle performed |
| Maintenance Required | Device flagged for manual intervention |
Manually trigger an Agentic RAG analysis of your entire fleet:
View recent recovery actions:
💡 Tip: Start with only low-risk actions enabled. Enable medium/high risk actions only after testing in your environment.
When a server recovers from an unreachable (“dark”) state, IPMI Monitor can investigate what happened during the downtime.
| Cause | Evidence | Confidence |
|---|---|---|
| Reboot | OS boot time during outage | High |
| Power Outage | SEL shows “AC Lost” events | Very High |
| BMC Reset | SEL shows reset events | High |
| Network Issue | Multiple servers offline simultaneously | High |
| BMC Unresponsive | No other evidence found | Medium |
Automatic: Investigation runs when alert resolves (if AI agent enabled)
Manual:
API:
curl -X POST http://ipmi-monitor:5000/api/server/192.168.1.100/investigate \
-H "Content-Type: application/json" \
-d '{"downtime_start": "2025-12-10T10:00:00Z"}'
With AI features enabled, the CryptoLabs AI service can send tasks to your IPMI Monitor for execution.
| Task | Description | Prerequisites |
|---|---|---|
power_cycle |
BMC power cycle | IPMI credentials |
power_reset |
Chassis reset | IPMI credentials |
bmc_reset |
BMC cold/warm reset | IPMI credentials |
collect_inventory |
SSH inventory collection | SSH credentials |
ssh_command |
Execute SSH command | SSH credentials |
check_connectivity |
Verify server reachability | - |
AI Service IPMI Monitor
│ │
│ Create task │
├─────────────────────────────►│ Poll for tasks
│ │
│◄─────────────────────────────┤ Claim task
│ │
│ │ Execute action
│ │
│◄─────────────────────────────┤ Report completion
│ │
The agent dashboard shows:
IPMI Monitor is a self-hosted application - there is no central server that can reset your password. Your data and credentials are stored locally in a SQLite database inside the Docker container.
Since you have root access to your server, you can reset your password directly:
Save this script as reset-ipmi-password.sh and run it:
#!/bin/bash
# IPMI Monitor Password Reset Script
# Usage: ./reset-ipmi-password.sh <new_password> [username]
NEW_PASSWORD="${1:-changeme}"
USERNAME="${2:-admin}"
# Find the container
CONTAINER=$(docker ps --format '' | grep -E 'ipmi-monitor|ipmi_monitor' | head -1)
if [ -z "$CONTAINER" ]; then
echo "❌ IPMI Monitor container not found"
echo " Running containers: $(docker ps --format '')"
exit 1
fi
echo "🔧 Resetting password for user '$USERNAME' in container '$CONTAINER'..."
# Generate password hash and update database
docker exec -i "$CONTAINER" python3 << EOF
from werkzeug.security import generate_password_hash
import sqlite3
new_password = "$NEW_PASSWORD"
username = "$USERNAME"
password_hash = generate_password_hash(new_password)
conn = sqlite3.connect('/var/lib/ipmi-monitor/ipmi_events.db')
cursor = conn.cursor()
# Check if user exists
cursor.execute("SELECT id FROM user WHERE username = ?", (username,))
user = cursor.fetchone()
if user:
cursor.execute("UPDATE user SET password_hash = ? WHERE username = ?", (password_hash, username))
conn.commit()
print(f"✅ Password updated for user '{username}'")
else:
print(f"❌ User '{username}' not found")
cursor.execute("SELECT username FROM user")
users = cursor.fetchall()
if users:
print(f" Available users: {', '.join([u[0] for u in users])}")
conn.close()
EOF
echo ""
echo "🔐 You can now login with:"
echo " Username: $USERNAME"
echo " Password: $NEW_PASSWORD"
Usage:
chmod +x reset-ipmi-password.sh
# Reset admin password to 'newpassword123'
./reset-ipmi-password.sh newpassword123 admin
# Reset a different user
./reset-ipmi-password.sh mypassword myuser
# Enter the container
docker exec -it ipmi-monitor bash
# Use Python to update password
python3 << 'EOF'
from werkzeug.security import generate_password_hash
import sqlite3
new_password = "your_new_password"
username = "admin"
password_hash = generate_password_hash(new_password)
conn = sqlite3.connect('/var/lib/ipmi-monitor/ipmi_events.db')
cursor = conn.cursor()
cursor.execute("UPDATE user SET password_hash = ? WHERE username = ?", (password_hash, username))
conn.commit()
print(f"Password updated for {username}")
conn.close()
EOF
If starting fresh, set the admin password via environment variable:
environment:
- ADMIN_USER=admin
- ADMIN_PASS=your_new_password
⚠️ Note: The
ADMIN_PASSenvironment variable only sets the password on first run when the database is created. It does not reset existing passwords.
ping 192.168.1.100| Error | Cause | Solution |
|---|---|---|
| “Permission denied” | Wrong credentials | Check SSH key or password |
| “Connection refused” | SSH not running | Verify SSH service, check port |
| “No route to host” | Network issue | Check IP address, firewall |
| “error in libcrypto” | Key format issue | Re-paste the key carefully |
| Term | Definition |
|---|---|
| BMC | Baseboard Management Controller - dedicated processor for server management |
| IPMI | Intelligent Platform Management Interface - protocol for BMC communication |
| Redfish | Modern REST API alternative to IPMI |
| SEL | System Event Log - BMC’s record of hardware events |
| FRU | Field Replaceable Unit - hardware inventory data |
| SDR | Sensor Data Record - sensor configuration data |
| ECC | Error Correcting Code - memory error detection/correction |
| DIMM | Dual Inline Memory Module - RAM stick |
| PSU | Power Supply Unit |
| VBAT | Backup battery voltage (usually CR2032 for CMOS) |
| iDRAC | Dell’s BMC implementation |
| iLO | HP’s BMC implementation |
| Xid | NVIDIA GPU driver error code (hidden from users in UI) |
| PCI Unbind/Rebind | Soft GPU reset via Linux sysfs |
| Clock Limiting | Reducing GPU clock speeds to improve stability |
| Recovery Agent | AI system that autonomously handles GPU failures |
| Escalation Ladder | Progressive recovery actions (soft → hard) |
| Cooldown | Waiting period between recovery attempts |
IPMI Monitor provides a REST API for integration.
API endpoints require session authentication. Login via POST to /login.
GET /api/servers - List all servers
GET /api/servers/managed - List managed servers
GET /api/server/{ip}/events - Get server events
GET /api/server/{ip}/sensors - Get sensor readings
GET /api/servers/{ip}/inventory - Get hardware inventory
POST /api/servers/{ip}/inventory - Collect inventory
GET /api/auth/status - Check auth status
POST /api/test/bmc - Test BMC connection
POST /api/test/ssh - Test SSH connection
GET /metrics - Prometheus metrics
GET /health - Health check
GET /api/version - Get current version and build info
GET /api/version/check - Check GitHub for newer releases
GET /api/maintenance - List maintenance tasks
PUT /api/maintenance/{id} - Update maintenance task
GET /api/recovery-logs - Get recovery action history
GET /api/uptime - Get server uptime information
GET /api/settings/credentials/defaults - Get global defaults
PUT /api/settings/credentials/defaults - Set global defaults
POST /api/settings/credentials/apply - Apply to multiple servers
GET /api/ssh-keys - List stored SSH keys
POST /api/ssh-keys - Add SSH key
DELETE /api/ssh-keys/{id} - Delete SSH key
GET /api/recovery/permissions/default - Get system defaults
PUT /api/recovery/permissions/default - Set system defaults
GET /api/recovery/permissions/server/{ip} - Get per-server overrides
PUT /api/recovery/permissions/server/{ip} - Set per-server overrides
POST /api/recovery/permissions/apply - Apply to multiple servers
For complete API documentation, see the GitHub repository.
Short Answer: Power readings require DCMI (Data Center Manageability Interface) support, which is an optional IPMI extension not all BMCs support.
Details: IPMI Monitor collects power consumption using the command:
ipmitool dcmi power reading
DCMI is primarily found on enterprise/server-grade BMCs. Many motherboards, especially consumer-grade or older server boards, don’t support it. Even servers of the same model can have different BMC firmware versions with varying DCMI support.
What you can do:
For comprehensive GPU and system metrics, consider using DC Overview which installs exporters directly on servers:
node_exporter - CPU, memory, disk, network metricsdc-exporter-rs - GPU temperatures, power, utilization, memory, errorsBMC sensor support varies widely:
| BMC Type | Typical Sensors Available |
|---|---|
| Enterprise (Dell iDRAC, HP iLO) | CPU, inlet, outlet, DIMM, PSU, drive temps |
| NVIDIA DGX | Limited via IPMI - use Redfish or dc-exporter-rs |
| Supermicro IPMI | CPU, system temps, some have VRM temps |
| Consumer boards | Often only CPU package temp |
Solution: Enable Redfish in Settings if your BMC supports it - Redfish often exposes more sensors than IPMI.
This happens when dc-exporter-rs cannot communicate with NVIDIA’s NVML (NVIDIA Management Library):
| Error | Cause | Solution |
|---|---|---|
| “NVML failed to initialize” | Driver not loaded | Run nvidia-smi, reboot if needed |
| “Driver/library mismatch” | Kernel module ≠ userspace lib | Reinstall NVIDIA driver, reboot |
| “No NVIDIA GPU found” | No GPU or disabled | Check lspci \| grep NVIDIA |
| “Insufficient permissions” | Need root or nvidia group | Run exporter as root or add to nvidia group |
Quick fix attempt:
# Check driver status
nvidia-smi
# If mismatch, reinstall driver
sudo apt install --reinstall nvidia-driver-XXX
# Reboot to reload kernel module
sudo reboot
| Feature | IPMI Monitor | DC Overview |
|---|---|---|
| Data Source | BMC (out-of-band) | OS-level exporters (in-band) |
| Works when OS down? | ✅ Yes | ❌ No |
| GPU metrics depth | Basic (if BMC supports) | Comprehensive (NVML-based) |
| CPU/Memory/Disk | Limited to BMC sensors | Full via node_exporter |
| Setup complexity | Just need BMC IPs | Install exporters on each server |
| Power consumption | DCMI (if supported) | Per-GPU power via NVML |
| Hardware events (SEL) | ✅ Full SEL history | ❌ No |
| Remote power control | ✅ Yes | ❌ No |
Recommendation: Use both together for complete coverage:
Common causes:
IPMI Monitor mitigations:
Option 1: Via BMC (limited)
Option 2: DC Overview with dc-exporter-rs (recommended)
# On each GPU server
pipx install dc-overview
dc-overview quickstart
This installs dc-exporter-rs which provides 50+ GPU metrics:
The dashboard uses metrics that require:
ipmi_power_watts - Requires DCMI support (see FAQ above)ipmi_temperature_celsius{sensor_name=~"CPU.*"} - Requires CPU temp sensorsIf panels are empty:
IPMI Monitor supports pure-Redfish monitoring:
Redfish advantages:
Last updated: January 2026