Monitoring the Hardware System Event Log in Linux

Overview

Whilst it’s possible to capture events like hardware failures directly from alerts on the BMC or other lights-out controller, it can be helpful to see these events from within the host OS, especially when the BMC or iLO hasn’t been fully configured on the network, or you don’t have the IP address readily avaliable.

Fetching the System Event Log (SEL)

# Get the time on the IPMI controller:
ipmitool sel time get

# Get the SEL information:
ipmitool sel

# List the contents of the SEL:
ipmitool sel list

Example output on a node with some DIMM errors:

host:~# ipmitool sel
SEL Information
Version          : 1.5 (v1.5, v2 compliant)
Entries          : 6
Free Space       : 16254 bytes
Percent Used     : 0%
Last Add Time    : 12/19/2014 02:12:56
Last Del Time    : 12/10/2014 18:35:14
Overflow         : false
Supported Cmds   : 'Delete' 'Partial Add' 'Reserve' 'Get Alloc Info'
# of Alloc Units : 909
Alloc Unit Size  : 18
# Free Units     : 903
Largest Free Blk : 903
Max Record Size  : 1

host:~# ipmitool sel list
   1 | 12/10/2014 |18:35:14 | Event Logging Disabled #0x8a | Log area reset/cleared | Asserted
   2 | 12/15/2014 | 16:56:50 | Memory #0x87 | Correctable ECC | Asserted
   3 | 12/18/2014 | 02:02:20 | Memory #0x87 | Correctable ECC | Asserted
   4 | 12/19/2014 | 01:31:12 | Memory #0x87 | Correctable ECC | Asserted
   5 | 12/19/2014 | 02:12:54 | Memory #0x87 | Correctable ECC | Asserted
   6 | 12/19/2014 | 02:12:56 | Memory #0x87 | Correctable ECC | Asserted

It is important to check that the BMC time is accurate, you can check that like this:

host:~# ipmitool sel time get
12/29/2014 17:39:33

It’s also possible to clear the SEL using the same method.

Setup

If you haven’t already, you will need to install the ipmitool package.

In order that ipmitool can connect to the IPMI interface without using the network, a kernel module will need to be loaded. You can check whether this is loaded with:

host:~# lsmod | grep ipmi

If it’s not loaded, you will see errors similar to this:

host:~# ipmitool sel
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory

If nothing is returned, you can use the following procedure to load the relevant modules:

# Load the IPMI kernel modules
modprobe ipmi_devintf
modprobe ipmi_si
lsmod | grep ipmi

Example:

host:~# modprobe ipmi_devintf
host:~# modprobe ipmi_si
host:~# lsmod | grep ipmi
ipmi_si                53210  0
ipmi_devintf           17572  0
ipmi_msghandler        43745  2 ipmi_devintf,ipmi_si