System Event Log Troubleshooting Guide for PCSD Platforms

System Event Log Troubleshooting
Guide for PCSD Platforms Based on
Intel®Xeon®Processor E5
4600/2600/2400/1600/1400
Product Families
Intel order number G90620-003
Revision 1.2
December 2013
Platform Collaboration and Systems Division – Marketing
Revision History
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
Revision History
Date
January 2013
Revision
Number
1.0
Initial release
September 2013
1.1

Added MIC Thermal Margin sensors C4 through C7.

Added MIC Status sensors A2, A3, A6, and A7.

Added voltage sensors EA, EB, EC, ED, and EF.

Corrected typographical errors.

Made corrections to Firmware Update Status table.

Made corrections to Catastrophic Error Sensor table.

Added support for S1400FP, S1400SP, S1600JP, and S4600LH.

Corrected IPMI Watchdog and PEF Sensors Typical Characteristics tables.

Clarified Channel designators for DIMM memory errors.

Corrected ME Firmware Health Event Sensor – Next Steps table.

Corrected DIMM Thermal Trip Typical Characteristics table.

Clarified DIMM locations for memory errors.

Made corrections to Firmware Update Status table.

Updated Power Unit Status sensor next steps table.
December 2013
ii
1.2
Modifications
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
Disclaimers
Disclaimers
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS
GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR
SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly,
in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION
CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES,
SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH,
HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS'
FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL
INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR
NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF
THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not
rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves
these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from
future changes to them. The information here is subject to change without notice. Do not finalize a design with this
information.
The products described in this document may contain design defects or errors known as errata which may cause the
product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your
product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may
be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.
Revision 1.2
Intel order number G90620-003
iii
Table of Contents
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
Table of Contents
1. Introduction ........................................................................................................................ 1
1.1
Purpose.................................................................................................................. 1
1.2
Industry Standard ................................................................................................... 2
1.2.1
Intelligent Platform Management Interface (IPMI) ................................................... 2
1.2.2
Baseboard Management Controller (BMC) ............................................................. 2
1.2.3
Intel® Intelligent Power Node Manager Version 2.0 ................................................ 3
2. Basic Decoding of a SEL Record ...................................................................................... 4
2.1
Default Values in the SEL Records ........................................................................ 4
2.2
Notes on SEL Logs and Collecting SEL Information ............................................. 10
2.2.1
Examples of Decoding BIOS Timestamp Events .................................................. 10
2.2.2
Example of Decoding a PCI Express* Correctable Error Events........................... 11
2.2.3
Example of Decoding a Power Supply Predictive Failure Event ........................... 12
3. Sensor Cross Reference List ........................................................................................... 13
3.1
BMC owned Sensors (GID = 0020h) .................................................................... 13
3.2
BIOS POST owned Sensors (GID = 0001h) ......................................................... 24
3.3
BIOS SMI Handler owned Sensors (GID = 0033h) ............................................... 24
3.4
Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch) ............. 25
3.5
Microsoft* OS owned Events (GID = 0041) .......................................................... 26
3.6
Linux* Kernel Panic Events (GID = 0021) ............................................................. 26
4. Power Subsystems ........................................................................................................... 27
4.1
Threshold-based Voltage Sensors ....................................................................... 27
4.2
Voltage Regulator Watchdog Timer Sensor ......................................................... 33
4.2.1
Voltage Regulator Watchdog Timer Sensor – Next Steps .................................... 34
4.3
Power Unit ........................................................................................................... 34
4.3.1
Power Unit Status Sensor .................................................................................... 34
4.3.2
Power Unit Redundancy Sensor........................................................................... 36
4.3.3
Node Auto Shutdown Sensor ............................................................................... 37
4.4
Power Supply ....................................................................................................... 38
4.4.1
Power Supply Status Sensors .............................................................................. 38
4.4.2
Power Supply Power In Sensors .......................................................................... 41
4.4.3
Power Supply Current Out % Sensors ................................................................. 42
4.4.4
Power Supply Temperature Sensors .................................................................... 43
4.4.5
Power Supply Fan Tachometer Sensors .............................................................. 44
5. Cooling Subsystem .......................................................................................................... 45
5.1
Fan Sensors ......................................................................................................... 45
5.1.1
Fan Tachometer Sensors ..................................................................................... 45
5.1.2
Fan Presence and Redundancy Sensors ............................................................. 46
5.2
Temperature Sensors........................................................................................... 49
iv
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
Table of Contents
5.2.1
Threshold-based Temperature Sensors ............................................................... 49
5.2.2
Thermal Margin Sensors ...................................................................................... 51
5.2.3
Processor Thermal Control Sensors ..................................................................... 53
5.2.4
Processor DTS Thermal Margin Sensors ............................................................. 54
5.2.5
Discrete Thermal Sensors .................................................................................... 55
5.2.6
DIMM Thermal Trip Sensors ................................................................................ 57
5.3
System Air Flow Monitoring Sensor...................................................................... 58
6. Processor Subsystem ...................................................................................................... 59
6.1
Processor Status Sensor ...................................................................................... 59
6.2
Catastrophic Error Sensor .................................................................................... 60
6.3
CPU Missing Sensor ............................................................................................ 62
6.3.1
CPU Missing Sensor – Next Steps ....................................................................... 62
6.4
Quick Path Interconnect Sensors ......................................................................... 62
6.4.1
QPI Link Width Reduced Sensor .......................................................................... 63
6.4.2
QPI Correctable Error Sensor .............................................................................. 64
6.4.3
QPI Fatal Error and Fatal Error #2........................................................................ 64
6.5
Processor ERR2 Timeout Sensor......................................................................... 67
6.5.1
Processor ERR2 Timeout – Next Steps................................................................ 68
6.6
Processor MSID Mismatch Sensor ....................................................................... 68
6.6.1
Processor MSID Mismatch Sensor – Next Steps.................................................. 69
7. Memory Subsystem .......................................................................................................... 70
7.1
Memory RAS Configuration Status ....................................................................... 70
7.2
Memory RAS Mode Select ................................................................................... 72
7.3
Mirroring Redundancy State ................................................................................. 73
7.3.1
Mirroring Redundancy State Sensor – Next Steps ............................................... 74
7.4
Sparing Redundancy State................................................................................... 74
7.4.1
Sparing Redundancy State Sensor – Next Steps ................................................. 76
7.5
ECC and Address Parity ...................................................................................... 76
7.5.1
Memory Correctable and Uncorrectable ECC Error .............................................. 76
7.5.2
Memory Address Parity Error ............................................................................... 78
8. PCI Express* and Legacy PCI Subsystem ...................................................................... 81
8.1
PCI Express* Errors ............................................................................................. 81
8.1.1
Legacy PCI Errors ................................................................................................ 81
8.1.2
PCI Express* Fatal Errors and Fatal Error #2 ....................................................... 82
8.1.3
PCI Express* Correctable Errors .......................................................................... 84
9. System BIOS Events ........................................................................................................ 87
9.1
System Events ..................................................................................................... 87
9.1.1
System Boot ......................................................................................................... 87
9.1.2
Timestamp Clock Synchronization ....................................................................... 87
9.2
System Firmware Progress (Formerly Post Error) ................................................ 89
Revision 1.2
Intel order number G90620-003
v
Table of Contents
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
9.2.1
System Firmware Progress (Formerly Post Error) – Next Steps ........................... 89
10. Chassis Subsystem .......................................................................................................... 97
10.1
Physical Security .................................................................................................. 97
10.1.1 Chassis Intrusion.................................................................................................. 97
10.1.2 LAN Leash Lost.................................................................................................... 97
10.2
FP (NMI) Interrupt ................................................................................................ 98
10.2.1 FP (NMI) Interrupt – Next Steps ........................................................................... 99
10.3
Button Sensor .................................................................................................... 100
11. Miscellaneous Events .................................................................................................... 101
11.1
IPMI Watchdog................................................................................................... 101
11.2
SMI Timeout ....................................................................................................... 103
11.2.1 SMI Timeout – Next Steps.................................................................................. 103
11.3
System Event Log Cleared ................................................................................. 104
11.4
System Event – PEF Action ............................................................................... 104
11.4.1 System Event – PEF Action – Next Steps .......................................................... 105
11.5
BMC Watchdog Sensor ...................................................................................... 106
11.5.1 BMC Watchdog Sensor – Next Steps ................................................................. 106
11.6
BMC FW Health Sensor ..................................................................................... 107
11.6.1 BMC FW Health Sensor – Next Steps ................................................................ 107
11.7
Firmware Update Status Sensor......................................................................... 108
11.8
Add-In Module Presence Sensor ........................................................................ 109
11.8.1 Add-In Module Presence – Next Steps ............................................................... 109
11.9
Intel® Xeon Phi™ Coprocessor Management Sensors ........................................ 110
11.9.1 Intel® Xeon Phi™ Coprocessor (MIC) Thermal Margin Sensors .......................... 110
11.9.2 Intel® Xeon Phi™ Coprocessor (MIC) Status Sensors ......................................... 110
12. Hot-Swap Controller Backplane Events ........................................................................ 112
12.1
HSC Backplane Temperature Sensor ................................................................ 112
12.2
Hard Disk Drive Monitoring Sensor .................................................................... 113
12.3
Hot-Swap Controller Health Sensor.................................................................... 114
12.3.1 HSC Health Sensor – Next Steps ....................................................................... 115
13. Manageability Engine (ME) Events ................................................................................ 116
13.1
ME Firmware Health Event................................................................................. 116
13.1.1 ME Firmware Health Event – Next Steps ........................................................... 116
13.2
Node Manager Exception Event ......................................................................... 118
13.2.1 Node Manager Exception Event – Next Steps .................................................... 118
13.3
Node Manager Health Event .............................................................................. 119
13.3.1 Node Manager Health Event – Next Steps ......................................................... 120
13.4
Node Manager Operational Capabilities Change................................................ 121
13.4.1 Node Manager Operational Capabilities Change – Next Steps .......................... 122
13.5
Node Manager Alert Threshold Exceeded .......................................................... 123
vi
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
Table of Contents
13.5.1 Node Manager Alert Threshold Exceeded – Next Steps ..................................... 124
14. Microsoft Windows* Records ........................................................................................ 125
14.1
Boot up Event Records ...................................................................................... 125
14.2
Shutdown Event Records ................................................................................... 127
14.3
Bug Check / Blue Screen Event Records ........................................................... 129
15. Linux* Kernel Panic Records ......................................................................................... 131
Revision 1.2
Intel order number G90620-003
vii
List of Tables
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
List of Tables
Table 1: SEL Record Format....................................................................................................... 4
Table 2: Event Request Message Event Data Field Contents ..................................................... 7
Table 3: OEM SEL Record (Type C0h-DFh) ............................................................................... 8
Table 4: OEM SEL Record (Type E0h-FFh) ................................................................................ 9
Table 5: BMC owned Sensors................................................................................................... 13
Table 6: BIOS POST owned Sensors ....................................................................................... 24
Table 7: BIOS SMI Handler owned Sensors ............................................................................. 24
Table 8: Management Engine Firmware owned Sensors .......................................................... 25
Table 9: Microsoft* OS owned Events ....................................................................................... 26
Table 10: Linux* Kernel Panic Events ....................................................................................... 26
Table 11: Threshold-based Voltage Sensors Typical Characteristics ........................................ 27
Table 12: Threshold-based Voltage Sensors Event Triggers – Description ............................... 28
Table 13: Threshold-based Voltage Sensors – Next Steps ....................................................... 28
Table 14: Voltage Regulator Watchdog Timer Sensor Typical Characteristics .......................... 34
Table 15: Power Unit Status Sensors Typical Characteristics ................................................... 35
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps ............................ 35
Table 17: Power Unit Redundancy Sensors Typical Characteristics ......................................... 36
Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps ....................... 37
Table 19: Node Auto Shutdown Sensor Typical Characteristics ................................................ 38
Table 20: Power Supply Status Sensors Typical Characteristics ............................................... 39
Table 21: Power Supply Status Sensor – Sensor Specific Offsets – Next Steps ....................... 39
Table 22: Power Supply Power In Sensors Typical Characteristics ........................................... 41
Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next Steps ........................ 41
Table 24: Power Supply Current Out % Sensors Typical Characteristics .................................. 42
Table 25: Power Supply Current Out % Sensor – Event Trigger Offset – Next Steps ................ 42
Table 26: Power Supply Temperature Sensors Typical Characteristics .................................... 43
Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps .................. 43
Table 28: Power Supply Fan Tachometer Sensors Typical Characteristics ............................... 44
Table 29: Fan Tachometer Sensors Typical Characteristics...................................................... 45
Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps ................................... 46
Table 31: Fan Presence Sensors Typical Characteristics ......................................................... 46
Table 32: Fan Presence Sensors – Event Trigger Offset – Next Steps ..................................... 47
Table 33: Fan Redundancy Sensors Typical Characteristics..................................................... 47
Table 34: Fan Redundancy Sensor – Event Trigger Offset – Next Steps .................................. 48
Table 35: Temperature Sensors Typical Characteristics ........................................................... 49
Table 36: Temperature Sensors Event Triggers – Description .................................................. 50
Table 37: Temperature Sensors – Next Steps........................................................................... 50
Table 38: Thermal Margin Sensors Typical Characteristics ....................................................... 51
viii
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
List of Tables
Table 39: Thermal Margin Sensors Event Triggers – Description .............................................. 52
Table 40: Thermal Margin Sensors – Next Steps ...................................................................... 52
Table 41: Processor Thermal Control Sensors Typical Characteristics ..................................... 53
Table 42: Processor Thermal Control Sensors Event Triggers – Description ............................ 54
Table 43: Processor DTS Thermal Margin Sensors Typical Characteristics .............................. 55
Table 44: Discrete Thermal Sensors Typical Characteristics..................................................... 55
Table 45: Discrete Thermal Sensors – Next Steps .................................................................... 56
Table 46: DIMM Thermal Trip Typical Characteristics ............................................................... 57
Table 47: Process Status Sensors Typical Characteristics ........................................................ 59
Table 48: Processor Status Sensors – Next Steps .................................................................... 60
Table 49: Catastrophic Error Sensor Typical Characteristics..................................................... 60
Table 50: Catastrophic Error Sensor – Event Data 2 Values – Next Steps ................................ 61
Table 51: CPU Missing Sensor Typical Characteristics ............................................................. 62
Table 52: QPI Link Width Reduced Sensor Typical Characteristics........................................... 63
Table 53: QPI Correctable Error Sensor Typical Characteristics ............................................... 64
Table 54: QPI Fatal Error Sensor Typical Characteristics ......................................................... 65
Table 55: QPI Fatal #2 Error Sensor Typical Characteristics..................................................... 66
Table 56: Processor ERR2 Timeout Sensor Typical Characteristics ......................................... 67
Table 57: Processor MSID Mismatch Sensor Typical Characteristics ....................................... 68
Table 58: Memory RAS Configuration Status Sensor Typical Characteristics ........................... 70
Table 59: Memory RAS Configuration Status Sensor – Event Trigger Offset – Next Steps ....... 71
Table 60: Memory RAS Mode Select Sensor Typical Characteristics........................................ 72
Table 61: Mirroring Redundancy State Sensor Typical Characteristics ..................................... 73
Table 62: Sparing Redundancy State Sensor Typical Characteristics ....................................... 75
Table 63: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics ................ 77
Table 64: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps78
Table 65: Address Parity Error Sensor Typical Characteristics ................................................. 79
Table 66: Legacy PCI Error Sensor Typical Characteristics ...................................................... 81
Table 67: PCI Express* Fatal Error Sensor Typical Characteristics........................................... 82
Table 68: PCI Express* Fatal Error #2 Sensor Typical Characteristics ...................................... 83
Table 69: PCI Express* Correctable Error Sensor Typical Characteristics ................................ 85
Table 70: System Event Sensor Typical Characteristics ........................................................... 88
Table 71: POST Error Sensor Typical Characteristics ............................................................... 89
Table 72: POST Error Codes .................................................................................................... 90
Table 73: Physical Security Sensor Typical Characteristics ...................................................... 97
Table 74: Physical Security Sensor Event Trigger Offset – Next Steps ..................................... 98
Table 75: FP (NMI) Interrupt Sensor Typical Characteristics ..................................................... 99
Table 76: Button Sensor Typical Characteristics ..................................................................... 100
Table 77: IPMI Watchdog Sensor Typical Characteristics ....................................................... 101
Table 78: IPMI Watchdog Sensor Event Trigger Offset – Next Steps ...................................... 102
Revision 1.2
Intel order number G90620-003
ix
List of Tables
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
Table 79: SMI Timeout Sensor Typical Characteristics ........................................................... 103
Table 80: System Event Log Cleared Sensor Typical Characteristics ..................................... 104
Table 81: System Event – PEF Action Sensor Typical Characteristics .................................... 105
Table 82: BMC Watchdog Sensor Typical Characteristics ...................................................... 106
Table 83: BMC FW Health Sensor Typical Characteristics ...................................................... 107
Table 84: Firmware Update Status Sensor Typical Characteristics ......................................... 108
Table 85: Add-In Module Presence Sensor Typical Characteristics ........................................ 109
Table 86: MIC Status Sensors – Typical Characteristics ......................................................... 110
Table 87: HSC Backplane Temperature Sensor Typical Characteristics ................................. 112
Table 88: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps ............. 113
Table 89: Hard Disk Drive Monitoring Sensor Typical Characteristics ..................................... 113
Table 90: Hard Disk Drive Monitoring Sensor – Event Trigger Offset – Next Steps ................. 114
Table 91: HSC Health Sensor Typical Characteristics............................................................. 114
Table 92: ME Firmware Health Event Sensor Typical Characteristics ..................................... 116
Table 93: ME Firmware Health Event Sensor – Next Steps .................................................... 117
Table 94: Node Manager Exception Sensor Typical Characteristics ....................................... 118
Table 95: Node Manager Health Event Sensor Typical Characteristics ................................... 119
Table 96: Node Manager Operational Capabilities Change Sensor Typical Characteristics .... 121
Table 97: Node Manager Alert Threshold Exceeded Sensor Typical Characteristics .............. 123
Table 98: Boot up Event Record Typical Characteristics ......................................................... 125
Table 99: Boot up OEM Event Record Typical Characteristics ................................................ 126
Table 100: Shutdown Reason Code Event Record Typical Characteristics ............................. 127
Table 101: Shutdown Reason OEM Event Record Typical Characteristics ............................. 127
Table 102: Shutdown Comment OEM Event Record Typical Characteristics .......................... 128
Table 103: Bug Check / Blue Screen – OS Stop Event Record Typical Characteristics .......... 129
Table 104: Bug Check / Blue Screen code OEM Event Record Typical Characteristics .......... 130
Table 105: Linux* Kernel Panic Event Record Characteristics ................................................ 131
Table 106: Linux* Kernel Panic String Extended Record Characteristics ................................ 132
x
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
1.
Introduction
Introduction
The server management hardware that is part of the Intel® Server Boards and Intel® Server
Platforms serves as a vital part of the overall server management strategy. The server
management hardware provides essential information to the system administrator and provides
the administrator the ability to remotely control the server, even when the operating system is
not running.
The Intel® Server Boards and Intel® Server Platforms offer comprehensive hardware and
software based solutions. The server management features make the servers simple to manage
and provide alerting on system events. From entry to enterprise systems, good overall server
management is essential to reduce overall total cost of ownership.
This Troubleshooting Guide is intended to help the users better understand the events that are
logged in the Baseboard Management Controllers (BMC) System Event Logs (SEL) on these
Intel® Server Boards.
There is a separate User’s Guide that covers the general server management and the server
management software offered on the Intel® Server Boards and Intel® Server Platforms.
Server boards currently supported by this document:
















1.1
Intel® S1400FP Server Boards
Intel® S1400SP Server Boards
Intel® S1600JP Server Boards
Intel® S2400BB Server Boards
Intel® S2400EP Server Boards
Intel® S2400GP Server Boards
Intel® S2400LP Server Boards
Intel® S2400SC Server Boards
Intel® S2600CO Server Boards
Intel® S2600CP Server Boards
Intel® S2600GZ/S2600GL Server Boards
Intel® S2600IP Server Boards
Intel® S2600JF Server Boards
Intel® S2600WP Server Boards
Intel® S4600LH Server Boards
Intel® W2600CR Workstation Boards
Purpose
The purpose of this document is to list all possible events generated by the Intel platform. It may
be possible that other sources (not under our control) also generate events, which will not be
described in this document.
Revision 1.2
Intel order number G90620-003
1
Introduction
1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
Industry Standard
1.2.1
Intelligent Platform Management Interface (IPMI)
The key characteristic of the Intelligent Platform Management Interface (IPMI) is that the
inventory, monitoring, logging, and recovery control functions are available independently of the
main processors, BIOS, and operating system. Platform management functions can also be
made available when the system is in a power-down state.
IPMI works by interfacing with the BMC, which extends management capabilities in the server
system and operates independently of the main processor by monitoring the on-board
instrumentation. Through the BMC, IPMI also allows administrators to control power to the
server, and remotely access BIOS configuration and operating system console information.
IPMI defines a common platform instrumentation interface to enable interoperability between:



The baseboard management controller and chassis
The baseboard management controller and systems management software
Between servers
IPMI enables the following:

Common access to platform management information, consisting of:
-



Local access from systems management software
Remote access from LAN
Inter-chassis access from Intelligent Chassis Management Bus
Access from LAN, serial/modem, IPMB, PCI SMBus*, or ICMB, available even if the
processor is down
IPMI interface isolates systems management software from hardware.
Hardware advancements can be made without impacting the systems management
software.
IPMI facilitates cross-platform management software.
You can find more information on IPMI at the following URL:
http://www.intel.com/design/servers/ipmi
1.2.2
Baseboard Management Controller (BMC)
A baseboard management controller (BMC) is a specialized microcontroller embedded on most
Intel® Server Boards. The BMC is the heart of the IPMI architecture and provides the
intelligence behind intelligent platform management, that is, the autonomous monitoring and
recovery features implemented directly in platform management hardware and firmware.
Different types of sensors built into the computer system report to the BMC on parameters such
as temperature, cooling fan speeds, power mode, operating system status, and so on. The BMC
monitors the system for critical events by communicating with various sensors on the system
2
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5
4600/2600/2400/1600/1400 Product Families
Introduction
board; it sends alerts and logs events when certain parameters exceed their preset thresholds,
indicating a potential failure of the system. The administrator can also remotely communicate
with the BMC to take some corrective action such as resetting or power cycling the system to
get a hung OS running again. These abilities save on the total cost of ownership of a system.
For Intel® Server Boards and Intel® Server Platforms, the BMC supports the industry standard
IPMI 2.0 Specification, enabling you to configure, monitor, and recover systems remotely.
1.2.2.1
System Event Log (SEL)
The BMC provides a centralized, non-volatile repository for critical, warning, and informational
system events called the System Event Log or SEL. By having the BMC manage the SEL and
logging functions, it helps to ensure that “post-mortem” logging information is available if a
failure occurs that disables the system processor(s).
The BMC allows access to SEL from in-band and out-of-band mechanisms. There are various
tools and utilities that can be used to access the SEL. There is the Intel® SELView utility and
multiple open sourced IPMI tools.
1.2.3
Intel®Intelligent Power Node Manager Version 2.0
Intel® Intelligent Power Node Manager Version 2.0 (NM) is a platform-resident technology that
enforces power and thermal policies for the platform. These policies are applied by exploiting
subsystem knobs (such as processor P and T states) that can be used to control power
consumption. Intel® Intelligent Power Node Manager enables data center power and thermal
management by exposing an external interface to management software through which platform
policies can be specified. It also enables specific data center power management usage models
such as power limiting.
The configuration and control commands are used by the external management software or
BMC to configure and control the Intel® Intelligent Power Node Manager feature. Because
Platform Services firmware does not have any external interface, external commands are first
received by the BMC over LAN and then relayed to the Platform Services firmware over IPMB
channel. The BMC acts as a relay and the transport conversion device for these commands. For
simplicity, the commands from the management console might be encapsulated in a generic
CONFIG packet format (configuration data length, configuration data blob) to the BMC so that
the BMC doesn’t even have to parse the actual configuration data.
The BMC provides the access point for remote commands from external management SW and
generates alerts to them. Intel® Intelligent Power Node Manager on Intel® Manageability Engine
(Intel® ME) is an IPMI satellite controller. A mechanism exists to forward commands to Intel® ME
and then sends the response back to originator. Similarly events from Intel® ME will be sent as
alerts outside of the BMC.
Revision 1.2
Intel order number G90620-003
3
Basic Decoding of a SEL Record
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
2.
Basic Decoding of a SEL Record
The System Event Log (SEL) record format is defined in the IPMI Specification. The following section provides a basic definition for
each of the fields in a SEL. For more details see the IPMI Specification.
The definitions for the standard SEL can be found in Table 1.
The definitions for the OEM defined event logs can be found in Table 3 and Table 4.
2.1
Default Values in the SEL Records
Unless otherwise noted in the event record descriptions the following are the default values in all SEL entries.



Byte [3] = Record Type (RT) = 02h = System event record
Byte [9:8] = Generator ID = 0020h = BMC Firmware
Byte [10] = Event Message Revision (ER) = 04h = IPMI 2.0
Table 1: SEL Record Format
Byte
4
Field
Description
1
2
Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] – Record Type
02h = System event record
C0h-DFh = OEM timestamped, bytes 8-16 OEM defined (See Table 3)
E0h-FFh = OEM non-timestamped, bytes 4-16 OEM defined (See Table 4)
4
5
6
7
Timestamp
(TS)
Time when event was logged. LS byte first.
Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010
23:20:09 UTC
Note: There are various websites that will convert the raw number to a date/time.
8
9
Generator ID
(GID)
RqSA and LUN if event was generated from IPMB.
Software ID if event was generated from system software.
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Basic Decoding of a SEL Record
Byte
Field
Description
Byte 1
2
[7:1] – 7-bit I C Slave Address, or 7-bit system software ID
[0] 0b = ID is IPMB Slave Address
1b = System software ID
Software ID values:

0001h – BIOS POST for POST errors, RAS Configuration/State,
Timestamp Synch, OS Boot events

0033h – BIOS SMI Handler

0020h – BMC Firmware

002Ch – ME Firmware

0041h – Server Management Software

00C0h – HSC Firmware – HSBP A

00C2h – HSC Firmware – HSBP B
Byte 2
[7:4] – Channel number. Channel that event message was received over. 0h if the event
message was received from the system interface, primary IPMB, or internally generated
by the BMC.
[3:2] – Reserved. Write as 00b.
[1:0] – IPMB device LUN if byte 1 holds Slave Address. 00b otherwise.
Revision 1.2
10
EvM Rev
(ER)
Event Message format version. 04h = IPMI v2.0; 03h = IPMI v1.0
11
Sensor Type
(ST)
Sensor Type Code for sensor that generated the event
12
Sensor #
(SN)
Number of sensor that generated the event (From SDR)
13
Event Dir |
Event Type
(EDIR)
Event Dir
[7] – 0b = Assertion event.
1b = Deassertion event.
Event Type
Type of trigger for the event, for example, critical threshold going high, state asserted,
and so on. Also indicates class of the event. For example, discrete, threshold, or OEM.
The Event Type field is encoded using the Event/Reading Type Code.
[6:0] – Event Type Codes
01h = Threshold (States = 0x00-0x0b)
Intel order number G90620-003
5
Basic Decoding of a SEL Record
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
02h-0ch = Discrete
6Fh = Sensor-Specific
70-7Fh = OEM
6
14
Event Data 1
(ED1)
15
Event Data 2
(ED2)
16
Event Data 3
(ED3)
Per Table 2
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Basic Decoding of a SEL Record
Table 2: Event Request Message Event Data Field Contents
Sensor
Class
Event Data
Threshold
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
01b = Trigger reading in Event Data 2
10b = OEM code in Event Data 2
11b = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
01b = Trigger threshold value in Event Data 3
10b = OEM code in Event Data 3
11b = Sensor-specific event extension code in Event Data 3
[3:0] – Offset from Event/Reading Code for threshold event.
Event Data 2 – Reading that triggered event, FFh or not present if unspecified.
Event Data 3 – Threshold value that triggered event, FFh or not present if unspecified. If present, Event Data 2 must be present.
discrete
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
01b = Previous state and/or severity in Event Data 2
10b = OEM code in Event Data 2
11b = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
01b = Reserved
10b = OEM code in Event Data 3
11b = Sensor-specific event extension code in Event Data 3
[3:0] – Offset from Event/Reading Code for discrete event state
Event Data 2
[7:4] – Optional offset from “Severity” Event/Reading Code (0Fh if unspecified).
[3:0] – Optional offset from Event/Reading Type Code for previous discrete event state (0Fh if unspecified).
Event Data 3 – Optional OEM code. FFh or not present if unspecified.
OEM
Event Data 1
[7:6] – 00b = Unspecified in Event Data 2
01b = Previous state and/or severity in Event Data 2
10b = OEM code in Event Data 2
Revision 1.2
Intel order number G90620-003
7
Basic Decoding of a SEL Record
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Class
Event Data
11b = Reserved
[5:4] – 00b = Unspecified Event Data 3
01b = Reserved
10b = OEM code in Event Data 3
11b = Reserved
[3:0] – Offset from Event/Reading Type Code
Event Data 2
[7:4] – Optional OEM code bits or offset from “Severity” Event/Reading Type Code (0Fh if unspecified).
[3:0] – Optional OEM code or offset from Event/Reading Type Code for previous event state (0Fh if unspecified).
Event Data 3 – Optional OEM code. FFh or not present if unspecified.
Table 3: OEM SEL Record (Type C0h-DFh)
Byte
8
Field
Description
1
2
Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] – Record Type
C0h-DFh = OEM timestamped, bytes 8-16 OEM defined
4
5
6
7
Timestamp
(TS)
Time when event was logged. LS byte first.
Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010
23:20:09 UTC
Note: There are various websites that will convert the raw number to a date/time.
8
9
10
Manufacturer ID
LS Byte first. The manufacturer ID is a 20-bit value that is derived from the IANA
“Private Enterprise” ID.
Most significant four bits = Reserved (0000b).
000000h = Unspecified. 0FFFFFh = Reserved.
This value is binary encoded.
For example the ID for the IPMI forum is 7154 decimal, which is 1BF2h, which will be
stored in this record as F2h, 1Bh, 00h for bytes 8 through 10, respectively.
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Basic Decoding of a SEL Record
Byte
11
12
13
14
15
16
Field
OEM Defined
Description
OEM Defined. This is defined according to the manufacturer identified by the
Manufacturer ID field.
Table 4: OEM SEL Record (Type E0h-FFh)
Byte
Revision 1.2
Field
Description
1
2
Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] – Record Type
E0h-FFh = OEM system event record
4
5
6
7
8
9
10
11
12
13
14
15
16
OEM
OEM Defined. This is defined by the system integrator.
Intel order number G90620-003
9
Basic Decoding of a SEL Record
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
2.2
Notes on SEL Logs and Collecting SEL Information
Whenever you capture the SEL log, you should always collect both the text/human readable version and the hex version. Because
some of the data is OEM-specific, some utilities cannot decode the information correctly. In addition with some OEM-specific data
there may be additional variables that are not decoded at all.
An example of not decoding all of the information is the BIOS timestamp synchronization event log. This event can be logged by the
BIOS during POST or it can be logged by the BIOS SMI Handler when a system is requested to do a shutdown or a restart from the
operating system (OS). See section 2.2.1 for examples. Most utilities report this as just a BIOS event and do not differentiate
between the two. But sometimes it is useful because you can see the sequence of events better. For example if there are multiple
sequences of the timestamp synchronization events, was the power lost after booting to the OS and then the system restarted, was it
multiple POST events, or was it a restart from the OS?
An example of not decoding all the information is with the PCI Express* errors and some of the Power Supply events. For the PCI
Express* errors the type of error and the PCI Bus, Device, and Function are all a part of Event Data 1 through Event Data 3. See
section 2.2.2. For the Power Supply events when there is a failure, predictive failure, or a configuration error, Event Data 2 and Event
Data 3 hold additional information that describes the Power Supplies PMBus* Command Registers and values for that particular
event. See section 2.2.3.
2.2.1
Examples of Decoding BIOS Timestamp Events
The following are some samples of BIOS timestamp events during POST and during an OS shutdown.
2.2.1.1
BIOS POST Timestamp Events
RID[19][01] RT[02] TS[57][49][6A][4E] GID[01][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[00] ED3[FF]
RID (Record ID) = 0119h
RT (Record Type) = 02h = system event record
TS (Timestamp) = 4E6A4957h
GID (Generator ID = 0001h = BIOS POST
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 83h
EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event
[6:0] = 6fh = Sensor specific
ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization
ED2 (Event Data 2) = 00h = First in pair
RID[1A][01] RT[02] TS[57][49][6A][4E] GID[01][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[80] ED3[FF]
10
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Basic Decoding of a SEL Record
RID (Record ID) = 011Ah
RT (Record Type) = 02h = system event record
TS (Timestamp) = 4E6A4957h
GID (Generator ID = 0001h = BIOS POST
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 83h
EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event
[6:0] = 6fh = Sensor specific
ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization
ED2 (Event Data 2) = 80h = Second in pair
2.2.1.2
BIOS SMI Handler Timestamp Events
RID[1F][00] RT[02] TS[C3][70][8D][4F] GID[33][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[00] ED3[FF]
RID (Record ID) = 001Fh
RT (Record Type) = 02h = system event record
TS (Timestamp) = 4F8D70C3h
GID (Generator ID = 0033h = BIOS SMI Handler
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 83h
EDIR (Event Direction/Event Type) = 6Fh; [7] = 0 = Assertion Event
[6:0] = 6fh = Sensor specific
ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization
ED2 (Event Data 2) = 00h = First in pair
RID[20][00] RT[02] TS[C4][70][8D][4F] GID[33][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[80] ED3[FF]
RID (Record ID) = 0020h
RT (Record Type) = 02h = system event record
TS (Timestamp) = 4F8D70C4h
GID (Generator ID = 0033h = BIOS SMI Handler
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 83h
EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event
[6:0] = 6fh = Sensor specific
ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization
ED2 (Event Data 2) = 00h = First in pair
2.2.2
Example of Decoding a PCI Express* Correctable Error Events
The following is an example of decoding a PCI Express* correctable error event. For this particular event it recorded a receiver error
on Bus 0, Device 2, and Function 2. Note that correctable errors are acceptable and normal at a low rate of occurrence.
RID[27][00] RT[02] TS[0A][9B][2E][50] GID[33][00] ER[04] ST[13] SN[05] EDIR[71] ED1[A0] ED1[00] ED3[12]
RID (Record ID) = 0027h
Revision 1.2
Intel order number G90620-003
11
Basic Decoding of a SEL Record
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
RT (Record Type) = 02h = system event record
TS (Timestamp) = 502E9B0Ah
GID (Generator ID = 0033h = BIOS SMI Handler
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 13h = Critical Interrupt (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 05h
EDIR (Event Direction/Event Type) = 71h; [7] = 0 = Assertion Event
[6:0] = 71h = OEM Specific for PCI Express* correctable errors
ED1 (Event Data 1) = A0h; [7:6] = 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 0h = Receiver Error
ED2 (Event Data 2) = 00h; PCI Bus number = 0
ED3 (Event Data 3) = 12h; [7:3] – PCI Device number = 02h
[2:0] – PCI Function number = 2
2.2.3
Example of Decoding a Power Supply Predictive Failure Event
The following is an example of decoding a Power Supply predictive failure event. For this example power supply 1 saw an AC power
loss event with both the input under-voltage warning and fault events getting set. In most cases this means that the AC power spiked
under the minimum warning and fault thresholds for over 20 milliseconds but the system remained powered on. If these events
continue to occur, it is advisable to check your power source.
RID[5D][00] RT[02] TS[D3][B1][AE][4E] GID[20][00] ER[04] ST[08] SN[50] EDIR[6F] ED1[A2] ED2[06] ED3[30]
RID (Record ID) = 005Dh
RT (Record Type) = 02h = system event record
TS (Timestamp) = 4EAEB1D3h
GID (Generator ID = 0020h = BMC
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 08h = Power Supply (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 50h = Power Supply 1
EDIR (Event Direction/Event Type) = 6Fh; [7] = 0 = Assertion Event
[6:0] = 6fh = Sensor specific
ED1 (Event Data 1) = A2h; [7:6] = 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 2h = Predictive Failure
ED2 (Event Data 2) = 06h = Input under-voltage warning
ED3 (Event Data 3) = 30h; From PMBus* Specification STATUS_INPUT command
[5] – VIN_UV_WARNING (Input Under-voltage Warning) = 1
[4] – VIN_UV_FAULT (Input Under-voltage Fault) = 1
12
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor Cross Reference List
3.
Sensor Cross Reference List
This section contains a cross reference to help find details on any specific SEL entry.
3.1
BMC owned Sensors (GID = 0020h)
The following table can be used to find the details of sensors owned by the BMC.
Table 5: BMC owned Sensors
Sensor
Number
Sensor Name
Details Section
Next Steps
01h
Power Unit Status
(Pwr Unit Status)
Power Unit Status Sensor
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next
Steps
02h
Power Unit Redundancy
(Pwr Unit Redund)
Power Unit Redundancy Sensor
Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next
Steps
03h
IPMI Watchdog
(IPMI Watchdog)
IPMI Watchdog
Table 78: IPMI Watchdog Sensor Event Trigger Offset – Next Steps
04h
Physical Security
(Physical Scrty)
Physical Security
Table 74: Physical Security Sensor Event Trigger Offset – Next Steps
05h
FP Interrupt
(FP NMI Diag Int)
FP (NMI) Interrupt
FP (NMI) Interrupt – Next Steps
06h
SMI Timeout
(SMI Timeout)
SMI Timeout
SMI Timeout – Next Steps
07h
System Event Log
(System Event Log)
System Event Log Cleared
Not applicable
08h
System Event
(System Event)
System Event – PEF Action
System Event – PEF Action – Next Steps
09h
Button Sensor
(Button)
Button Sensor
Not applicable
Revision 1.2
Intel order number G90620-003
13
Sensor Cross Reference List
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
14
Sensor Name
Details Section
Next Steps
0Ah
BMC Watchdog
(BMC Watchdog)
BMC Watchdog Sensor
BMC Watchdog Sensor – Next Steps
0Bh
Voltage Regulator Watchdog
(VR Watchdog)
Voltage Regulator Watchdog
Timer Sensor
Voltage Regulator Watchdog Timer Sensor – Next Steps
0Ch
Fan Redundancy
(Fan Redundancy)
Fan Presence and Redundancy
Sensors
Table 34: Fan Redundancy Sensor – Event Trigger Offset – Next Steps
0Dh
SSB Thermal Trip
(SSB Thermal Trip)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
0Eh
IO Module Presence
(IO Mod Presence)
Add-In Module Presence Sensor
Add-In Module Presence – Next Steps
0Fh
SAS Module Presence
(SAS Mod Presence)
Add-In Module Presence Sensor
Add-In Module Presence – Next Steps
10h
BMC Firmware Health
(BMC FW Health)
BMC FW Health Sensor
BMC FW Health Sensor – Next Steps
11h
System Airflow
(System Airflow)
System Air Flow Monitoring
Sensor
Not applicable
12h
Firmware Update Status
(FW Update Status)
Firmware Update Status Sensor
Not applicable
13h
IO Module2 Presence
(IO Mod2 Presence)
Add-In Module Presence Sensor
Add-In Module Presence – Next Steps
14h
Baseboard Temperature 5
(Platform Specific)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
15h
Baseboard Temperature 6
(Platform Specific)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
16h
IO Module2 Temperature
(I/O Mod2 Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
17h
PCI Riser 3 Temperature
(PCI Riser 3 Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor Cross Reference List
Sensor
Number
Sensor Name
Details Section
Next Steps
18h
PCI Riser 4 Temperature
(PCI Riser 4 Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
19h
Baseboard +1.05V Processor3
Vccp
(BB +1.05Vccp P3)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
1Ah
Baseboard +1.05V Processor4
Vccp
(BB +1.05Vccp P4)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
20h
Baseboard Temperature 1
(Platform Specific)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
21h
Front Panel Temperature
(Front Panel Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
22h
SSB Temperature
(SSB Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
23h
Baseboard Temperature 2
(Platform Specific)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
24h
Baseboard Temperature 3
(Platform Specific)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
25h
Baseboard Temperature 4
(Platform Specific)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
26h
IO Module Temperature
(I/O Mod Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
27h
PCI Riser 1 Temperature
(PCI Riser 1 Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
28h
IO Riser Temperature
(IO Riser Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
Hot-Swap Back Plane 1-3
Temperature
(HSBP 1-3 Temp)
HSC Backplane Temperature
Sensor
Table 88: HSC Backplane Temperature Sensor – Event Trigger Offset –
Next Steps
29h-2Bh
Revision 1.2
Intel order number G90620-003
15
Sensor Cross Reference List
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
Sensor Name
Details Section
Next Steps
2Ch
PCI Riser 2 Temperature
(PCI Riser 2 Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
2Dh
SAS Module Temperature
(SAS Mod Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
2Eh
Exit Air Temperature
(Exit Air Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
2Fh
Network Interface Controller
Temperature
(LAN NIC Temp)
Threshold-based Temperature
Sensors
Table 37: Temperature Sensors – Next Steps
30h-3Fh
Fan Tachometer Sensors
(Chassis specific sensor names)
Fan Tachometer Sensors
Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps
40h-4Fh
Fan Present Sensors
(Fan x Present)
Fan Presence and Redundancy
Sensors
Table 32: Fan Presence Sensors – Event Trigger Offset – Next Steps
50h
Power Supply 1 Status
(PS1 Status)
Power Supply Status Sensors
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next
Steps
51h
Power Supply 2 Status
(PS2 Status)
Power Supply Status Sensors
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next
Steps
54h
Power Supply 1 AC Power Input
(PS1 Power In)
Power Supply Power In Sensors
Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next
Steps
55h
Power Supply 2 AC Power Input
(PS2 Power In)
Power Supply Power In Sensors
Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next
Steps
58h
Power Supply 1 +12V % of
Maximum Current Output
(PS1 Curr Out %)
Power Supply Current Out %
Sensors
Table 25: Power Supply Current Out % Sensor – Event Trigger Offset –
Next Steps
59h
Power Supply 2 +12V % of
Maximum Current Output
(PS2 Curr Out %)
Power Supply Current Out %
Sensors
Table 25: Power Supply Current Out % Sensor – Event Trigger Offset –
Next Steps
5Ch
Power Supply 1 Temperature
(PS1 Temperature)
Power Supply Temperature
Sensors
Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next
Steps
16
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor Cross Reference List
Sensor
Number
Sensor Name
Details Section
Next Steps
5Dh
Power Supply 2 Temperature
(PS2 Temperature)
Power Supply Temperature
Sensors
Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next
Steps
60h-68h
Hard Disk Drive 15-23 Status
(HDD 15-23 Status)
Hard Disk Drive Monitoring
Sensor
Table 90: Hard Disk Drive Monitoring Sensor – Event Trigger Offset – Next
Steps
69h-6Bh
Hot-Swap Controller 1-3 Status
(HSC1-3 Status)
Hot-Swap Controller Health
Sensor
HSC Health Sensor – Next Steps
70h
Processor 1 Status
(P1 Status)
Processor Status Sensor
Table 48: Processor Status Sensors – Next Steps
71h
Processor 2 Status
(P2 Status)
Processor Status Sensor
Table 48: Processor Status Sensors – Next Steps
72h
Processor 3 Status
(P3 Status)
Processor Status Sensor
Table 48: Processor Status Sensors – Next Steps
73h
Processor 4 Status
(P4 Status)
Processor Status Sensor
Table 48: Processor Status Sensors – Next Steps
74h
Processor 1 Thermal Margin
(P1 Therm Margin)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
75h
Processor 2 Thermal Margin
(P2 Therm Margin)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
76h
Processor 3 Thermal Margin
(P3 Therm Margin)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
77h
Processor 4 Thermal Margin
(P4 Therm Margin)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
Processor 1-4 Thermal Control %
(P1-P4 Therm Ctrl %)
Processor Thermal Control
Sensors
Processor Thermal Control % Sensors – Next Steps
7Ch
Processor 1 ERR2 Timeout
(P1 ERR2)
Processor ERR2 Timeout Sensor
Processor ERR2 Timeout – Next Steps
7Dh
Processor 2 ERR2 Timeout
(P2 ERR2)
Processor ERR2 Timeout Sensor
Processor ERR2 Timeout – Next Steps
78h-7Bh
Revision 1.2
Intel order number G90620-003
17
Sensor Cross Reference List
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
Sensor Name
Details Section
Next Steps
7Eh
Processor 3 ERR2 Timeout
(P3 ERR2)
Processor ERR2 Timeout Sensor
Processor ERR2 Timeout – Next Steps
7Fh
Processor 4 ERR2 Timeout
(P4 ERR2)
Processor ERR2 Timeout Sensor
Processor ERR2 Timeout – Next Steps
80h
Catastrophic Error
(CATERR)
Catastrophic Error Sensor
Table 50: Catastrophic Error Sensor – Event Data 2 Values – Next Steps
81h
Processor 1 MSID Mismatch
(P1 MSID Mismatch)
Processor MSID Mismatch
Sensor
Processor MSID Mismatch Sensor – Next Steps
82h
Processor Population Fault
(CPU Missing)
CPU Missing Sensor
CPU Missing Sensor – Next Steps
83h-86h
Processor 1-4 DTS Thermal
Margin
(P1-P4 DTS Therm Mgn)
Processor DTS Thermal Margin
Sensors
Not applicable
87h
Processor 2 MSID Mismatch
(P2 MSID Mismatch)
Processor MSID Mismatch
Sensor
Processor MSID Mismatch Sensor – Next Steps
88h
Processor 3 MSID Mismatch
(P3 MSID Mismatch)
Processor MSID Mismatch
Sensor
Processor MSID Mismatch Sensor – Next Steps
89h
Processor 4 MSID Mismatch
(P4 MSID Mismatch)
Processor MSID Mismatch
Sensor
Processor MSID Mismatch Sensor – Next Steps
90h
Processor 1 VRD Temp
(P1 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
91h
Processor 2 VRD Temp
(P2 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
92h
Processor 3 VRD Temp
(P3 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
93h
Processor 4 VRD Temp
(P4 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
18
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor Cross Reference List
Sensor
Number
Sensor Name
Details Section
Next Steps
94h
Processor 1 Memory VRD Hot 0-1
(P1 Mem01 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
95h
Processor 1 Memory VRD Hot 2-3
(P1 Mem23 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
96h
Processor 2 Memory VRD Hot 0-1
(P2 Mem01 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
97h
Processor 2 Memory VRD Hot 2-3
(P2 Mem23 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
98h
Processor 3 Memory VRD Hot 0-1
(P3 Mem01 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
99h
Processor 3 Memory VRD Hot 2-3
(P4 Mem23 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
9Ah
Processor 4 Memory VRD Hot 0-1
(P4 Mem01 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
9Bh
Processor 4 Memory VRD Hot 2-3
(P4 Mem23 VRD Hot)
Discrete Thermal Sensors
Table 45: Discrete Thermal Sensors – Next Steps
A0h
Power Supply 1 Fan Tachometer 1
(PS1 Fan Tach 1)
Power Supply Fan Tachometer
Sensors
Power Supply Fan Tachometer Sensors – Next Steps
A1h
Power Supply 1 Fan Tachometer 2
(PS1 Fan Tach 2)
Power Supply Fan Tachometer
Sensors
Power Supply Fan Tachometer Sensors – Next Steps
A2h
Intel Xeon Phi Coprocessor
Status 1
(MIC 1 Status)
A3h
Intel Xeon Phi Coprocessor
Status 2
(MIC 2 Status)
Intel Xeon Phi Coprocessor
(MIC) Status Sensors
Intel Xeon Phi Coprocessor (MIC) Status Sensors Next Steps
A4h
Power Supply 2 Fan Tachometer 1
(PS2 Fan Tach 1)
Power Supply Fan Tachometer
Sensors
Power Supply Fan Tachometer Sensors – Next Steps
®
®
Revision 1.2
™
®
™
Intel Xeon Phi Coprocessor
(MIC) Status Sensors
®
™
®
™
Intel Xeon Phi Coprocessor (MIC) Status Sensors Next Steps
™
®
™
Intel order number G90620-003
19
Sensor Cross Reference List
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
Sensor Name
Next Steps
A5h
Power Supply 2 Fan Tachometer 2
(PS2 Fan Tach 2)
A6h
Intel Xeon Phi Coprocessor
Status 3
(MIC 3 Status)
A7h
Intel Xeon Phi Coprocessor
Status 4
(MIC 4 Status)
Intel Xeon Phi Coprocessor
(MIC) Status Sensors
Intel Xeon Phi Coprocessor (MIC) Status Sensors Next Steps
B0h
Processor 1 DIMM Aggregate
Thermal Margin 1
(P1 DIMM Thrm Mrgn1)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
B1h
Processor 1 DIMM Aggregate
Thermal Margin 2
(P1 DIMM Thrm Mrgn2)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
B2h
Processor 2 DIMM Aggregate
Thermal Margin 1
(P2 DIMM Thrm Mrgn1)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
B3h
Processor 2 DIMM Aggregate
Thermal Margin 2
(P2 DIMM Thrm Mrgn2)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
B4h
Processor 3 DIMM Aggregate
Thermal Margin 1
(P3 DIMM Thrm Mrgn1)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
B5h
Processor 3 DIMM Aggregate
Thermal Margin 2
(P3 DIMM Thrm Mrgn2)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
B6h
Processor 4 DIMM Aggregate
Thermal Margin 1
(P4 DIMM Thrm Mrgn1)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
®
®
20
Details Section
Power Supply Fan Tachometer
Sensors
Power Supply Fan Tachometer Sensors – Next Steps
™
®
™
Intel Xeon Phi Coprocessor
(MIC) Status Sensors
®
™
®
™
Intel Xeon Phi Coprocessor (MIC) Status Sensors Next Steps
™
®
™
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor Cross Reference List
Sensor
Number
Sensor Name
Details Section
Next Steps
B7h
Processor 4 DIMM Aggregate
Thermal Margin 2
(P4 DIMM Thrm Mrgn2)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
B8h
Node Auto-Shutdown Sensor
(Auto Shutdown)
Node Auto Shutdown Sensor
Node Auto Shutdown Sensor – Next Steps
BAh-BFh
Fan Tachometer Sensors
(Chassis specific sensor names)
Fan Tachometer Sensors
Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps
C0h-C3h
Processor 1-4 DIMM Thermal Trip
(P1-P4 Mem Thrm Trip)
DIMM Thermal Trip Sensors
DIMM Thermal Trip Sensors – Next Steps
®
™
C4h
Intel Xeon Phi Coprocessor
Thermal Margin 1
(MIC 1 Margin)
C5h
Intel Xeon Phi Coprocessor
Thermal Margin 2
(MIC 2 Margin)
C6h
Intel Xeon Phi Coprocessor
Thermal Margin 3
(MIC 3 Margin)
C7h
Intel Xeon Phi Coprocessor
Thermal Margin 4
(MIC 4 Margin)
Intel Xeon Phi Coprocessor
(MIC) Thermal Margin Sensors
Not applicable
C8h-CFh
Global Aggregate Temperature
Margin 1-8
(Agg Therm Mrgn 1-8)
Thermal Margin Sensors
Table 40: Thermal Margin Sensors – Next Steps
D0h
Baseboard +12V
(BB +12.0V)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
D1h
Baseboard +5V
(BB +5.0V)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
D2h
Baseboard +3.3V
(BB +3.3V)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
®
®
®
Revision 1.2
®
™
Intel Xeon Phi Coprocessor
(MIC) Thermal Margin Sensors
Not applicable
™
®
™
Intel Xeon Phi Coprocessor
(MIC) Thermal Margin Sensors
Not applicable
™
®
™
Intel Xeon Phi Coprocessor
(MIC) Thermal Margin Sensors
Not applicable
™
®
™
Intel order number G90620-003
21
Sensor Cross Reference List
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
Sensor Name
Details Section
Next Steps
D3h
Baseboard +5V Stand-by
(BB +5.0V STBY)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
D4h
Baseboard +3.3V Auxiliary
(BB +3.3V AUX)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
D6h
Baseboard +1.05V Processor1
Vccp
(BB +1.05Vccp P1)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
D7h
Baseboard +1.05V Processor2
Vccp
(BB +1.05Vccp P2)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
D8h
Baseboard +1.5V P1 Memory AB
VDDQ
(BB +1.5 P1MEM AB)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
D9h
Baseboard +1.5V P1 Memory CD
VDDQ
(BB +1.5 P1MEM CD)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
DAh
Baseboard +1.5V P2 Memory AB
VDDQ
(BB +1.5 P2MEM AB)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
DBh
Baseboard +1.5V P2 Memory CD
VDDQ
(BB +1.5 P2MEM CD)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
DCh
Baseboard +1.8V Aux
(BB +1.8V AUX)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
DDh
Baseboard +1.1V Stand-by
(BB +1.1V STBY)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
DEh
Baseboard CMOS Battery
(BB +3.3V Vbat)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
22
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor Cross Reference List
Sensor
Number
Sensor Name
Details Section
Next Steps
E4h
Baseboard +1.35V P1 Low Voltage
Memory AB VDDQ
(BB +1.35 P1LV AB)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
E5h
Baseboard +1.35V P1 Low Voltage
Memory CD VDDQ
(BB +1.35 P1LV CD)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
E6h
Baseboard +1.35V P2 Low Voltage
Memory AB VDDQ
(BB +1.35 P2LV AB)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
E7h
Baseboard +1.35V P2 Low Voltage
Memory CD VDDQ
(BB +1.35 P2LV CD)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
EAh
Baseboard +3.3V Riser 1 Power
Good
(BB +3.3 RSR1 PGD)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
EBh
Baseboard +3.3V Riser 2 Power
Good
(BB +3.3 RSR2 PGD)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
ECh
Baseboard +0.9V
(BB 0.9V Core IB)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
EDh
Baseboard +1.8V
(BB 1.8V IB I/O)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
EEh
Baseboard +1.1V
(BB 1.1V PCH)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
EFh
Baseboard +1.2V
(BB +1.2V IB)
Threshold-based Voltage
Sensors
Table 13: Threshold-based Voltage Sensors – Next Steps
Hard Disk Drive 0-14 Status
(HDD 0-14 Status)
Hard Disk Drive Monitoring
Sensor
Table 90: Hard Disk Drive Monitoring Sensor – Event Trigger Offset – Next
Steps
F0h-FEh
Revision 1.2
Intel order number G90620-003
23
Sensor Cross Reference List
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
3.2
BIOS POST owned Sensors (GID = 0001h)
The following table can be used to find the details of sensors owned by BIOS POST.
Table 6: BIOS POST owned Sensors
Sensor
Number
Sensor Name
Details Section
Next Steps
02h
Memory RAS Configuration Status
Memory RAS Configuration Status
Table 58: Memory RAS Configuration Status Sensor Typical
Characteristics
06h
POST Error
System Firmware Progress (Formerly Post
Error)
System Firmware Progress (Formerly Post Error) – Next Steps
09h
Intel Quick Path Interface Link
Width Reduced
QPI Link Width Reduced Sensor
QPI Link Width Reduced Sensor – Next Steps
12h
Memory RAS Mode Select
Memory RAS Mode Select
Not applicable
83h
System Event
System Events
Not applicable
®
3.3
BIOS SMI Handler owned Sensors (GID = 0033h)
The following table can be used to find the details of sensors owned by BIOS SMI Handler.
Table 7: BIOS SMI Handler owned Sensors
Sensor
Number
Sensor Name
Details Section
Next Steps
Mirroring Redundancy State Sensor – Next Steps
01h
Mirroring Redundancy State
Mirroring Redundancy State
02h
Memory ECC Error
Memory Correctable and Uncorrectable
ECC Error
03h
Legacy PCI Error
Legacy PCI Errors
Legacy PCI Error Sensor – Next Steps
PCI Express* Fatal Errors and Fatal Error #2
PCI Express* Fatal Error and Fatal Error #2 Sensor – Next
Steps
04h
24
PCI Express* Fatal Error
Intel order number G90620-003
Table 64: Correctable and Uncorrectable ECC Error Sensor
Event Trigger Offset – Next Steps
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor Cross Reference List
Sensor
Number
05h
Sensor Name
PCI Express* Correctable Error
Details Section
Next Steps
PCI Express* Correctable Errors
PCI Express* Correctable Error Sensor – Next Steps
QPI Correctable Error Sensor
QPI Correctable Error Sensor – Next Steps
®
06h
Intel Quick Path Interface
Correctable Error
07h
Intel Quick Path Interface Fatal Error
QPI Fatal Error and Fatal Error #2
QPI Fatal Error and Fatal Error #2 – Next Steps
11h
Sparing Redundancy State
Sparing Redundancy State
Sparing Redundancy State Sensor – Next Steps
13h
Memory Parity Error
Memory Address Parity Error
Memory Address Parity Error Sensor – Next Steps
14h
PCI Express* Fatal Error#2
(continuation of Sensor 04h)
PCI Express* Fatal Errors and Fatal Error #2
PCI Express* Fatal Error and Fatal Error #2 Sensor – Next
Steps
17h
Intel Quick Path Interface Fatal Error
#2
(continuation of Sensor 07h)
QPI Fatal Error and Fatal Error #2
QPI Fatal Error and Fatal Error #2 – Next Steps
83h
System Event
System Events
Not applicable
®
®
3.4
Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch)
The following table can be used to find the details of sensors owned by the Node Manager / Management Engine (ME) firmware.
Table 8: Management Engine Firmware owned Sensors
Sensor
Number
Sensor Name
Details Section
Next Steps
17h
ME Firmware Health Events
ME Firmware Health Event
ME Firmware Health Event – Next Steps
18h
Node Manager Exception Events
Node Manager Exception Event
Node Manager Exception Event – Next Steps
19h
Node Manager Health Events
Node Manager Health Event
Node Manager Health Event – Next Steps
1Ah
Node Manager Operational Capabilities
Change Events
Node Manager Operational Capabilities
Change
Node Manager Operational Capabilities Change – Next
Steps
1Bh
Node Manager Alert Threshold
Exceeded Events
Node Manager Alert Threshold Exceeded
Node Manager Alert Threshold Exceeded – Next Steps
Revision 1.2
Intel order number G90620-003
25
Sensor Cross Reference List
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
3.5
Microsoft* OS owned Events (GID = 0041)
The following table can be used to find the details of records that are owned by the Microsoft* Operating System (OS).
Table 9: Microsoft* OS owned Events
Sensor Name
Record
Type
Sensor Type
Details Section
Next Steps
02h
1Fh = OS Boot
Table 98: Boot up Event Record Typical Characteristics
Not applicable
DCh
Not applicable
Table 99: Boot up OEM Event Record Typical Characteristics
02h
20h = OS Stop/Shutdown
Table 100: Shutdown Reason Code Event Record Typical Characteristics
Not applicable
DDh
Not applicable
Table 101: Shutdown Reason OEM Event Record Typical Characteristics
Table 102: Shutdown Comment OEM Event Record Typical Characteristics
Not applicable
02h
20h = OS Stop/Shutdown
Table 103: Bug Check / Blue Screen – OS Stop Event Record Typical
Characteristics
Not applicable
DEh
Not applicable
Table 104: Bug Check / Blue Screen code OEM Event Record Typical
Characteristics
Boot Event
Shutdown Event
Bug Check / Blue
Screen
3.6
Linux* Kernel Panic Events (GID = 0021)
The following table can be used to find the details of records that can be generated when there is a Linux* Kernel panic.
Table 10: Linux* Kernel Panic Events
Sensor Name
Record
Type
Sensor Type
Details Section
02h
20h = OS Stop/Shutdown
Table 105: Linux* Kernel Panic Event Record Characteristics
F0h
Not applicable
Table 106: Linux* Kernel Panic String Extended Record Characteristics
Next Steps
Not applicable
Linux* Kernel Panic
26
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Power Subsystems
4.
Power Subsystems
The BMC monitors the power subsystem including power supplies, select onboard voltages, and related sensors.
4.1
Threshold-based Voltage Sensors
The BMC monitors the main voltage sources in the system, including the baseboard, memory, and processors, using IPMI-compliant
analog/threshold sensors. Some voltages are only on specific platforms. For details check your platforms Technical Product
Specification (TPS).
Note: A voltage error can be caused by the device supplying the voltage or by the device using the voltage. For each sensor it will be
noted who is supplying the voltage and who is using it.
Table 11: Threshold-based Voltage Sensors Typical Characteristics
Byte
Field
Description
11
Sensor Type
02h = Voltage
12
Sensor Number
See Table 13
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Triggers as described in Table 12
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Revision 1.2
Intel order number G90620-003
27
Power Subsystems
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Table 12: Threshold-based Voltage Sensors Event Triggers – Description
Hex
Event Trigger
Description
Assertion
Severity
Deassert
Severity
Description
00h
Lower non-critical
going low
Degraded
OK
The voltage has dropped below its lower non-critical threshold.
02h
Lower critical
going low
non-fatal
Degraded
The voltage has dropped below its lower critical threshold.
07h
Upper non-critical
going high
Degraded
OK
The voltage has gone over its upper non-critical threshold.
09h
Upper critical
going high
non-fatal
Degraded
The voltage has gone over its upper critical threshold.
Table 13: Threshold-based Voltage Sensors – Next Steps
Sensor
Number
19h
1Ah
28
Sensor Name
Next Steps
Baseboard +1.05V Processor3 Vccp
(BB +1.05Vccp P3)
This 1.05V line is supplied by the main board.
This 1.05V line is used by processor 3.
1. Ensure all cables are connected correctly.
2. Check the processor is seated properly.
3. Cross test the processors. If the issue remains with the processor socket, replace the main board,
otherwise the processor.
Baseboard +1.05V Processor4 Vccp
(BB +1.05Vccp P4)
This 1.05V line is supplied by the main board.
This 1.05V line is used by processor 4.
1. Ensure all cables are connected correctly.
2. Check the processor is seated properly.
3. Cross test the processors. If the issue remains with the processor socket, replace the main board,
otherwise the processor.
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Power Subsystems
Sensor
Number
D0h
D1h
D2h
D3h
Revision 1.2
Sensor Name
Next Steps
Baseboard +12V
(BB +12.0V)
+12V is supplied by the power supplies.
+12V is used by SATA drives, Fans, and PCI cards. In addition it is used to generate various processor
voltages.
1. Ensure all cables are connected correctly.
2. Check connections on the fans and HDDs.
3. If the issue follows the component, swap it, otherwise, replace the board.
4. If the issue remains, replace the power supplies.
Baseboard +5V
(BB +5.0V)
+5.0V is supplied by the power supplies for pedestal systems, and supplied by the main board on rackoptimized systems.
+5.0V is used by the PCI slots.
1. Ensure all cables are connected correctly.
2. Reseat any PCI cards.
3. Try PCI cards in other PCI slots.
4. If the issue follows the card, swap it, otherwise, replace the main board.
5. If the issue remains, replace the power supplies.
Baseboard +3.3V
(BB +3.3V)
+3.3V is supplied by the power supplies for pedestal systems, and supplied by the main board on rackoptimized systems.
+3.3V is used by the PCIe and PCI-X slots.
1. Ensure all cables are connected correctly.
2. Reseat any PCI cards.
3. Try PCI cards in other PCI slots.
4. If the issue follows the card, swap it, otherwise, replace the main board.
5. If the issue remains, replace the power supplies.
Baseboard +5V Stand-by
(BB +5.0V STBY)
+5.0V STBY is supplied by the power supplies for pedestal systems, and supplied by the main board on
rack-optimized systems.
+5.0V STBY is used to generate other standby voltages.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
Intel order number G90620-003
29
Power Subsystems
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
D4h
D6h
D7h
D8h
D9h
30
Sensor Name
Next Steps
Baseboard +3.3V Auxiliary
(BB +3.3V AUX)
+3.3V AUX is supplied by the main board.
®
+3.3V AUX is used by the BMC, clock chips, PCI-E Slot, on-board NIC, Intel C600 series Chipset, and
ICH.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
Baseboard +1.05V Processor1 Vccp
(BB +1.05Vccp P1)
This 1.05V line is supplied by the main board.
This 1.05V line is used by processor 1.
1. Ensure all cables are connected correctly.
2. Check the processor is seated properly.
3. Cross test the processors. If the issue remains with the processor socket, replace the main board,
otherwise the processor.
Baseboard +1.05V Processor2 Vccp
(BB +1.05Vccp P2)
This 1.05V line is supplied by the main board.
This 1.05V line is used by processor 2.
1. Ensure all cables are connected correctly.
2. Check the processor is seated properly.
3. Cross test the processors. If the issue remains with the processor socket, replace the main board,
otherwise the processor.
Baseboard +1.5V P1 Memory AB
VDDQ
(BB +1.5 P1MEM AB)
This 1.5V line is supplied by the main board.
This 1.5V line is used by processor 1 memory slots A and B.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
Baseboard +1.5V P1 Memory CD
VDDQ
(BB +1.5 P1MEM CD)
This 1.5V line is supplied by the main board.
This 1.5V line is used by processor 1 memory slots C and D.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Power Subsystems
Sensor
Number
Sensor Name
Next Steps
Baseboard +1.5V P2 Memory AB
VDDQ
(BB +1.5 P2MEM AB)
This 1.5V line is supplied by the main board.
This 1.5V line is used by processor 2 memory slots A and B.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
Baseboard +1.5V P2 Memory CD
VDDQ
(BB +1.5 P2MEM CD)
This 1.5V line is supplied by the main board.
This 1.5V line is used by processor 2 memory slots C and D.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
Baseboard +1.8V Aux
(BB +1.8V AUX)
+1.8V AUX is supplied by the main board.
+1.8V AUX is used by the BMC and on-board NIC.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
DDh
Baseboard +1.1V Stand-by
(BB +1.1V STBY)
+1.1V STBY is supplied by the main board.
®
+1.1V STBY is used by the Intel C600 series Chipset.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
DEh
Baseboard CMOS Battery
(BB +3.3V Vbat)
+3.3V Vbat is supplied by the CMOS battery when power is off and by the main board when power is on.
+3.3V Vbat is used by the CMOS and related circuits.
1. Replace the CMOS battery. Any battery of type CR2032 can be used.
2. If error remains (unlikely), replace the board.
Baseboard +1.35V P1 Low Voltage
Memory AB VDDQ
(BB +1.35 P1LV AB)
This 1.35V line is supplied by the main board.
This 1.35V line is used by processor 1 memory slots A and B.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
DAh
DBh
DCh
E4h
Revision 1.2
Intel order number G90620-003
31
Power Subsystems
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
E5h
E6h
E7h
EAh
EBh
32
Sensor Name
Next Steps
Baseboard +1.35V P1 Low Voltage
Memory CD VDDQ
(BB +1.35 P1LV CD)
This 1.35V line is supplied by the main board.
This 1.35V line is used by processor 1 memory slots C and D.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
Baseboard +1.35V P2 Low Voltage
Memory AB VDDQ
(BB +1.35 P2LV AB)
This 1.35V line is supplied by the main board.
This 1.35V line is used by processor 2 memory slots A and B.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
Baseboard +1.35V P2 Low Voltage
Memory CD VDDQ
(BB +1.35 P2LV CD)
This 1.35V line is supplied by the main board.
This 1.35V line is used by processor 2 memory slots C and D.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
Baseboard +3.3V Riser 1 Power Good
(BB +3.3 RSR1 PGD)
+3.3V Riser 1 Power Good is supplied by Riser 1 on specific platforms.
+3.3V Riser 1 Power Good is an indication of the +3.3V on Riser 1.
1. Ensure that the riser is seated correctly.
2. If the issue remains, replace the riser.
3. If the issue remains, replace the main board.
4. If the issue remains, replace the power supplies.
Baseboard +3.3V Riser 2 Power Good
(BB +3.3 RSR2 PGD)
+3.3V Riser 2 Power Good is supplied by Riser 2 on specific platforms.
+3.3V Riser 2 Power Good is an indication of the +3.3V on Riser 2.
1. Ensure that the riser is seated correctly.
2. If the issue remains, replace the riser.
3. If the issue remains, replace the main board.
4. If the issue remains, replace the power supplies.
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Power Subsystems
Sensor
Number
ECh
EDh
EEh
EFh
4.2
Sensor Name
Next Steps
Baseboard +0.9V
(BB 0.9V Core IB)
+0.9V Core IB is supplied by the main board on specific platforms.
+0.9V Core IB is used by the on-board Infiniband* controller on those specific platforms.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
Baseboard +1.8V
(BB 1.8V IB I/O)
+1.8V IB I/O is supplied by the main board on specific platforms.
+1.8V IB I/O is used by the on-board Infiniband* controller on those specific platforms.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
Baseboard +1.1V
(BB 1.1V PCH)
This 1.1V line is supplied by the main board.
®
This 1.1V line is used by the Intel C600 series Chipset.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
Baseboard +1.2V
(BB +1.2V IB)
+1.2V is supplied by the main board on specific platforms.
+1.2V is used by the on-board Infiniband* controller on those specific platforms.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
Voltage Regulator Watchdog Timer Sensor
The BMC FW monitors that the power sequence for the board VR controllers is completed when a DC power-on is initiated.
Incompletion of the sequence indicates a board problem, in which case the FW powers down the system.
The sequence is as follows:

BMC FW monitors the PowerSupplyPowerGood signal for assertion, indicating a DC-power-on has been initiated, and starts a
timer (VR Watchdog Timer). For PCSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600 Product Families
this timeout is 500ms.
Revision 1.2
Intel order number G90620-003
33
Power Subsystems
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families

If the SystemPowerGood signal has not asserted by the time the VR Watchdog Timer expires, the FW powers down the system,
logs a SEL entry, and emits a beep code (1-5-1-2). This failure is termed as VR Watchdog Timeout.
Table 14: Voltage Regulator Watchdog Timer Sensor Typical Characteristics
Byte
4.2.1
Field
Description
11
Sensor Type
02h = Voltage
12
Sensor Number
0Bh
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (“digital” Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h = State Asserted
15
Event Data 2
Not used
16
Event Data 3
Not used
Voltage Regulator Watchdog Timer Sensor – Next Steps
1. Ensure that all the connectors from the power supply are well seated.
2. Cross test the baseboard. If the issue remains with the baseboard, replace the baseboard.
4.3
Power Unit
The power unit monitors the power state of the system and logs the state changes in the SEL.
4.3.1
Power Unit Status Sensor
The power unit status sensor monitors the power state of the system and logs state changes. Expected power-on events such as DC
ON/OFF is logged and unexpected events are also logged, such as AC loss and power good loss.
34
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Power Subsystems
Table 15: Power Unit Status Sensors Typical Characteristics
Byte
Field
Description
11
Sensor Type
09h = Power Unit
12
Sensor Number
01h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] = Sensor Specific offset as described in Table 16
15
Event Data 2
Not used
16
Event Data 3
Not used
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps
Sensor Specific Offset
Hex
Description
Next Steps
Description
00h
Power down
System is powered down.
Informational Event
02h
240 VA power down
240 VA power limit was exceeded and the
hardware forced a power down.
This could have been caused by many things.
1. If you recently added hardware, try removing it.
2. Remove/replace any add-in adapters.
3. Remove/replace the power supply.
4. Remove/replace the processors, DIMM, and/or hard drives.
5. Remove/replace the boards in the system.
04h
AC Lost
AC power was removed.
Informational Event
Revision 1.2
Intel order number G90620-003
35
Power Subsystems
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor Specific Offset
Hex
Description
05h
Soft Power Control
Failure
06h
4.3.2
Power Unit Failure
Description
Next Steps
Asserted if the system fails to power on
due to the following power control
sources:

Chassis Control command

PEF action

BMC Watchdog Timer

Power State Retention
Power subsystem experienced a failure.
Asserted for one of the following
conditions:

Unexpected de-assertion of
system POWER_GOOD signal.

System fails to respond to any
power control source’s attempt to
power down the system.

System fails to respond to any
hardware power control source’s
attempt to power on the system.

Power Distribution Board (PDB)
failure is detected (applies only
to systems that have a PDB).
This could be caused by the power supply subsystem or system
components.
1. Verify all power cables and adapters are connected properly (AC
cables as well as the cables between the PSU and system
components).
2. Cross test the PSU if possible.
3. Replace the power subsystem.
Indicates a power supply failed.
1. Remove and reapply AC power.
2. Verify all power cables and adapters are connected properly (AC
cables as well as the cables between the PSU and system
components).
3. Cross test the PSU if possible.
4. If the power supply still fails, replace it.
5. If the problems still exists, replace the baseboard.
Power Unit Redundancy Sensor
This sensor is enabled on the systems that support redundant power supplies. When a system has AC applied or if it loses
redundancy of the power supplies, a message will get logged into the SEL.
Table 17: Power Unit Redundancy Sensors Typical Characteristics
Byte
36
Field
Description
11
Sensor Type
09h = Power Unit
12
Sensor Number
02h
13
Event Direction and
[7] Event direction
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Power Subsystems
Byte
Field
Description
Event Type
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 18
15
Event Data 2
Not used
16
Event Data 3
Not used
Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex
Description
Next Steps
Description
00h
Fully redundant
01h
Redundancy lost
02h
Redundancy degraded
03h
Non-redundant, sufficient from redundant
04h
Non-redundant, sufficient from insufficient
05h
Non-redundant, insufficient
06h
Non-redundant, degraded from fully redundant
07h
Redundant, degraded from non-redundant
4.3.3
System is fully operational.
Informational Event
System is not running in
redundant power supply mode.
This event is accompanied by specific power supply errors
(AC lost, PSU failure, and so on). Troubleshoot these events
accordingly.
Node Auto Shutdown Sensor
The BMC supports a Node Auto Shutdown sensor for logging a SEL event due to an emergency shutdown of a node due to loss of
power supply redundancy or PSU CLST throttling due to an over-current warning condition. This sensor is applicable only to multinode systems.
The sensor is rearmed on power-on (AC or DC power-on transitions).
Revision 1.2
Intel order number G90620-003
37
Power Subsystems
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
This sensor is only used for triggering SEL to indicate node or power auto shutdown assertion or deassertion.
Table 19: Node Auto Shutdown Sensor Typical Characteristics
Byte
4.3.3.1
Field
Description
11
Sensor Type
09h = Power Unit
12
Sensor Number
B8h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (“digital” discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
1h = State Asserted
15
Event Data 2
Not used
16
Event Data 3
Not used
Node Auto Shutdown Sensor – Next Steps
This event is accompanied by specific power supply errors (AC lost, PSU failure, and so on) or other system events. Troubleshoot
these events accordingly.
4.4
Power Supply
The BMC monitors the power supply subsystem.
4.4.1
Power Supply Status Sensors
These sensors report the status of the power supplies in the system. When a system first AC applied or removed, it can log an event.
Also if there is a failure, predictive failure, or a configuration error, it can log an event.
38
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Power Subsystems
Table 20: Power Supply Status Sensors Typical Characteristics
Byte
Field
Description
11
Sensor Type
08h = Power Supply
12
Sensor Number
50h = Power Supply 1 Status
51h = Power Supply 2 Status
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – ED2 data in Table 21
[5:4] – ED3 data in Table 21
[3:0] – Sensor Specific offset as described in Table 21
15
Event Data 2
As described in Table 21
16
Event Data 3
As described in Table 21
Table 21: Power Supply Status Sensor – Sensor Specific Offsets – Next Steps
Sensor Specific Offset
Hex
Description
ED2
ED3
Next Steps
Description
00h
Presence
Power supply detected
00b = Unspecified Event Data 2
00b = Unspecified Event Data 3
Informational Event
01h
Failure
Power supply failed
Check the data in ED2
and ED3 for more details.
10b = OEM code in Event Data 2

01h – Output voltage fault

02h – Output power fault

03h – Output over-current fault

04h – Over-temperature fault

05h – Fan fault
10b = OEM code in Event Data 3
Indicates a power supply
failed.
1. Remove and reapply
AC.
2. If the power supply
still fails, replace it.
Revision 1.2
Will have the contents of the
associated PMBus* Status
register. For example, Data 3 will
have the contents of the
VOLTAGE_STATUS register at
the time an Output Voltage fault
was detected. Refer to the
PMBus* Specification for details
on specific register contents.
Intel order number G90620-003
39
Power Subsystems
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor Specific Offset
Hex
02h
Description
ED2
ED3
Check the data in ED2
and ED3 for more details.
10b = OEM code in Event Data 2

01h – Output voltage warning

02h – Output power warning

03h – Output over-current
warning

04h –Over-temperature warning

05h – Fan warning

06h – Input under-voltage
warning

07h – Input over-current
warning

08h – Input over-power warning
10b = OEM code in Event Data 3
Next Steps
Description
Predictive
Failure
Will have the contents of the
associated PMBus* Status
register. For example, Data 3 will
have the contents of the
VOLTAGE_STATUS register at
the time an Output Voltage
warning was detected. Refer to
the PMBus* Specification for
details on specific register
contents.
Depends on the warning
event.
1. Replace the power
supply.
2. Verify proper airflow
to the system.
3. Verify the power
source.
4. Replace the system
boards.
03h
AC lost
AC removed
00b = Unspecified Event Data 2
00b = Unspecified Event Data 3
Informational Event.
06h
Configuration
error
Power supply
configuration is not
supported.
Check the data in ED2 for
more details.
10b = OEM code in Event Data 2

01h – The BMC cannot access
the PMBus* device on the PSU
but its FRU device is
responding.

02h – The PMBUS*_REVISION
command returns a version
number that is not supported
(only version 1.1 and 1.2 are
supported).

03h – The PMBus* device does
not successfully respond to the
PMBUS*_REVISION command.

04h – The PSU is incompatible
with one or more PSUs that are
present in the system.

05h –The PSU FW is operating
in a degraded mode (likely due
to a failed firmware update).
00b = Unspecified Event Data 3
Indicates that at least one of
the supplies is not correct for
your system configuration.
1. Remove the power
supply and verify
compatibility.
2. If the power supply is
compatible, it may be
faulty. Replace it.
40
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Power Subsystems
4.4.2
Power Supply Power In Sensors
These sensors will log an event when a power supply in the system is exceeding its AC power in threshold.
Table 22: Power Supply Power In Sensors Typical Characteristics
Byte
Field
Description
11
Sensor Type
0Bh = Other Units
12
Sensor Number
54h = Power Supply 1 Status
55h = Power Supply 2 Status
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 23
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Assertion
Severity
Deassert
Severity
Hex
Description
07h
Upper non-critical
going high
Degraded
OK
09h
Upper critical
going high
non-fatal
Degraded
Revision 1.2
Description
PMBus* feature to monitor power
supply power consumption.
Next Steps
If you see this event, the system is pulling too much power on the
input for the PSU rating.
1. Verify the power budget is within the specified range.
2. Check http://www.intel.com/p/en_US/support/ for the
power budget tool for your system.
Intel order number G90620-003
41
Power Subsystems
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
4.4.3
Power Supply Current Out % Sensors
PMBus*-compliant power supplies may monitor the current output of the main 12v voltage rail and report the current usage as a
percentage of the maximum power output for that rail.
Table 24: Power Supply Current Out % Sensors Typical Characteristics
Byte
Field
Description
11
Sensor Type
03h = Current
12
Sensor Number
58h = Power Supply 1 Current Out %
59h = Power Supply 2 Current Out %
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 25
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 25: Power Supply Current Out % Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Assertion
Severity
Deassert
Severity
Hex
Description
07h
Upper non-critical
going high
Degraded
OK
09h
Upper critical
going high
non-fatal
Degraded
42
Description
PMBus* feature to monitor power
supply power consumption.
Next Steps
If you see this event, the system is using too much power on the
output for the PSU rating.
1. Verify the power budget is within the specified range.
2. Check http://www.intel.com/p/en_US/support/ for the
power budget tool for your system.
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Power Subsystems
4.4.4
Power Supply Temperature Sensors
The BMC monitors one or two power supply temperature sensors for each installed PMBus*-compliant power supply.
Table 26: Power Supply Temperature Sensors Typical Characteristics
Byte
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
5Ch = Power Supply 1 Temperature
5Dh = Power Supply 2 Temperature
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 27
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex
Description
Assertion
Severity
Deassert
Severity
07h
Upper non-critical
going high
Degraded
OK
09h
Upper critical going
high
non-fatal
Degraded
Revision 1.2
Description
An upper non-critical or
critical temperature
threshold has been
crossed.
Next Steps
1.
2.
3.
4.
Check for clear and unobstructed airflow into and out of the chassis.
Ensure the SDR is programmed and correct chassis has been
selected.
Ensure there are no fan failures.
Ensure the air used to cool the system is within the thermal
specifications for the system (typically below 35°C).
Intel order number G90620-003
43
Power Subsystems
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
4.4.5
Power Supply Fan Tachometer Sensors
The BMC polls each installed power supply using the PMBus* fan status commands to check for failure conditions for the power
supply fans.
Table 28: Power Supply Fan Tachometer Sensors Typical Characteristics
Byte
4.4.5.1
Field
Description
11
Sensor Type
04h = Fan
12
Sensor Number
A0h = Power Supply 1 Fan Tachometer 1
A1h = Power Supply 1 Fan Tachometer 2
A4h = Power Supply 2 Fan Tachometer 1
A5h = Power Supply 2 Fan Tachometer 2
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (“digital” Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h = State Asserted
15
Event Data 2
Not used
16
Event Data 3
Not used
Power Supply Fan Tachometer Sensors – Next Steps
These events only get generated in the systems with PMBus*-capable power supplies and normally when the airflow is obstructed to
the power supply:
1.
2.
3.
4.
44
Remove and then reinstall the power supply to see whether something might have temporarily caused the fan failure.
Swap the power supply with another one to see whether the problem stays with the location or follows the power supply.
Replace the power supply depending on the outcome of steps 1 and 2.
Ensure the latest FRUSDR update has been run and the correct chassis is detected or selected.
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Cooling Subsystem
5.
Cooling Subsystem
5.1
Fan Sensors
There are three types of fan sensors that can be present on Intel® Server Systems: speed, presence, and redundancy. The last two
are only present in the systems with hot-swap redundant fans.
5.1.1
Fan Tachometer Sensors
Fan tachometer sensors monitor the rpm signal on the relevant fan headers on the platform. Fan speed sensors are threshold-based
sensors. Usually they only have lower (critical) thresholds set, so that a SEL entry is only generated if the fan spins too slowly.
Table 29: Fan Tachometer Sensors Typical Characteristics
Byte
Field
Description
11
Sensor Type
04h = Fan
12
Sensor Number
30h-3Fh (Chassis specific)
BAh-BFh (Chassis specific)
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 30
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Revision 1.2
Intel order number G90620-003
45
Cooling Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Assertion
Severity
Deassert
Severity
Description
Hex
Description
00h
Lower non-critical
going low
Degraded
OK
The fan speed has dropped
below its lower non-critical
threshold.
02h
Lower critical
going low
non-fatal
Degraded
The fan speed has dropped
below its lower critical
threshold.
5.1.2
Next Steps
A fan speed error on a new system build is typically not caused by the fan
spinning too slowly, instead it is caused by the fan being connected to the
wrong header (the BMC expects them on certain headers for each
chassis and will log this event if there is no fan on that header).
1. Refer to the Quick Start Guide or the Service Guide to identify
the correct fan headers to use.
2. Ensure the latest FRUSDR update has been run and the correct
chassis is detected or selected.
3. If you are sure this was done, the event may be a sign of
impending fan failure (although this only normally applies if the
system has been in use for a while). Replace the fan.
Fan Presence and Redundancy Sensors
Fan presence sensors are only implemented for hot-swap fans, and require an additional pin on the fan header. Fan redundancy is
an aggregate of the fan presence sensors and will warn when redundancy is lost. Typically the redundancy mode on Intel® servers is
an n+1 redundancy (if one fan fails there are still sufficient fans to cool the system, but it is no longer redundant) although other
modes are also possible.
Table 31: Fan Presence Sensors Typical Characteristics
Byte
46
Field
Description
11
Sensor Type
04h = Fan
12
Sensor Number
40h-4Fh (Chassis specific)
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 08h (Generic “digital” Discrete)
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Cooling Subsystem
Byte
Field
Description
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 32
15
Event Data 2
Not used
16
Event Data 3
Not used
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 32: Fan Presence Sensors – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex
01h
Description
Device
Present
Assertion
Severity
OK
Deassert
Severity
Degraded
Description
Next Steps
Assertion – A fan was inserted. This
event may also get logged when the
BMC initializes when AC is applied.
Informational only
Deassert – A fan was removed, or
was not present at the expected
location when the BMC initialized.
These events only get generated in the systems with hot-swappable fans,
and normally only when a fan is physically inserted or removed. If fans
were not physically removed:
1. Use the Quick Start Guide to check whether the right fan
headers were used.
2. Swap the fans round to see whether the problem stays with the
location or follows the fan.
3. Replace the fan or fan wiring/housing depending on the outcome
of step 2.
4. Ensure the latest FRUSDR update has been run and the correct
chassis is detected or selected.
Table 33: Fan Redundancy Sensors Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
04h = Fan
12
Sensor Number
0Ch
Intel order number G90620-003
47
Cooling Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 34
15
Event Data 2
Not used
16
Event Data 3
Not used
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 34: Fan Redundancy Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Hex
Description
Description
00h
Fully redundant
01h
Redundancy lost
02h
Redundancy degraded
03h
Non-redundant, sufficient from redundant
04h
Non-redundant, sufficient from insufficient
05h
Non-redundant, insufficient
The system has lost fans and may no longer be able to cool
itself adequately. Overheating may occur if this situation
remains for a longer period of time.
06h
Non-redundant, degraded from fully
redundant
The system has lost one or more fans and is running in nonredundant mode. There are enough fans to keep the system
properly cooled, but fan speeds will boost.
07h
Redundant, degraded from non-redundant
The system has lost one or more fans and is running in a
degraded mode, but still is redundant. There are enough fans
to keep the system properly cooled.
48
Next Steps
The system has lost one or more fans and is running in nonredundant mode. There are enough fans to keep the system
properly cooled, but fan speeds will boost.
Intel order number G90620-003
Fan redundancy loss indicates failure of
one or more fans.
Look for lower (non-) critical fan errors,
or fan removal errors in the SEL, to
indicate which fan is causing the
problem, and follow the troubleshooting
steps for these event types.
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Cooling Subsystem
5.2
Temperature Sensors
There are a variety of temperature sensors that can be implemented on Intel® Server Systems. They are split into various types each
with their own events that can be logged.






Threshold-based Temperature
Thermal Margin
Processor Thermal Control %
Processor DTS Thermal Margin (Monitor only)
Discrete Thermal
DIMM Thermal Trip
5.2.1
Threshold-based Temperature Sensors
Threshold-based temperature sensors are sensors that report an actual temperature. These are linear, threshold-based sensors. In
most Intel® Server Systems, multiple sensors are defined: front panel temperature and baseboard temperature. There are also
multiple other sensors that can be defined and are platform-specific. Most of these sensors typically have upper and lower thresholds
set – upper to warn in case of an over-temperature situation, lower to warn against sensor failure (temperature sensors typically read
out 0 if they stop working).
Table 35: Temperature Sensors Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
See Table 37
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 36
15
Event Data 2
Reading that triggered event
Intel order number G90620-003
49
Cooling Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
16
Field
Description
Event Data 3
Threshold value that triggered event
Table 36: Temperature Sensors Event Triggers – Description
Hex
Event Trigger
Description
Assertion
Severity
Deassert
Severity
Description
00h
Lower non-critical
going low
Degraded
OK
The temperature has dropped below its lower non-critical threshold.
02h
Lower critical
going low
non-fatal
Degraded
The temperature has dropped below its lower critical threshold.
07h
Upper non-critical
going high
Degraded
OK
The temperature has gone over its upper non-critical threshold.
09h
Upper critical
going high
non-fatal
Degraded
The temperature has gone over its upper critical threshold.
Table 37: Temperature Sensors – Next Steps
Sensor
Number
50
Sensor Name
21h
Front Panel Temp
14h
Baseboard Temperature 5
15h
Baseboard Temperature 6
16h
I/O Mod2 Temp
17h
PCI Riser 5 Temp
18h
PCI Riser 4 Temp
20h
Baseboard Temperature 1
22h
SSB Temperature
Next Steps
If the front panel temperature reads zero, check:
1. It is connected properly.
2. The SDR has been programmed correctly for your chassis.
If the front panel temperature is too high:
1. Check the cooling of your server room.
1.
2.
3.
4.
Check for clear and unobstructed airflow into and out of the chassis.
Ensure the SDR is programmed and correct chassis has been selected.
Ensure there are no fan failures.
Ensure the air used to cool the system is within the thermal specifications for the system (typically below
35°C).
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Cooling Subsystem
Sensor
Number
Sensor Name
23h
Baseboard Temperature 2
24h
Baseboard Temperature 3
25h
Baseboard Temperature 4
26h
I/O Mod Temp
27h
PCI Riser 1 Temp
28h
IO Riser Temp
2Ch
PCI Riser 2 Temp
2Dh
SAS Mod Temp
2Eh
Exit Air Temp
2Fh
LAN NIC Temp
5.2.2
Next Steps
Thermal Margin Sensors
Margin sensors are also linear sensors but typically report a negative value. This is not an actual temperature, but in fact an offset to
a critical temperature. Values reported are seen as number of degrees below a critical temperature for the particular component.
The BMC supports DIMM aggregate temperature margin IPMI sensors. The temperature readings from the physical temperature
sensors on each DIMM (such as, Temperature Sensor on DIMM, or TSOD) are aggregated into IPMI temperature margin sensors for
groupings of DIMM slots, the partitioning of which is platform/SKU specific and generally corresponding to fan domains.
The BMC supports global aggregate temperature margin IPMI sensors. There may be as many unique global aggregate sensors as
there are fan domains. Each sensor aggregates the readings of multiple other IPMI temperature sensors supported by the BMC FW.
The mapping of child-sensors into each global aggregate sensor is SDR-configurable. The primary usage for these sensors is to
trigger turning off fans when a lower threshold is reached.
Table 38: Thermal Margin Sensors Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
See Table 40
Intel order number G90620-003
51
Cooling Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Triggers as described in Table 39
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
Table 39: Thermal Margin Sensors Event Triggers – Description
Hex
Event Trigger
Description
Assertion
Severity
Deassert
Severity
Description
07h
Upper non-critical
going high
Degraded
OK
The thermal margin has gone over its upper non-critical threshold.
09h
Upper critical
going high
non-fatal
Degraded
The thermal margin has gone over its upper critical threshold.
Table 40: Thermal Margin Sensors – Next Steps
Sensor
Number
Sensor Name
74h
P1 Therm Margin
75h
P2 Therm Margin
76h
P3 Therm Margin
77h
P4 Therm Margin
B0h
P1 DIMM Thrm Mrgn1
B1h
P1 DIMM Thrm Mrgn2
B2h
P2 DIMM Thrm Mrgn1
Next Steps
Not a logged SEL event. Sensor is used for thermal management of the processor.
52
1.
2.
3.
Check for clear and unobstructed airflow into and out of the chassis.
Ensure the SDR is programmed and correct chassis has been selected.
Ensure there are no fan failures.
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Cooling Subsystem
Sensor
Number
Sensor Name
B3h
P2 DIMM Thrm Mrgn2
B4h
P3 DIMM Thrm Mrgn1
B5h
P3 DIMM Thrm Mrgn2
B6h
P4 DIMM Thrm Mrgn1
B7h
P4 DIMM Thrm Mrgn2
C8h
Agg Therm Mrgn 1
C9h
Agg Therm Mrgn 2
CAh
Agg Therm Mrgn 3
CBh
Agg Therm Mrgn 4
CCh
Agg Therm Mrgn 5
CDh
Agg Therm Mrgn 6
CEh
Agg Therm Mrgn 7
CFh
Agg Therm Mrgn 8
5.2.3
Next Steps
4.
Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).
Processor Thermal Control Sensors
The BMC FW monitors the percentage of time that a processor has been operationally constrained over a given time window
(nominally six seconds) due to internal thermal management algorithms engaging to reduce the temperature of the device. This
monitoring is instantiated as one IPMI analog/threshold sensor per processor package.
If this is not addressed, the processor will overheat and shut down the system to protect itself from damage.
Table 41: Processor Thermal Control Sensors Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
78h = Processor 1 Thermal Control %
79h = Processor 2 Thermal Control %
7Ah = Processor 3 Thermal Control %
Intel order number G90620-003
53
Cooling Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
7Bh = Processor 4 Thermal Control %
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Triggers as described in Table 42
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
Table 42: Processor Thermal Control Sensors Event Triggers – Description
Hex
5.2.3.1
Event Trigger
Description
Assertion
Severity
Deassert
Severity
Description
07h
Upper non-critical
going high
Degraded
OK
The thermal margin has gone over its upper non-critical threshold.
09h
Upper critical
going high
non-fatal
Degraded
The thermal margin has gone over its upper critical threshold.
Processor Thermal Control % Sensors – Next Steps
These events normally occur due to failures of the thermal solution:
1.
2.
3.
4.
Verify heatsink is properly attached and has thermal grease.
If the system has a heatsink fan, ensure the fan is spinning.
Check all system fans are operating properly.
Check that the air used to cool the system is within limits (typically 35°C).
5.2.4
Processor DTS Thermal Margin Sensors
Intel® Xeon® processor E5-4600/2600/2400/1600 v2 product families are incorporating a DTS-based thermal spec. This allows a
much more accurate control of the thermal solution and enables lower fan speeds and lower fan power consumption. For Intel®
54
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Cooling Subsystem
Xeon® processor E5-4600/2600/2400/1600 product families, this requires significant BMC FW calculations to derive the sensor value.
Intel® Xeon® processor E5-4600/2600/2400/1600 v2 product families are the follow-on processors to Intel® Xeon® processor E54600/2600/2400/1600 product families. For Intel® Xeon® processor E5-4600/2600/2400/1600 v2 product families, the BMC’s
derivation of this value is greatly simplified because the majority of the calculations are performed within the processor itself.
The main usage of this sensor is as an input to the BMC’s fan control algorithms. The BMC implements this as a threshold sensor.
There is one DTS sensor for each installed physical processor package. Thresholds are not set and alert generation is not enabled
for these sensors.
Table 43: Processor DTS Thermal Margin Sensors Typical Characteristics
Byte
5.2.5
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
83h = Processor 1 DTS Thermal Margin
84h = Processor 2 DTS Thermal Margin
85h = Processor 3 DTS Thermal Margin
86h = Processor 4 DTS Thermal Margin
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
Discrete Thermal Sensors
Discrete thermal sensors do not report a temperature at all, instead they report an overheating event of some kind. For example,
VRD Hot (voltage regulator is overheating) or processor Thermal Trip (the processor got so hot that its over-temperature protection
was triggered and the system was shut down to prevent damage).
Table 44: Discrete Thermal Sensors Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
See Table 45
Intel order number G90620-003
55
Cooling Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = See Table 45
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 45
15
Event Data 2
Not used
16
Event Data 3
Not used
Table 45: Discrete Thermal Sensors – Next Steps
Sensor
Number
Sensor Name
Event
Type
Event Trigger Offset
Hex
Description
Description
0Dh
SSB Thermal Trip
03h
01h
State Asserted
South Side Bridge (SSB) overheated
90h
P1 VRD Hot
05h
01h
Limit Exceeded
Processor 1 voltage regulator overheated
91h
P2 VRD Hot
Processor 2 voltage regulator overheated
92h
P3 VRD Hot
Processor 3 voltage regulator overheated
93h
P4 VRD Hot
Processor 4 voltage regulator overheated
94h
P1 Mem01 VRD Hot
Processor 1 Memory 0/1 voltage regulator
overheated
95h
P1 Mem23 VRD Hot
Processor 1 Memory 2/3 voltage regulator
overheated
96h
P2 Mem01 VRD Hot
Processor 2 Memory 0/1 voltage regulator
overheated
97h
P2 Mem23 VRD Hot
Processor 2 Memory 2/3 voltage regulator
overheated
98h
P3 Mem01 VRD Hot
Processor 3 Memory 0/1 voltage regulator
overheated
56
Next Steps
Intel order number G90620-003
1.
2.
3.
4.
Check for clear and unobstructed
airflow into and out of the chassis.
Ensure the SDR is programmed and
correct chassis has been selected.
Ensure there are no fan failures.
Ensure the air used for cooling the
system is within the thermal
specifications for the system (typically
below 35°C).
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Cooling Subsystem
Sensor
Number
Sensor Name
99h
P4 Mem23 VRD Hot
Processor 3 Memory 2/3 voltage regulator
overheated
9Ah
P4 Mem01 VRD Hot
Processor 4 Memory 0/1 voltage regulator
overheated
9Bh
P4 Mem23 VRD Hot
Processor 4 Memory 2/3 voltage regulator
overheated
5.2.6
Event
Type
Event Trigger Offset
Hex
Description
Next Steps
Description
DIMM Thermal Trip Sensors
The BMC supports DIMM Thermal Trip monitoring that is instantiated as one aggregate IPMI discrete sensor per CPU. When a
DIMM Thermal Trip occurs, the system hardware will automatically power down the server and the BMC will assert the sensor offset
and log an event.
Table 46: DIMM Thermal Trip Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
0Ch = Memory
12
Sensor Number
C0h = Processor 1 DIMM Thermal Trip
C1h = Processor 2 DIMM Thermal Trip
C2h = Processor 3 DIMM Thermal Trip
C3h = Processor 4 DIMM Thermal Trip
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 0A = Critical over temperature
15
Event Data 2
Not used
16
Event Data 3
[7:5] – Socket ID
Intel order number G90620-003
57
Cooling Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
0-3 = CPU1-4
[4:3] – Channel
0-3 = Channel A, B, C, D for CPU1
Channel E, F, G, H for CPU2
Channel J, K, L, M for CPU3
Channel N, P, R, T for CPU4
[2:0] – DIMM
0-2 = DIMM 1-3 on Channel
5.2.6.1
1.
2.
3.
4.
DIMM Thermal Trip Sensors – Next Steps
Check for clear and unobstructed airflow into and out of the chassis.
Ensure the SDR is programmed and correct chassis has been selected.
Ensure there are no fan failures.
Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).
5.3
System Air Flow Monitoring Sensor
The BMC provides an IPMI sensor to report the volumetric system airflow in CFM (cubic feet per minute). The airflow in CFM is
calculated based on the system fan PWM values. The specific Pulse Width Modulation (PWM or PWMs) used to determine the CFM
is SDR-configurable. The relationship between PWM and CFM is based on a lookup table in an OEM SDR.
The airflow data is used in the calculation for exit air temperature monitoring. It is exposed as an IPMI sensor to allow a data center
management application to access this data for use in rack-level thermal management.
This sensor is informational only and will not log events into the SEL.
58
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Processor Subsystem
6.
Processor Subsystem
Intel® servers report multiple processor-centric sensors in the SEL.
6.1
Processor Status Sensor
The BMC provides an IPMI sensor of type processor for monitoring status information for each processor slot. If an event state
(sensor offset) has been asserted, it remains asserted until one of the following happens:


A rearm Sensor Events command is executed for the processor status sensor.
AC or DC power cycle, system reset, or system boot occurs.
CPU Presence status is not saved across AC power cycles and therefore will not generate a deassertion after cycling AC power.
Table 47: Process Status Sensors Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
07h = Processor
12
Sensor Number
70h = Processor 1 Status
71h = Processor 2 Status
72h = Processor 3 Status
73h = Processor 4 Status
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 48
15
Event Data 2
Not used
16
Event Data 3
Not used
Intel order number G90620-003
59
Processor Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Table 48: Processor Status Sensors – Next Steps
Event Trigger
Offset
Next Steps
Internal error (IERR)
1.
2.
1h
Thermal trip
This event normally only happens due to failures of the thermal solution:
1. Verify heatsink is properly attached and has thermal grease.
2. If the system has a heatsink fan, ensure the fan is spinning.
3. Check all system fans are operating properly.
4. Check that the air used to cool the system is within limits (typically
35°C).
2h
FRB1/BIST failure
3h
FRB2/Hang in POST failure
4h
FRB3/Processor startup/initialization failure (CPU fails to
start)
5h
Configuration error (for DMI)
6h
SM BIOS uncorrectable CPU-complex error
7h
Processor presence detected
Informational Event
8h
Processor disabled
9h
Terminator presence detected
1.
2.
0h
6.2
Processor Status
1.
2.
Cross test the processors.
Replace the processors depending on the results of the test.
Cross test the processors.
Replace the processors depending on the results of the test.
Cross test the processors.
Replace the processors depending on the results of the test.
Catastrophic Error Sensor
When the Catastrophic Error signal (CATERR#) stays asserted, it is a sign that something serious has gone wrong in the hardware.
The BMC monitors this signal and reports when it stays asserted.
Table 49: Catastrophic Error Sensor Typical Characteristics
Byte
11
60
Field
Sensor Type
Description
07h = Processor
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Processor Subsystem
Byte
Field
Description
12
Sensor Number
80h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (Digital Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 1h (State Asserted)
15
Event Data 2
Event Data 2 values as described in Table 50.
16
Event Data 3
Bitmap of the CPU that causes the system CATERR.
[0]: CPU1
[1]: CPU2
[2]: CPU3
[3]: CPU4
Note: If more than one bit is set, the BMC cannot
determine the source of the CATERR.
Table 50: Catastrophic Error Sensor – Event Data 2 Values – Next Steps
ED2
Next Steps
Unknown
1.
2.
01h
CATERR
This error is typically caused by other platform components.
1. Check for other errors near the time of the CATERR event.
2. Verify all peripherals are plugged in and operating correctly, particularly Hard Drives, Optical Drives,
and I/O.
3. Update system firmware and drivers.
2h
CPU Core
Error
1.
2.
3h
MSID
Mismatch
Verify the processor is supported by your baseboard. Check your boards Technical Product Specification
(TPS).
00h
Revision 1.2
Description
Cross test the processors.
Replace the processors depending on the results of the test.
Cross test the processors.
Replace the processors depending on the results of the test.
Intel order number G90620-003
61
Processor Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
6.3
CPU Missing Sensor
The CPU Missing sensor is a discrete sensor reporting the processor is not installed. The most common instance of this event is due
to a processor populated in the incorrect socket.
Table 51: CPU Missing Sensor Typical Characteristics
Byte
6.3.1
Field
Description
11
Sensor Type
07h = Processor
12
Sensor Number
82h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h (State Asserted)
15
Event Data 2
Not used
16
Event Data 3
Not used
CPU Missing Sensor – Next Steps
Verify the processor is installed in the correct slot.
6.4
Quick Path Interconnect Sensors
The Intel® Quick Path Interconnect (QPI) bus on Intel® PCSD Boards Based on Intel® Xeon® Processor E5‑
4600/2600/2400/1600/1400 Product Families is the interconnect between processors.
The QPI Link Width Reduced sensor is used by the BIOS POST to report when the link width has been reduced. Therefore the
Generator ID will be 01h.
62
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Processor Subsystem
The QPI Error sensors are reported by the BIOS SMI Handler to the BMC so the Generator ID will be 33h.
6.4.1
QPI Link Width Reduced Sensor
BIOS POST has reduced the QPI Link Width because of an error condition seen during initialization.
Table 52: QPI Link Width Reduced Sensor Typical Characteristics
Byte
6.4.1.1
Field
Description
8
9
Generator ID
0001h = BIOS POST
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
09h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 77h (OEM Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
1h = Reduced to ½ width
2h = Reduced to ¼ width
15
Event Data 2
0-3 = CPU1-4
16
Event Data 3
Not used
QPI Link Width Reduced Sensor – Next Steps
If the error continues:
1. Check the processor is installed correctly.
2. Inspect the socket for bent pins.
3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.
Revision 1.2
Intel order number G90620-003
63
Processor Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
6.4.2
QPI Correctable Error Sensor
The system detected an error and corrected it. This is an informational event.
Table 53: QPI Correctable Error Sensor Typical Characteristics
Byte
6.4.2.1
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
06h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 72h (OEM Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = Reserved
15
Event Data 2
0-3 = CPU1-4
16
Event Data 3
Not used
QPI Correctable Error Sensor – Next Steps
This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:
1. Check the processor is installed correctly.
2. Inspect the socket for bent pins.
3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.
6.4.3
QPI Fatal Error and Fatal Error #2
The system detected a QPI fatal or non-recoverable error. This is a fatal error.
64
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Processor Subsystem
Table 54: QPI Fatal Error Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
07h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 73h (OEM Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
0h = Link Layer Uncorrectable ECC Error
1h = Protocol Layer Poisoned Packet Reception Error
2h = Link/PHY Init Failure with resultant degradation in link width
3h = PHY Layer detected drift buffer alarm
4h = PHY detected latency buffer rollover
5h = PHY Init Failure
6h = Link Layer generic control error (buffer overflow/underflow, credit underflow, and so on)
7h = Parity error in link or PHY layer
8h = Protocol layer timeout detected
9h = Protocol layer failed response
Ah = Protocol layer illegal packet field, target Node ID Error, and so on
Bh = Protocol Layer Queue/table overflow/underflow
Ch = Viral Error
Dh = Protocol Layer parity error
Eh = Routing Table Error
Fh = (unused) = Reserved
15
Event Data 2
0-3 = CPU1-4
16
Event Data 3
Not used
Intel order number G90620-003
65
Processor Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
The QPI Fatal Error #2 is a continuation of QPI Fatal Error.
Table 55: QPI Fatal #2 Error Sensor Typical Characteristics
Byte
6.4.3.1
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
17h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 74h (OEM Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
0h = Illegal inbound request
1h = IIO Write Cache Uncorrectable Data ECC Error
2h = IIO CSR crossing 32-bit boundary Error
3h = IIO Received XPF physical/logical redirect interrupt inbound
4h = IIO Illegal SAD or Illegal or non-existent address or memory
5h = IIO Write Cache Coherency Violation
15
Event Data 2
0-3 = CPU1-4
16
Event Data 3
Not used
QPI Fatal Error and Fatal Error #2 – Next Steps
This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:
1. Check the processor is installed correctly.
2. Inspect the socket for bent pins.
3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.
66
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Processor Subsystem
6.5
Processor ERR2 Timeout Sensor
The BMC supports an ERR2 Timeout Sensor (1 per CPU) that asserts if a CPU’s ERR2 signal has been asserted for longer than a
fixed time period (> 90 seconds). ERR[2] is a processor signal that indicates when the IIO (Integrated IO module in the processor)
has a fatal error which could not be communicated to the core to trigger SMI. ERR[2] events are fatal error conditions, where the
BIOS and OS will attempt to gracefully handle error, but may not always do so reliably. A continuously asserted ERR2 signal is an
indication that the BIOS cannot service the condition that caused the error. This is usually because that condition prevents the BIOS
from running.
When an ERR2 timeout occurs, the BMC asserts/deasserts the ERR2 Timeout Sensor, and logs a SEL event for that sensor. The
default behavior for BMC core firmware is to initiate a system reset upon detection of an ERR2 timeout. The BIOS setup utility
provides an option to disable or enable system reset by the BMC on detection of this condition.
Table 56: Processor ERR2 Timeout Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
07h = Processor
12
Sensor Number
7Ch = Processor 1 ERR2 Timeout
7Dh = Processor 2 ERR2 Timeout
7Eh = Processor 3 ERR2 Timeout
7Fh = Processor 4 ERR2 Timeout
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (“digital” discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h (State Asserted)
15
Event Data 2
Not used
16
Event Data 3
Not used
Intel order number G90620-003
67
Processor Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
6.5.1
Processor ERR2 Timeout – Next Steps
1. Check the SEL for any other events around the time of the failure.
2. Take note of all IPMI activity that was occurring around the time of the failure. Capture a System BMC Debug Log as soon as you
can after experiencing this failure. This log can be captured from the Integrated BMC Web Console or by using the Intel ® Syscfg
utility (syscfg/sbmcdl private filename.zip). Send the log file to your system manufacturer or Intel representative for failure
analysis.
6.6
Processor MSID Mismatch Sensor
The BMC supports a MSID Mismatch sensor for monitoring for the fault condition that will occur if there is a power rating
incompatibility between a baseboard and a processor.
The sensor is rearmed on power-on (AC or DC power-on transitions).
Table 57: Processor MSID Mismatch Sensor Typical Characteristics
Byte
68
Field
Description
11
Sensor Type
07h = Processor
12
Sensor Number
81h = Processor 1 MSID Mismatch
87h = Processor 2 MSID Mismatch
88h = Processor 3 MSID Mismatch
89h = Processor 4 MSID Mismatch
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (“digital” discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h (State Asserted)
15
Event Data 2
Not used
16
Event Data 3
Not used
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Processor Subsystem
6.6.1
Processor MSID Mismatch Sensor – Next Steps
Verify the processor is supported by your baseboard. Check your boards Technical Product Specification (TPS).
Revision 1.2
Intel order number G90620-003
69
Memory Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
7.
Memory Subsystem
Intel® servers report memory errors, status, and configuration in the SEL.
7.1
Memory RAS Configuration Status
A Memory RAS Configuration Status event is logged after an AC power-on occurs, only if any RAS Mode is currently configured, and
only if RAS Mode is successfully initiated.
This is to make sure that there is a record in the SEL telling what the RAS Mode was at the time that the system started up. This is
only logged after AC power-on, not DC power-on.
The Memory RAS Configuration Status Sensor is also used to log an event during POST whenever there is a RAS configuration
error. This is a case where a RAS Mode has been selected but when the system boots, the memory configuration cannot support the
RAS Mode. The memory configuration fails, and operates in Independent Channel Mode.
In the SEL record logged, the ED1 Offset value is “RAS Configuration Disabled”, and ED3 contains the RAS Mode that is currently
selected but could not be configured. ED2 gives the reason for the RAS configuration failure – at present, only two “RAS
Configuration Error Type” values are implemented:
0 = None – This is used for an AC power-on log record when the RAS configuration is successfully configured.
3 = Invalid DIMM Configuration for RAS Mode – The installed DIMM configuration cannot support the currently selected RAS
Mode. This may be due to DIMMs that have failed or been disabled, so when this reason has been logged, the user
should check the preceding SEL events to see whether there are DIMM error events.
Table 58: Memory RAS Configuration Status Sensor Typical Characteristics
Byte
70
Field
Description
8
9
Generator ID
0001h = BIOS POST
11
Sensor Type
0ch = Memory
12
Sensor Number
02h
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Memory Subsystem
Byte
Field
Description
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 09h (digital Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset as described in Table 59
15
Event Data 2
RAS Configuration Error Type
[7:4] = Reserved
[3:0] = Configuration Error
0 = None
3 = Invalid DIMM Configuration for RAS Mode
All other values are reserved.
16
Event Data 3
RAS Mode Configured
[7:4] = Reserved
[3:0] = RAS Mode
0h = None (Independent Channel Mode)
1h = Mirroring Mode
2h = Lockstep Mode
4h = Rank Sparing Mode
Table 59: Memory RAS Configuration Status Sensor – Event Trigger Offset – Next Steps
Event Trigger Offset
Description
Next Steps
Hex
Description
01h
RAS configuration
enabled.
User enabled mirrored channel mode
in setup.
Informational event only.
00h
RAS configuration
disabled.
Mirrored channel mode is disabled
(either in setup or due to unavailability
of memory at post, in which case post
error 8500 is also logged).
1.
Revision 1.2
2.
If this event is accompanied by a post error 8500, there was a problem
applying the mirroring configuration to the memory. Check for other errors
related to the memory and troubleshoot accordingly.
If there is no post error, mirror mode was simply disabled in BIOS setup and
this should be considered informational only.
Intel order number G90620-003
71
Memory Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
7.2
Memory RAS Mode Select
Memory RAS Mode Select events are logged to record changes in RAS Mode.
When a RAS Mode selection is made that changes the RAS Mode (including selecting a RAS Mode from or to Independent Channel
Mode), that change is logged to SEL in a Memory RAS Mode Select event message, which records the previous RAS Mode (from)
and the newly selected RAS Mode (to). The event also includes an Offset value in ED1 which indicates whether the mode change
left the system with a RAS Mode active (Enabled), or not (Disabled – Independent Channel Mode selected).This sensor provides the
Spare Channel mode RAS Configuration status. Memory RAS Mode Select is an informational event.
Table 60: Memory RAS Mode Select Sensor Typical Characteristics
Byte
72
Field
Description
8
9
Generator ID
0001h = BIOS POST
11
Sensor Type
0ch = Memory
12
Sensor Number
12h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 09h (digital Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
0h = RAS Configuration Disabled
1h = RAS Configuration Enabled
15
Event Data 2
Prior RAS Mode
[7:4] = Reserved
[3:0] = RAS Mode
0h = None (Independent Channel Mode)
1h = Mirroring Mode
2h = Lockstep Mode
4h = Rank Sparing Mode
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Memory Subsystem
Byte
Field
16
7.3
Description
Event Data 3
Selected RAS Mode
[7:4] = Reserved
[3:0] = RAS Mode
0h = None (Independent Channel Mode)
1h = Mirroring Mode
2h = Lockstep Mode
4h = Rank Sparing Mode
Mirroring Redundancy State
Mirroring Mode protects memory data by full redundancy – keeping complete copies of all data on both channels of a Mirroring
Domain (channel pair). If an Uncorrectable Error, which is normally fatal, occurs on one channel of a pair, and the other channel is
still intact and operational, then the Uncorrectable Error is “demoted” to a Correctable Error, and the failed channel is disabled.
Because the Mirror Domain is no longer redundant, a Mirroring Redundancy State SEL Event is logged.
Table 61: Mirroring Redundancy State Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
0ch = Memory
12
Sensor Number
01h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
0h = Fully Redundant
2h = Redundancy Degraded
Intel order number G90620-003
73
Memory Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
7.3.1
Field
Description
15
Event Data 2
Location
[7:4] = Mirroring Domain
0-1 = Channel Pair for Socket
[3:2] = Reserved
[1:0] = Rank on DIMM
0-3 = Rank Number
16
Event Data 3
Location
[7:5] = Socket ID
0-3 = CPU1-4
[4:3] = Channel
0-3 = Channel A, B, C, D for CPU1
Channel E, F, G, H for CPU2
Channel J, K, L, M for CPU3
Channel N, P, R, T for CPU4
[2:0] = DIMM
0-2 = DIMM 1-3 on Channel
Mirroring Redundancy State Sensor – Next Steps
This event is accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected
DIMM).
For boards with DIMM Fault LEDs, the appropriate Fault LED is lit to indicate which DIMM was the source of the error triggering the
Mirroring Failover action, that is, the failing DIMM.
7.4
Sparing Redundancy State
Rank Sparing Mode is a Memory RAS configuration option that reserves one memory rank per channel as a “spare rank”. If any rank
on a given channel experiences enough Correctable ECC Errors to cross the Correctable Error Threshold, the data in that rank is
copied to the spare rank, and then the spare rank is mapped into the memory array to replace the failing rank.
74
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Memory Subsystem
Rank Sparing Mode protects memory data by reserving a “Spare Rank” on each channel that has memory installed on it. If a
Correctable Error Threshold event occurs, the data from the failing rank is copied to the Spare Rank on the same channel, and the
failing DIMM is disabled. Because the Sparing Domain is no longer redundant, a Sparing Redundancy State SEL Event is logged.
Table 62: Sparing Redundancy State Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
0ch = Memory
12
Sensor Number
11h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
0h = Fully Redundant
2h = Redundancy Degraded
15
Event Data 2
Location
[7:4] = Sparing Domain
0-3 = Channel A-D for Socket
[3:2] = Reserved
[1:0] = Rank on DIMM
0-3 = Rank Number
Intel order number G90620-003
75
Memory Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
16
7.4.1
Field
Description
Event Data 3
Location
[7:5]= Socket ID
0-3 = CPU1-4
[4:3] = Channel
0-3 = Channel A, B, C, D for CPU1
Channel E, F, G, H for CPU2
Channel J, K, L, M for CPU3
Channel N, P, R, T for CPU4
[2:0] = DIMM
0-2 = DIMM 1-3 on Channel
Sparing Redundancy State Sensor – Next Steps
This event is accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected
DIMM).
For boards with DIMM Fault LEDs, the appropriate Fault LED is lit to indicate which DIMM was the source of the error triggering the
Mirroring Failover action, that is, the failing DIMM.
7.5
ECC and Address Parity
1. Memory data errors are logged as correctable or uncorrectable.
2. Uncorrectable errors are fatal.
3. Memory addresses are protected with parity bits and a parity error is logged. This is a fatal error.
7.5.1
Memory Correctable and Uncorrectable ECC Error
ECC errors are divided into Uncorrectable ECC Errors and Correctable ECC Errors. A “Correctable ECC Error” actually represents a
threshold overflow. More Correctable Errors are detected at the memory controller level for a given DIMM within a given timeframe.
In both cases, the error can be narrowed down to particular DIMM(s). The BIOS SMI error handler uses this information to log the
data to the BMC SEL and identify the failing DIMM module.
76
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Memory Subsystem
Table 63: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics
Byte
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
0ch = Memory
12
Sensor Number
02h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset as described in
Table 64
Revision 1.2
15
Event Data 2
[7:2] – Reserved. Set to 0.
[1:0] – Rank on DIMM
0-3 = Rank number
16
Event Data 3
[7:5] – Socket ID
0-3 = CPU1-4
[4:3] –Channel
0-3 = Channel A, B, C, D for CPU1
Channel E, F, G, H for CPU2
Channel J, K, L, M for CPU3
Channel N, P, R, T for CPU4
[2:0] DIMM
0-2 = DIMM 1-3 on Channel
Intel order number G90620-003
77
Memory Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Table 64: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps
Event Trigger Offset
Hex
Description
01h
Uncorrectable ECC
Error
00h
7.5.2
Correctable ECC
Error threshold
reached
Description
Next Steps
An uncorrectable (multi-bit) ECC error has occurred. This
is a fatal issue that will typically lead to an OS crash
(unless memory has been configured in a RAS mode).
The system will generate a CATERR# (catastrophic error)
and an MCE (Machine Check Exception Error).
While the error may be due to a failing DRAM chip on the
DIMM, it can also be caused by incorrect seating or
improper contact between socket and DIMM, or by bent
pins in the processor socket.
1.
2.
3.
There have been too many (10 or more) correctable ECC
errors for this particular DIMM since last boot. This event
in itself does not pose any direct problems because the
ECC errors are still being corrected. Depending on the
RAS configuration of the memory, the IMC may take the
affected DIMM offline.
Even though this event doesn't immediately lead to problems, it
can indicate one of the DIMM modules is slowly failing. If this
error occurs more than once:
1. If needed, decode DIMM location from hex version of SEL.
2. Verify the DIMM is seated properly.
3. Examine gold fingers on edge of the DIMM to verify
contacts are clean.
4. Inspect the processor socket this DIMM is connected to for
bent pins, and if found, replace the board.
5. Consider replacing the DIMM as a preventative measure.
For multiple occurrences, replace the DIMM.
4.
5.
If needed, decode DIMM location from hex version of SEL.
Verify the DIMM is seated properly.
Examine gold fingers on edge of the DIMM to verify
contacts are clean.
Inspect the processor socket this DIMM is connected to for
bent pins, and if found, replace the board.
Consider replacing the DIMM as a preventative measure.
For multiple occurrences, replace the DIMM.
Memory Address Parity Error
Address Parity errors are errors detected in the memory addressing hardware. Because these affect the addressing of memory
contents, they can potentially lead to the same sort of failures as ECC errors. They are logged as a distinct type of error because
they affect memory addressing rather than memory contents, but otherwise they are treated exactly the same as Uncorrectable ECC
Errors. Address Parity errors are logged to the BMC SEL, with Event Data to identify the failing address by channel and DIMM to the
extent that it is possible to do so.
78
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Memory Subsystem
Table 65: Address Parity Error Sensor Typical Characteristics
Byte
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
0ch = Memory
12
Sensor Number
13h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 2h
15
Event Data 2
[7:5] – Reserved. Set to 0.
[4] – Channel Information Validity Check:
0b = Channel Number in Event Data 3 Bits[4:3] is not valid
1b = Channel Number in Event Data 3 Bits[4:3] is valid
[3] – DIMM Information Validity Check:
0b = DIMM Slot ID in Event Data 3 Bits[2:0] is not valid
1b = DIMM Slot ID in Event Data 3 Bits[2:0] is valid
[2:0] – Error Type:
000b = Parity Error Type not known
001b = Data Parity Error (not used)
010b = Address Parity Error
All other values are reserved.
16
Event Data 3
[7:5] – Indicates the Processor Socket to which the DDR3 DIMM having the ECC error is attached:
0-3 = CPU1-4
All other values are reserved.
[4:3] – Channel Number (if valid) on which the Parity Error occurred. This value will be indeterminate and should be ignored if ED2
Bit [4] is 0b.
0-3 = Channel A, B, C, D for CPU1
Channel E, F, G, H for CPU2
Revision 1.2
Intel order number G90620-003
79
Memory Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
Channel J, K, L, M for CPU3
Channel N, P, R, T for CPU4
[2:0] – DIMM Slot ID (if valid) of the specific DIMM that was involved in the transaction that led to the parity error. This value will
be indeterminate and should be ignored if ED2 Bit [3] is 0b.
0-2 = DIMM 1-3 on Channel
All other values are reserved.
7.5.2.1
Memory Address Parity Error Sensor – Next Steps
These are bit errors that are detected in the memory addressing hardware. An Address Parity Error implies that the memory address
transmitted to the DIMM addressing circuitry has been compromised, and data read or written is compromised in turn. An Address
Parity Error is logged as such in SEL but in all other ways is treated the same as an Uncorrectable ECC Error.
While the error may be due to a failing DRAM chip on the DIMM, it can also be caused by incorrect seating or improper contact
between the socket and DIMM, or by the bent pins in the processor socket.
1.
2.
3.
4.
5.
80
If needed, decode DIMM location from hex version of SEL.
Verify the DIMM is seated properly.
Examine gold fingers on edge of the DIMM to verify contacts are clean.
Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.
Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
PCI Express* and Legacy PCI Subsystem
8.
PCI Express* and Legacy PCI Subsystem
The PCI Express* (PCIe) Specification defines standard error types under the Advanced Error Reporting (AER) capabilities. The
BIOS logs AER events into the SEL.
The Legacy PCI Specification error types are PERR and SERR. These errors are supported and logged into the SEL.
8.1
PCI Express* Errors
PCIe error events are either correctable (informational event) or fatal. In both cases information is logged to help identify the source
of the PCIe error and the bus, device, and function is included in the extended data fields. The PCIe devices are mapped in the
operating system by bus, device, and function. Each device is uniquely identified by the bus, device, and function. PCIe device
information can be found in the operating system.
8.1.1
Legacy PCI Errors
Legacy PCI errors include PERR and SERR; both are fatal errors.
Table 66: Legacy PCI Error Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
03h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
Intel order number G90620-003
81
PCI Express* and Legacy PCI Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel ®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
4h = PCI PERR
5h = PCI SERR
8.1.1.1
15
Event Data 2
PCI Bus number
16
Event Data 3
[7:3] – PCI Device number
[2:0] – PCI Function number
Legacy PCI Error Sensor – Next Steps
1. Decode the bus, device, and function to identify the card.
2. If this is an add-in card:
a. Verify the card is inserted properly.
b. Install the card in another slot and check whether the error follows the card or stays with the slot.
c. Update all firmware and drivers, including non-Intel components.
3. If this is an on-board device:
a. Update all BIOS, firmware, and drivers.
b. Replace the board.
8.1.2
PCI Express* Fatal Errors and Fatal Error #2
When a PCI Express* fatal error is reported to the BIOS SMI handler, it will record the error using the following format.
Table 67: PCI Express* Fatal Error Sensor Typical Characteristics
Byte
82
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
04h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 70h (OEM Specific)
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
PCI Express* and Legacy PCI Subsystem
Byte
Field
Description
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger
0h = Data Link Layer Protocol Error
1h = Surprise Link Down Error
2h = Completer Abort
3h = Unsupported Request
4h = Poisoned TLP
5h = Flow Control Protocol
6h = Completion Timeout
7h = Receiver Buffer Overflow
8h = ACS Violation
9h = Malformed TLP
Ah = ECRC Error
Bh = Received Fatal Message From Downstream
Ch = Unexpected Completion
Dh = Received ERR_NONFATAL Message
Eh = Uncorrectable Internal
Fh = MC Blocked TLP
15
Event Data 2
PCI Bus number
16
Event Data 3
[7:3] – PCI Device number
[2:0] – PCI Function number
The PCI Express* Fatal Error #2 is a continuation of the PCI Express* Fatal Error.
Table 68: PCI Express* Fatal Error #2 Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
Intel order number G90620-003
83
PCI Express* and Legacy PCI Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel ®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
8.1.2.1
Field
Description
12
Sensor Number
14h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 76h (OEM Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
0h = Atomic Egress Blocked
1h = TLP Prefix Blocked
Fh = Unspecified Non-AER Fatal Error
15
Event Data 2
PCI Bus number
16
Event Data 3
[7:3] – PCI Device number
[2:0] – PCI Function number
PCI Express* Fatal Error and Fatal Error #2 Sensor – Next Steps
1. Decode the bus, device, and function to identify the card.
2. If this is an add-in card:
a. Verify the card is inserted properly.
b. Install the card in another slot and check whether the error follows the card or stays with the slot.
c. Update all firmware and drivers, including non-Intel components.
3. If this is an on-board device:
a. Update all BIOS, firmware, and drivers.
b. Replace the board.
8.1.3
PCI Express* Correctable Errors
When a PCI Express* correctable error is reported to the BIOS SMI handler, it will record the error using the following format.
84
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
PCI Express* and Legacy PCI Subsystem
Table 69: PCI Express* Correctable Error Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
8
9
Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
05h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 71h (OEM Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
0h = Receiver Error
1h = Bad DLLP
2h = Bad TLP
3h = Replay Num Rollover
4h = Replay Timer timeout
5h = Advisory Non-fatal
6h = Link BW Changed
7h = Correctable Internal
8h = Header Log Overflow
Fh = Unspecified Non-AER Correctable Error
15
Event Data 2
PCI Bus number
16
Event Data 3
[7:3] – PCI Device number
[2:0] – PCI Function number
Intel order number G90620-003
85
PCI Express* and Legacy PCI Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel ®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
8.1.3.1
PCI Express* Correctable Error Sensor – Next Steps
This is an informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:
1. Decode the bus, device, and function to identify the card.
2. If this is an add-in card:
a. Verify the card is inserted properly.
b. Install the card in another slot and check whether the error follows the card or stays with the slot.
c. Update all firmware and drivers, including non-Intel components.
3. If this is an on-board device:
a. Update all BIOS, firmware, and drivers.
b. Replace the board.
86
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
System BIOS Events
9.
System BIOS Events
There are a number of events that are owned by the system BIOS. These events can occur during Power On Self Test (POST) or
when coming out of a sleep state. Not all of these events signify errors. Some events are described in other chapters in this
document (for example, memory events).
9.1
System Events
These events can occur during POST or when coming out of a sleep state. These are informational events only.
1. When logging events during BIOS POST uses generator ID 0001h.
2. When logging events during BIOS SMI Handler uses generator ID 0033h.
9.1.1
System Boot
At the end of POST, just before the actual OS boot occurs, a System Boot Event is logged. This basically serves to mark the
transition of control from completed POST to OS Loader. It is an informational only event.
9.1.2
Timestamp Clock Synchronization
These events are used when the time between the BIOS and the BMC is synchronized. Two events are logged. The BIOS does the
first one to send the time synch message to the BMC for synchronization, and the timestamp that message gets is unknown, that is,
the timestamp in the log can be anything because it gets the "before" timestamp.
So the BIOS sends a second time synch message to get a "baseline" correct timestamp in the log. That is the "starting time".
For example, say that the time the BMC has is March 1, 2011 21:00. The BIOS time synch updates that to the same date, 21:20 (the
BMC was running behind). Without that second time synch message, you don't know that the log time jumped ahead, and when you
get the next log message it looks like there was a 20-min delay during the boot for some unknown reasons.
Without that second time synch message, the time span to the next logged message is indeterminate. With the second time synch as
a baseline, the following log timestamps are always determinate.
Revision 1.2
Intel order number G90620-003
87
System BIOS Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
The timestamp clock synchronization is run and the events are logged by the BIOS POST every time the system boots. In addition
during the shutdown from some Operating Systems the BIOS SMI Handler is called to run timestamp clock synchronization and log
the events.
Table 70: System Event Sensor Typical Characteristics
Byte
88
Field
Description
8
9
Generator ID


11
Sensor Type
12h = System Event
12
Sensor Number
83h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
01h = System Boot
05h = Timestamp Clock Synchronization
15
Event Data 2
For Event Trigger Offset 05h only (Timestamp Clock
Synchronization)
00h = 1st in pair
80h = 2nd in pair
16
Event Data 3
Not used
0001h = BIOS POST
0033h = BIOS SMI Handler
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
System BIOS Events
9.2
System Firmware Progress (Formerly Post Error)
The BIOS logs any POST errors to the SEL. The 2-byte POST code gets logged in the ED2 and ED3 bytes in the SEL entry. This
event will be logged every time a POST error is displayed. Even though this event indicates an error, it may not be a fatal error. If this
is a serious error, there will typically also be a corresponding SEL entry logged for whatever was the cause of the error – this event
may contain more information about what happened than the POST error event.
Table 71: POST Error Sensor Typical Characteristics
Byte
9.2.1
Field
Description
8
9
Generator ID
0001h = BIOS POST
11
Sensor Type
0Fh = System Firmware Progress (formerly POST Error)
12
Sensor Number
06h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 0h
15
Event Data 2
Low Byte of POST Error Code
16
Event Data 3
High Byte of POST Error Code
System Firmware Progress (Formerly Post Error) – Next Steps
See the following table for POST Error Codes.
Revision 1.2
Intel order number G90620-003
89
System BIOS Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Table 72: POST Error Codes
Error Code
90
Error Message
Response
0012
System RTC date/time not set
Major
0048
Password check failed
Major
0140
PCI component encountered a PERR error
Major
0141
PCI resource conflict
Major
0146
PCI out of resources error
Major
0191
Processor core/thread count mismatch detected
Fatal
0192
Processor cache size mismatch detected
Fatal
0194
Processor family mismatch detected
Fatal
0195
Processor Intel(R) QPI link frequencies unable to synchronize
Fatal
0196
Processor model mismatch detected
Fatal
0197
Processor frequencies unable to synchronize
Fatal
5220
BIOS Settings reset to default settings
Major
5221
Passwords cleared by jumper
Major
5224
Password clear jumper is Set
Major
8130
Processor 01 disabled
Major
8131
Processor 02 disabled
Major
8132
Processor 03 disabled
Major
8133
Processor 04 disabled
Major
8160
Processor 01 unable to apply microcode update
Major
8161
Processor 02 unable to apply microcode update
Major
8162
Processor 03 unable to apply microcode update
Major
8163
Processor 04 unable to apply microcode update
Major
8170
Processor 01 failed Self Test (BIST)
Major
8171
Processor 02 failed Self Test (BIST)
Major
8172
Processor 03 failed Self Test (BIST)
Major
8173
Processor 04 failed Self Test (BIST)
Major
8180
Processor 01 microcode update not found
Minor
8181
Processor 02 microcode update not found
Minor
8182
Processor 03 microcode update not found
Minor
8183
Processor 04 microcode update not found
Minor
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
System BIOS Events
Error Code
Revision 1.2
Error Message
Response
8190
Watchdog timer failed on last boot
Major
8198
OS boot watchdog timer failure
Major
8300
Baseboard management controller failed self test
Major
8305
Hot-Swap Controller failure
Major
83A0
Management Engine (ME) failed self test
Major
83A1
Management Engine (ME) Failed to respond.
Major
84F2
Baseboard management controller failed to respond
Major
84F3
Baseboard management controller in update mode
Major
84F4
Sensor data record empty
Major
84FF
System event log full
Minor
8500
Memory component could not be configured in the selected RAS mode
Major
8501
DIMM Population Error
Major
8520
DIMM_A1 failed test/initialization
Major
8521
DIMM_A2 failed test/initialization
Major
8522
DIMM_A3 failed test/initialization
Major
8523
DIMM_B1 failed test/initialization
Major
8524
DIMM_B2 failed test/initialization
Major
8525
DIMM_B3 failed test/initialization
Major
8526
DIMM_C1 failed test/initialization
Major
8527
DIMM_C2 failed test/initialization
Major
8528
DIMM_C3 failed test/initialization
Major
8529
DIMM_D1 failed test/initialization
Major
852A
DIMM_D2 failed test/initialization
Major
852B
DIMM_D3 failed test/initialization
Major
852C
DIMM_E1 failed test/initialization
Major
852D
DIMM_E2 failed test/initialization
Major
852E
DIMM_E3 failed test/initialization
Major
852F
DIMM_F1 failed test/initialization
Major
8530
DIMM_F2 failed test/initialization
Major
8531
DIMM_F3 failed test/initialization
Major
8532
DIMM_G1 failed test/initialization
Major
8533
DIMM_G2 failed test/initialization
Major
Intel order number G90620-003
91
System BIOS Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Error Code
92
Error Message
Response
8534
DIMM_G3 failed test/initialization
Major
8535
DIMM_H1 failed test/initialization
Major
8536
DIMM_H2 failed test/initialization
Major
8537
DIMM_H3 failed test/initialization
Major
8538
DIMM_J1 failed test/initialization
Major
8539
DIMM_J2 failed test/initialization
Major
853A
DIMM_J3 failed test/initialization
Major
853B
DIMM_K1 failed test/initialization
Major
853C
DIMM_K2 failed test/initialization
Major
853D
DIMM_K3 failed test/initialization
Major
853E
DIMM_L1 failed test/initialization
Major
853F
(Go to 85C0)
DIMM_L2 failed test/initialization
Major
8540
DIMM_A1 disabled
Major
8541
DIMM_A2 disabled
Major
8542
DIMM_A3 disabled
Major
8543
DIMM_B1 disabled
Major
8544
DIMM_B2 disabled
Major
8545
DIMM_B3 disabled
Major
8546
DIMM_C1 disabled
Major
8547
DIMM_C2 disabled
Major
8548
DIMM_C3 disabled
Major
8549
DIMM_D1 disabled
Major
854A
DIMM_D2 disabled
Major
854B
DIMM_D3 disabled
Major
854C
DIMM_E1 disabled
Major
854D
DIMM_E2 disabled
Major
854E
DIMM_E3 disabled
Major
854F
DIMM_F1 disabled
Major
8550
DIMM_F2 disabled
Major
8551
DIMM_F3 disabled
Major
8552
DIMM_G1 disabled
Major
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
System BIOS Events
Error Code
Revision 1.2
Error Message
Response
8553
DIMM_G2 disabled
Major
8554
DIMM_G3 disabled
Major
8555
DIMM_H1 disabled
Major
8556
DIMM_H2 disabled
Major
8557
DIMM_H3 disabled
Major
8558
DIMM_J1 disabled
Major
8559
DIMM_J2 disabled
Major
855A
DIMM_J3 disabled
Major
855B
DIMM_K1 disabled
Major
855C
DIMM_K2 disabled
Major
855D
DIMM_K3 disabled
Major
855E
DIMM_L1 disabled
Major
855F
(Go to 85D0)
DIMM_L2 disabled
Major
8560
DIMM_A1 encountered a Serial Presence Detection (SPD) failure
Major
8561
DIMM_A2 encountered a Serial Presence Detection (SPD) failure
Major
8562
DIMM_A3 encountered a Serial Presence Detection (SPD) failure
Major
8563
DIMM_B1 encountered a Serial Presence Detection (SPD) failure
Major
8564
DIMM_B2 encountered a Serial Presence Detection (SPD) failure
Major
8565
DIMM_B3 encountered a Serial Presence Detection (SPD) failure
Major
8566
DIMM_C1 encountered a Serial Presence Detection (SPD) failure
Major
8567
DIMM_C2 encountered a Serial Presence Detection (SPD) failure
Major
8568
DIMM_C3 encountered a Serial Presence Detection (SPD) failure
Major
8569
DIMM_D1 encountered a Serial Presence Detection (SPD) failure
Major
856A
DIMM_D2 encountered a Serial Presence Detection (SPD) failure
Major
856B
DIMM_D3 encountered a Serial Presence Detection (SPD) failure
Major
856C
DIMM_E1 encountered a Serial Presence Detection (SPD) failure
Major
856D
DIMM_E2 encountered a Serial Presence Detection (SPD) failure
Major
856E
DIMM_E3 encountered a Serial Presence Detection (SPD) failure
Major
856F
DIMM_F1 encountered a Serial Presence Detection (SPD) failure
Major
8570
DIMM_F2 encountered a Serial Presence Detection (SPD) failure
Major
8571
DIMM_F3 encountered a Serial Presence Detection (SPD) failure
Major
Intel order number G90620-003
93
System BIOS Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Error Code
94
Error Message
Response
8572
DIMM_G1 encountered a Serial Presence Detection (SPD) failure
Major
8573
DIMM_G2 encountered a Serial Presence Detection (SPD) failure
Major
8574
DIMM_G3 encountered a Serial Presence Detection (SPD) failure
Major
8575
DIMM_H1 encountered a Serial Presence Detection (SPD) failure
Major
8576
DIMM_H2 encountered a Serial Presence Detection (SPD) failure
Major
8577
DIMM_H3 encountered a Serial Presence Detection (SPD) failure
Major
8578
DIMM_J1 encountered a Serial Presence Detection (SPD) failure
Major
8579
DIMM_J2 encountered a Serial Presence Detection (SPD) failure
Major
857A
DIMM_J3 encountered a Serial Presence Detection (SPD) failure
Major
857B
DIMM_K1 encountered a Serial Presence Detection (SPD) failure
Major
857C
DIMM_K2 encountered a Serial Presence Detection (SPD) failure
Major
857D
DIMM_K3 encountered a Serial Presence Detection (SPD) failure
Major
857E
DIMM_L1 encountered a Serial Presence Detection (SPD) failure
Major
857F
(Go to 85E0)
DIMM_L2 encountered a Serial Presence Detection (SPD) failure
Major
85C0
DIMM_L3 failed test/initialization
Major
85C1
DIMM_M1 failed test/initialization
Major
85C2
DIMM_M2 failed test/initialization
Major
85C3
DIMM_M3 failed test/initialization
Major
85C4
DIMM_N1 failed test/initialization
Major
85C5
DIMM_N2 failed test/initialization
Major
85C6
DIMM_N3 failed test/initialization
Major
85C7
DIMM_P1 failed test/initialization
Major
85C8
DIMM_P2 failed test/initialization
Major
85C9
DIMM_P3 failed test/initialization
Major
85CA
DIMM_R1 failed test/initialization
Major
85CB
DIMM_R2 failed test/initialization
Major
85CC
DIMM_R3 failed test/initialization
Major
85CD
DIMM_T1 failed test/initialization
Major
85CE
DIMM_T2 failed test/initialization
Major
85CF
DIMM_T3 failed test/initialization
Major
85D0
DIMM_L3 disabled
Major
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
System BIOS Events
Error Code
Revision 1.2
Error Message
Response
85D1
DIMM_M1 disabled
Major
85D2
DIMM_M2 disabled
Major
85D3
DIMM_M3 disabled
Major
85D4
DIMM_N1 disabled
Major
85D5
DIMM_N2 disabled
Major
85D6
DIMM_N3 disabled
Major
85D7
DIMM_P1 disabled
Major
85D8
DIMM_P2 disabled
Major
85D9
DIMM_P3 disabled
Major
85DA
DIMM_R1 disabled
Major
85DB
DIMM_R2 disabled
Major
85DC
DIMM_R3 disabled
Major
85DD
DIMM_T1 disabled
Major
85DE
DIMM_T2 disabled
Major
85DF
DIMM_T3 disabled
Major
85E0
DIMM_L3 encountered a Serial Presence Detection (SPD) failure
Major
85E1
DIMM_M1 encountered a Serial Presence Detection (SPD) failure
Major
85E2
DIMM_M2 encountered a Serial Presence Detection (SPD) failure
Major
85E3
DIMM_M3 encountered a Serial Presence Detection (SPD) failure
Major
85E4
DIMM_N1 encountered a Serial Presence Detection (SPD) failure
Major
85E5
DIMM_N2 encountered a Serial Presence Detection (SPD) failure
Major
85E6
DIMM_N3 encountered a Serial Presence Detection (SPD) failure
Major
85E7
DIMM_P1 encountered a Serial Presence Detection (SPD) failure
Major
85E8
DIMM_P2 encountered a Serial Presence Detection (SPD) failure
Major
85E9
DIMM_P3 encountered a Serial Presence Detection (SPD) failure
Major
85EA
DIMM_R1 encountered a Serial Presence Detection (SPD) failure
Major
85EB
DIMM_R2 encountered a Serial Presence Detection (SPD) failure
Major
85EC
DIMM_R3 encountered a Serial Presence Detection (SPD) failure
Major
85ED
DIMM_T1 encountered a Serial Presence Detection (SPD) failure
Major
85EE
DIMM_T2 encountered a Serial Presence Detection (SPD) failure
Major
85EF
DIMM_T3 encountered a Serial Presence Detection (SPD) failure
Major
8604
POST Reclaim of non-critical NVRAM variables
Minor
Intel order number G90620-003
95
System BIOS Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Error Code
96
Error Message
Response
8605
BIOS Settings are corrupted
Major
8606
NVRAM variable space was corrupted and has been reinitialized
Major
92A3
Serial port component was not detected
Major
92A9
Serial port component encountered a resource conflict error
Major
A000
TPM device not detected.
Minor
A001
TPM device missing or not responding.
Minor
A002
TPM device failure.
Minor
A003
TPM device failed self test.
Minor
A100
BIOS ACM Error
Major
A421
PCI component encountered a SERR error
Fatal
A5A0
PCI Express* component encountered a PERR error
Minor
A5A1
PCI Express* component encountered an SERR error
Fatal
A6A0
DXE Boot Services driver: Not enough memory available to shadow a Legacy Option ROM.
Minor
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Chassis Subsystem
10. Chassis Subsystem
The BMC monitors several aspects of the chassis. Next to logging when the power and reset buttons get pressed, the BMC also
monitors chassis intrusion if a chassis intrusion switch is included in the chassis, as well as looking at the network connections, and
logging an event whenever the physical network link is lost.
10.1 Physical Security
Two sensors are included in the physical security subsystem: chassis intrusion and LAN leash lost.
10.1.1
Chassis Intrusion
Chassis Intrusion is monitored on supported chassis, and the BMC logs corresponding events when the chassis lid is opened and
closed.
10.1.2
LAN Leash Lost
The LAN Leash lost sensor monitors the physical connection on the on-board network ports. If a LAN Leash lost event is logged, this
means the network port lost its physical connection.
Table 73: Physical Security Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
05h = Physical Security
12
Sensor Number
04h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 74
Intel order number G90620-003
97
Chassis Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
15
Event Data 2
Not used
16
Event Data 3
Not used
Table 74: Physical Security Sensor Event Trigger Offset – Next Steps
Event Trigger Offset
Hex
Description
Next Steps
Description
Somebody has opened the chassis (or the chassis
intrusion sensor is not connected).
00h
chassis
intrusion
1.
2.
3.
04h
LAN leash
lost
Someone has unplugged a LAN cable that was
present when the BMC initialized. This event gets
logged when the electrical connection on the NIC
connector gets lost.
Use the Quick Start Guide and the Service Guide to determine whether
the chassis intrusion switch is connected properly.
If this is the case, make sure it makes proper contact when the chassis is
closed.
If this is also the case, someone has opened the chassis. Ensure nobody
has access to the system that shouldn't.
This is most likely due to unplugging the cable but can also happen if there is
an issue with the cable or switch.
1. Check the LAN cable and connector for issues.
2. Investigate switch logs where possible.
3. Ensure nobody has access to the server that shouldn't.
10.2 FP (NMI) Interrupt
The BMC supports an NMI sensor for logging an event when a diagnostic interrupt is generated for the following cases:


The front panel diagnostic interrupt button is pressed.
The BMC receives an IPMI Chassis Control command that requests this action.
The front panel interrupt button (also referred to as NMI button) is a recessed button on the front panel that allows the user to force a
critical interrupt which causes a crash error or kernel panic.
98
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Chassis Subsystem
Table 75: FP (NMI) Interrupt Sensor Typical Characteristics
Byte
10.2.1
Field
Description
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
05h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 0h
15
Event Data 2
Not used
16
Event Data 3
Not used
FP (NMI) Interrupt – Next Steps
The purpose of this button is for diagnosing software issues – when a critical interrupt is generated the OS typically saves a memory
dump. This allows for exact analysis of what is going on in system memory, which can be useful for software developers, or for
troubleshooting OS, software, and driver issues.
If this button was not actually pressed, you should ensure there is no physical fault with the front panel.
This event only gets logged if a user pressed the NMI button or sent an IPMI Chassis Control command requesting this action, and
although it causes the OS to crash, is not an error.
Revision 1.2
Intel order number G90620-003
99
Chassis Subsystem
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
10.3 Button Sensor
The BMC logs when the front panel power and reset buttons get pressed. This is purely for informational purposes and these events
do not indicate errors.
Table 76: Button Sensor Typical Characteristics
Byte
100
Field
Description
11
Sensor Type
14h = Button/Switch
12
Sensor Number
09h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
0h = Power Button
2h = Reset Button
15
Event Data 2
Not used
16
Event Data 3
Not used
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Miscellaneous Events
11. Miscellaneous Events
The miscellaneous events section addresses sensors not easily grouped with other sensor types.
11.1 IPMI Watchdog
PCSD server systems support an IPMI watchdog timer, which can check to see whether the OS is still responsive. The timer is
disabled by default, and has to be enabled manually. It then requires an IPMI-aware utility in the operating system that will reset the
timer before it expires. If the timer does expire, the BMC can take action if it is configured to do so (reset, power down, power cycle,
or generate a critical interrupt).
Table 77: IPMI Watchdog Sensor Typical Characteristics
Byt
e
Field
Description
11
Sensor
Type
23h = Watchdog 2
12
Sensor
Number
03h
13
Event
Directio
n and
Event
Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event
Data 1
[7:6] – 11B = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 78: IPMI Watchdog Sensor Event Trigger Offset – Next Steps
Event Trigger Offset
Revision 1.2
Hex
Description
00h
Timer expired,
status only
01h
Hard reset
Description
Our server systems support a BMC watchdog timer,
which can check to see whether the OS is still
responsive. The timer is disabled by default, and has to
Intel order number G90620-003
Next Steps
If this event is being logged, it is because the BMC has been
configured to check the watchdog timer.
1. Make sure you have support for this in your OS (typically
101
Miscellaneous Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byt
e
Field
Description
02h
Power down
03h
Power cycle
08h
Timer interrupt
be enabled manually. It then requires an IPMI-aware
utility in the operating system that will reset the timer
before it expires. If the timer does expire, the BMC can
take action if it is configured to do so (reset, power
down, power cycle, or generate a critical interrupt).
15
Event
Data 2
[7:4] – Interrupt type
0h = None
1h = SMI
2h = NMI
3h = Messaging Interrupt
Fh = Unspecified
All other = Reserved
[3:0] – Timer use at expiration
0h = Reserved
1h = BIOS FRB2
2h = BIOS/POST
3h = OS Load
4h = SMS/OS
5h = OEM
Fh = Unspecified
All other = Reserved
16
Event
Data 3
Not used
2.
using a third-party IPMI-aware utility such as ipmitool or
ipmiutil along with the OpenIPMI driver).
If this is the case, it is likely your OS has hung, and you need
to investigate OS event logs to determine what may have
caused this.
Table 78: IPMI Watchdog Sensor Event Trigger Offset – Next Steps
Event Trigger Offset
102
Hex
Description
00h
Timer expired,
status only
Description
Next Steps
Our server systems support a BMC watchdog timer,
which can check to see whether the OS is still
If this event is being logged, it is because the BMC has been
configured to check the watchdog timer.
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Miscellaneous Events
Event Trigger Offset
Hex
Next Steps
Description
01h
Hard reset
02h
Power down
03h
Power cycle
08h
Timer interrupt
Revision 1.2
Description
responsive. The timer is disabled by default, and has to
be enabled manually. It then requires an IPMI-aware
utility in the operating system that will reset the timer
before it expires. If the timer does expire, the BMC can
take action if it is configured to do so (reset, power
down, power cycle, or generate a critical interrupt).
3.
4.
Make sure you have support for this in your OS (typically
using a third-party IPMI-aware utility such as ipmitool or
ipmiutil along with the OpenIPMI driver).
If this is the case, it is likely your OS has hung, and you need
to investigate OS event logs to determine what may have
caused this.
Intel order number G90620-003
103
Miscellaneous Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
11.2 SMI Timeout
SMI stands for system management interrupt and is an interrupt that gets generated so the processor can service server
management events (typically memory or PCI errors, or other forms of critical interrupts), in order to log them to the SEL. If this
interrupt times out, the system is frozen. The BMC will reset the system after logging the event.
Table 79: SMI Timeout Sensor Typical Characteristics
Byte
11.2.1
Field
Description
11
Sensor Type
F3h = SMI Timeout
12
Sensor Number
06h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (“digital” Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h = State Asserted
15
Event Data 2
Not used
16
Event Data 3
Not used
SMI Timeout – Next Steps
This event normally only occurs after another more critical event.
1. Check the SEL for any critical interrupts, memory errors, bus errors, PCI errors, or any other serious errors.
2. If these are not present, the system locked up before it was able to log the original issue. In this case, low level debug is normally
required.
104
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Miscellaneous Events
11.3 System Event Log Cleared
The BMC logs a SEL clear event. This is only ever the first event in the SEL. Cause of this event is either a manual SEL clear using
selview or some other IPMI-aware utility, or is done in the factory as one of the last steps in the manufacturing process.
This is an informational event only.
Table 80: System Event Log Cleared Sensor Typical Characteristics
Byte
Field
Description
11
Sensor Type
10h = Event Logging Disabled
12
Sensor Number
07h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 2h = Log area reset/cleared
15
Event Data 2
Not used
16
Event Data 3
Not used
11.4 System Event – PEF Action
The BMC is configurable to send alerts for events logged into the SEL. These alerts are called Platform Event Filters (PEF) and are
disabled by default. The user must configure and enable this feature. PEF events are logged if the BMC takes action due to a PEF
configuration. The BMC event triggering the PEF action will also be in the SEL.
This is functionality built into the BMC to allow it to send alerts (SNMP or other) for any event that gets logged to the SEL. PEF filters
are turned off by default and have to be enabled manually using Intel® deployment assistant, Intel® syscfg utility, or an IPMI-aware
utility.
Revision 1.2
Intel order number G90620-003
105
Miscellaneous Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Table 81: System Event – PEF Action Sensor Typical Characteristics
Byte
11.4.1
Field
Description
11
Sensor Type
12h = System Event
12
Sensor Number
08h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 11B = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 4h = PEF Action
15
Event Data 2
[7:6] – Reserved
[5] – 1b = Diagnostic Interrupt (NMI)
[4] – 1b = OEM action
[3] – 1b = Power cycle
[2] – 1b = Reset
[1] – 1b = Power off
[0] – 1b = Alert
16
Event Data 3
Not used
System Event – PEF Action – Next Steps
This event gets logged if the BMC takes an action due to PEF configuration. Actions can be sending an alert, along with possibly
resetting, power cycling, or powering down the system. There will be another event that has led to the action so you need to
investigate the SEL and PEF settings to identify this event, and troubleshoot accordingly.
106
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Miscellaneous Events
11.5 BMC Watchdog Sensor
The BMC supports an IPMI sensor to report that a BMC reset has occurred due to an action taken by the BMC Watchdog feature. A
SEL event will be logged whenever either the BMC FW stack is reset or the BMC CPU itself is reset.
Table 82: BMC Watchdog Sensor Typical Characteristics
Byte
11.5.1
Field
Description
11
Sensor Type
28h = Management Subsystem Health
12
Sensor Number
0Ah
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (“digital” Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h = State Asserted
15
Event Data 2
Not used
16
Event Data 3
Not used
BMC Watchdog Sensor – Next Steps
A SEL event will be logged whenever either the BMC FW stack is reset or the BMC CPU itself is reset.
1. Check the SEL for any other events around the time of the failure.
2. Take note of all IPMI activity that was occurring around the time of the failure. Capture a System BMC Debug Log as soon as you
can after experiencing this failure. This log can be captured from the Integrated BMC Web Console or by using the Intel ® Syscfg
utility (syscfg/sbmcdl private filename.zip). Send the log file to your system manufacturer or Intel representative for failure
analysis.
Revision 1.2
Intel order number G90620-003
107
Miscellaneous Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
11.6 BMC FW Health Sensor
The BMC tracks the health of each of its IPMI sensors and reports failures by providing a “BMC FW Health” sensor of the IPMI 2.0
sensor type Management Subsystem Health with support for the Sensor Failure offset. Only assertions will be logged into the SEL
for the Sensor Failure offset. The BMC Firmware Health sensor asserts for any sensor when 10 consecutive sensor errors are read.
These are not standard sensor events (that is, threshold crossings or discrete assertions). These are BMC Hardware Access Layer
(HAL) errors such as I2C NAKs or internal errors while attempting to read a register. If a successful sensor read is completed, the
counter resets to zero.
Table 83: BMC FW Health Sensor Typical Characteristics
Byte
11.6.1
Field
Description
11
Sensor Type
28h = Management Subsystem Health
12
Sensor Number
10h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 11b = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 4h = Sensor failure
15
Event Data 2
Sensor number of the failed sensor
16
Event Data 3
Not used
BMC FW Health Sensor – Next Steps
1. Check the SEL for any other events around the time of the failure.
2. Take note of all IPMI activity that was occurring around the time of the failure. Capture a System BMC Debug Log as soon as you
can after experiencing this failure. This log can be captured from the Integrated BMC Web Console or by using the Intel® Syscfg
utility (syscfg/sbmcdl private filename.zip). Send the log file to your system manufacturer or Intel representative for failure
analysis.
3. If the failure continues around a specific sensor, replace the board with that sensor.
108
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Miscellaneous Events
11.7 Firmware Update Status Sensor
The BMC FW supports a single Firmware Update Status sensor. This sensor is used to generate SEL events related to update of
embedded firmware on the platform. This includes updates to the BMC, BIOS, and ME FW.
This sensor is an event-only sensor that is not readable. Event generation is only enabled for assertion events.
Table 84: Firmware Update Status Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
2Bh (Version Change)
12
Sensor Number
12h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 70h = OEM defined
14
Event Data 1
Event Trigger Offset
00h = Update started
01h = Update completed successfully
02h = Update failure
15
Event Data 2
[Bits 7:4] Target of update
0000b = BMC
0001b = BIOS
0010b = ME
All other values are reserved.
[Bits 3:1] Target instance (zero-based)
[Bits 0:0] Reserved
16
Event Data 3
Not used
Intel order number G90620-003
109
Miscellaneous Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
11.8 Add-In Module Presence Sensor
Some server boards provide dedicated slots for add-in modules/boards (for example, SAS, IO, and PCIe-riser). For these boards the
BMC provides an individual presence sensor to indicate whether the module/board is installed.
Table 85: Add-In Module Presence Sensor Typical Characteristics
Byte
11.8.1
Field
Description
11
Sensor Type
15h = Module/Board
12
Sensor Number
0Eh = IO Module Presence
0Fh = SAS Module Presence
13h = IO Module2 Presence
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 08h (“digital” discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
0h = Device Removed / Device Absent
1h = Device Inserted / Device Present
15
Event Data 2
Not used
16
Event Data 3
Not used
Add-In Module Presence – Next Steps
If an unexpected device is removed or inserted, ensure that the module has been seated properly.
110
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Miscellaneous Events
11.9 Intel®Xeon Phi™ Coprocessor Management Sensors
The Intel® Xeon® Processor E5 4600/2600/2400/1600 Product Families BMC supports limited manageability of the Intel® Xeon Phi™
Coprocessor adapter as described in this section. The Intel® Xeon Phi™ Coprocessor adapter uses the Many Integrated Core (MIC)
architecture and the sensors are referred to as MIC sensors.
For each manageable Intel® Xeon Phi™ Coprocessor adapter found in the system, the BMC automatically enables the associated
thermal margin sensors (0xC4-0xC7) and status sensors (0xA2, 0xA3, 0xA6, 0xA7).
11.9.1
Intel®Xeon Phi™ Coprocessor (MIC) Thermal Margin Sensors
The management controller FW of the Intel® Xeon Phi™ Coprocessor adapter provides an IPMI sensor that is read to get the
temperature data. The BMC then instantiates its own version of this sensor, which is used for fan speed control.
The thermal margin sensor is the difference between the Core Temp sensor value and the TControl value reported by the Intel® Xeon
Phi™ Coprocessor adapter.
This sensor will not log events into the SEL.
11.9.2
Intel®Xeon Phi™ Coprocessor (MIC) Status Sensors
Every time DC power is turned on, the BMC checks for Intel® Xeon Phi™ Coprocessor adapters installed in the system. All compatible
cards will be enabled for management. The status sensor is a direct copy of the status sensor reported by the Intel® Xeon Phi™
Coprocessor adapter.
Table 86: MIC Status Sensors – Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
C0h = OEM defined
12
Sensor Number
A2h = MIC 1 Status
A3h = MIC 2 Status
A6h = MIC 3 Status
A7h = MIC 4 Status
Intel order number G90620-003
111
Miscellaneous Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
11.9.2.1
Field
Description
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 70h (OEM defined)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
®
™
Refer to the latest Intel Xeon Phi
Coprocessor Adapter specification.
15
Event Data 2
Not used
16
Event Data 3
Not used
Intel®Xeon Phi™ Coprocessor (MIC) Status Sensors Next Steps
Refer to the latest Intel® Xeon Phi™ Coprocessor Adapter specification for the next steps.
112
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Hot-Swap Controller Backplane Events
12. Hot-Swap Controller Backplane Events
All new PCSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600 Product Families backplanes follow a hybrid
architecture, in which the IPMI functionality previously supported in the HSC is integrated into the BMC FW.
12.1 HSC Backplane Temperature Sensor
There is a thermal sensor on the Hot-Swap Backplane to measure the ambient temperature.
Table 87: HSC Backplane Temperature Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
29h = HSBP 1 Temp
2Ah = HSBP 2 Temp
2Bh = HSBP 3 Temp
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 88
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
Intel order number G90620-003
113
Hot-Swap Controller Backplane Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Table 88: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps
Hex
Event Trigger
Description
Assertion
Severity
Deassert
Severity
Description
Next Steps
00h
Lower non-critical
going low
Degraded
OK
The temperature has dropped below its lower
non-critical threshold.
1.
02h
Lower critical
going low
non-fatal
Degraded
The temperature has dropped below its lower
critical threshold.
2.
07h
Upper non-critical
going high
Degraded
OK
The temperature has gone over its upper noncritical threshold.
09h
Upper critical
going high
non-fatal
Degraded
The temperature has gone over its upper
critical threshold.
3.
4.
Check for clear and unobstructed airflow into
and out of the chassis.
Ensure the SDR is programmed and correct
chassis has been selected.
Ensure there are no fan failures.
Ensure the air used to cool the system is within
the thermal specifications for the system
(typically below 35°C).
12.2 Hard Disk Drive Monitoring Sensor
The new backplane design for PCSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600 Product Families moves
IPMI ownership of the HDD sensors to the BMC. Note that systems may have multiple storage backplanes. Hard Disk Drive status
monitoring is supported through disk status sensors owned by the BMC.
Table 89: Hard Disk Drive Monitoring Sensor Typical Characteristics
Byte
114
Field
Description
11
Sensor Type
0Dh = Drive Slot (Bay)
12
Sensor Number
60h-68h = Hard Disk Drive 15-23 Status
F0h-FEh = Hard Disk Drive 0-14 Status
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 90
15
Event Data 2
Not used
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Hot-Swap Controller Backplane Events
Byte
16
Field
Event Data 3
Description
Not used
Table 90: Hard Disk Drive Monitoring Sensor – Event Trigger Offset – Next Steps
Event
Trigger
Description
00h
Drive Presence
01h
Drive Fault
07h
Rebuild/Remap
in progress
Next Steps
If during normal operation the state changes unexpectedly, ensure that the drive was seated properly and the drive carrier was
properly latched. If that does not work, replace the drive.
If you have replaced a hard drive, this is expected.
If you have a hot spare and one of the drives failed, this is expected. Check logs for which drive has failed.
If this is seen unexpectedly, it could be an indication of a drive that is close to failing.
12.3 Hot-Swap Controller Health Sensor
The BMC supports an IPMI sensor to indicate the health of the Hot-Swap Controller (HSC). This sensor will indicate that the
controller is offline for the cases that the BMC either cannot communicate with it or it is stuck in a degraded state so that the BMC
cannot restore it to full operation through a firmware update.
Table 91: HSC Health Sensor Typical Characteristics
Byte
Revision 1.2
Field
Description
11
Sensor Type
16h = Microcontroller
12
Sensor Number
69h = Hot-Swap Controller 1 Status
6Ah = Hot-Swap Controller 2 Status
6Bh = Hot-Swap Controller 3 Status
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Ah (Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 4h = Transition to offline
Intel order number G90620-003
115
Hot-Swap Controller Backplane Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
12.3.1
Field
Description
15
Event Data 2
Not used
16
Event Data 3
Not used
HSC Health Sensor – Next Steps
Ensure that all connections to the HSC are well seated.
Cross test with another HSC. If the issue remains with the HSC, replace the HSC, otherwise start cross testing all interconnections.
116
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Manageability Engine (ME) Events
13. Manageability Engine (ME) Events
The Manageability Engine controls the PECI interface and also contains the Node Manager functionality.
13.1 ME Firmware Health Event
This sensor is used in Platform Event messages to the BMC containing health information including but not limited to firmware
upgrade and application errors.
Table 92: ME Firmware Health Event Sensor Typical Characteristics
Byte
13.1.1
Field
Description
8
9
Generator ID
002Ch or 602Ch – ME Firmware
11
Sensor Type
DCh = OEM
12
Sensor Number
17h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 75h (OEM)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Health event type – 0h (Firmware Status)
15
Event Data 2
See Table 93
16
Event Data 3
See Table 93
ME Firmware Health Event – Next Steps
In the following table Event Data 3 is only noted for specific errors.
If the issue continues to be persistent, provide the content of Event Data 3 to Intel support team for interpretation. Event Data 3
codes are in general not documented, because their meaning only provides some clues, varies, and usually needs to be individually
interpreted.
Revision 1.2
Intel order number G90620-003
117
Manageability Engine (ME) Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Table 93: ME Firmware Health Event Sensor – Next Steps
ED2
ED3
Description
Next Steps
®
00h
Recovery GPIO forced. Recovery Image loaded due to recovery
MGPIO pin asserted. Pin number is configurable in factory presets.
Default recovery pin is MGPIO1.
Deassert MGPIO1 and reset the Intel ME.
01h
Image execution failed. Recovery Image or backup operational
image loaded because operational image is corrupted. This may be
either caused by flash device corruption or failed upgrade
procedure.
Either the flash device must be replaced (if error is persistent) or the
upgrade procedure must be started again.
02h
Flash erase error. Error during flash erasure procedure.
The flash device must be replaced.
Flash state information.
Check extended info byte in ED3 whether this is wear-out
protection causing this event. If so just wait until wear-out protection
expires, otherwise probably the flash device must be replaced (if
error is persistent).
Recovery bootloader image or factory presets image corrupted.
04h
Internal error. Error during firmware execution – FW Watchdog
Timeout.
Operational image needs to be updated to other version or hardware board
repair is needed (if error is persistent).
05h
BMC did not respond to cold reset request and Intel ME rebooted
the platform.
06h
Direct Flash update requested by the BIOS. Intel ME firmware will
switch to recovery mode to perform full update from the BIOS.
This is transient state. Intel ME firmware will return to operational mode
after successful image update performed by the BIOS.
Manufacturing error. Wrong manufacturing configuration detected
®
by Intel ME firmware.
®
Intel ME FW configuration is inconsistent or out of range
The flash device must be replaced (if error is persistent).
08h
Persistent storage integrity error. Flash file system error detected.
If error is persistent, restore factory presets using “Force ME Recovery”
IPMI command or by doing AC power cycle with Recovery jumper asserted.
09h
Firmware Exception.
Restore factory presets using “Force ME Recovery” IPMI command or by
doing AC power cycle with Recovery jumper asserted. If this does not clear
the issue, reflash the SPI flash.
10hFFh
Reserved.
03h
00h
01h
02h
03h
07h
118
®
®
04h
Flash erase limit has been reached.
Flash write limit has been reached; writing to flash has been disabled.
Writing to the flash has been enabled.
®
Verify the Intel Node Manager configuration.
®
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Manageability Engine (ME) Events
13.2 Node Manager Exception Event
A Node Manager Exception Event will be sent each time maintained policy power limit is exceeded over Correction Time Limit.
Table 94: Node Manager Exception Sensor Typical Characteristics
Byte
Field
Description
8
9
Generator ID
002Ch or 602Ch – ME Firmware
11
Sensor Type
DCh = OEM
12
Sensor Number
18h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 72h (OEM)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3] – Node Manager Policy event
0 – Reserved
1 – Policy Correction Time Exceeded – Policy did not meet the contract for the defined policy. The policy will continue to
limit the power or shut down the platform based on the defined policy action.
[2] – Reserved
[1:0] – 00b
15
Event Data 2
[4:7] – Reserved
[0:3] – Domain Id (Currently, supports only one domain, Domain 0)
16
Event Data 3
Policy Id
13.2.1
Node Manager Exception Event – Next Steps
This is an informational event. Next steps depend on the policy that was set. See the Node Manager Specification for more details.
Revision 1.2
Intel order number G90620-003
119
Manageability Engine (ME) Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
13.3 Node Manager Health Event
A Node Manager Health Event message provides a runtime error indication about Intel® Intelligent Power Node Manager’s health.
Types of service that can send an error are defined as follows:


Misconfigured policy Error reading power data
Error reading inlet temperature
Table 95: Node Manager Health Event Sensor Typical Characteristics
Byte
120
Field
Description
8
9
Generator ID
002Ch or 602Ch – ME Firmware
11
Sensor Type
DCh = OEM
12
Sensor Number
19h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 73h (OEM)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Health Event Type = 02h (Sensor Node Manager)
15
Event Data 2
[7:4] – Error type
0-9 – Reserved
10 – Policy Misconfiguration
11 – Power Sensor Reading Failure
12 – Inlet Temperature Reading Failure
13 – Host Communication error
14 – Real-time clock synchronization failure
15 – Platform shutdown initiated by NM policy due to execution of action defined by Policy Exception Action
[3:0] – Domain Id
16
Event Data 3
If Error type = 10 or 15 <Policy Id>
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Manageability Engine (ME) Events
Byte
Field
Description
If Error type = 11 <Power Sensor Address>
If Error type = 12 <Inlet Sensor Address>
Otherwise set to 0.
13.3.1
Node Manager Health Event – Next Steps
Misconfigured policy can happen if the max/min power consumption of the platform exceeds the values in policy due to hardware
reconfiguration.
First occurrence of not acknowledged event will be retransmitted no faster than every 300 milliseconds.
Real-time clock synchronization failure alert is sent when NM is enabled and capable of limiting power, but within 10 minutes the
firmware cannot obtain valid calendar time from the host side, so NM cannot handle suspend periods.
Next steps depend on the policy that was set. See the Node Manager Specification for more details.
Revision 1.2
Intel order number G90620-003
121
Manageability Engine (ME) Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
13.4 Node Manager Operational Capabilities Change
This message provides a runtime error indication about Intel® Intelligent Power Node Manager’s operational capabilities. This applies
to all domains.
Assertion and deassertion of these events are supported.
Table 96: Node Manager Operational Capabilities Change Sensor Typical Characteristics
Byte
122
Field
Description
8
9
Generator ID
002Ch or 602Ch – ME Firmware
11
Sensor Type
DCh = OEM
12
Sensor Number
1Ah
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 74h (OEM)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Current state of Operational Capabilities. Bit pattern:
0 – Policy interface capability
0 – Not Available
1 – Available
1 – Monitoring capability
0 – Not Available
1 – Available
2 – Power limiting capability
0 – Not Available
1 – Available
15
Event Data 2
Not used
16
Event Data 3
Not used
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Manageability Engine (ME) Events
13.4.1
Node Manager Operational Capabilities Change – Next Steps
Policy Interface available indicates that Intel® Intelligent Power Node Manager is able to respond to the external interface about
querying and setting Intel® Intelligent Power Node Manager policies. This is generally available as soon as the microcontroller is
initialized.
Monitoring Interface available indicates that Intel® Intelligent Power Node Manager has the capability to monitor power and
temperature. This is generally available when firmware is operational.
Power limiting interface available indicates that Intel® Intelligent Power Node Manager can do power limiting and is indicative of an
ACPI-compliant OS loaded (unless the OEM has indicated support for non-ACPI compliant OS).
Current value of not acknowledged capability sensor will be retransmitted no faster than every 300 milliseconds.
Next steps depend on the policy that was set. See the Node Manager Specification for more details.
Revision 1.2
Intel order number G90620-003
123
Manageability Engine (ME) Events
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
13.5 Node Manager Alert Threshold Exceeded
Policy Correction Time Exceeded Event will be sent each time maintained policy power limit is exceeded over Correction Time Limit.
Table 97: Node Manager Alert Threshold Exceeded Sensor Typical Characteristics
Byte
Field
Description
8
9
Generator ID
002Ch – ME Firmware
11
Sensor Type
DCh = OEM
12
Sensor Number
1Bh
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 72h (OEM)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3] = Node Manager Policy event
0 – Threshold exceeded
1 – Policy Correction Time Exceeded – Policy did not meet the contract for the defined policy. The policy will continue to
limit the power or shut down the platform based on the defined policy action.
[2] – Reserved
[1:0] – Threshold Number. Valid only if Byte 5 bit [3] is set to 0.
0 to 2 – Threshold index
15
Event Data 2
[7:4] – Reserved
[3:0] – Domain Id (Currently, supports only one domain, Domain 0)
16
Event Data 3
Policy ID
124
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Manageability Engine (ME) Events
13.5.1
Node Manager Alert Threshold Exceeded – Next Steps
First occurrence of not acknowledged event will be retransmitted no faster than every 300 milliseconds.
First occurrence of Threshold exceeded event assertion/deassertion will be retransmitted no faster than every 300 milliseconds.
Next steps depend on the policy that was set. See the Node Manager Specification for more details.
Revision 1.2
Intel order number G90620-003
125
Microsoft Windows* Records
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
14. Microsoft Windows* Records
With Microsoft Windows Server 2003* R2 and later versions, an Intelligent Platform Management Interface (IPMI) driver was added.
This added the capability of logging some OS events to the SEL. The driver can write multiple records to the SEL for the following
events:



Boot-up
Shutdown
Bug Check / Blue Screen
14.1 Boot up Event Records
When the system boots into the Microsoft Windows* OS, two events can be logged. The first is a boot-up record and the second is
an OEM event. These are informational only records.
Table 98: Boot up Event Record Typical Characteristics
Byte
126
Field
Description
8
9
Generator ID
0041h – System Software with an ID = 20h
11
Sensor Type
1Fh = OS Boot
12
Sensor Number
00h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h = C: boot completed
15
Event Data 2
Not used
16
Event Data 3
Not used
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Microsoft Windows* Records
Table 99: Boot up OEM Event Record Typical Characteristics
Byte
Field
Description
1
2
Record ID
ID used for SEL Record access
3
Record Type
[7:0] – DCh = OEM timestamped, bytes 8-16 OEM defined
4
5
6
7
Timestamp
Time when the event was logged. LS byte first.
8
9
10
IPMI Manufacturer ID
0137h (311d) = IANA enterprise number for Microsoft
11
Record ID
Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in
the SEL and continue sequentially to n, the number of entries in the SEL.
12
13
14
15
Boot Time
Timestamp of when the system booted into the OS
16
Reserved
00h
Revision 1.2
Intel order number G90620-003
127
Microsoft Windows* Records
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
14.2 Shutdown Event Records
When the system shuts down from the Microsoft Windows* OS, multiple events can be logged. The first is an OS Stop/Shutdown
Event Record; this can be followed by a shutdown reason code OEM record, and then zero or more shutdown comment OEM
records. These are all informational only records.
Table 100: Shutdown Reason Code Event Record Typical Characteristics
Byte
Field
Description
8
9
Generator ID
0041h – System Software with an ID = 20h
11
Sensor Type
20h = OS Stop/Shutdown
12
Sensor Number
00h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 3h = OS Graceful Shutdown
15
Event Data 2
Not used
16
Event Data 3
Not used
Table 101: Shutdown Reason OEM Event Record Typical Characteristics
Byte
Field
Description
1
2
Record ID
ID used for SEL Record access
3
Record Type
[7:0] – DDh = OEM timestamped, bytes 8-16 OEM defined
4
5
Timestamp
Time when the event was logged. LS byte first.
128
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Microsoft Windows* Records
Byte
Field
Description
6
7
8
9
10
IPMI Manufacturer
ID
0137h (311d) = IANA enterprise number for Microsoft
11
Record ID
Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and
continue sequentially to n, the number of entries in the SEL.
12
13
14
15
Shutdown Reason
Shutdown Reason code from the registry (LSB first):
HKLM/Software/Microsoft/Windows/CurrentVersion/Reliability/shutdown/ReasonCode
16
Reserved
00h
Table 102: Shutdown Comment OEM Event Record Typical Characteristics
Byte
Field
Description
1
2
Record ID
ID used for SEL Record access
3
Record Type
[7:0] – DDh = OEM timestamped, bytes 8-16 OEM defined
4
5
6
7
Timestamp
Time when the event was logged. LS byte first.
8
9
10
IPMI Manufacturer
ID
0137h (311d) = IANA enterprise number for Microsoft
0157h (343) = IANA enterprise number for Intel
The value logged depends on the Intelligent Management Bus Driver (IMBDRV) that is loaded.
11
Record ID
Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and
continue sequentially to n, the number of entries in the SEL.
Revision 1.2
Intel order number G90620-003
129
Microsoft Windows* Records
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
12
13
14
15
Shutdown Comment
Shutdown Comment from the registry (LSB first):
HKLM/Software/Microsoft/Windows/CurrentVersion/Reliability/shutdown/Comment
16
Reserved
00h
14.3 Bug Check / Blue Screen Event Records
When the system experiences a bug check (blue screen), multiple records will be written to the event log. The first is a Bug Check /
Blue Screen OS Stop/Shutdown Event Record; this can be followed by multiple Bug Check / Blue Screen code OEM records that will
contain the Bug Check / Blue Screen codes. This information can be used to determine what caused the failure.
Table 103: Bug Check / Blue Screen – OS Stop Event Record Typical Characteristics
Byte
130
Field
Description
8
9
Generator ID
0041h – System Software with an ID = 20h
11
Sensor Type
20h = OS Stop/Shutdown
12
Sensor Number
00h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h = Runtime Critical Stop (that is, “core dump”, “blue screen”)
15
Event Data 2
Not used
16
Event Data 3
Not used
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Microsoft Windows* Records
Table 104: Bug Check / Blue Screen code OEM Event Record Typical Characteristics
Byte
Field
Description
1
2
Record ID
ID used for SEL Record access
3
Record Type
[7:0] – DEh = OEM timestamped, bytes 8-16 OEM defined
4
5
6
7
Timestamp
Time when the event was logged. LS byte first.
8
9
10
IPMI Manufacturer ID
0137h (311) = IANA enterprise number for Microsoft
0157h (343) = IANA enterprise number for Intel
The value logged depends on the Intelligent Management Bus Driver (IMBDRV) that is loaded.
11
Sequence Number
Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL
and continue sequentially to n, the number of entries in the SEL.
12
13
14
15
Bug Check / Blue Screen
Data
The first record of this type contains the Bug Check / Blue Screen Stop code and is followed by the four Bug Check / Blue
Screen parameters. LSB first.
Note that each of the Bug Check / Blue Screen parameters requires two records each.
Both of the two records for each parameter have the same Record ID.
There is a total of nine records.
16
Operating system type
00 = 32-bit OS
01 = 64-bit OS
Revision 1.2
Intel order number G90620-003
131
Linux* Kernel Panic Records
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
15. Linux* Kernel Panic Records
The Open IPMI driver supports the ability to put semi-custom and custom events in the system event log if a panic occurs. If you
enable the “Generate a panic event to all BMCs on a panic” option, you will get one event on a panic in a standard IPMI event format.
If you enable the “Generate OEM events containing the panic string” option, you will also get a set of OEM events holding the panic
string.
Table 105: Linux* Kernel Panic Event Record Characteristics
Byte
132
Field
Description
8
9
Generator ID
0021h – Kernel
10
EvM Rev
03h = IPMI 1.0 format
11
Sensor Type
20h = OS Stop/Shutdown
12
Sensor Number
The first byte of the panic string (0 if no panic string)
13
Event Direction and Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 1h = Runtime Critical Stop (a.k.a. “core dump”, “blue screen”)
15
Event Data 2
The second byte of the panic string
16
Event Data 3
The third byte of the panic string
Intel order number G90620-003
Revision 1.2
System Event Log Troubleshooting Guide for PCSD Platforms Based on Intel®Xeon®Processor E5 4600/2600/2400/1600/1400 Product Families
Linux* Kernel Panic Records
Table 106: Linux* Kernel Panic String Extended Record Characteristics
Byte
Field
Description
1
2
Record ID
ID used for SEL Record access
3
Record Type
[7:0] – F0h = OEM non-timestamped, bytes 4-16 OEM defined
4
Slave Address
The slave address of the card saving the panic
5
Sequence
Number
A sequence number (starting at zero)
6
…
16
Kernel Panic Data
These hold the panic sting. If the panic string is longer than 11 bytes, multiple messages will be sent with increasing sequence
numbers.
Revision 1.2
Intel order number G90620-003
133