h1.Fault Management for System Administrators * [Solaris Fault Management Overview|#overview] * [Fault Notification|#notification] * [Displaying Faults|#DisplayingFaults] ** [fmadm faulty|#faulty] ** [fmdump|#dump] * [Repairing Faults|#repair] ** [fmadm repaired|#repaired] ** [fmadm replaced|#replaced] ** [fmadm acquit|#acquit] * [Fault Management Administration Example|#example] * [Fault Management Log Files|#logfiles] * [Fault Statistics|#statistics]
{anchor:overview} |
h2.Solaris Fault Management Overview
|
... Solaris Fault Management provides a new architecture for building resilient error handlers, structured error telemetry, automated diagnostic software, response agents, and structured messaging. Many parts of the Solaris software stack participate in Fault Management, including the CPU, memory and I/O subsystems, Solaris ZFS, an increasing set of device drivers, and other management stacks.
At a high level, the Fault Management stack is comprised of error detectors, diagnosis engines, and response agents. Error detectors, as the name suggests, detect errors in the system and perform any immediate, required handling. The error detectors issue well defined error reports, or ereports, to a diagnosis engine. A diagnosis engine interprets ereports and determines whether a fault is present in the system. When such a determination is made, the diagnosis engine issues a suspect list that describes the field-replaceable unit (FRU) or set of FRUs that might be the cause of the problem. Suspect lists are interpreted by response agents. A response agent attempts to take some action based on the suspect list. Responses include logging messages, taking CPU strands offline, retiring memory pages, and retiring I/O devices.
The error detectors, diagnosis engines, and response agents are connected by the Fault Manager daemon, [{{fmd}}(1M)|http://docs.sun.com/app/docs/doc/819-2240/fmd-1m?a=view], which acts as a multiplexor between the various components.
{code} error Fault Response detectors ------------------> Mgmt --------------------> Agents ereport Daemon suspect list | ^ | | ereport | | suspect list | | v | diagnosis engines {code}
The Fault Manager daemon is itself a service under SMF ([Service Management Facility|http://www.opensolaris.org/os/community/smf/]) control. The service is enabled by default and controlled just like any other SMF service, as the following example shows:
{code} # svcs fmd STATE STIME FMRI online 11:25:44 svc:/system/fmd:default # svcadm disable fmd # svcs svc:/system/fmd:default STATE STIME FMRI disabled 15:27:45 svc:/system/fmd:default # svcadm enable fmd # svcs fmd STATE STIME FMRI online 15:27:51 svc:/system/fmd:default |
{code}
|
| {anchor:notification} |
h2.Fault Notification
|
... Often, the first interaction with the Fault Manager is a system message indicating that a fault has been diagnosed. Messages are sent to both the console and the {{/var/adm/messages}} file. All messages from the Fault Manager use the following format:
{code} 1 SUNW-MSG-ID: AMD-8000-AV, TYPE: Fault, VER: 1, SEVERITY: Major 2 EVENT-TIME: Tue May 13 15:00:02 PDT 2008 3 PLATFORM: Sun Ultra 20 Workstation, CSN: 0604FK401F, HOSTNAME: hexterra 4 SOURCE: eft, REV: 1.16 5 EVENT-ID: 04837324-f221-e7dc-f6fa-dc7d9420ea76 6 DESC: The number of errors associated with this CPU has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-AV for more information. 7 AUTO-RESPONSE: An attempt will be made to remove this CPU from service. 8 IMPACT: Performance of this system may be affected. 9 REC-ACTION: Schedule a repair procedure to replace the affected CPU. Use fmdump -v -u <EVENT_ID> to identify the module. {code}
When notified of a diagnosed fault, always consult the recommended [knowledge article|http://sun.com/msg/] for additional details. See line 6 above for an example. The knowledge article might contain additional actions that you or a service provider should take beyond those listed on line 9.
{info:title=Note}For historical and backward compatibility reasons, the REC-ACTION (line 9) often refers to the [{{fmdump}}(1M)|http://docs.sun.com/app/docs/doc/819-2240/fmdump-1m?a=view] command. However, in OpenSolaris 2008.11, the preferred method to display fault information and determine the FRUs involved is the {{fmadm faulty}} command. The [{{fmadm}}(1M)|http://docs.sun.com/app/docs/doc/819-2240/fmadm-1m?a=view] command is discussed in [Displaying Faults|#DisplayingFaults] below. {info}
Fault Manager fault events can also be plugged into a Simple Network Management Protocol (SNMP) monitoring system. One of the response agents is the {{snmp-trapgen}} module. This module requires the System Management Agent (SMA), which is part of the Solaris freeware packages. Configuration of traps is straightforward, with typical modifications to the {{/etc/sma/snmp/snmp.conf}} file. Fault Management also provides a Management Information Base (MIB) plug-in for use with SMA, which resides in {{/usr/lib/fm/`isainfo -k`/libfmd_snmp.so.1}}. Refer to the following pages for details on SNMP and FMA setup:
* [Fun with the FMA and SNMP|http://blogs.sun.com/pmonday/entry/fun_with_the_fma_and], by Paul Monday * [A louder voice for the fault manager|http://blogs.sun.com/wesolows/entry/a_louder_voice_for_the], by Keith Wesolowski
{anchor:DisplayingFaults} h2.Displaying Faults |
The preferred method to display fault information and determine the FRUs involved is the {{fmadm faulty}} command. However, the {{fmdump}} command is still supported.
|
| {anchor:faulty} |
h3.fmadm faulty The {{fmadm faulty}} command is used to display any faulty components in the system, as shown in the following example: |
... {code} 1 # fmadm faulty 2 --------------- ------------------------------------ ----------- --------- 3 TIME EVENT-ID MSG-ID SEVERITY 4 --------------- ------------------------------------ ----------- --------- 5 May 13 15:00:02 04837324-f221-e7dc-f6fa-dc7d9420ea76 AMD-8000-AV Major 6 7 Fault class : fault.cpu.amd.dcachedata 8 Affects : cpu:///cpuid=0 9 degraded but still in service 10 FRU : "CPU 0" (hc://:product-id=Sun-Ultra-20-Workstation: chassis-id=0604FK401F:server-id=hexterra/motherboard=0/chip=0) 11 faulty 12 13 Description : The number of errors associated with this CPU has exceeded 14 acceptable levels. Refer to http://sun.com/msg/AMD-8000-AV 15 for more information. 16 17 Response : An attempt will be made to remove this CPU from service. 18 19 Impact : Performance of this system may be affected. 20 21 Action : Schedule a repair procedure to replace the affected CPU. 22 Use fmdump -v -u <EVENT_ID> to identify the module. {code}
Of primary interest is line 10, which shows the data for the impacted FRUs. The more human-readable location string is presented in quotation marks, "CPU 0" in the preceding example. The quoted value is intended to match the label on the physical hardware. The FRU is also represented in a Fault Management Resource Identifier (FMRI) format, which includes descriptive properties about the system containing the fault, such as its host name and chassis serial number. On platforms that support it, the part number and serial number of the FRU are also included in the FRU's FMRI.
The Affects line (lines 8 and 9) indicates the components that are impacted by the fault and their relative state. In this example, a single CPU strand is impacted. It is "degraded," which means it has not been taken offline by the system. In this example, this machine is a single CPU system, and the last CPU cannot be taken offline for obvious reasons. Another reason that an attempt to offline a CPU might fail is if real-time threads are bound to the affected CPU.
If this were a multiprocessor system, one could expect the affected CPU to be taken offline by the operating system, as shown in the following example:
{code} # psrinfo 0 faulted since 05/13/2008 12:55:26 1 on-line since 05/12/2008 11:47:26 {code}
The faulted state indicates that the processor has been taken offline by a Fault Management response agent.
The {{fmadm faulty}} command output also combines some details from the console message into the output (lines 13-22), notably, the SEVERITY and the Action to take to address the fault.
Following the FRU description in the {{fmadm faulty}} command output, line 11 shows the state as "faulty." Other state values that you might see in other situations include "acquitted" and "repair attempted," as shown for SLOT 2 and SLOT3 in the following example:
{code} # fmadm faulty --------------- ------------------------------------ -------------- ------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- ------- Sep 21 10:01:36 d482f935-5c8f-e9ab-9f25-d0aaafec1e6c PCIEX-8000-7J Major
Fault class : fault.io.pciex.device_invreq Affects : dev:///pci@0,0/pci1022,7458@11/pci1000,3060@0 dev:///pci@0,0/pci1022,7458@11/pci1000,3060@1 ok and in service dev:///pci@0,0/pci1022,7458@11/pci1000,3060@2 dev:///pci@0,0/pci1022,7458@11/pci1000,3060@3 faulty and taken out of service FRU : "SLOT 2" (hc://.../pciexrc=3/pciexbus=4/pciexdev=0) repair attempted "SLOT 3" (hc://.../pciexrc=3/pciexbus=4/pciexdev=1) acquitted "SLOT 4" (hc://.../pciexrc=3/pciexbus=4/pciexdev=2) not present "SLOT 5" (hc://.../pciexrc=3/pciexbus=4/pciexdev=3) faulty
Description : A problem was detected for a PCIEX device. Refer to http://sun.com/msg/PCIEX-8000-7J for more information.
Response : One or more device instances may be disabled
Impact : Possible loss of services provided by the device instances associated with this fault
Action : Schedule a repair procedure to replace the affected device.
|
{code}
|
| {anchor:dump} |
h3.fmdump As mentioned above, some console messages and knowledge articles might instruct you to use the older {{fmdump -v -u UUID}} command to display fault information. While the {{fmadm faulty}} command is preferred, the {{fmdump}} command still operates, as shown in the following example: |
... {code} 1 # fmdump -v -u 04837324-f221-e7dc-f6fa-dc7d9420ea76 2 TIME UUID SUNW-MSG-ID 3 May 13 15:00:02.2409 04837324-f221-e7dc-f6fa-dc7d9420ea76 AMD-8000-AV 4 100% fault.cpu.amd.dcachedata 5 6 Problem in: hc://:product-id=Sun-Ultra-20-Workstation:chassis- id=0604FK401F:server-id=hexterra/motherboard=0/chip=0/cpu=0 7 Affects: cpu:///cpuid=0 8 FRU: hc://:product-id=Sun-Ultra-20-Workstation:chassis- id=0604FK401F:server-id=hexterra/motherboard=0/chip=0 9 Location: CPU 0 {code}
|
The information about the impacted FRUs is still present, although separated across two lines (lines 8 and 9). The Location string presents the human-readable FRU string, and the FRU line presents the formal FMRI. Note that the severity, descriptive text, and action are not shown with the {{fmdump}}(1M) command.
|
| {anchor:repair} |
h2.Repairing Faults Once Fault Management has faulted a component in your system, you will want to repair it. A repair can happen in one of two ways: implicitly or explicitly. |
... An implicit repair can occur when the faulty component is replaced, provided the component has serial number information that the Fault Manager can track. On many of Sun's SPARC based systems, serial number information is included in the FMRIs so that the Fault Manager can determine when components have been removed from operation, either through replacement or other means (for example, _blacklisting_). When such detections occur, the Fault Manager daemon will no longer display the affected resource in {{fmadm faulty}} output. The resource is maintained in the daemon's internal resource cache until the fault event is 30 days old, at which point it is purged.
Implicit repairs do not apply to all systems and are unlikely to occur on generic x86 based hardware. Note that in the previous fault example, while there is a {{chassis-id}} in the FMRIs, no FRU serial number information is available. So the Fault Manager daemon would not be able to detect a FRU replacement, necessitating an explicit repair.
The {{fmadm repair}} command is used to explicitly mark a fault as repaired. It accepts a UUID, FMRI, or Location as an argument. For example:
{code} # fmadm repair 04837324-f221-e7dc-f6fa-dc7d9420ea76 fmadm: recorded repair to 04837324-f221-e7dc-f6fa-dc7d9420ea76 {code}
In OpenSolaris 2008.11, the {{fmadm repair}} command has been replaced by the following four new {{fmadm}} commands:
{code} fmadm repaired fmri | label fmadm replaced fmri | label fmadm acquit fmri | label fmadm acquit uuid_ [ fmri | label ] {code}
The {{fmadm repair}} command is retained as a synonym for the new {{fmadm repaired}} command for compatibility.
These four new commands behave in exactly the same way as {{fmadm repair}}, except that {{fmd}} remembers whether the FMRI or UUID has been repaired, replaced, or acquitted.
|
{info:title=Note}Although these four commands can take FMRIs and UUIDs as arguments, the preferred argument to use is the Label. If a FRU has multiple faults against it, you want to replace the FRU only one time. If you issue the {{fmadm replaced}} command against the Label, the FRU is reflected as such in any outstanding cases.{info}
|
| {anchor:repaired} |
h3.fmadm repaired The {{fmadm repaired}} command should be used when some physical repair has been carried out that might resolve the problem. Examples of such repairs include reseating a card or straightening a bent pin.
|
| {anchor:replaced} |
h3.fmadm replaced The {{fmadm replaced}} command should be used to indicate that the suspect FRU has been replaced. |
... If the system automatically discovers that a FRU has been replaced (the serial number has changed), then this discovery is treated in the same way as if {{fmadm replaced}} had been typed. The {{fmadm replaced}} command is not allowed if {{fmd}} can automatically confirm that the FRU has _not_ been replaced (the serial number has not changed).
|
If the system automatically discovers that a FRU has been removed but not replaced, then the current behaviour is unchanged: The suspect is displayed as "not present", but is not considered to be permanently removed until the {{rsrc.aged}} time has expired.
|
| {anchor:acquit} |
h3.fmadm acquit Replacement takes precedence over repair and both replacement and repair take precedence over acquittal. Thus, you can acquit something and then subsequently repair it, but you cannot acquit something that has already been repaired. |
... A case is considered repaired (moves into the FMD_CASE_REPAIRED state and a {{list.repaired}} event generated) when either its UUID is acquitted or all suspects have been either repaired, replaced, removed, or acquitted.
Typically you would only want to acquit by fmri/label if you determined that the resource was not guilty in all current cases in which it is a suspect. However, to allow a FRU to be manually acquitted in one case while remaining a suspect in all others, the following option allows you to specify both uuid and fmri/label:
|
{code}fmadm acquit uuid [ fmri | label ]{code}
|
| {anchor:example} |
h3.Fault Management Administration Example The most common uses of these four commands are expected to be the following three uses: |
... {code} fmadm acquit label fmadm replaced label fmadm repaired label {code}
Consider the following case: * The suspect list has two entries. One entry is for a FRU with label PCIE1. The other entry is for MB. * The [knowledge article|http://sun.com/msg/] suggests replacing the FRU in PCIE1 but acquitting MB.
In this case, you should take the following actions: # Replace the FRU in PCIE1. # Enter the following commands: {code} fmadm replaced PCIE1 fmadm acquit MB
|
{code}
|
| {anchor:logfiles} |
h2.Fault Management Log Files An overview of the FMA log files is here: * [Managing Fault Management Log Files|http://blogs.sun.com/sdaven/entry/fma_log_files], by Scott Davenport
|
| {anchor:statistics} |
h2.Fault Statistics The Fault Manager daemon, {{fmd}}(1M), and many of its modules track statistics. The [{{fmstat}}(1M)|http://docs.sun.com/app/docs/doc/819-2240/fmstat-1m?a=view] command reports those statistics. Without options, {{fmstat}} gives a high-level overview of the events, processing times, and memory usage of the loaded modules. For example: |
... {code} # fmstat module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz cpumem-retire 1 0 0.0 403.5 0 0 0 0 419b 0 disk-transport 0 0 0.0 500.6 0 0 0 0 32b 0 eft 0 0 0.0 4.8 0 0 0 0 1.4M 43b fmd-self-diagnosis 0 0 0.0 4.7 0 0 0 0 0 0 io-retire 0 0 0.0 4.5 0 0 0 0 0 0 snmp-trapgen 0 0 0.0 4.5 0 0 0 0 32b 0 sysevent-transport 0 0 0.0 1444.4 0 0 0 0 0 0 syslog-msgs 0 0 0.0 4.5 0 0 0 0 0 0 zfs-diagnosis 0 0 0.0 4.7 0 0 0 0 0 0 zfs-retire 0 0 0.0 4.5 0 0 0 0 0 0 {code}
The {{fmstat}}(1M) man page describes each column in this output. Note that the open and solve columns apply only to Fault Management cases, which are only created and solved by diagnosis engines. These columns are immaterial for other modules, such as response agents.
Statistics on an individual module can also be displayed by using the {{-m _module_}} option. This syntax is commonly used with the {{-z}} option to suppress zero-valued statistics. For example:
{code} # fmstat -z -m cpumem-retire NAME VALUE DESCRIPTION cpu_flts 1 cpu faults resolved {code}
This example shows that the {{cpumem-retire}} agent has successfully processed a request to take a CPU offline.
|