Fault Management

Fault Management for System Administrators

Solaris Fault Management Overview

Solaris Fault Management provides a new architecture for building resilient error handlers, structured error telemetry, automated diagnostic software, response agents, and structured messaging. Many parts of the Solaris software stack participate in Fault Management, including the CPU, memory and I/O subsystems, Solaris ZFS, an increasing set of device drivers, and other management stacks.

At a high level, the Fault Management stack is comprised of error detectors, diagnosis engines, and response agents. Error detectors, as the name suggests, detect errors in the system and perform any immediate, required handling. The error detectors issue well defined error reports, or ereports, to a diagnosis engine. A diagnosis engine interprets ereports and determines whether a fault is present in the system. When such a determination is made, the diagnosis engine issues a suspect list that describes the field-replaceable unit (FRU) or set of FRUs that might be the cause of the problem. Suspect lists are interpreted by response agents. A response agent attempts to take some action based on the suspect list. Responses include logging messages, taking CPU strands offline, retiring memory pages, and retiring I/O devices.

The error detectors, diagnosis engines, and response agents are connected by the Fault Manager daemon, fmd(1M), which acts as a multiplexor between the various components.

error                           Fault                         Response
detectors ------------------>   Mgmt   -------------------->  Agents
              ereport           Daemon     suspect list
                                |    ^
                                |    |
                        ereport |    | suspect list
                                |    |
                                v    |
                                diagnosis
                                engines

The Fault Manager daemon is itself a service under SMF (Service Management Facility) control. The service is enabled by default and controlled just like any other SMF service, as the following example shows:

# svcs fmd 
STATE          STIME    FMRI 
online         11:25:44 svc:/system/fmd:default 
# svcadm disable fmd 
# svcs svc:/system/fmd:default 
STATE          STIME    FMRI 
disabled       15:27:45 svc:/system/fmd:default 
# svcadm enable fmd 
# svcs fmd 
STATE          STIME    FMRI 
online         15:27:51 svc:/system/fmd:default

Fault Notification

Often, the first interaction with the Fault Manager is a system message indicating that a fault has been diagnosed. Messages are sent to both the console and the /var/adm/messages file. All messages from the Fault Manager use the following format:

1    SUNW-MSG-ID: AMD-8000-AV, TYPE: Fault, VER: 1, SEVERITY: Major
2    EVENT-TIME: Tue May 13 15:00:02 PDT 2008
3    PLATFORM: Sun Ultra 20 Workstation, CSN: 0604FK401F, HOSTNAME: hexterra
4    SOURCE: eft, REV: 1.16
5    EVENT-ID: 04837324-f221-e7dc-f6fa-dc7d9420ea76
6    DESC: The number of errors associated with this CPU has exceeded
     acceptable levels.  Refer to http://sun.com/msg/AMD-8000-AV for more
     information.
7    AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
8    IMPACT: Performance of this system may be affected.
9    REC-ACTION: Schedule a repair procedure to replace the affected CPU.
     Use fmdump -v -u <EVENT_ID> to identify the module.

When notified of a diagnosed fault, always consult the recommended knowledge article for additional details. See line 6 above for an example. The knowledge article might contain additional actions that you or a service provider should take beyond those listed on line 9.

Note
For historical and backward compatibility reasons, the REC-ACTION (line 9) often refers to the fmdump(1M) command. However, in OpenSolaris 2008.11, the preferred method to display fault information and determine the FRUs involved is the fmadm faulty command. The fmadm(1M) command is discussed in Displaying Faults below.

Fault Manager fault events can also be plugged into a Simple Network Management Protocol (SNMP) monitoring system. One of the response agents is the snmp-trapgen module. This module requires the System Management Agent (SMA), which is part of the Solaris freeware packages. Configuration of traps is straightforward, with typical modifications to the /etc/sma/snmp/snmp.conf file. Fault Management also provides a Management Information Base (MIB) plug-in for use with SMA, which resides in /usr/lib/fm/`isainfo -k`/libfmd_snmp.so.1. Refer to the following pages for details on SNMP and FMA setup:

Displaying Faults

The preferred method to display fault information and determine the FRUs involved is the fmadm faulty command. However, the fmdump command is still supported.

fmadm faulty

The fmadm faulty command is used to display any faulty components in the system, as shown in the following example:

 1    # fmadm faulty
 2    --------------- ------------------------------------ ----------- ---------
 3    TIME            EVENT-ID                             MSG-ID      SEVERITY
 4    --------------- ------------------------------------ ----------- ---------
 5    May 13 15:00:02 04837324-f221-e7dc-f6fa-dc7d9420ea76 AMD-8000-AV Major
 6
 7    Fault class : fault.cpu.amd.dcachedata 
 8    Affects     : cpu:///cpuid=0 
 9	                degraded but still in service 
10    FRU         : "CPU 0" (hc://:product-id=Sun-Ultra-20-Workstation:
      chassis-id=0604FK401F:server-id=hexterra/motherboard=0/chip=0)
11	                faulty 
12	 
13    Description : The number of errors associated with this CPU has exceeded
14	            acceptable levels.  Refer to http://sun.com/msg/AMD-8000-AV
15	            for more information.
16	 
17    Response    : An attempt will be made to remove this CPU from service.
18	 
19    Impact      : Performance of this system may be affected.
20	 
21    Action      : Schedule a repair procedure to replace the affected CPU.
22	            Use fmdump -v -u <EVENT_ID> to identify the module.

Of primary interest is line 10, which shows the data for the impacted FRUs. The more human-readable location string is presented in quotation marks, "CPU 0" in the preceding example. The quoted value is intended to match the label on the physical hardware. The FRU is also represented in a Fault Management Resource
Identifier (FMRI) format, which includes descriptive properties about the system containing the fault, such as its host name and chassis serial number. On platforms that support it, the part number and serial number of the FRU are also included in the FRU's FMRI.

The Affects line (lines 8 and 9) indicates the components that are impacted by the fault and their relative state. In this example, a single CPU strand is impacted. It is "degraded," which means it has not been taken offline by the system. In this example, this machine is a single CPU system, and the last CPU cannot be taken offline for obvious reasons. Another reason that an attempt to offline a CPU might fail is if real-time threads are bound to the affected CPU.

If this were a multiprocessor system, one could expect the affected CPU to be taken offline by the operating system, as shown in the following example:

# psrinfo 
0       faulted   since 05/13/2008 12:55:26 
1       on-line   since 05/12/2008 11:47:26 

The faulted state indicates that the processor has been taken offline by a Fault Management response agent.

The fmadm faulty command output also combines some details from the console message into the output (lines 13-22), notably, the SEVERITY and the Action to take to address the fault.

Following the FRU description in the fmadm faulty command output, line 11 shows the state as "faulty." Other state values that you might see in other situations include "acquitted" and "repair attempted," as shown for SLOT 2 and SLOT3 in the following example:

# fmadm faulty
 --------------- ------------------------------------  -------------- -------
 TIME            EVENT-ID                              MSG-ID         SEVERITY
 --------------- ------------------------------------  -------------- -------
 Sep 21 10:01:36 d482f935-5c8f-e9ab-9f25-d0aaafec1e6c  PCIEX-8000-7J  Major

 Fault class  : fault.io.pciex.device_invreq
 Affects      : dev:///pci@0,0/pci1022,7458@11/pci1000,3060@0
                dev:///pci@0,0/pci1022,7458@11/pci1000,3060@1
                  ok and in service
                dev:///pci@0,0/pci1022,7458@11/pci1000,3060@2
                dev:///pci@0,0/pci1022,7458@11/pci1000,3060@3
                  faulty and taken out of service
 FRU          : "SLOT 2" (hc://.../pciexrc=3/pciexbus=4/pciexdev=0)
                  repair attempted
                "SLOT 3" (hc://.../pciexrc=3/pciexbus=4/pciexdev=1)
                  acquitted
                "SLOT 4" (hc://.../pciexrc=3/pciexbus=4/pciexdev=2)
                  not present
                "SLOT 5" (hc://.../pciexrc=3/pciexbus=4/pciexdev=3)
                  faulty

 Description  : A problem was detected for a PCIEX device.
                Refer to http://sun.com/msg/PCIEX-8000-7J for more information.

 Response     : One or more device instances may be disabled

 Impact       : Possible loss of services provided by the device instances
                associated with this fault

 Action       : Schedule a repair procedure to replace the affected device.

fmdump

As mentioned above, some console messages and knowledge articles might instruct you to use the older fmdump -v -u UUID command to display fault information. While the fmadm faulty command is preferred, the fmdump command still operates, as shown in the following example:

1    # fmdump -v -u 04837324-f221-e7dc-f6fa-dc7d9420ea76
2    TIME                 UUID                                 SUNW-MSG-ID
3    May 13 15:00:02.2409 04837324-f221-e7dc-f6fa-dc7d9420ea76 AMD-8000-AV
4    100%  fault.cpu.amd.dcachedata 
5	 
6    Problem in:      hc://:product-id=Sun-Ultra-20-Workstation:chassis-
     id=0604FK401F:server-id=hexterra/motherboard=0/chip=0/cpu=0
7    Affects: cpu:///cpuid=0
8     FRU: hc://:product-id=Sun-Ultra-20-Workstation:chassis-
      id=0604FK401F:server-id=hexterra/motherboard=0/chip=0
9     Location: CPU 0

The information about the impacted FRUs is still present, although separated across two lines (lines 8 and 9). The Location string presents the human-readable FRU string, and the FRU line presents the formal FMRI. Note that the severity, descriptive text, and action are not shown with the fmdump(1M) command.

Repairing Faults

Once Fault Management has faulted a component in your system, you will want to repair it. A repair can happen in one of two ways: implicitly or explicitly.

An implicit repair can occur when the faulty component is replaced, provided the component has serial number information that the Fault Manager can track. On many of Sun's SPARC based systems, serial number information is included in the FMRIs so that the Fault Manager can determine when components have been removed from operation, either through replacement or other means (for example, blacklisting). When such detections occur, the Fault Manager daemon will no longer display the affected resource in fmadm faulty output. The resource is maintained in the daemon's internal resource cache until the fault event is 30 days old, at which point it is purged.

Implicit repairs do not apply to all systems and are unlikely to occur on generic x86 based hardware. Note that in the previous fault example, while there is a chassis-id in the FMRIs, no FRU serial number information is available. So the Fault Manager daemon would not be able to detect a FRU replacement, necessitating an explicit repair.

The fmadm repair command is used to explicitly mark a fault as repaired. It accepts a UUID, FMRI, or Location as an argument. For example:

# fmadm repair 04837324-f221-e7dc-f6fa-dc7d9420ea76 
fmadm: recorded repair to 04837324-f221-e7dc-f6fa-dc7d9420ea76 

In OpenSolaris 2008.11, the fmadm repair command has been replaced by the following four new fmadm commands:

fmadm repaired fmri | label
fmadm replaced fmri | label
fmadm acquit fmri | label
fmadm acquit uuid_ [ fmri | label ]

The fmadm repair command is retained as a synonym for the new fmadm repaired command for compatibility.

These four new commands behave in exactly the same way as fmadm repair, except that fmd remembers whether the FMRI or UUID has been repaired, replaced, or acquitted.

Note
Although these four commands can take FMRIs and UUIDs as arguments, the preferred argument to use is the Label. If a FRU has multiple faults against it, you want to replace the FRU only one time. If you issue the fmadm replaced command against the Label, the FRU is reflected as such in any outstanding cases.

fmadm repaired

The fmadm repaired command should be used when some physical repair has been carried out that might resolve the problem. Examples of such repairs include reseating a card or straightening a bent pin.

fmadm replaced

The fmadm replaced command should be used to indicate that the suspect FRU has been replaced.

If the system automatically discovers that a FRU has been replaced (the serial number has changed), then this discovery is treated in the same way as if fmadm replaced had been typed. The fmadm replaced command is not allowed
if fmd can automatically confirm that the FRU has not been replaced (the serial number has not changed).

If the system automatically discovers that a FRU has been removed but not replaced, then the current behaviour is unchanged: The suspect is displayed as "not present", but is not considered to be permanently removed until the rsrc.aged time has expired.

fmadm acquit

Replacement takes precedence over repair and both replacement and repair take precedence over acquittal. Thus, you can acquit something and then subsequently repair it, but you cannot acquit something that has already been repaired.

A case is considered repaired (moves into the FMD_CASE_REPAIRED state and a list.repaired event generated) when either its UUID is acquitted or all suspects have been either repaired, replaced, removed, or acquitted.

Typically you would only want to acquit by fmri/label if you determined that the resource was not guilty in all current cases in which it is a suspect. However, to allow a FRU to be manually acquitted in one case while remaining a suspect in all others, the following option allows you to specify both uuid and fmri/label:

fmadm acquit uuid [ fmri | label ]

Fault Management Administration Example

The most common uses of these four commands are expected to be the following three uses:

fmadm acquit label
fmadm replaced label
fmadm repaired label

Consider the following case:

  • The suspect list has two entries. One entry is for a FRU with label PCIE1. The other entry is for MB.
  • The knowledge article suggests replacing the FRU in PCIE1 but acquitting MB.

In this case, you should take the following actions:

  1. Replace the FRU in PCIE1.
  2. Enter the following commands:
    fmadm replaced PCIE1
    fmadm acquit MB
    

Fault Management Log Files

An overview of the FMA log files is here:

Fault Statistics

The Fault Manager daemon, fmd(1M), and many of its modules track statistics. The fmstat(1M) command reports those statistics. Without options, fmstat gives a high-level overview of the events, processing times, and memory usage of the loaded modules. For example:

# fmstat 
module    		ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire                 1       0  0.0  403.5   0   0     0     0   419b      0 
disk-transport                0       0  0.0  500.6   0   0     0     0    32b      0 
eft                           0       0  0.0    4.8   0   0     0     0   1.4M    43b 
fmd-self-diagnosis            0       0  0.0    4.7   0   0     0     0      0      0 
io-retire                     0       0  0.0    4.5   0   0     0     0      0      0 
snmp-trapgen                  0       0  0.0    4.5   0   0     0     0    32b      0 
sysevent-transport            0       0  0.0 1444.4   0   0     0     0      0      0 
syslog-msgs                   0       0  0.0    4.5   0   0     0     0      0      0 
zfs-diagnosis                 0       0  0.0    4.7   0   0     0     0      0      0 
zfs-retire                    0       0  0.0    4.5   0   0     0     0      0      0 

The fmstat(1M) man page describes each column in this output. Note that the open and solve columns apply only to Fault Management cases, which are only created and solved by diagnosis engines. These columns are immaterial for other modules, such as response agents.

Statistics on an individual module can also be displayed by using the -m module option. This syntax is commonly used with the -z option to suppress zero-valued statistics. For example:

# fmstat -z -m cpumem-retire 
  NAME VALUE            DESCRIPTION 
  cpu_flts 1            cpu faults resolved

This example shows that the cpumem-retire agent has successfully processed a request to take a CPU offline.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

Sign up or Log in to add a comment or watch this page.


The individuals who post here are part of the extended Sun Microsystems community and they might not be employed or in any way formally affiliated with Sun Microsystems. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Sun nor any other party necessarily agrees with them.

Copyright 1994-2009 Sun Microsystems, Inc.
Powered by Atlassian Confluence
Sun Guidelines on Public Discourse Privacy Policy Terms of Use Trademarks Site Map Employment Investor Relations Contact