View Source

h1.Fault Management for System Administrators
* [Solaris Fault Management Overview|#overview]
* [Fault Notification|#notification]
* [Displaying Faults|#DisplayingFaults]
** [fmadm faulty|#faulty]
** [fmdump|#dump]
* [Repairing Faults|#repair]
** [fmadm repaired|#repaired]
** [fmadm replaced|#replaced]
** [fmadm acquit|#acquit]
* [Fault Management Administration Example|#example]
* [Fault Management Log Files|#logfiles]
* [Fault Statistics|#statistics]

{anchor:overview}
h2.Solaris Fault Management Overview

Solaris Fault Management provides a new architecture for building resilient error handlers, structured error telemetry, automated diagnostic software, response agents, and structured messaging. Many parts of the Solaris software stack participate in Fault Management, including the CPU, memory and I/O subsystems, Solaris ZFS, an increasing set of device drivers, and other management stacks.

At a high level, the Fault Management stack is comprised of error detectors, diagnosis engines, and response agents. Error detectors, as the name suggests, detect errors in the system and perform any immediate, required handling. The error detectors issue well defined error reports, or ereports, to a diagnosis engine. A diagnosis engine interprets ereports and determines whether a fault is present in the system. When such a determination is made, the diagnosis engine issues a suspect list that describes the field-replaceable unit (FRU) or set of FRUs that might be the cause of the problem. Suspect lists are interpreted by response agents. A response agent attempts to take some action based on the suspect list. Responses include logging messages, taking CPU strands offline, retiring memory pages, and retiring I/O devices.

The error detectors, diagnosis engines, and response agents are connected by the Fault Manager daemon, [{{fmd}}(1M)|http://docs.sun.com/app/docs/doc/819-2240/fmd-1m?a=view], which acts as a multiplexor between the various components.

{code}
error Fault Response
detectors ------------------> Mgmt --------------------> Agents
ereport Daemon suspect list
| ^
| |
ereport | | suspect list
| |
v |
diagnosis
engines
{code}

The Fault Manager daemon is itself a service under SMF ([Service Management Facility|http://www.opensolaris.org/os/community/smf/]) control. The service is enabled by default and controlled just like any other SMF service, as the following example shows:

{code}
# svcs fmd
STATE STIME FMRI
online 11:25:44 svc:/system/fmd:default
# svcadm disable fmd
# svcs svc:/system/fmd:default
STATE STIME FMRI
disabled 15:27:45 svc:/system/fmd:default
# svcadm enable fmd
# svcs fmd
STATE STIME FMRI
online 15:27:51 svc:/system/fmd:default
{code}

{anchor:notification}
h2.Fault Notification

Often, the first interaction with the Fault Manager is a system message indicating that a fault has been diagnosed. Messages are sent to both the console and the {{/var/adm/messages}} file. All messages from the Fault Manager use the following format:

{code}
1 SUNW-MSG-ID: AMD-8000-AV, TYPE: Fault, VER: 1, SEVERITY: Major
2 EVENT-TIME: Tue May 13 15:00:02 PDT 2008
3 PLATFORM: Sun Ultra 20 Workstation, CSN: 0604FK401F, HOSTNAME: hexterra
4 SOURCE: eft, REV: 1.16
5 EVENT-ID: 04837324-f221-e7dc-f6fa-dc7d9420ea76
6 DESC: The number of errors associated with this CPU has exceeded
acceptable levels. Refer to http://sun.com/msg/AMD-8000-AV for more
information.
7 AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
8 IMPACT: Performance of this system may be affected.
9 REC-ACTION: Schedule a repair procedure to replace the affected CPU.
Use fmdump -v -u <EVENT_ID> to identify the module.
{code}

When notified of a diagnosed fault, always consult the recommended [knowledge article|http://sun.com/msg/] for additional details. See line 6 above for an example. The knowledge article might contain additional actions that you or a service provider should take beyond those listed on line 9.

{info:title=Note}For historical and backward compatibility reasons, the REC-ACTION (line 9) often refers to the [{{fmdump}}(1M)|http://docs.sun.com/app/docs/doc/819-2240/fmdump-1m?a=view] command. However, in OpenSolaris 2008.11, the preferred method to display fault information and determine the FRUs involved is the {{fmadm&nbsp;faulty}} command. The [{{fmadm}}(1M)|http://docs.sun.com/app/docs/doc/819-2240/fmadm-1m?a=view] command is discussed in [Displaying Faults|#DisplayingFaults] below.
{info}

Fault Manager fault events can also be plugged into a Simple Network Management Protocol (SNMP) monitoring system. One of the response agents is the {{snmp-trapgen}} module. This module requires the System Management Agent (SMA), which is part of the Solaris freeware packages. Configuration of traps is straightforward, with typical modifications to the {{/etc/sma/snmp/snmp.conf}} file. Fault Management also provides a Management Information Base (MIB) plug-in for use with SMA, which resides in {{/usr/lib/fm/`isainfo&nbsp;-k`/libfmd_snmp.so.1}}. Refer to the following pages for details on SNMP and FMA setup:

* [Fun with the FMA and SNMP|http://blogs.sun.com/pmonday/entry/fun_with_the_fma_and], by Paul Monday
* [A louder voice for the fault manager|http://blogs.sun.com/wesolows/entry/a_louder_voice_for_the], by Keith Wesolowski

{anchor:DisplayingFaults}
h2.Displaying Faults
The preferred method to display fault information and determine the FRUs involved is the {{fmadm&nbsp;faulty}} command. However, the {{fmdump}} command is still supported.

{anchor:faulty}
h3.fmadm faulty
The {{fmadm faulty}} command is used to display any faulty components in the system, as shown in the following example:

{code}
1 # fmadm faulty
2 --------------- ------------------------------------ ----------- ---------
3 TIME EVENT-ID MSG-ID SEVERITY
4 --------------- ------------------------------------ ----------- ---------
5 May 13 15:00:02 04837324-f221-e7dc-f6fa-dc7d9420ea76 AMD-8000-AV Major
6
7 Fault class : fault.cpu.amd.dcachedata
8 Affects : cpu:///cpuid=0
9 degraded but still in service
10 FRU : "CPU 0" (hc://:product-id=Sun-Ultra-20-Workstation:
chassis-id=0604FK401F:server-id=hexterra/motherboard=0/chip=0)
11 faulty
12
13 Description : The number of errors associated with this CPU has exceeded
14 acceptable levels. Refer to http://sun.com/msg/AMD-8000-AV
15 for more information.
16
17 Response : An attempt will be made to remove this CPU from service.
18
19 Impact : Performance of this system may be affected.
20
21 Action : Schedule a repair procedure to replace the affected CPU.
22 Use fmdump -v -u <EVENT_ID> to identify the module.
{code}

Of primary interest is line 10, which shows the data for the impacted FRUs. The more human-readable location string is presented in quotation marks, "CPU 0" in the preceding example. The quoted value is intended to match the label on the physical hardware. The FRU is also represented in a Fault Management Resource
Identifier (FMRI) format, which includes descriptive properties about the system containing the fault, such as its host name and chassis serial number. On platforms that support it, the part number and serial number of the FRU are also included in the FRU's FMRI.

The Affects line (lines 8 and 9) indicates the components that are impacted by the fault and their relative state. In this example, a single CPU strand is impacted. It is "degraded," which means it has not been taken offline by the system. In this example, this machine is a single CPU system, and the last CPU cannot be taken offline for obvious reasons. Another reason that an attempt to offline a CPU might fail is if real-time threads are bound to the affected CPU.

If this were a multiprocessor system, one could expect the affected CPU to be taken offline by the operating system, as shown in the following example:

{code}
# psrinfo
0 faulted since 05/13/2008 12:55:26
1 on-line since 05/12/2008 11:47:26
{code}

The faulted state indicates that the processor has been taken offline by a Fault Management response agent.

The {{fmadm&nbsp;faulty}} command output also combines some details from the console message into the output (lines 13-22), notably, the SEVERITY and the Action to take to address the fault.

Following the FRU description in the {{fmadm&nbsp;faulty}} command output, line 11 shows the state as "faulty." Other state values that you might see in other situations include "acquitted" and "repair attempted," as shown for SLOT 2 and SLOT3 in the following example:

{code}
# fmadm faulty
--------------- ------------------------------------ -------------- -------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- -------
Sep 21 10:01:36 d482f935-5c8f-e9ab-9f25-d0aaafec1e6c PCIEX-8000-7J Major

Fault class : fault.io.pciex.device_invreq
Affects : dev:///pci@0,0/pci1022,7458@11/pci1000,3060@0
dev:///pci@0,0/pci1022,7458@11/pci1000,3060@1
ok and in service
dev:///pci@0,0/pci1022,7458@11/pci1000,3060@2
dev:///pci@0,0/pci1022,7458@11/pci1000,3060@3
faulty and taken out of service
FRU : "SLOT 2" (hc://.../pciexrc=3/pciexbus=4/pciexdev=0)
repair attempted
"SLOT 3" (hc://.../pciexrc=3/pciexbus=4/pciexdev=1)
acquitted
"SLOT 4" (hc://.../pciexrc=3/pciexbus=4/pciexdev=2)
not present
"SLOT 5" (hc://.../pciexrc=3/pciexbus=4/pciexdev=3)
faulty

Description : A problem was detected for a PCIEX device.
Refer to http://sun.com/msg/PCIEX-8000-7J for more information.

Response : One or more device instances may be disabled

Impact : Possible loss of services provided by the device instances
associated with this fault

Action : Schedule a repair procedure to replace the affected device.
{code}

{anchor:dump}
h3.fmdump
As mentioned above, some console messages and knowledge articles might instruct you to use the older {{fmdump&nbsp;-v&nbsp;-u&nbsp;UUID}} command to display fault information. While the {{fmadm&nbsp;faulty}} command is preferred, the {{fmdump}} command still operates, as shown in the following example:

{code}
1 # fmdump -v -u 04837324-f221-e7dc-f6fa-dc7d9420ea76
2 TIME UUID SUNW-MSG-ID
3 May 13 15:00:02.2409 04837324-f221-e7dc-f6fa-dc7d9420ea76 AMD-8000-AV
4 100% fault.cpu.amd.dcachedata
5
6 Problem in: hc://:product-id=Sun-Ultra-20-Workstation:chassis-
id=0604FK401F:server-id=hexterra/motherboard=0/chip=0/cpu=0
7 Affects: cpu:///cpuid=0
8 FRU: hc://:product-id=Sun-Ultra-20-Workstation:chassis-
id=0604FK401F:server-id=hexterra/motherboard=0/chip=0
9 Location: CPU 0
{code}

The information about the impacted FRUs is still present, although separated across two lines (lines 8 and 9). The Location string presents the human-readable FRU string, and the FRU line presents the formal FMRI. Note that the severity, descriptive text, and action are not shown with the {{fmdump}}(1M) command.

{anchor:repair}
h2.Repairing Faults
Once Fault Management has faulted a component in your system, you will want to repair it. A repair can happen in one of two ways: implicitly or explicitly.

An implicit repair can occur when the faulty component is replaced, provided the component has serial number information that the Fault Manager can track. On many of Sun's SPARC based systems, serial number information is included in the FMRIs so that the Fault Manager can determine when components have been removed from operation, either through replacement or other means (for example, _blacklisting_). When such detections occur, the Fault Manager daemon will no longer display the affected resource in {{fmadm&nbsp;faulty}} output. The resource is maintained in the daemon's internal resource cache until the fault event is 30 days old, at which point it is purged.

Implicit repairs do not apply to all systems and are unlikely to occur on generic x86 based hardware. Note that in the previous fault example, while there is a {{chassis-id}} in the FMRIs, no FRU serial number information is available. So the Fault Manager daemon would not be able to detect a FRU replacement, necessitating an explicit repair.

The {{fmadm repair}} command is used to explicitly mark a fault as repaired. It accepts a UUID, FMRI, or Location as an argument. For example:

{code}
# fmadm repair 04837324-f221-e7dc-f6fa-dc7d9420ea76
fmadm: recorded repair to 04837324-f221-e7dc-f6fa-dc7d9420ea76
{code}

In OpenSolaris 2008.11, the {{fmadm repair}} command has been replaced by the following four new {{fmadm}} commands:

{code}
fmadm repaired fmri | label
fmadm replaced fmri | label
fmadm acquit fmri | label
fmadm acquit uuid_ [ fmri | label ]
{code}

The {{fmadm repair}} command is retained as a synonym for the new {{fmadm repaired}} command for compatibility.

These four new commands behave in exactly the same way as {{fmadm repair}}, except that {{fmd}} remembers whether the FMRI or UUID has been repaired, replaced, or acquitted.

{info:title=Note}Although these four commands can take FMRIs and UUIDs as arguments, the preferred argument to use is the Label. If a FRU has multiple faults against it, you want to replace the FRU only one time. If you issue the {{fmadm&nbsp;replaced}} command against the Label, the FRU is reflected as such in any outstanding cases.{info}

{anchor:repaired}
h3.fmadm repaired
The {{fmadm repaired}} command should be used when some physical repair has been carried out that might resolve the problem. Examples of such repairs include reseating a card or straightening a bent pin.

{anchor:replaced}
h3.fmadm replaced
The {{fmadm replaced}} command should be used to indicate that the suspect FRU has been replaced.

If the system automatically discovers that a FRU has been replaced (the serial number has changed), then this discovery is treated in the same way as if {{fmadm replaced}} had been typed. The {{fmadm replaced}} command is not allowed
if {{fmd}} can automatically confirm that the FRU has _not_ been replaced (the serial number has not changed).

If the system automatically discovers that a FRU has been removed but not replaced, then the current behaviour is unchanged: The suspect is displayed as "not present", but is not considered to be permanently removed until the {{rsrc.aged}} time has expired.

{anchor:acquit}
h3.fmadm acquit
Replacement takes precedence over repair and both replacement and repair take precedence over acquittal. Thus, you can acquit something and then subsequently repair it, but you cannot acquit something that has already been repaired.

A case is considered repaired (moves into the FMD_CASE_REPAIRED state and a {{list.repaired}} event generated) when either its UUID is acquitted or all suspects have been either repaired, replaced, removed, or acquitted.

Typically you would only want to acquit by fmri/label if you determined that the resource was not guilty in all current cases in which it is a suspect. However, to allow a FRU to be manually acquitted in one case while remaining a suspect in all others, the following option allows you to specify both uuid and fmri/label:

{code}fmadm acquit uuid [ fmri | label ]{code}

{anchor:example}
h3.Fault Management Administration Example
The most common uses of these four commands are expected to be the following three uses:

{code}
fmadm acquit label
fmadm replaced label
fmadm repaired label
{code}

Consider the following case:
* The suspect list has two entries. One entry is for a FRU with label PCIE1. The other entry is for MB.
* The [knowledge article|http://sun.com/msg/] suggests replacing the FRU in PCIE1 but acquitting MB.

In this case, you should take the following actions:
# Replace the FRU in PCIE1.
# Enter the following commands:
{code}
fmadm replaced PCIE1
fmadm acquit MB
{code}

{anchor:logfiles}
h2.Fault Management Log Files
An overview of the FMA log files is here:
* [Managing Fault Management Log Files|http://blogs.sun.com/sdaven/entry/fma_log_files], by Scott Davenport

{anchor:statistics}
h2.Fault Statistics
The Fault Manager daemon, {{fmd}}(1M), and many of its modules track statistics. The [{{fmstat}}(1M)|http://docs.sun.com/app/docs/doc/819-2240/fmstat-1m?a=view] command reports those statistics. Without options, {{fmstat}} gives a high-level overview of the events, processing times, and memory usage of the loaded modules. For example:

{code}
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 1 0 0.0 403.5 0 0 0 0 419b 0
disk-transport 0 0 0.0 500.6 0 0 0 0 32b 0
eft 0 0 0.0 4.8 0 0 0 0 1.4M 43b
fmd-self-diagnosis 0 0 0.0 4.7 0 0 0 0 0 0
io-retire 0 0 0.0 4.5 0 0 0 0 0 0
snmp-trapgen 0 0 0.0 4.5 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 1444.4 0 0 0 0 0 0
syslog-msgs 0 0 0.0 4.5 0 0 0 0 0 0
zfs-diagnosis 0 0 0.0 4.7 0 0 0 0 0 0
zfs-retire 0 0 0.0 4.5 0 0 0 0 0 0
{code}

The {{fmstat}}(1M) man page describes each column in this output. Note that the open and solve columns apply only to Fault Management cases, which are only created and solved by diagnosis engines. These columns are immaterial for other modules, such as response agents.

Statistics on an individual module can also be displayed by using the {{-m&nbsp;_module_}} option. This syntax is commonly used with the {{-z}} option to suppress zero-valued statistics. For example:

{code}
# fmstat -z -m cpumem-retire
NAME VALUE DESCRIPTION
cpu_flts 1 cpu faults resolved
{code}

This example shows that the {{cpumem-retire}} agent has successfully processed a request to take a CPU offline.

The individuals who post here are part of the extended Sun Microsystems community and they might not be employed or in any way formally affiliated with Sun Microsystems. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Sun nor any other party necessarily agrees with them.

Copyright 1994-2009 Sun Microsystems, Inc.
Powered by Atlassian Confluence
Sun Guidelines on Public Discourse Privacy Policy Terms of Use Trademarks Site Map Employment Investor Relations Contact