io Provider
The io provider makes available probes related to disk input and output. The io provider enables quick exploration of behavior observed through I/O monitoring tools such as iostat(1M)
. For example, using the io provider, you can understand I/O by device, by I/O type, by I/O size, by process, by application name, by file name, or by file offset.
Top
Probes
The io probes are described in Table 27–1.
io Probes
| Probe |
Description |
| start |
Probe that fires when an I/O request is about to be made either to a peripheral device or to an NFS server. The bufinfo_t corresponding to the I/O request is pointed to by args[0]. The devinfo_t of the device to which the I/O is being issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2]. Note that file information availability depends on the filesystem making the I/O request. See fileinfo_t for more information. |
| done |
Probe that fires after an I/O request has been fulfilled. The bufinfo_t corresponding to the I/O request is pointed to by args[0]. The done probe fires after the I/O completes, but before completion processing has been performed on the buffer. As a result B_DONE is not set in b_flags at the time the done probe fires. The devinfo_t of the device to which the I/O was issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2]. |
| wait-start |
Probe that fires immediately before a thread begins to wait pending completion of a given I/O request. The buf(9S) structure corresponding to the I/O request for which the thread will wait is pointed to by args[0]. The devinfo_t of the device to which the I/O was issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2]. Some time after the wait-start probe fires, the wait-done probe will fire in the same thread. |
| wait-done |
Probe that fires when a thread is done waiting for the completion of a given I/O request. The bufinfo_t corresponding to the I/O request for which the thread will wait is pointed to by args[0]. The devinfo_t of the device to which the I/O was issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2]. The wait-done probe fires only after the wait-start probe has fired in the same thread. |
Note that the io probes fire for all I/O requests to peripheral devices, and for all file read and file write requests to an NFS server. Requests for metadata from an NFS server, for example, do not trigger io probes due to a readdir(3C)
request.
Top
Arguments
The argument types for the io probes are listed in Table 27–2. The arguments are described in Table 27–1.
io Probe Arguments
| Probe |
args[0] |
args[1] |
args[2] |
| start |
struct buf * |
devinfo_t * |
fileinfo_t * |
| done |
struct buf * |
devinfo_t * |
fileinfo_t * |
| wait-start |
struct buf * |
devinfo_t * |
fileinfo_t * |
| wait-done |
struct buf * |
devinfo_t * |
fileinfo_t * |
Each io probe has arguments consisting of a pointer to a buf(9S)
structure, a pointer to a devinfo_t, and a pointer to a fileinfo_t. These structures are described in greater detail in this section.
Top
bufinfo_t structure
The bufinfo_t structure is the abstraction that describes an I/O request. The buffer corresponding to an I/O request is pointed to by args[0] in the start, done, wait-start, and wait-done probes. The bufinfo_t structure definition is as follows:
The b_flags member indicates the state of the I/O buffer, and consists of a bitwise-or of different state values. The valid state values are in Table 27–3.
b_flags Values
| B_DONE |
Indicates that the data transfer has completed. |
| B_ERROR |
Indicates an I/O transfer error. It is set in conjunction with the b_error field. |
| B_PAGEIO |
Indicates that the buffer is being used in a paged I/O request. See the description of the b_addr field for more information. |
| B_PHYS |
Indicates that the buffer is being used for physical (direct) I/O to a user data area. |
| B_READ |
Indicates that data is to be read from the peripheral device into main memory. |
| B_WRITE |
Indicates that the data is to be transferred from main memory to the peripheral device. |
| B_ASYNC |
The I/O request is asynchronous, and will not be waited upon. The wait-start and wait-done probes don't fire for asynchronous I/O requests. Note that some I/Os directed to be asynchronous might not have B_ASYNC set: the asynchronous I/O subsystem might implement the asynchronous request by having a separate worker thread perform a synchronous I/O operation. |
The b_bcount field is the number of bytes to be transferred as part of the I/O request.
The b_addr field is the virtual address of the I/O request, unless B_PAGEIO is set. The address is a kernel virtual address unless B_PHYS is set, in which case it is a user virtual address. If B_PAGEIO is set, the b_addr field contains kernel private data. Exactly one of B_PHYS and B_PAGEIO can be set, or neither flag will be set.
The b_lblkno field identifies which logical block on the device is to be accessed. The mapping from a logical block to a physical block (such as the cylinder, track, and so on) is defined by the device.
The b_resid field is set to the number of bytes not transferred because of an error.
The b_bufsize field contains the size of the allocated buffer.
The b_iodone field identifies a specific routine in the kernel that is called when the I/O is complete.
The b_error field may hold an error code returned from the driver in the event of an I/O error. b_error is set in conjunction with the B_ERROR bit set in the b_flags member.
The b_edev field contains the major and minor device numbers of the device accessed. Consumers may use the D subroutines getmajor and getminor to extract the major and minor device numbers from the b_edev field.
Top
devinfo_t
The devinfo_t structure provides information about a device. The devinfo_t structure corresponding to the destination device of an I/O is pointed to by args[1] in the start, done, wait-start, and wait-done probes. The members of devinfo_t are as follows:
The dev_major field is the major number of the device. See getmajor(9F)
for more information.
The dev_minor field is the minor number of the device. See getminor(9F)
for more information.
The dev_instance field is the instance number of the device. The instance of a device is different from the minor number. The minor number is an abstraction managed by the device driver. The instance number is a property of the device node. You can display device node instance numbers with prtconf(1M)
.
The dev_name field is the name of the device driver that manages the device. You can display device driver names with the -D option to prtconf(1M)
.
The dev_statname field is the name of the device as reported by iostat(1M)
. This name also corresponds to the name of a kernel statistic as reported by kstat(1M)
. This field is provided so that aberrant iostat or kstat output can be quickly correlated to actual I/O activity.
The dev_pathname field is the full path of the device. This path may be specified as an argument to prtconf(1M)
to obtain detailed device information. The path specified by dev_pathname includes components expressing the device node, the instance number, and the minor node. However, all three of these elements aren't necessarily expressed in the statistics name. For some devices, the statistics name consists of the device name and the instance number. For other devices, the name consists of the device name and the number of the minor node. As a result, two devices that have the same dev_statname may differ in dev_pathname.
Top
fileinfo_t
The fileinfo_t structure provides information about a file. The file to which an I/O corresponds is pointed to by args[2] in the start, done, wait-start, and wait-done probes. The presence of file information is contingent upon the filesystem providing this information when dispatching I/O requests. Some filesystems, especially third-party filesystems, might not provide this information. Also, I/O requests might emanate from a filesystem for which no file information exists. For example, any I/O to filesystem metadata will not be associated with any one file. Finally, some highly optimized filesystems might aggregate I/O from disjoint files into a single I/O request. In this case, the filesystem might provide the file information either for the file that represents the majority of the I/O or for the file that represents some of the I/O. Alternately, the filesystem might provide no file information at all in this case.
The definition of the fileinfo_t structure is as follows:
The fi_name field contains the name of the file but does not include any directory components. If no file information is associated with an I/O, the fi_name field will be set to the string <none>. In some rare cases, the pathname associated with a file might be unknown. In this case, the fi_name field will be set to the string <unknown>.
The fi_dirname field contains only the directory component of the file name. As with fi_name, this string may be set to <none> if no file information is present, or <unknown> if the pathname associated with the file is not known.
The fi_pathname field contains the full pathname to the file. As with fi_name, this string may be set to <none> if no file information is present, or <unknown> if the pathname associated with the file is not known.
The fi_offset field contains the offset within the file , or -1 if either file information is not present or if the offset is otherwise unspecified by the filesystem.
Top
Examples
The following example script displays pertinent information for every I/O as it's issued:
The output of the example when cold-starting Acrobat Reader on an x86 laptop system resembles the following example:
The <none> entries in the output indicate that the I/O doesn't correspond to the data in any particular file: these I/Os are due to metadata of one form or another. The <unknown> entries in the output indicate that the pathname for the file is not known. This situation is relatively rare.
You could make the example script slightly more sophisticated by using an associative array to track the time spent on each I/O, as shown in the following example:
The output of the above example while hot-plugging a USB storage device into an otherwise idle x86 laptop system is shown in the following example:
You can make several observations about the mechanics of the system based on this output. First, note the long time to perform the first several I/Os, which took about 25 milliseconds each. This time might have been due to the cmdk0 device having been power managed on the laptop. Second, observe the I/O due to the scsa2usb(7D)
driver loading to deal with USB Mass Storage device. Third, note the writes to /var/adm/messages as the device is reported. Finally, observe the reading of the device link generators (the files ending in link.so) , which presumably deal with the new device.
The io provider enables in-depth understanding of iostat(1M)
output. Assume you observe iostat output similar to the following example:
You can use the iotime.d script to see these I/Os as they happen, as shown in the following example:
This output appears to show that the file archives.tar is being read from cmdk0 (in /export/archives), and being written to device sd2 (in /mnt). This existence of two files named archives.tar that are being operated on separately in parallel seems unlikely. To investigate further, you can aggregate on device, application, process ID and bytes transferred, as shown in the following example:
Running this script for a few seconds results in output similar to the following example:
This output shows that this activity is a copy of the file archives.tar from one device to another. This conclusion leads to another natural question: is one of these devices faster than the other? Which device acts as the limiter on the copy? To answer these questions, you need to know the effective throughput of each device rather than the number of bytes per second each device is transferring. You can determine the throughput with the following example script:
Running the example script for several seconds yields the following output:
The output shows that sd2 is clearly the limiting device. The sd2 throughput is between 256K/sec and 512K/sec, while cmdk0 is delivering I/O at anywhere from 8 MB/second to over 64 MB/second. The script prints out both the name as seen in iostat, and the full path of the device. To find out more about the device, you could specify the device path to prtconf, as shown in the following example:
As the emphasized terms indicate, this device is a removable USB storage device.
The examples in this section have explored all I/O requests. However, you might only be interested in one type of request. The following example tracks the directories in which writes are occurring, along with the applications performing the writes:
Running this example script on a desktop workload for a period of time yields some interesting results, as shown in the following example output:
As the output indicates, virtually all writes are associated with the Mozilla Firebird cache. The writes labeled <none> are likely due to writes associated with the UFS log, writes that are themselves induced by other writes in the filesystem. See ufs(7FS)
for details on logging. This example shows how to use the io provider to discover a problem at a much higher layer of software. In this case, the script has revealed a configuration problem: the web browser would induce much less I/O (and quite likely none at all) if its cache were in a directory in a tmpfs(7FS)
filesystem.
The previous examples have used only the start and done probes. You can use the wait-start and wait-done probes to understand why applications block for I/O – and for how long. The following example script uses both io probes and sched probes (see Chapter 26, sched Provider) to derive CPU time compared to I/O wait time for the StarOffice software:
Running the example script during a cold start of the StarOffice software yields the following output:
As this output shows, much of the cold StarOffice start time is due to waiting for I/O. (13.1 seconds waiting for I/O as opposed to 3.6 seconds on CPU.) Running the script on a warm start of the StarOffice software reveals that page caching has eliminated the I/O time , as shown in the following example output:
The cold start output shows that the file applicat.rdb accounts for more I/O wait time than any other file. This result is presumably due to many I/Os to the file. To explore the I/Os performed to this file, you can use the following D script:
This script uses the fi_offset field of the fileinfo_t structure to understand which parts of the file are being accessed, at the granularity of a megabyte. Running this script during a cold start of the StarOffice software results in output similar to the following example:
This output indicates that only the first six megabytes of the file are accessed, perhaps because the file is six megabytes in size. The output also indicates that the entire file is not accessed. If you wanted to improve the cold start time of StarOffice, you might want to understand the access pattern of the file. If the needed sections of the file could be largely contiguous, one way to improve StarOffice cold start time might be to have a scout thread run ahead of the application, inducing the I/O to the file before it's needed. (This approach is particularly straightforward if the file is accessed using mmap(2)
.) However, the ~1.6 seconds that this strategy would gain in cold start time does not merit the additional complexity and maintenance burden in the application. Either way, the data gathered with the io provider allows a precise understanding of the benefit that such work could ultimately deliver.
Top
Stability
The io provider uses DTrace's stability mechanism to describe its stabilities, as shown in the following table. For more information about the stability mechanism, see Chapter 39, Stability.
| Element |
Name stability |
Data stability |
Dependency class |
| Provider |
Evolving |
Evolving |
ISA |
| Module |
Private |
Private |
Unknown |
| Function |
Private |
Private |
Unknown |
| Name |
Evolving |
Evolving |
ISA |
| Arguments |
Evolving |
Evolving |
ISA |