Getting insights with DTrace - Part 1

Analyzing Oracle Logwriter w/ buffered vs. direct I/O



Introduction / Motivation

SAP and Oracle already provide powerful instrumentation to diagnose problems and performance issues. However, DTrace allows unparalleled deepdive into the system. Therefore I decided to write a series of articles showing use cases for DTrace in SAP/Oracle environments. Since I already wrote some articles about Oracle and I/O, this will be my starting point for DTrace:

Although it is really considered best practice to let Oracle access it's files through direct I/O rather than buffered I/O, there are still many installations not following this advice. Either customers are convinced, that buffered I/O is better / faster, or they are 'victims' of Oracle bug 5752399 affecting releases 10.2.0.2 and 10.2.0.3 (the latter one not being supported in SAP environments).

Background

While there are very few situations, where buffered I/O for the Oracle datafiles (sapdata filesystems) provides better performance than direct I/O, direct I/O for the online redolog filesystems should be set in any case. This article is supposed to give some more insights on this subject by using DTrace.
Note:
Since the test environment used Oracle 10.2.0.1 I used the 'forcedirectio' filesystem mount option. In most situations it is recommended to enable direct I/O at the Oracle level via parameter 'filesystemio_options'. This will allow other applications / utilities (e.g. backup tools) to benefit from filesystem cache.

Recently Updated
by joerg_meiners (29 Jan)
Adaptive Oracle Grid (SAPonSun)
by j.brosowski (28 Jan)
Understanding a SAP SD 2-tier benchmark result (SAPonSun)
by vwetter (19 Jan)
Open Dodecaeder for SAP on Sun (SAPonSun)
by j.brosowski (18 Jan)
Open Dodecaeder for SAP on Sun v. 16 (SAPonSun)

Approach

The test scenario uses an Oracle 10g database (10.2.0.1), specifically the imp utility with a dump file containing around 1,000,000 records. After each test, the corresponding table gets dropped and a checkpoint is triggered. By changing between direct and buffered I/O, the mountoption will be changed in between the tests (online by using remount). This procedure allows good control over commit frequency and commit size (via imp parfile settings) as well as very stable results (reproducable). We will not only measure total runtimes but look at I/O and ufs internals by using DTrace scripts.

buffer=1024
commit=Y
file=my_sales.dump
rows=y

This parameterfile to the imp-utility ensures frequent commits, each commit writing only small amounts of data.

Import with redologs mounted 'noforcedirectio':

Import terminated successfully without warnings.

real        2:24.41
user          11.04
sys            1.66


Import with redologs mounted 'forcedirectio':

Import terminated successfully without warnings.

real        1:48.09
user          11.03
sys            1.62

These results show an advantage of about 36 secs (~30%) when using 'forcedirectio'. First of all, let's have a look at I/O. The DTrace toolkit by Brendan Gregg provides a script called 'iosnoop.sh'. Tracing the Logwriter process (lgwr) gives us the following information:

forcedirectio:

./iosnoop.sh -aD -p 4162
STRTIME              DEVICE  MAJ MIN DELTA        UID   PID D    BLOCK   SIZE                     PATHNAME ARGS
2009 May 27 11:42:56 sd0      32   4 3772         215  4162 W 22293311    512                       <none> ora_lgwr_DEM\0
2009 May 27 11:42:56 sd0      32   4 3874         215  4162 W 22293312   2560                       <none> ora_lgwr_DEM\0
2009 May 27 11:42:56 sd0      32   4 3600         215  4162 W 22293317   3072                       <none> ora_lgwr_DEM\0
2009 May 27 11:42:56 sd0      32   4 3733         215  4162 W 22293323   2560                       <none> ora_lgwr_DEM\0
2009 May 27 11:42:56 sd0      32   4 3814         215  4162 W 22293328    512                       <none> ora_lgwr_DEM\0
2009 May 27 11:42:56 sd0      32   4 3322         215  4162 W 22293329   3072                       <none> ora_lgwr_DEM\0
2009 May 27 11:42:56 sd0      32   4 3193         215  4162 W 22293335   3072                       <none> ora_lgwr_DEM\0
2009 May 27 11:42:56 sd0      32   4 3730         215  4162 W 22293341   1536                       <none> ora_lgwr_DEM\0
2009 May 27 11:42:56 sd0      32   4 3813         215  4162 W 22293344   1536                       <none> ora_lgwr_DEM\0
2009 May 27 11:42:56 sd0      32   4 3615         215  4162 W 22293347   3072                       <none> ora_lgwr_DEM\0
2009 May 27 11:42:56 sd0      32   4 3556         215  4162 W 22293353   3072                       <none> ora_lgwr_DEM\0

The table being imported contains VARCHAR fields, so the recordlength may vary. The size of all I/Os is a multiple of 512 Byte and it is pretty small - allways less than 4 KB. Overall, no surprises: sequential writes of different sizes depending on the amount of data that has to be flushed from logbuffer to the redolog file. Let's have a look at Oracle's statistics:

Event		
Statistic				Total		per Second	per Trans
redo size 				48,438,904 	267,760.27 	2,734.50
redo synch time 			7,384 		40.82 		0.42
redo synch writes 			17,864 	        98.75 		1.01
redo wastage 				5,559,348 	30,730.93 	313.84
redo write time 			7,381 		40.80 		0.42
redo writer latching time 		0 		0.00 		0.00
redo writes 				17,734 	        98.03 		1.00

OK, Oracle states an average lgwr I/O of 2731 Bytes (redo size / redo writes) with an average servicetime of 4,2 ms. Remember: the times listed in v$sysstat is measured in cs rather than ms!
Now let's do the same with 'noforcedirectio'.
noforcedirectio:

./iosnoop.sh -aD -p 4162
STRTIME              DEVICE  MAJ MIN DELTA        UID   PID D    BLOCK   SIZE                     PATHNAME ARGS
2009 May 27 02:09:41 sd0      32   4 5848         215  4162 W 21946960  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:41 sd0      32   4 3667         215  4162 W 21946960  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:41 sd0      32   4 3690         215  4162 W 21946960  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:41 sd0      32   4 3650         215  4162 W 21946960  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:41 sd0      32   4 22498        215  4162 R 21949024 1048576 /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:41 sd0      32   4 23045        215  4162 W 21946976  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:41 sd0      32   4 3468         215  4162 W 21946976  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
					:
					:
2009 May 27 02:09:44 sd0      32   4 5855         215  4162 W 21949008  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:44 sd0      32   4 3333         215  4162 W 21949008  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:44 sd0      32   4 3690         215  4162 W 21949008  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:44 sd0      32   4 22519        215  4162 R 21951072 1048576 /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:44 sd0      32   4 23070        215  4162 W 21949024  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
2009 May 27 02:09:44 sd0      32   4 3675         215  4162 W 21949024  8192   /slice4/redolog/redo01.log ora_lgwr_DEM\0
					:
					:

Two interesting observations:
LGWR repeatedly writes to the same Block! We know, that in our test case Oracle does small commits. If buffered I/O is used, even though just 512 bytes changed, a full OS block has to be written at each commit. This leads to subsequent writes to the same block on disk.
Every now and then, we can see a large read (1 MB) to the logfile. Looking at the block adresses, it seems very much like a sort of prefetching. One 8 KB block before the end of the prefetched data (1 MB) is reached, another 1 MB read is issued. This somehow makes sense, because when being forced to write 8 KB blocks only, you have to know Byte 513-8192 if you want to change Bytes 1-512.
Let's look at the database internal perspective:

Event		
Statistic				Total		per Second	per Trans
redo size 				50,214,248 	293,724.43 	2,832.48
redo synch time 			10,993 	   	64.30 		0.62
redo synch writes 			17,870 		104.53 		1.01
redo wastage 				5,609,796 	32,814.08 	316.44
redo write time 			10,999 		64.34 		0.62
redo writer latching time 		0 		0.00 		0.00
redo writes 				17,728 		103.70 		1.00

One might have expected the 'redo size' being much greater due to the write overhead of buffered I/O we talked about earlier. But remember: this is the Oracle view of things. Padding the small chunks of data to full 8 KB filesystem blocks happens at the UFS layer, not within Oracle! Therefore we can observe almost exactly the same amount of redo data (redo size) and almost exactly the same number of redo writes. But we can see a significant difference in redo write time (73,8 s vs. 110 s), which corresponds to the difference in total runtime of both tests.
Now it would be great to get a better understanding of what's happening behind the scenes in both scenarios. Again, a small DTrace script doing kernel profiling can give us an impression:

#!/usr/sbin/dtrace -s
#pragma D option flowindent

fbt:ufs::entry
/pid == $1/
{
        self->ts[probefunc] = timestamp;
}

fbt:ufs::return 
/pid == $1 && self->ts[probefunc] != 0/
{ 
        self->duration[probefunc] = (timestamp - self->ts[probefunc])/1000;
} 

fbt:ufs::
/pid == $1/
{
        trace(self->duration[probefunc]);
        self->duration[probefunc] = 0;
}

There is one parameter passed to the script which is the pid of the Oracle LGWR process. The script uses the fbt (function boundary tracing) provider. fbt provides entry and return probes of most Solaris kernel functions. This time we are particularly interested in the UFS module. Basically we are tracking elapsed time between entry and return. In addition we are using the flowindent option to create more readable output providing a clear picture of the call hierarchy.
Direct I/O is being described as having a pretty short codepath. Well, we're not going into the code of each call, but simply the number of calls being made should give you an impression of what's going on. At the beginning, there's the call to 'ufs_write'.

forcedirectio (numbers in microsec):

root@ontario1:/opt/dtrace # ./ufs3.d 4162
 
  1          -> ufs_write                                     0
  1            -> ufs_check_rewrite                           0
  1              -> bmap_has_holes                            0
  1              <- bmap_has_holes                           28
  1            <- ufs_check_rewrite                          94
  1            -> ufs_lockfs_begin                            0
  1              -> ufs_lockfs_is_under_rawlockfs             0
  1              <- ufs_lockfs_is_under_rawlockfs            47
  1            <- ufs_lockfs_begin                          135
  1            -> ufs_directio_write                          0
  1              -> bmap_read                                 0
  1                -> findextent                              0
  1                <- findextent                             23
  1              <- bmap_read                                75
  1              -> directio_start                            0
  1              <- directio_start                           67
  1              -> directio_wait                             0
  1                -> directio_wait_one                       0
  1                <- directio_wait_one                    2914
  1              <- directio_wait                          2981
  1            <- ufs_directio_write                       3259
  1            -> ufs_itimes_nolock                           0
  1              -> ufs_imark                                 0
  1              <- ufs_imark                                21
  1            <- ufs_itimes_nolock                          78
  1            -> ufs_lockfs_end                              0
  1              -> ufs_lockfs_top_vop_return                 0
  1              <- ufs_lockfs_top_vop_return                42
  1            <- ufs_lockfs_end                            120
  1          <- ufs_write                                  3894


noforcedirectio (numbers in microsec):

  0  -> ufs_write                                             0
  0    -> ufs_check_rewrite                                   0
  0      -> bmap_has_holes                                    0
  0      <- bmap_has_holes                                   36
  0    <- ufs_check_rewrite                                 119
  0    -> ufs_lockfs_begin                                    0
  0      -> ufs_lockfs_is_under_rawlockfs                     0
  0      <- ufs_lockfs_is_under_rawlockfs                    59
  0    <- ufs_lockfs_begin                                  170
  0    -> ufs_itimes_nolock                                   0
  0      -> ufs_imark                                         0
  0      <- ufs_imark                                        29
  0    <- ufs_itimes_nolock                                 102
  0    -> wrip                                                0
  0      -> ufs_indirblk_sync                                 0
  0      <- ufs_indirblk_sync                                41
  0      -> ufs_getpage                                       0
  0        -> ufs_lockfs_begin_getpage                        0
  0        <- ufs_lockfs_begin_getpage                       52
  0        -> bmap_has_holes                                  0
  0        <- bmap_has_holes                                 36
  0      <- ufs_getpage                                     227
  0      -> ufs_putpage                                       0
  0      <- ufs_putpage                                      32
  0      -> ufs_putpages                                      0
  0        -> ufs_putapage                                    0
  0          -> bmap_read                                     0
  0            -> findextent                                  0
  0            <- findextent                                 34
  0          <- bmap_read                                    99
  0          -> lufs_write_strategy                           0
  0            -> deltamap_remove                             0
  0            <- deltamap_remove                            38
  0          <- lufs_write_strategy                         171
  0        <- ufs_putapage                                 4877
  0      <- ufs_putpages                                   4955
  0      -> ufs_check_rewrite                                 0
  0      <- ufs_check_rewrite                                41
  0      -> top_delta                                         0
  0        -> deltamap_add                                    0
  0        <- deltamap_add                                   33
  0      <- top_delta                                        98
  0      -> ufs_itimes_nolock                                 0
  0        -> ufs_imark                                       0
  0        <- ufs_imark                                      28
  0      <- ufs_itimes_nolock                               101
  0    <- wrip                                             5833
  0    -> ufs_lockfs_end                                      0
  0      -> ufs_lockfs_top_vop_return                         0
  0      <- ufs_lockfs_top_vop_return                        53
  0    <- ufs_lockfs_end                                    152
  0  <- ufs_write                                          6614

In case you have no access to Oracle statspack or AWR reports (or even have no license for Oracle Enterprise Manager diagnostics pack), you may run the following script to get an idea for the average redo write time of LGWR:

#!/usr/sbin/dtrace -s

fbt:ufs:ufs_write:entry 
/pid == $1/
{ 
        self->ts = timestamp; 
} 

fbt:ufs:ufs_write:return 
/self->ts != 0 && pid == $1/ 
{ 
        @time[execname] = avg(timestamp - self->ts); 
        self->ts = 0; 
} 

This script uses an aggregation (@time[]) to calculate the average elapsed time when entering the return probe.

Again: you have to pass the pid of Oracle LGWR which can be determined by

root@ontario1:/opt/dtrace # ps -ef | grep lgwr
  oradem  4162     1   0 02:01:29 ?           1:50 ora_lgwr_DEM
    root  9747  5417   0 15:53:39 pts/4       0:00 grep lgwr

Pressing CTRL-c after the testimport is finished will deliver the follwing results in ns:

forcedirectio

root@ontario1:/opt/dtrace # ./avg_redo_write.d 4162
dtrace: script './avg_redo_write.d' matched 2 probes
^C

  oracle                                                      3932139


noforcedirectio

root@ontario1:/opt/dtrace # ./avg_redo_write.d 4162
dtrace: script './avg_redo_write.d' matched 2 probes
^C

  oracle                                                      5893665

This script observed an average response time for LGWR ufs_write calls of ~ 4 ms in case of direct I/O and about 5,9 ms for buffered I/O. These results are in line with the numbers we've gotten from AWR reports.

Back to the original discussion about usage of buffered vs. direct I/O:
You may think now, that this testscenario is a little bit awkward since it assumes lots of tiny commits. OK, let's do larger commits by changing the buffersize in the imp-parfile to 16 MB.
The first run uses noforcedirectio

Import terminated successfully without warnings.

real          44.74
user          10.57
sys            0.60

iosnoop-Output:

STRTIME              DEVICE  MAJ MIN   UID   PID D    BLOCK   SIZE                     PATHNAME ARGS
2009 May 28 00:33:34 sd0      32   4   215  4162 W 36398016   8192   /slice4/redolog/redo03.log ora_lgwr_DEM\0
2009 May 28 00:33:34 sd0      32   4   215  4162 W 36398032   8192   /slice4/redolog/redo03.log ora_lgwr_DEM\0
2009 May 28 00:33:34 sd0      32   4   215  4162 W 36398048   8192   /slice4/redolog/redo03.log ora_lgwr_DEM\0
2009 May 28 00:33:34 sd0      32   4   215  4162 W 36398064   8192   /slice4/redolog/redo03.log ora_lgwr_DEM\0
2009 May 28 00:33:34 sd0      32   4   215  4162 W 36398080   8192   /slice4/redolog/redo03.log ora_lgwr_DEM\0

Again, we can see just 8 KB sequential writes, although Oracle should commit larger chunks of data! This can be confirmed by looking at Oracle stats:

Event		
Statistic				Total		per Second	per Trans
redo size 				39,506,404 	314,606.56 	564,377.20
redo write time 			3,222 		25.66 		46.03
redo writes 				99 		0.79 		1.41

This makes up for about 400 KB/redo write.

Let's now look at the same test using forcedirectio:

Import terminated successfully without warnings.

real          17.01
user          10.55
sys            0.59

That's about 163% faster than buffered I/O (17,01 sec vs. 44,74 sec). Again, let's check iosnoop:

STRTIME              DEVICE  MAJ MIN   UID   PID D    BLOCK   SIZE                     PATHNAME ARGS
2009 May 28 00:54:09 sd0      32   4   215  4162 W 27313955   6656                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:10 sd0      32   4   215  4162 W 27313968 1048576                      <none> ora_lgwr_DEM\0
2009 May 28 00:54:10 sd0      32   4   215  4162 W 27316019 242176                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:10 sd0      32   4   215  4162 W 27316492  34816                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:10 sd0      32   4   215  4162 W 34013328  93696                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:10 sd0      32   4   215  4162 W 34013511 1044992                      <none> ora_lgwr_DEM\0
2009 May 28 00:54:10 sd0      32   4   215  4162 W 34015552   3584                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:10 sd0      32   4   215  4162 W 34015559   3072                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:10 sd0      32   4   215  4162 W 34015565 361984                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:10 sd0      32   4   215  4162 W 34016272   5632                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:11 sd0      32   4   215  4162 W 34016283 133632                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:11 sd0      32   4   215  4162 W 34018064 857600                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:11 sd0      32   4   215  4162 W 34013216  57344                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:11 sd0      32   4   215  4162 W 34019739 371200                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:11 sd0      32   4   215  4162 W 34020464    512                       <none> ora_lgwr_DEM\0
2009 May 28 00:54:11 sd0      32   4   215  4162 W 34020465 1048064                      <none> ora_lgwr_DEM\0 

As you can see, I/Os occur at different sizes, values between 512 Byte and 1 MB can be observed. On average that certainly appears to be more efficient than small 8 KB writes.

Bottomline

I think the results shown by AWR and DTrace are pretty clear. You are still preferring UFS buffered I/O? Well, not my fault ......

Labels

sap sap Delete
oracle oracle Delete
performance performance Delete
dtrace dtrace Delete
solaris solaris Delete
logwriter logwriter Delete
lgwr lgwr Delete
direct direct Delete
io io Delete
favourite favourite Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

Sign up or Log in to add a comment or watch this page.


The individuals who post here are part of the extended Sun Microsystems community and they might not be employed or in any way formally affiliated with Sun Microsystems. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Sun nor any other party necessarily agrees with them.

© 2010, Oracle Corporation and/or its affiliates
Powered by Atlassian Confluence
Oracle Social Media Participation Policy Privacy Policy Terms of Use Trademarks Site Map Employment Investor Relations Contact