Analyzing Oracle Logwriter w/ buffered vs. direct I/O
Introduction / MotivationSAP and Oracle already provide powerful instrumentation to diagnose problems and performance issues. However, DTrace allows unparalleled deepdive into the system. Therefore I decided to write a series of articles showing use cases for DTrace in SAP/Oracle environments. Since I already wrote some articles about Oracle and I/O, this will be my starting point for DTrace: Although it is really considered best practice to let Oracle access it's files through direct I/O rather than buffered I/O, there are still many installations not following this advice. Either customers are convinced, that buffered I/O is better / faster, or they are 'victims' of Oracle bug 5752399 affecting releases 10.2.0.2 and 10.2.0.3 (the latter one not being supported in SAP environments).
BackgroundWhile there are very few situations, where buffered I/O for the Oracle datafiles (sapdata filesystems) provides better performance than direct I/O, direct I/O for the online redolog filesystems should be set in any case. This article is supposed to give some more insights on this subject by using DTrace.
|
|
Approach
The test scenario uses an Oracle 10g database (10.2.0.1), specifically the imp utility with a dump file containing around 1,000,000 records. After each test, the corresponding table gets dropped and a checkpoint is triggered. By changing between direct and buffered I/O, the mountoption will be changed in between the tests (online by using remount). This procedure allows good control over commit frequency and commit size (via imp parfile settings) as well as very stable results (reproducable). We will not only measure total runtimes but look at I/O and ufs internals by using DTrace scripts.
buffer=1024 commit=Y file=my_sales.dump rows=y
This parameterfile to the imp-utility ensures frequent commits, each commit writing only small amounts of data.
Import with redologs mounted 'noforcedirectio':
Import terminated successfully without warnings. real 2:24.41 user 11.04 sys 1.66
Import with redologs mounted 'forcedirectio':
Import terminated successfully without warnings. real 1:48.09 user 11.03 sys 1.62
These results show an advantage of about 36 secs (~30%) when using 'forcedirectio'. First of all, let's have a look at I/O. The DTrace toolkit by Brendan Gregg provides a script called 'iosnoop.sh'. Tracing the Logwriter process (lgwr) gives us the following information:
forcedirectio:
./iosnoop.sh -aD -p 4162 STRTIME DEVICE MAJ MIN DELTA UID PID D BLOCK SIZE PATHNAME ARGS 2009 May 27 11:42:56 sd0 32 4 3772 215 4162 W 22293311 512 <none> ora_lgwr_DEM\0 2009 May 27 11:42:56 sd0 32 4 3874 215 4162 W 22293312 2560 <none> ora_lgwr_DEM\0 2009 May 27 11:42:56 sd0 32 4 3600 215 4162 W 22293317 3072 <none> ora_lgwr_DEM\0 2009 May 27 11:42:56 sd0 32 4 3733 215 4162 W 22293323 2560 <none> ora_lgwr_DEM\0 2009 May 27 11:42:56 sd0 32 4 3814 215 4162 W 22293328 512 <none> ora_lgwr_DEM\0 2009 May 27 11:42:56 sd0 32 4 3322 215 4162 W 22293329 3072 <none> ora_lgwr_DEM\0 2009 May 27 11:42:56 sd0 32 4 3193 215 4162 W 22293335 3072 <none> ora_lgwr_DEM\0 2009 May 27 11:42:56 sd0 32 4 3730 215 4162 W 22293341 1536 <none> ora_lgwr_DEM\0 2009 May 27 11:42:56 sd0 32 4 3813 215 4162 W 22293344 1536 <none> ora_lgwr_DEM\0 2009 May 27 11:42:56 sd0 32 4 3615 215 4162 W 22293347 3072 <none> ora_lgwr_DEM\0 2009 May 27 11:42:56 sd0 32 4 3556 215 4162 W 22293353 3072 <none> ora_lgwr_DEM\0
The table being imported contains VARCHAR fields, so the recordlength may vary. The size of all I/Os is a multiple of 512 Byte and it is pretty small - allways less than 4 KB. Overall, no surprises: sequential writes of different sizes depending on the amount of data that has to be flushed from logbuffer to the redolog file. Let's have a look at Oracle's statistics:
Event Statistic Total per Second per Trans redo size 48,438,904 267,760.27 2,734.50 redo synch time 7,384 40.82 0.42 redo synch writes 17,864 98.75 1.01 redo wastage 5,559,348 30,730.93 313.84 redo write time 7,381 40.80 0.42 redo writer latching time 0 0.00 0.00 redo writes 17,734 98.03 1.00
OK, Oracle states an average lgwr I/O of 2731 Bytes (redo size / redo writes) with an average servicetime of 4,2 ms. Remember: the times listed in v$sysstat is measured in cs rather than ms!
Now let's do the same with 'noforcedirectio'.
noforcedirectio:
./iosnoop.sh -aD -p 4162 STRTIME DEVICE MAJ MIN DELTA UID PID D BLOCK SIZE PATHNAME ARGS 2009 May 27 02:09:41 sd0 32 4 5848 215 4162 W 21946960 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:41 sd0 32 4 3667 215 4162 W 21946960 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:41 sd0 32 4 3690 215 4162 W 21946960 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:41 sd0 32 4 3650 215 4162 W 21946960 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:41 sd0 32 4 22498 215 4162 R 21949024 1048576 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:41 sd0 32 4 23045 215 4162 W 21946976 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:41 sd0 32 4 3468 215 4162 W 21946976 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 : : 2009 May 27 02:09:44 sd0 32 4 5855 215 4162 W 21949008 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:44 sd0 32 4 3333 215 4162 W 21949008 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:44 sd0 32 4 3690 215 4162 W 21949008 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:44 sd0 32 4 22519 215 4162 R 21951072 1048576 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:44 sd0 32 4 23070 215 4162 W 21949024 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 2009 May 27 02:09:44 sd0 32 4 3675 215 4162 W 21949024 8192 /slice4/redolog/redo01.log ora_lgwr_DEM\0 : :
Two interesting observations:
LGWR repeatedly writes to the same Block! We know, that in our test case Oracle does small commits. If buffered I/O is used, even though just 512 bytes changed, a full OS block has to be written at each commit. This leads to subsequent writes to the same block on disk.
Every now and then, we can see a large read (1 MB) to the logfile. Looking at the block adresses, it seems very much like a sort of prefetching. One 8 KB block before the end of the prefetched data (1 MB) is reached, another 1 MB read is issued. This somehow makes sense, because when being forced to write 8 KB blocks only, you have to know Byte 513-8192 if you want to change Bytes 1-512.
Let's look at the database internal perspective:
Event Statistic Total per Second per Trans redo size 50,214,248 293,724.43 2,832.48 redo synch time 10,993 64.30 0.62 redo synch writes 17,870 104.53 1.01 redo wastage 5,609,796 32,814.08 316.44 redo write time 10,999 64.34 0.62 redo writer latching time 0 0.00 0.00 redo writes 17,728 103.70 1.00
One might have expected the 'redo size' being much greater due to the write overhead of buffered I/O we talked about earlier. But remember: this is the Oracle view of things. Padding the small chunks of data to full 8 KB filesystem blocks happens at the UFS layer, not within Oracle! Therefore we can observe almost exactly the same amount of redo data (redo size) and almost exactly the same number of redo writes. But we can see a significant difference in redo write time (73,8 s vs. 110 s), which corresponds to the difference in total runtime of both tests.
Now it would be great to get a better understanding of what's happening behind the scenes in both scenarios. Again, a small DTrace script doing kernel profiling can give us an impression:
#!/usr/sbin/dtrace -s
#pragma D option flowindent
fbt:ufs::entry
/pid == $1/
{
self->ts[probefunc] = timestamp;
}
fbt:ufs::return
/pid == $1 && self->ts[probefunc] != 0/
{
self->duration[probefunc] = (timestamp - self->ts[probefunc])/1000;
}
fbt:ufs::
/pid == $1/
{
trace(self->duration[probefunc]);
self->duration[probefunc] = 0;
}
There is one parameter passed to the script which is the pid of the Oracle LGWR process. The script uses the fbt (function boundary tracing) provider. fbt provides entry and return probes of most Solaris kernel functions. This time we are particularly interested in the UFS module. Basically we are tracking elapsed time between entry and return. In addition we are using the flowindent option to create more readable output providing a clear picture of the call hierarchy.
Direct I/O is being described as having a pretty short codepath. Well, we're not going into the code of each call, but simply the number of calls being made should give you an impression of what's going on. At the beginning, there's the call to 'ufs_write'.
forcedirectio (numbers in microsec):
root@ontario1:/opt/dtrace # ./ufs3.d 4162 1 -> ufs_write 0 1 -> ufs_check_rewrite 0 1 -> bmap_has_holes 0 1 <- bmap_has_holes 28 1 <- ufs_check_rewrite 94 1 -> ufs_lockfs_begin 0 1 -> ufs_lockfs_is_under_rawlockfs 0 1 <- ufs_lockfs_is_under_rawlockfs 47 1 <- ufs_lockfs_begin 135 1 -> ufs_directio_write 0 1 -> bmap_read 0 1 -> findextent 0 1 <- findextent 23 1 <- bmap_read 75 1 -> directio_start 0 1 <- directio_start 67 1 -> directio_wait 0 1 -> directio_wait_one 0 1 <- directio_wait_one 2914 1 <- directio_wait 2981 1 <- ufs_directio_write 3259 1 -> ufs_itimes_nolock 0 1 -> ufs_imark 0 1 <- ufs_imark 21 1 <- ufs_itimes_nolock 78 1 -> ufs_lockfs_end 0 1 -> ufs_lockfs_top_vop_return 0 1 <- ufs_lockfs_top_vop_return 42 1 <- ufs_lockfs_end 120 1 <- ufs_write 3894
noforcedirectio (numbers in microsec):
0 -> ufs_write 0 0 -> ufs_check_rewrite 0 0 -> bmap_has_holes 0 0 <- bmap_has_holes 36 0 <- ufs_check_rewrite 119 0 -> ufs_lockfs_begin 0 0 -> ufs_lockfs_is_under_rawlockfs 0 0 <- ufs_lockfs_is_under_rawlockfs 59 0 <- ufs_lockfs_begin 170 0 -> ufs_itimes_nolock 0 0 -> ufs_imark 0 0 <- ufs_imark 29 0 <- ufs_itimes_nolock 102 0 -> wrip 0 0 -> ufs_indirblk_sync 0 0 <- ufs_indirblk_sync 41 0 -> ufs_getpage 0 0 -> ufs_lockfs_begin_getpage 0 0 <- ufs_lockfs_begin_getpage 52 0 -> bmap_has_holes 0 0 <- bmap_has_holes 36 0 <- ufs_getpage 227 0 -> ufs_putpage 0 0 <- ufs_putpage 32 0 -> ufs_putpages 0 0 -> ufs_putapage 0 0 -> bmap_read 0 0 -> findextent 0 0 <- findextent 34 0 <- bmap_read 99 0 -> lufs_write_strategy 0 0 -> deltamap_remove 0 0 <- deltamap_remove 38 0 <- lufs_write_strategy 171 0 <- ufs_putapage 4877 0 <- ufs_putpages 4955 0 -> ufs_check_rewrite 0 0 <- ufs_check_rewrite 41 0 -> top_delta 0 0 -> deltamap_add 0 0 <- deltamap_add 33 0 <- top_delta 98 0 -> ufs_itimes_nolock 0 0 -> ufs_imark 0 0 <- ufs_imark 28 0 <- ufs_itimes_nolock 101 0 <- wrip 5833 0 -> ufs_lockfs_end 0 0 -> ufs_lockfs_top_vop_return 0 0 <- ufs_lockfs_top_vop_return 53 0 <- ufs_lockfs_end 152 0 <- ufs_write 6614
In case you have no access to Oracle statspack or AWR reports (or even have no license for Oracle Enterprise Manager diagnostics pack), you may run the following script to get an idea for the average redo write time of LGWR:
#!/usr/sbin/dtrace -s
fbt:ufs:ufs_write:entry
/pid == $1/
{
self->ts = timestamp;
}
fbt:ufs:ufs_write:return
/self->ts != 0 && pid == $1/
{
@time[execname] = avg(timestamp - self->ts);
self->ts = 0;
}
This script uses an aggregation (@time[]) to calculate the average elapsed time when entering the return probe.
Again: you have to pass the pid of Oracle LGWR which can be determined by
root@ontario1:/opt/dtrace # ps -ef | grep lgwr
oradem 4162 1 0 02:01:29 ? 1:50 ora_lgwr_DEM
root 9747 5417 0 15:53:39 pts/4 0:00 grep lgwr
Pressing CTRL-c after the testimport is finished will deliver the follwing results in ns:
forcedirectio
root@ontario1:/opt/dtrace # ./avg_redo_write.d 4162 dtrace: script './avg_redo_write.d' matched 2 probes ^C oracle 3932139
noforcedirectio
root@ontario1:/opt/dtrace # ./avg_redo_write.d 4162 dtrace: script './avg_redo_write.d' matched 2 probes ^C oracle 5893665
This script observed an average response time for LGWR ufs_write calls of ~ 4 ms in case of direct I/O and about 5,9 ms for buffered I/O. These results are in line with the numbers we've gotten from AWR reports.
Back to the original discussion about usage of buffered vs. direct I/O:
You may think now, that this testscenario is a little bit awkward since it assumes lots of tiny commits. OK, let's do larger commits by changing the buffersize in the imp-parfile to 16 MB.
The first run uses noforcedirectio
Import terminated successfully without warnings. real 44.74 user 10.57 sys 0.60
iosnoop-Output:
STRTIME DEVICE MAJ MIN UID PID D BLOCK SIZE PATHNAME ARGS 2009 May 28 00:33:34 sd0 32 4 215 4162 W 36398016 8192 /slice4/redolog/redo03.log ora_lgwr_DEM\0 2009 May 28 00:33:34 sd0 32 4 215 4162 W 36398032 8192 /slice4/redolog/redo03.log ora_lgwr_DEM\0 2009 May 28 00:33:34 sd0 32 4 215 4162 W 36398048 8192 /slice4/redolog/redo03.log ora_lgwr_DEM\0 2009 May 28 00:33:34 sd0 32 4 215 4162 W 36398064 8192 /slice4/redolog/redo03.log ora_lgwr_DEM\0 2009 May 28 00:33:34 sd0 32 4 215 4162 W 36398080 8192 /slice4/redolog/redo03.log ora_lgwr_DEM\0
Again, we can see just 8 KB sequential writes, although Oracle should commit larger chunks of data! This can be confirmed by looking at Oracle stats:
Event Statistic Total per Second per Trans redo size 39,506,404 314,606.56 564,377.20 redo write time 3,222 25.66 46.03 redo writes 99 0.79 1.41
This makes up for about 400 KB/redo write.
Let's now look at the same test using forcedirectio:
Import terminated successfully without warnings. real 17.01 user 10.55 sys 0.59
That's about 163% faster than buffered I/O (17,01 sec vs. 44,74 sec). Again, let's check iosnoop:
STRTIME DEVICE MAJ MIN UID PID D BLOCK SIZE PATHNAME ARGS 2009 May 28 00:54:09 sd0 32 4 215 4162 W 27313955 6656 <none> ora_lgwr_DEM\0 2009 May 28 00:54:10 sd0 32 4 215 4162 W 27313968 1048576 <none> ora_lgwr_DEM\0 2009 May 28 00:54:10 sd0 32 4 215 4162 W 27316019 242176 <none> ora_lgwr_DEM\0 2009 May 28 00:54:10 sd0 32 4 215 4162 W 27316492 34816 <none> ora_lgwr_DEM\0 2009 May 28 00:54:10 sd0 32 4 215 4162 W 34013328 93696 <none> ora_lgwr_DEM\0 2009 May 28 00:54:10 sd0 32 4 215 4162 W 34013511 1044992 <none> ora_lgwr_DEM\0 2009 May 28 00:54:10 sd0 32 4 215 4162 W 34015552 3584 <none> ora_lgwr_DEM\0 2009 May 28 00:54:10 sd0 32 4 215 4162 W 34015559 3072 <none> ora_lgwr_DEM\0 2009 May 28 00:54:10 sd0 32 4 215 4162 W 34015565 361984 <none> ora_lgwr_DEM\0 2009 May 28 00:54:10 sd0 32 4 215 4162 W 34016272 5632 <none> ora_lgwr_DEM\0 2009 May 28 00:54:11 sd0 32 4 215 4162 W 34016283 133632 <none> ora_lgwr_DEM\0 2009 May 28 00:54:11 sd0 32 4 215 4162 W 34018064 857600 <none> ora_lgwr_DEM\0 2009 May 28 00:54:11 sd0 32 4 215 4162 W 34013216 57344 <none> ora_lgwr_DEM\0 2009 May 28 00:54:11 sd0 32 4 215 4162 W 34019739 371200 <none> ora_lgwr_DEM\0 2009 May 28 00:54:11 sd0 32 4 215 4162 W 34020464 512 <none> ora_lgwr_DEM\0 2009 May 28 00:54:11 sd0 32 4 215 4162 W 34020465 1048064 <none> ora_lgwr_DEM\0
As you can see, I/Os occur at different sizes, values between 512 Byte and 1 MB can be observed. On average that certainly appears to be more efficient than small 8 KB writes.
Bottomline
I think the results shown by AWR and DTrace are pretty clear. You are still preferring UFS buffered I/O? Well, not my fault ......