SAP on Oracle Performance in high latency Metrocluster Setup

What this article is about

Many customers are planning to extend their local High Availabilty implementation in order to protect themselves against full site failure.
Quite a lot of customers I talk to are considering to extend their existing local HA infrastructure towards a dual site (stretched) cluster configuration for obvious reasons:

  • almost no change in administration / operation
  • usually no additional software components
  • automatic failover in case of site failure

This article is supposed to give you a better understanding of the performance impact the new setup may have due to increased I/O latency. The information you will get is based on theory as well as practical measurements in a real world environment. Although this topic easily allows for writing dozens of pages, I'll try to keep this article as short as possible.

What this article is not about

I am not going into general discussions or pros & cons of:

  • SLAs
  • Recovery Point Objective (RPO)
  • Recovery Time Objective (RTO)
  • logical replication vs. blocklevel replication
  • High Availability vs. Disaster Recovery etc.
  • generic topics for dual room / dual site configs (e.g. quorum)
Recently Updated
by joerg_meiners (29 Jan)
Adaptive Oracle Grid (SAPonSun)
by j.brosowski (28 Jan)
Understanding a SAP SD 2-tier benchmark result (SAPonSun)
by vwetter (19 Jan)
Open Dodecaeder for SAP on Sun (SAPonSun)
by j.brosowski (18 Jan)
Open Dodecaeder for SAP on Sun v. 16 (SAPonSun)

Background

There are many theoretical explanations about increasing latency depending on distance. There are also different WAN protocols and technologies available. Nowadays the most commonly used is FC with Wavelength Division Multiplex technology (WDM), especially high bandwidth Dense Wavelength Division Multiplex (DWDM).
But a good understanding of the effective latency imposed by the long distance SAN is just the first step. The second step is to apply the effective latency numbers to your database workload and calculate the anticipated increase in database response time. In a third step, you may apply the increase in DB response time to your overall application response time. Sounds easy? Well, not really! There are many aspects that have significant impact to the overall calculation, like the specific storage volume Layout, the particular database workload / transaction profile as well as the specific infrastructure layout (do appservers also have to traverse the - just to name a few.
The real world measurements I was referring to earlier have been done on an exclusive DWDM based Metropolitan Network with about 45 km distance between both datacenters.

Effective I/O elongation

According to common theory, you should calculate about 5 microsec/km distance, which is about two thirds of the speed of light (a ray of light won't go straight through the fibre, it rather gets reflected over and over again). At 50 km distance you should calculate :
(50 km + 50 km) * 5 microsec/km = 500 microsec = 0,5 ms

There's one big assumption behind this formula: I/O only needs one roundtrip to complete. While this is in fact possible using special HW and/or software, it is by no means standard fibre channel behaviour. The fibre channel protocol for SCSI requires two roundtrips as shown in the figure below:

Source: http://www.cisco.com/en/US/prod/collateral/ps4159/ps6409/ps5989/ps6217/prod_white_paper0900aecd8024fd2b.html

As you can see, before being able to send the actual data, the host/initiator has to wait for the disk/target to send back an FCP_XFER_RDY signal. That's roundtrip number one. Sending the data itself plus receiving an FCP_RSP 'status good' signal makes up roundtrip number two. There are products / options available that eliminate this two way handshake by returning the FCP_XFER_RDY signal from the local switch:

Source: http://www.cisco.com/en/US/prod/collateral/ps4159/ps6409/ps5989/ps6217/prod_white_paper0900aecd8024fd2b.html

However, this functionality usually requires special hardware (e.g. specific Storage Services Modules) and/or software / licenses.

Volume Layout considerations

Let's start with the obvious: Many Volume Managers provide a way to concentrate all reads towards one plex (mirror). Instead of reading in a round-robin fashion (which usually is set as default), most workloads are likely to benefit from satisfying reads from the local plex. Consider though that the corresponding disks will have to serve twice the number of read I/Os. If all systems are distributed equally to both locations, this should however have no performance impact in case of consolidated storage: the storage subsystems in both locations should have their share of the total I/O workload!
Regarding Veritas Volume Manager it is quite common to define so-called 'Dirty Region Logs' (DRLs) in order to prevent full mirror resynchs after a host crash. In a metrocluster setup, you should also carefully plan for the location of the DRL (keep it local for 'normal' operation). Another option specific to remote mirrors are so-called DCO-Logs (Data Change Object). These logs avoid full mirror resynchs not only after a host crash (in case DRLs are embedded to the DCO-Logs) but also after a temporary outage of the DWDM links (stale mirrors). Persistent DCO-Logs do track changes of data within a bitmap on disk. These logs have to exist in both locations. However, this additional security comes at it's price: During our tests we figured out, that each application write creates 6 I/O's to disk, 3 to each mirror. Although the DCO Log Area is pretty small, you still have to deal with lots of IOPS. I would recommend not to use DCO-Logs on small volumes with high write activity which is the case with Oracle's online redologs. A full mirror resynch of these small volumes won't hurt performance for a long time therefore omitting DCO-Logs here seems to be no problem at all.

Some theory about Oracle

The Oracle Server uses different processes to perform different sorts of disk I/O. It is important to understand, that not every I/O has a direct impact on application response time. From an application point of view, a commit is a blocking event, the app may not continue to work before the commit has been acknowledged. On the other hand, a change to a data block (SQL Update) is a non blocking event. The change is being applied to the database block within the data buffer of the SGA and won't go to disk as long as no checkpoint is triggered. I'm trying to avoid the words 'asynchronous' and 'synchronous' because they tend to be a great source of confusion in the database space as described here, so let's stick with 'blocking' and 'non-blocking'.
Oracle internal statistics reflect this fact by using so-called 'events' which may either be 'idle events' (non-blocking) or 'non-idle events' (blocking).
Now let's look at the most important types of I/O and see which ones are idle resp. non-idle events:

  • Physical Reads (single or multiblock, data or index blocks):
    It is pretty obvious, that reading a block is a blocking operation, since the app usually can't continue working without actually having the requested data. Since the volume manager's read policy should prefer the local disks, there should be no impact imposed by the long distance SAN (see description above). The corresponding Oracle wait event is called 'db file sequential read' for single block reads resp. 'db file scattered read' for multiblock reads (aka full table scans).
  • Direct Path I/O:
    Direct Path I/O is somewhat rare in SAP environments. It occurs when an Oracle
    I've seen them in APO and BW/SEM environments when Oracle performs on-disk sorts.
    Direct path reads don't need to be observed closer, as they benefit from an appropriate volume manager read policy. Direct path writes are non-blocking, as long as the session doesn't require the I/O Queue being completely processed. In addition, direct writes are multiblock I/O's and the latency impact for a fixed amount of data is reduced compared to lots of small I/O's (less roundtrips). Putting it all together, you probably don't have to consider Direct Path I/O when trying to guesstimate the performance impact on your app - at least not for SAP environments.
  • Flushing of 'dirty blocks' by database writer process (DBWR):
    As mentioned earlier, this is a non-blocking action from the app point of view. At least as long as you don't run into 'checkpoint not complete' situations. However, this usually only happens, if your storage subsystem doesn't provide enough IOPS. Increased latency should by no means become an issue here. Just be careful when using DCO-Logs, as this will increase the IOPS rate significantly.
  • Writes to Oracle online redologs after commits:
    The Oracle wait event you will see here is 'log file sync'. To be precise: this event also contains the time required for communication between the user session and the log writer process (LGWR), like waking up LGWR, retrieving the blocks to be flushed to the log from the log buffer, waking up the user session after successful flush etc. However, most of this event's wait time is spent to perform the I/O to the redolog. 'Log file sync' is a fully blocking event, as the app has to be sure that it's changes are persisted to the database before being able to continue processing. Therefore you certainly have to keep an eye on 'log file sync'.

Systematic approach to estimate the impact on your application response time

The basic idea during my tests was to look at Oracle statistics

  • at different times (e.g. during runtime critical processing, during batch, during online user load)
  • for different systems (as not every system behaves the same)

After having spoken to application folks you know where and when to look. Your friend to analyze your existing systems is Oracle AWR or good old statspack reports. Check the Top 10 Wait events and watch out especially for 'log file sync'. If 'log file sync' is not a major contributor to your overall wait time, then - as a first guess - you are probably fine. But let's get back to the calculation:

  1. determine the number of 'log file sync' waits (find it in section 'Top 5 timed events' or 'events', usually the column next to 'event name')
    Example:
                                                                       Avg
                                                         Total Wait   wait    Waits
    Event                               Waits   Timeouts   Time (s)   (ms)     /txn
    ---------------------------- ------------ ---------- ---------- ------ --------
    db file sequential read        18,767,245          0    141,638      8     25.5
    buffer busy waits               1,596,964         22     18,337     11      2.2
    db file parallel write          1,158,487          0     17,030     15      1.6
    db file scattered read            933,386          0     15,495     17      1.3
    log file sync                     746,021         15      5,993      8      1.0
    

    746021 is the number you are looking for

  2. calculate the average commit size:
    Look at section 'Instance Activity Stats' for 'redo size' and 'redo writes'. Now divide 'redo size' by 'redo writes' and you've got the average I/O size for LGWR in Bytes. You're curious why the I/O size is important? Well, stay tuned ...
    Example:
    Statistic                                      Total     per Second    per Trans
    --------------------------------- ------------------ -------------- ------------
    redo size                             39,353,890,204    5,465,059.1     53,368.7
    redo writes                                  519,561           72.2          0.7
    

    The average I/O size LGWR flushes to disk is 74 KB

  3. Now you could add the additional latency by multiplying the theoretical latency of your DWDM network with the number of 'log file sync' waits. But there's one more thing to consider:

    As you can see, there's a roundtrip for every 32 KB of data. In case you have eliminated the two way handshake (e.g. by FC Write Acceleration), you'll have 3 roundtrips to perform a 74 KB I/O. Without FC-WA, you will end up with 4 roundtrips! That's the reason why we calculated the average LGWR I/O size. Makes sense?
    Example:
    Assuming we are talking about the 50 km distance, without having FC-WA, we can calculate 4 * 0,5ms = 2 ms.
    Now multiply that with the number of 'log file sync' events, you get
    746021 * 2 msec = 1492 sec (Christopher Columbus would have liked that result )
  4. Now the real question: 'Do I have to care about the additional wait time?'
    Oracle AWR reports provide a section 'Time Model statistics' which lists the total time being spent in database user calls. In case of statspack, the way I deal with it is to take the top event (from the 'Top 5 Timed Events'), and take it's wait time and it's 'total elapsed time' percentage to calculate the total wait time.
    Example:
    Top 5 Timed Events
    ~~~~~~~~~~~~~~~~~~                                                     % Total
    Event                                               Waits    Time (s) Ela Time
    -------------------------------------------- ------------ ----------- --------
    db file sequential read                        18,767,245     141,638    44.92
    

    In this case just calculate:
    141638 sec / 0,4492 = 315.311 sec
    Within the example above, we calculated an additional database wait time of 1492 sec compared to a total wait time of > 315000 secs. This means an increase of database response time of less than half a percent. That's probably acceptable for most customers. Especially considering the fact, that DB response time is just one part of SAP response time. You may check the proportions using transaction ST03(N). In our example above, you probaly will stop here and don't do the SAP math.
    But be aware, that there are databases having 'log file sync' as the top wait event. Especially if this is due to a high number of commits with relatively low average wait time. If that's the case, the high latency may become a real problem for your system.

Drawbacks

The method described above will only give you an idea with respect to average system behaviour. There might be transactions / jobs not fitting that average. Update intensive processes with frequent commits are certainly such candidates. If these processes are also runtime-sensitive, it might be a better idea to trace this transaction / job specifically in order to get an idea about it's commit-frequency.

Labels

rss rss Delete
sap sap Delete
oracle oracle Delete
performance performance Delete
metro metro Delete
cluster cluster Delete
dwdm dwdm Delete
fibre fibre Delete
channel channel Delete
distance distance Delete
latency latency Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

Sign up or Log in to add a comment or watch this page.


The individuals who post here are part of the extended Sun Microsystems community and they might not be employed or in any way formally affiliated with Sun Microsystems. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Sun nor any other party necessarily agrees with them.

© 2010, Oracle Corporation and/or its affiliates
Powered by Atlassian Confluence
Oracle Social Media Participation Policy Privacy Policy Terms of Use Trademarks Site Map Employment Investor Relations Contact