|
Key
This line was removed.
This word was removed. This word was added.
This line was added.
|
Comment:
Changes (1)
View page history... {include:DSEE Banner} {toc} h1. Introduction The information contained in this article is primarily targeted at the Sun Java System Directory Server 5.2. Caching strategies may change for future releases as the Directory Server itself changes. It is based on my own analysis, experimentation, and experience with customers running in production environments. Some portions may not apply to all deployments. Because cache sizing is a very subjective topic, wherever possible I have tried to explain the reasoning behind all of my recommendations to allow administrators to discern whether they may be applicable in their own circumstances. Any potential configuration changes that you may make as a result of information in this document should be thoroughly tested before being implemented in a production environment. Properly sizing the Directory Server caches is one of the most important aspects in getting the best performance out of the server, but it can also be one of the most complicated. There are a few different approaches that can be taken, and the best one to use for any one environment can be based on a number of factors, including which platform and operating system you're using, the amount of memory you have in your systems, the size of your database, and the type of load that is placed on the server. I'll try to explain all the issues involved so that you can get a better understanding of how things work and are able to make informed decisions about the best configuration for your environment. h1. Directory Server Caches h2. Entry Cache This is basically a big hashtable that allows us to store entries indexed by DN, entry ID, and a replication-specific unique ID. If the entry that we need is held in the entry cache and we know any one of those keys, then we can retrieve that entry extremely quickly and in a form that is ready for immediate use. There is a separate entry cache per backend database, so if you've got multiple databases, then you've got multiple entry caches. When the Directory Server is first started, the entry caches are empty and therefore don't consume any memory. As it begins to handle requests and handle entries, the entry cache will start to be populated and the amount of memory that it consumes will begin to grow. Note that while the size of the entry cache can be constrained by a maximum number of entries, this is a legacy option remaining from the 4.x Directory Server and it is generally no longer used, and it will not be discussed in this article. h2. Database Cache This is a pool of memory that is used to hold pages of the database files in memory. These are pages from any database files, so that can include contents of both the indexes and the main id2entry database that holds the entries themselves (in the past it has been a somewhat common misconception that the database cache was only used for holding index information). In the default configuration, the database cache is memory-mapped and therefore backed by files (this is used to allow multiple processes to access the database concurrently, for example generating an LDIF export while the server is running). The location of these backing files is controlled by the nsslapd-db-home-directory configuration attribute and as I've pointed out before you will want to make sure that they are moved to a tmpfs filesystem on Solaris (or an equivalent memory-backed filesystem on other platforms). These files are allocated when the Directory Server first starts up, and the amount of memory consumed by the database cache remains constant while the server is running. h2. Filesystem Cache (buffer cache on UNIX systems) This is used to hold pages of files in memory, and in that regard can play a role very similar to the one that the database cache already fills. However, there are some very notable differences that can come into play, and therefore I'll describe the pros and cons of each in a later section. It is important to note, however, that the filesystem cache is provided by the underlying operating system and therefore is not directly tied to the Directory Server itself. h2. Disk Cache This is used to hold pages of files in the memory of the underlying storage device and therefore is independent of both the Directory Server and the operating system. It's usually quite small in relation to the main system memory and is primarily intended to help improve write performance. The disk cache is largely irrelevant to the remainder of this discussion and therefore won't be addressed further in this article. h2. Import Cache This memory is used to help provide temporary storage of information while the server is building a backend database as part of an [LDIF | http://en.wikipedia.org/wiki/LDIF] import. This LDIF import can be in the form of an ldif2db process launched with the server offline, an ldif2db.pl process launched as a task with the server online, or through online replica initialization. Because the latter of these two types of imports are performed with the server online, then that should be taken into consideration so that the addition of the import cache doesn't attempt to push the total memory consumption beyond what's available on the system, but this is generally not a problem because the entry cache for the database being imported will be wiped before the import cache is allocated, so that will usually free up more than enough memory. Because this cache is only used while an LDIF import is in progress, and because LDIF imports are generally rare and the associated database is unavailable anyway, the import cache will also be ignored for the rest of this document. You will get better import performance by increasing the import cache beyond the default size of 20MB, but there will be diminishing returns for anything above about 2GB, so that's usually where I set it. h1. Cache Sizing So now for the remainder of this document, we have three primary caches to focus on: the entry cache, the DB cache, and the filesystem cache. When we talk about sizing the caches, we are referring to not only the absolute sizes of these caches, but also their sizes relative to each other and to the amount of memory available on the system. The main principles behind cache sizing are simple: accessing data in memory is much faster than going to disk, and retrieving an entry from the entry cache is notably faster than getting it from the database (even if it is in the DB cache). In this vein, if you have enough memory on the system to fully cache all entries in the entry cache and also to cache the entire database in the DB cache, then that is usually the best configuration and will yield the greatest performance. However when there is not enough memory available to fully cache everything, or even in some cases when it may be possible but doing so would require extremely large amounts of memory (into the hundreds of gigabytes), then other tactics are necessary. h2. 32-bit vs. 64-bit In some cases, the Directory Server is available as either a 32-bit or a 64-bit application. In particular, it is available as a 64-bit process on Solaris (currently SPARC only, but x64 support is coming in the 6.0 release) and HP-UX. On other platforms (Solaris x86, Linux, AIX, and Windows), it is only available in 32-bit mode. Although there are a number of differences between 32-bit and 64-bit processes, the most significant is the amount of memory that the process can access. A 64-bit process can theoretically address up to 26464 bytes of memory (about 17 billion gigabytes), although there is currently no system that can hold anywhere near that amount (and there may be other architectural limitations that come into play before that limit is reached as well). However 32-bit processes can only address up to 23232 bytes, or four gigabytes. What's more, that total of four gigabytes must be split between user space and kernel space. User space is where all the memory actually consumed by the application resides, whereas kernel space is used for things like thread stack traces and memory linked in from shared libraries. There is generally a well-defined split between user space and kernel space in 32-bit processes, although it varies by operating system. On Solaris x86, a 32-bit process can use up to 3GB for its own data while 1GB is reserved for the kernel. The same is usually true for Linux, although some distributions provide kernels compiled with a different split. On HP-UX and AIX, the split is down the middle so user space and kernel space get 2GB each. Windows systems also use a 2GB split for user and kernel space, although it is possible to use up to 3GB for user space if applications are compiled with a special header (note that the Directory Server is not compiled in this manner) and this functionality is not available for systems that wish to use more than 16GB of total memory. As if the above explanation of 32-bit process size limits was not complex enough, I should also point out that it does not cover the case in which a 32-bit application is running on top of 64-bit Solaris. In this case, the user-space portion of the 32-bit application has access to nearly the entire 4GB range. This applies to SPARC-based systems as well as most x64 systems, although there is a problem with some early Opteron CPUs that can limit 32-bit applications on those systems to only 3GB. If the Directory Server is running as a 32-bit process and it reaches the upper limit of what the underlying operating system will allow, then any further attempt to allocate additional memory will be rejected. If this occurs, then the Directory Server cannot continue running and will attempt to shut down gracefully in an attempt to avoid crashing. It will release a small amount of memory that has been pre-allocated in attempt to have enough memory to complete the shutdown process, but if this is not sufficient then it will not be able to perform a normal shutdown and will have to run the database recovery process when it is restarted. Although running as a 64-bit application does have a slight overhead in terms of both performance and memory consumption, this overhead is far outweighed by the ability to address more than 4GB of memory. Therefore, for platforms that support it, the server should generally be run as a 64-bit process. This will be the default mode when the server is installed, and there is generally no reason to change that. For servers that are small enough that the total memory consumption is less than the limit that would be imposed for a 32-bit process, then the small difference in performance will be insignificant because directories that small will almost certainly not be pushing the performance boundary for the underlying system. Besides the difference in the amount of memory that the application can address, the only other notable difference between a Directory Server operating in 32-bit mode versus one operating in 64-bit mode is that the underlying database will use a slightly different format. This means that if the Directory Server is initially installed as a 64-bit application, it cannot be run in 32-bit mode without first exporting the data to LDIF and then re-importing in 32-bit mode. This process is also necessary in order to transition from 32-bit mode to 64-bit mode. h2. Filesystem Cache vs. Database Cache As I mentioned near the beginning of this article, both the database cache and the filesystem cache are used to cache pages of files, and in that regard they perform similar functions. However, there are a number of important differences between them and can have an impact on which one you should use to avoid having to go to disk to retrieve information from the database. Those differences include: * The database cache is directly associated with the Directory Server and therefore will only cache pages of the database files. The filesystem cache is associated with the underlying operating system and can cache portions of files that are completely unrelated to the Directory Server, and can potentially page out contents of the database files in order to make room for something completely unrelated. In this way, if you are primarily using the filesystem cache instead of the database cache for caching pages of the database files, then it is important to note that the filesystem cache may get "polluted" with contents of other files if any other I/O is necessary. This can apply even for I/O related to the Directory Server like performing backups * The database cache is directly associated with the Directory Server and therefore the memory that it consumes is also directly associated with the Directory Server and counts toward any limit that the OS may enforce on the process size (although because it is technically shared memory it will not be included in core files by default). The filesystem cache is not directly associated with the Directory Server and therefore if it happens to hold pages of the database files that will not be counted toward the total process size of the Directory Server. This is a very important distinction if the Directory Server is operating as a 32-bit process. * The filesystem cache will automatically use free memory on the system to buffer pages that have been recently read or written. It is dynamically sized, unlike the database cache which uses a fixed size configured by the Directory Server administrator. * The contents of the database cache are only valid while the Directory Server is online. If the server is stopped and restarted, then it will treat the database cache as if it were empty. Because the filesystem cache is not directly associated with the Directory Server process, its contents are retained across Directory Server restarts, although it will be lost across system reboots or if the associated filesystem is unmounted. * The database cache is helped to defer writes to the actual database files between checkpoints (the updates will still be written to the transaction logs and therefore will not be lost even if the server or underlying system crashes or otherwise fails). Therefore, even if you decide to allow the filesystem cache to take the primary responsibility for caching the database files, it is important to have at least some amount of database cache in order to efficiently handle write operations. * The filesystem cache can be manually "primed" by interacting with the underlying files. For example, {noformat}for FILE in db/*; do dd if=${FILE} of=/dev/null; done{noformat}. This is not as easily accomplished with the database cache. * Whenever the Directory Server performs a checkpoint, it walks through the database cache looking for dirty pages that need to be written to disk. This can cause a brief interruption in operations that need to access the underlying database, which is negligible in most cases but for very large caches (e.g., those nearing 100GB or more) it can be significant. * Whenever the Directory Server requests information from the database, the database will first check its own cache to determine if it contains the requested information. If it is not there, then it will ask the OS to retrieve that information from the filesystem, which will cause it to check to see if it is held in the filesystem cache. When the requested information comes back from the filesystem, the database cache may store it for future use, potentially paging some older information out to make room for it. This means that if the requested information is found in the database cache, then it will be slightly faster than if it is necessary to go to the underlying filesystem, and also that there may be an additional performance penalty for filesystem access if it is necessary to page out old data to make room for the new information. As noted, the filesystem cache is kind of like a black box. There isn't a whole lot of tuning that can be done to control how big it is, nor is there a supported method to determine what it might be holding. About the only type of configuration option that can be used is the "forcedirectio" mount option for UFS filesystems. If this option is enabled, then the contents of the associated filesystem will not be cached. If you are favoring the filesystem cache over the database cache, then it would probably be a very good idea to use the forcedirectio mount option on filesystems containing files that you don't want cached (e.g., those that will be used to hold the server logs or database backups). It is also important to consider that the two caches cannot really be used in conjunction with each other. That is, it isn't possible to use the database cache to hold a portion of the information and rely on the filesystem cache for the rest. The reason for this is that the two caches would always end up holding the same contents. The information in the database cache will reflect what has most recently been read from the filesystem and therefore the same information is also likely to be in the filesystem cache. This means that if you are going to rely on the filesystem cache rather than the database cache, it is best to make the database cache relatively small to minimize the overlap. But don't completely minimize it to avoid hurting write performance. In most cases, it should not be smaller than the 10MB that it was originally allotted. Based on all the details discussed in this section, it is possible to put together a basic set of guidelines that can be used to determine whether to use the database cache or the filesystem cache for holding pages of the database files: * If the Directory Server is running as a 32-bit application, then it may be best to use the filesystem cache because it is not directly associated with the Directory Server process and therefore not counted toward the maximum amount of memory that the server will be allowed to address. This means that you can use a small database cache and use most of the memory allowed by the 32-bit process size limits to go towards the entry cache. * If the Directory Server is running as a 64-bit application, then it's generally best to use the database cache instead of the filesystem cache because it will be slightly faster and that memory will be guaranteed to be reserved for use by the Directory Server. However, if the database size is very large and scanning through it during checkpoints could cause a noticeable delay, then it may be better to go with the filesystem cache. * If you will be frequently restarting the Directory Server for some reason but will not be rebooting or unmounting the associated filesystems, then it may be worthwhile to opt for the use of the filesystem cache over the database cache because it will be able to survive Directory Server restarts. * If you do decide to use the filesystem cache over the database cache, then do not completely minimize the size of the database cache because that could have an adverse impact on write performance. Further, you should consider using the forcedirectio mount option on any filesystems containing files that you don't want to be cached. Note that this discussion has primarily focused on the caching mechanism used by the UFS filesystem. Other filesystems may have different characteristics. For example, the caching mechanism available with QFS is highly configurable in the way that it operates, as well as in the amount of memory that it will consume. Similarly, ZFS has its own caching mechanism, and the developers have expressed interest in implementing features like compressing the data in the cache just as they have the ability to compress the data on disk. If this does become available, then it may be possible to hold an even greater amount of data in memory in order to avoid disk I/O. h2. The Entry Cache The Directory Server entry cache can play a huge role in getting the best possible performance out of the Directory Server, but it is also responsible for a pretty fair number of headaches experienced by administrators. On one hand, it allows for very fast access to entries if they are in the cache, and this can be notably faster than if it is necessary to retrieve the information from the database cache and convert it to an entry in internal form for use in processing. But on the other, it will generally require significantly more memory to cache an entry in the entry cache than in the DB. How much more? I can't really tell you that because it varies based on the type of information you have in your entries and the kinds of operations that have been performed on them, and perhaps a little bit on random chance. You may see some formulas floating around that try to estimate this, or some that may try to estimate the range of sizes that the entry cache may have, but I won't give any here because the truth is that it isn't something that can be boiled down to a simple formula, and most of them out there are based on either wild speculation or what was observed from a very small sample set. The best approach that you can use in order to get the most accurate estimates possible are based on what I like to call the "try it and see" approach. If you want to size the database cache, then it's relatively easy to do so based on the size of the database files on disk. The reason for this is that as I've mentioned before the database cache merely holds pages of the database files. Since you know how big the database files are, you can know how much memory will be needed to hold them. Further, this memory can be used in a very efficient manner because all the pages in the database files are exactly the same size, so if you need to throw an old page out of the cache in order to make room for a new one, then the new one will fit perfectly in the spot left behind by the old one with no wasted space. There's not even any need to allocate additional memory for the new page because it's already pre-allocated when the server starts. These are all traits that do not apply to the entry cache. In particular: * The amount of space that the entry consumes in the database (or in LDIF) has little to no direct relationship with the amount of space that will be required to store that entry in the internal form used by the entry cache. Therefore, you can't just look at the size of the database on disk and use that to create an accurate estimate of the ideal entry cache size. * Not all entries are the same size. If the entry cache is full and you want to add another entry into it, then it will be necessary to evict one or more existing entries to make room for the new entry. It will generally be the case that the new entry doesn't consume exactly the same amount of space as the entries that were freed. This can lead to little "gaps" in the memory space associated with the server that are allocated but may be too small to put anything in. * The memory associated with an entry in the entry cache can either be in a single contiguous block with pointers to different offsets within that block for the various components (which is the form that is normally used when an entry is first read into the cache from the database), or as lots of little bits of memory that are individually allocated (which is the form that it will have if it is modified after having been pulled into the cache). If the latter exploded form is used, then the underlying operating system may prefer to align each chunk of memory that it allocates on a particular type of boundary (e.g., four bytes or eight bytes), and if the number of bytes allocated for each of these chunks isn't a multiple of this alignment size, then there can be tiny gaps of free memory where nothing can be placed. All of this leads us to the basic conclusion that sizing the entry cache is hard. There is no simple formula that you can use to figure out how much memory you will need to fully cache everything. You would think that you might simply be able to do a big "(objectClass=*)" search to load all entries (or at least a really big chunk of them) into memory and then look at the monitor information to see how much space is being consumed, and in fact that is possible in a sense. This approach (or other, more efficient methods of priming the server) can be used to figure out what value you should use for the nsslapd-cacheMemSize attribute if you want to cache everything in the database. However, it is not sufficient for estimating the amount of memory that the entry cache will actually consume. The approach that the Directory Server takes when it tries to determine how much memory the entry cache consumes is pretty simple. Whenever the server puts an entry in the cache, it looks at all of the components of that entry and figures out how many bytes are needed to hold each of them. Then it adds up all of those values to get the total amount of memory needed to cache that entry. It keeps a running total for the cache, adding to it each time an entry is put in the cache, and subtracting from it every time that an entry is removed. In a perfect world, this estimate should match the amount of memory actually being consumed, but in reality this estimate can be wildly different as a result of all the gaps in memory due to fragmentation and alignment issues. As a result, the amount of memory that is actually taken up by the entry cache when you factor in all of this wasted space is a whole lot more than what the server thinks it is using. This can create a real problem. If the Directory Server is running in 32-bit mode and the estimate is far enough off then it runs the risk of exceeding the process size limit imposed by the underlying operating system, which will force the server to shut down or potentially crash while trying to do so. If the server is running in 64-bit mode and it nears the limit of the total amount of memory available on the system, then it can cause the system to start to swap, which is an absolute performance nightmare. Both of these conditions are extremely undesirable and should be avoided at all costs. For now, the best approach to prevent the amount of memory consumed by the entry cache from getting out of hand is to be conservative when setting the cache size. If you want to use the "try it and see" method in order to get a decent estimate, then you should do so using the following process: # Restart the server so that it starts with an empty entry cache. # Load the entry cache using only search operations. If you can fit all entries into the entry cache, then a simple "(objectClass=*)" search will accomplish this but will be quite slow. If you have a list of all the DNs of the entries in the database, then splitting that list up into several chunks and using a multithreaded process (or perhaps even searching from multiple client systems) and retrieving all of the entries. If you can't fit all the entries into the cache, then keep reading entries even after the cache is full so that old entries need to be evicted to make room for new ones. # Try to modify each entry, ideally in a somewhat random order, until all entries in the cache have been modified. This will cause the entries in the cache to be exploded from their optimized form in which the associated memory is allocated as a single block to the form in which each the memory for component of the entry needs to be allocated separately. The total amount of memory consumed by the Directory Server at the end of this process should give you some idea of what you can expect with that configuration in a production environment, although you should still make sure to leave a healthy buffer to prevent exceeding the 32-bit size limit or to force the system to start swapping. The upcoming Directory Server 6 release will attempt to deal with the discrepancy between the actual and estimated entry cache sizes using a two-pronged attack. On one hand, it will work more closely in conjunction with the underlying memory manager library in order to get a more accurate estimate of how much space is actually being used. This will help ensure that the value it reports for the currentEntryCacheSize attribute in the monitor entry for the backend will more accurately reflect the amount of memory actually being consumed by the entry cache. On the other, it will monitor the total size of the Directory Server process and if it gets too far out of line with what would be expected based on the configured cache sizes, then it will start to free up memory by evicting entries from the cache to prevent it from growing too large. h2. The Perils of Extremely Large Directory Server Processes As a 64-bit application, the Directory Server can address huge amounts of memory. If you have a really big directory and a really big system with lots of memory to run it on, then making the Directory Server run as a very large process can be tempting. However, there are a few caveats that should be taken into consideration when making this determination. In some cases, it may be appropriate to scale back the use of the caches in order to avoid certain undesirable side effects. First, let me clarify what I mean by "very large", because that is quite a subjective term. I would definitely say that systems like the Sun Fire 25K, which can support up to 576GB of RAM, certainly fall into that category, and I would also include systems like the Sun Fire 6900, which can handle up to 192GB of memory. For a Directory Server installation that could potentially strain or even surpass the memory capacity of systems like this, there is certainly danger of sizing the Directory Server so large that it runs into the problems described in this section. Some of the problems that I mention will also be applicable to smaller systems (e.g., like the Sun Fire 2900, which can support up to 96GB of RAM), but to a lesser extent. h3. Core Files The first issue that you need to be aware of is one that you will hopefully never encounter but should be prepared to handle nonetheless. If the Directory Server happens to crash for some reason, then the best way to figure out what went wrong is from a core file. A core file essentially contains a snapshot of the memory associated with a process at the time that it crashed, and using debugging tools it can be possible to figure out what went wrong so that the problem can be understood and potentially corrected. If the process was really large at the time that it crashed, then the resulting core file will be generally really big, and also can take a significant length of time to write to disk (e.g., if the system and underlying storage is able to achieve a sustained rate of 50 megabytes per second, then writing a 50 gigabyte core file would take over over 16 minutes). During this time, the associated process is in a kind of "limbo" state where it is in the process of crashing and still has all of its associated memory, and you shouldn't try to start it up again until this completes. It should be noted that by default, portions of the process size that are memory-mapped (which includes the database cache) will not be included in a core file, but the entry cache is contained in the process heap so it would be included. As such, the larger the entry cache, the larger that a core file could be if the heap was included in the core. I say "if the heap was included in the core", because even though it is included by default, the Solaris 10 coreadm utility provides the ability to configure what specifically is included in a core file, and the heap is something that can be excluded. This can be done by adding the following near the top of the start-slapd script: coreadm -p core.%f.%p -P default-heap $$ Note that excluding the heap from the core file may hinder the ability to debug certain problems, but the resulting core files will be much smaller (and therefore easier to transport if necessary) and will take much less time to write to disk. h3. Slow Process Map Traversal Another problem that may arise with extremely large processes (and again, particularly those with very large entry caches) is that performing certain kinds of operations that might need to traverse the memory map associated with the Directory Server process can get expensive. If this happens, then anything that would needs to traverse this map could essentially freeze the process for some period of time until the operation is complete. In cases with very large Directory Server processes (or any process that is sufficiently large), issuing a command like "pmap" or even "ps" can cause the server to hang for seconds or even minutes. Attaching to the process via DTrace or similar tools can also cause this behavior to occur. There is work underway to improve this problem in Solaris. In fact, one improvement was added into Solaris 10 just before it was released but after my last testing in this area, so it may already be markedly better than I have described. There are at least two other fixes in this area on the way. But even without these improvements, perhaps the best option that is available to administrators is to try increasing the default memory page size for applications. Using the largest available page size for the heap may help reduce the total number of pages that are needed and therefore reduce the impact for operations that need to traverse the process map. To determine the default page size for a Solaris system, use the "pagesize" command with no arguments. To determine the set of all available page sizes for the system, issue the command "pagesize -a". In general, for very large processes it is best to use the largest supported page size for the heap. In order to configure the Directory Server to use a larger page size, it is necessary to edit the start-slapd script so that it will use the ppgsz command to set the preferred page size. This is an option that won't be automatically inherited by sub-processes, so it must be directly applied to the command used launch the Directory Server binary. On the line immediately before the one that invokes the ./ns-slapd command, add the following new line: ppgsz -o heap=4M \ Make sure to leave the trailing backslash so that the next line is considered a continuation of this line, and change the "4M" to be the largest supported page size for that system. See the ppgsz(1) man page and/or SunSolve InfoDoc 71311 for more details. The degree to which increasing the page size helps may depend on a number of factors, including the overall size of the process and the type of system on which it is running. If increasing the page size does not appear to completely solve the problem, then unfortunately the only other option currently available is to avoid performing operations that require determining the process size or otherwise traversing the process map. h3. Long Startup Times After Reboot As mentioned earlier, the Directory Server database cache uses memory mapping such that the contents of the cache are actually backed by files on the filesystem, which makes it possible for multiple processes to safely have concurrent access to the database. For best performance, these backing files should be placed in a directory on a tmpfs filesystem (e.g., somewhere under "/tmp") using the nsslapd-db-home-directory configuration attribute. Because the database cache backing files are on a tmpfs filesystem, this means that the information will be lost if the system is rebooted. This is not in itself a problem because the Directory Server doesn't make any attempt to use the contents of the database cache backing files between restarts, and also because the server will recreate the files automatically if they aren't already there (or if they are present but have the wrong size because of a change in the cache configuration). However, the process of recreating the backing files can take some time, and if the cache is particularly large, then it can be several minutes. This will increase the length of time required for the Directory Server to start after the system is rebooted (or if the backing files need to be recreated for any other reason). There is nothing that can be done to avoid this, and the performance improvement that comes from placing the files on a tmpfs filesystem should far outweigh the cost of recreating them after a reboot, but it is a concern that should at least be noted. If the long startup time in this case is unacceptable, then that may be a reason to consider the use of the filesystem cache over the database cache. h3. Pauses During Database Checkpoints It was noted earlier in this article, but it is worth repeating here. When the Directory Server initiates a checkpoint, the database needs to traverse its cache to identify any dirty pages and ensure that they are flushed to disk. The larger the database cache, the longer this process can take. During this time, much of the database is locked, so any operation that needs to interact with the database in some way (which basically includes everything but base-level searches or binds targeting any entry that is already in the entry cache) may be blocked until the checkpoint completes. Although this problem may be addressed in future releases, there is little that can be done to avoid it currently. As such, if your database has a very large footprint and you would want to use a lot of memory to help cache it to avoid disk I/O, then reliance on the filesystem cache may be a better option than using the database cache. h2. Overall Recommendations for Caching Configurations With the information provided earlier in this document, you should have a fundamental understanding of the concepts behind cache sizing. At this point, we can make some generalizations about cache sizing that may be helpful in your deployment. Note that each deployment is a little different, but if you are aware of the pros and cons of each option, then you can make an informed decision about what the best approach might be. First, a few general statements that are almost always true in any environment: * Disk I/O is something to be avoided. If you see a lot of it, especially after the server has been online for long enough to "warm up", then it may be advisable to alter the cache configuration to place a greater reliance on the DB or filesystem cache. * Neither the entry cache nor the database cache should be completely minimized. If the database cache is too small, then the performance of write operations can suffer significantly. If the entry cache is too small, then there is a chance that the server can encounter a deadlock if an entry is evicted from the cache while an operation involving that entry is still in progress. In most cases, the cache sizes should not be reduced below their default sizes of 10MB each. * The number of entries that can fit in the entry cache for a given configured size is best determined by actually trying it -- you should not try to estimate it based on either the size of the entry in LDIF or the size of the database. Set the entry cache to what you would consider to be a large value and read entries until it is full or all entries have been read. Assuming that all the entries are about the same size, then you should be able to use extrapolation to determine how much space would be required for a larger or smaller amount. * With any configuration, you must ensure that the Directory Server process will not grow too large for the underlying system. For 32-bit versions, you should be careful to prevent it from reaching the OS-imposed process size limit so that it is not forced to shut itself down. For 64-bit versions, it is important to avoid growing so large that the system starts to use swap. If the system begins to swap, then performance will severely degrade and it will be difficult to recover without restarting. * The database cache may not be changed without restarting the Directory Server. The entry cache size may be altered while the server is running and that change will take effect immediately. However, you should never reduce the entry cache to a size below the amount of memory it is currently consuming. If this occurs, then the server will lock the cache until it can evict enough entries to free the necessary amount of space. During this time, the server will appear unresponsive because virtually all operations require interaction with the cache. * Remember to plan for the future as well as the present. If you have a reasonable degree of confidence that the database will grow as time goes by, then try to take that into account in your estimates. * You should always use performance testing to examine the caching configuration that you choose, including monitoring system metrics like memory consumption and disk I/O. Unless you have a good reason to believe that the Directory Server may be restarted frequently (which should not be the case in most environments), then you should ensure that at least most of the tests are run with the server fully primed. * If you do believe that you will be frequently restarting the Directory Server for some reason, then take that into account when deciding on the caching configuration. Remember that the entry cache and database cache need to "warm up" when the server is first started, so performance in those conditions will be worse than after it has been running for a while. In that case, increased reliance on the filesystem cache can be helpful because it will survive Directory Server restarts. If you expect to reboot the system frequently, then heavier use of the filesystem cache may still be a good idea but you may also want to consider manually priming the filesystem cache with utilities like dd. With this information in mind, we can begin to make some general recommendations for the most common scenarios in which the Directory Server may be deployed: * If you have enough memory on the system to safely cache all data in both the entry cache and either the DB cache or the filesystem cache, then do that. This will virtually always give you the best possible performance. * If you have enough memory on the system to safely cache all data in the entry cache and still have enough left over to use the DB or filesystem cache to handle all of the database except id2entry, then that will also yield very good performance. With all entries in the entry cache, then there will be little need to cache pages of the id2entry database file. However, you should remember that write operations will still need to interact with id2entry in order to store new or updated entries, so it is often better to allow for a little overhead to handle that. * You will virtually always want to ensure that the database or filesystem cache is able to hold at least the commonly-used indexes because they will be used for onelevel and subtree searches. This should include indexes required for internal operation, like the ancestorid and entrydn indexes. In a replicated environment, this should include the nsUniqueID index. If a lot of onelevel searches are performed, then this should also include the parentid index. * If the size of the database on disk is close to the amount of memory on the system, or if the size of the database is greater than the amount of available memory, then you will most likely want to use a very small entry cache and use the vast majority of the memory for caching pages of the database files with either the database or filesystem cache. If you are using the 32-bit version of the server, or if the system has more than around 96GB of memory, then the filesystem cache is probably a better choice than the database cache. * The most complicated case is the one in which there is more than enough memory on the system to fully cache the database in either the DB or filesystem cache but with only a relatively small amount left over for the entry cache. Using this particular approach may will often provide good performance, and if it is acceptable then that might be the way to go. However, in this case performance will often be better by reducing the percentage of the database pages that are cached in order to give more memory to the entry cache. Finding the optimal mix is something of a balancing act because it would still be preferable to avoid having to go to disk for database access, but if an entry is in the entry cache then it is redundant and wasteful for it to also be in the database/filesystem cache when memory is at a premium. Configurations that fall into this realm are often best served by the "try it and see" approach, but a heavier preference toward caching database pages rather than caching entries will often allow performance to be more stable. h2. Refining Cache Settings Based On Performance Monitoring In some cases, Directory Server cache sizing may be a "set it and forget it" kind of operation. If a given configuration yields sufficient performance for the workload it needs to handle, then there is little reason to mess with it, even if it would be possible to allow the server to go even faster. However, if the workload changes or increases significantly, or if you are still in the early stages of determining what the best cache configuration should be, then you may be able to take advantage of a wide array of monitoring information to help you better understand where to focus your efforts. Before describing the information that is available, it is important to note a couple of important caveats: * You should be aware that most metrics will be skewed shortly after the server has been restarted, particularly for measurements around the entry cache or database cache. These caches start off empty and then become populated as the server processes requests, and therefore there will be much higher miss rates than after the server has been running for a while. Unless the server or the underlying system needs to be restarted frequently for some reason, don't put much importance on any information gathered during these times. * You should be aware of the kinds of workloads that the server will be asked to handle. In particular, if an application needs to interact with an entry multiple times as part of a single operation, then that can inflate hit ratios. For example, one of the most common types of authentications involves the client issuing a search to find the user's entry, and then binding as that user to verify the credentials. In this case, because the entry was just retrieved from the search, it is virtually guaranteed that it will still be in the entry cache (even for very small caches) when the bind request is received, and as such the entry cache hit ratio will always be at least 50% for those kinds of operations. Similarly, if a client tends to hit the same entry four times in a row, then three of them are virtually guaranteed hits and therefore the ratio should always be at least 75%. In these cases,the biggest factor in Directory Server performance will be the efficiency with which it can retrieve the entry the first time in that sequence, and you shouldn't be fooled by what might initially look like a high hit ratio. The first tool that you can use is the OS-provided iostat utility (I generally use the command "iostat -x -n 2"), which can be used to determine how much disk I/O there might be. In particular, you will want to look out for reads against the database (this is most useful if the database has been isolated on its own disk subsystem, separate from the transaction logs, changelog, and access/error/audit logs). If there are a high number of reads against the database, then that likely means that the number of pages cached in the database/filesystem cache isn't high enough. One of the easiest ways to determine if disk I/O is a problem is to look in the "%b" column, which shows the percentage of the time that the associated disk is busy doing something. If this is a high percentage for the database disk and primarily read operations are involved, then that should be a red flag that the database cache or filesystem cache isn't being as effective as it could be. You should also note that it is normal to see very high %b values primarily as a result of writes performed during a database checkpoint, but if you see a significant number of writes between checkpoints, then that is a very good sign that the database cache is too small (even if you are relying primarily on the filesystem cache). The next set of metrics that you can use are the primary database cache statistics, which are available by retrieving the "cn=monitor,cn=ldbm database,cn=plugins,cn=config" entry. The attributes contained in this entry include: * dbCacheTries -- the number of times the Directory Server has attempted to retrieve data from the database. * dbCacheHits -- the number of times that attempts to retrieve information from the database were able to be processed using information in the database cache. * dbCacheHitRatio -- the percentage of attempts to retrieve information from the database that could be processed using information in the database cache. * dbCachePageIn -- the number of database pages that have been read into the cache to reads from the database requiring information not already in the cache. * dbCachePageOut -- the number of database pages that have been written to the cache backing files. * dbCacheROEvict -- the number of "clean" pages (i.e., those that had not changed since being read into the cache) that were thrown out of the cache to make room for new information. * dbCacheRWEvict -- the number of "dirty" pages (i.e., those that have changed since being read into the cache) that were thrown out of the cache to make room for new information, after first committing the changes that they contained to the main database files. If you have chosen to primarily use the filesystem cache instead of the database cache, then the only real metric here that you should be worried about is the dbCacheRWEvict attribute. Any time that a dirty page must be evicted from the cache, it means that the cache was not large enough for the number of write operations that the server needed to handle between checkpoints. This can significantly limit write performance, and as a result you should either increase the database cache (or decrease the checkpoint interval, although increasing the cache is generally the preferred approach). For caches in which the database cache will be used instead of the filesystem cache, then all metrics are important, but in particular you should look at the ratio of database cache tries to hits. Note that I didn't say the cache hit ratio, which may seem confusing at first, but the reason for this is that there will be a very large number of cache misses after the Directory Server has been restarted (because the DB cache will be empty) and all of those misses can skew the hit ratio. Instead, once the server has been up for a while, you should note the number of cache tries and cache hits. Then after a few hours, check them again. Subtract the old values from the new values, and compare the total number of hits for that period with the total number of tries in order to get the hit ratio for just that period. If a significant percentage of the DB operations involve cache misses, then either the server has not been up long enough to have been fully primed or the database cache is too small. If you're trying to cache the entire database in the DB cache, then you should also look at the number of read-only evicts. If the database has had to throw out clean pages in order to make room for new data, then the cache is not large enough to hold everything. The Directory Server also provides information about the state of the entry cache. Since there is a separate entry cache per backend database, then there is also a separate set of entry cache statistics per backend database. This information is available in the {noformat}"cn=monitor,cn={dbname},cn=ldbm database,cn=plugins,cn=config"{noformat} entries. The most important attributes for understanding entry cache efficiency are: * entryCacheTries -- the total number of attempts made to retrieve an entry from the entry cache. * entryCacheHits -- the total number of times that a requested entry was found in the entry cache. * entryCacheHitRatio -- the overall percentage of the time that a requested entry was found in the entry cache. * dbEntryCount -- the total number of entries in the backend database. Note that this can include "hidden" entries like tombstones that are not normally visible to clients. * ldapEntryCount -- the total number of "regular" entries in the backend database that will generally be visible to clients (with sufficient access rights). * currentEntryCacheCount -- the total number of entries that are currently held in the entry cache. This can include "hidden" entries. * maxEntryCacheSize -- the maximum allowed size in bytes for the entry cache associated with the backend database. * currentEntryCacheSize -- the number of bytes that the Directory Server believes the entry cache is currently consuming. The important metrics for understanding the efficiency of the entry cache are all ratios. The ratio of the current entry cache size to the maximum size will allow you to determine how close to becoming full the cache is. The ratio between the ldap entry count and the current entry cache count will show you the percentage of entries that are currently in the cache. If the cache is not yet full and does not contain all of the entries, then it is either still being "warmed up" after a restart, or there could be a significant percentage o f entries that are rarely or never accessed. If the entry cache has not yet been "warmed up", then attempts to judge the hit ratio are largely meaningless because there will likely be a lot of misses because the cache starts out empty. If it has been primed, then the entry cache hit ratio can be determined in much the same way as for the database cache -- by comparing the number of hits to the number of tries over a period of time so that the results are not skewed by the large number of misses that occur just after startup. However, as mentioned earlier in this document you will want to be aware of the real behavior of the applications that are hitting the server because if an application needs to interact with an entry multiple times in the course of a single operation, then that could artificially inflate the hit ratio and therefore you would want to take that into account. |
| Even if all is currently going well with Directory Server, you may want to periodically monitor these kinds of metrics so that you may be able to identify new trends that may emerge over time and deal with them before they become a problem. For example, as new entries are added and the database grows larger, you may start to see a few read-only evicts from the database cache or slightly higher miss rates. These may not be immediately accompanied by any noticeable drop in performance, but if the trend continues and the Directory Server reaches a critical tipping point (e.g., when the underlying disks can no longer keep up with all the random-access reads required to retrieve out-of-cache pages) then it may very quickly turn into a significant problem. |
Attribution: Neil Wilson. |