h1. Directory Server Monitoring
{toc:type=flat}
----
Directory Server installations differ, use the following table as a minimum as to what should be monitored. Values in the table, where they exist, are for guidance, they are not (necessarily) best practice.
||data point||sample rate||minimum useful data retention periods||nominal minimum value||nominal maximum value||actionable threshold||data retention period|| comparison to baseline values||
|CPU utilization|10m|
|Memory pressure|10m|
|swap space|10|
|disk space|3m|
|I/O pressure|3m|
|concurrent connections|3m|
|queued connections|5m|
|request queue backlog|5m|
|entry cache hit ratio|24h|
|database cache performance|24h|
|response times|5m|
|operations initiated vs completed|24h|
|replication latency|30m|
h1. System Resource Monitoring
h2. CPU
h3. CPU Pressure
CPU pressure occurs for a variety of reasons, and is measured in different ways. CPU utilization as a percentage in the user space and the system space as shown below:
{noformat}
$ sar -u 1 10
17:48:22 %usr %sys %idle
17:48:23 3 11 86
17:48:24 1 3 96
17:48:25 4 1 95
17:48:26 3 0 97
{noformat}
where %usr is user space utilization and %sys is system space utilization.
CPU pressure can also manifest as the queue of processes that are ready to to run but are blocked. This is shown in the leftmost column of vmstat output on a Solaris system - the run queue. In most cases, consistent non-zero values indicate a system under CPU pressure, and possibly on the road to CPU starvation. One could almost say that any system exhibiting consistently non-zero run queues is CPU starved relative to the load placed on the system.
A system that is not exhibiting CPU pressure:
{noformat}
solaris-devx / # vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd cd -- -- in sy cs us sy id
0 0 0 235776 80648 1 6 1 0 0 0 1 0 1 0 0 263 220 324 1 5 95
0 0 0 208744 54368 0 22 54 0 0 0 0 0 0 0 0 305 419 401 1 6 93
0 0 0 208736 54388 0 0 0 0 0 0 0 0 0 0 0 270 282 338 1 6 93
0 0 0 208736 54396 0 0 0 0 0 0 0 0 0 0 0 250 242 306 0 4 96
0 0 0 208736 54396 0 0 0 0 0 0 0 0 0 0 0 254 198 309 0 4 96
0 0 0 208736 54396 0 0 0 0 0 0 0 0 0 0 0 259 276 335 1 5 94
0 0 0 208736 54400 0 0 0 0 0 0 0 0 0 0 0 279 205 340 0 5 95
{noformat}
And then that same system comes under severe CPU pressure seconds later:
{noformat}
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd cd -- -- in sy cs us sy id
0 0 0 203200 47572 0 0 0 0 0 0 0 0 0 0 0 284 293 361 1 5 94
0 0 0 203200 47688 0 0 0 0 0 0 0 0 0 0 0 262 240 323 0 5 95
0 0 0 203200 47720 0 0 0 0 0 0 0 0 0 0 0 279 322 361 1 5 94
0 0 0 164092 34208 619 3411 0 0 0 0 0 0 0 0 0 296 12805 1733 27 64 9
32 0 0 170660 38920 2 1297 0 0 0 0 0 0 0 0 0 289 17470 3543 33 67 0
2 0 0 156364 27356 1 2914 0 0 0 0 0 7 0 0 0 375 18106 2519 27 73 0
110 0 0 145904 19976 6 2712 0 0 0 0 0 0 0 0 0 266 16687 2926 33 67 0
9 0 0 135384 11296 0 1855 4 0 36 0 14715 1 0 0 0 272 15540 2210 24 76 0
120 0 0 144496 19824 1 1811 12 0 0 0 0 2 0 0 0 291 18754 2828 36 64 0
108 0 0 137624 14340 0 897 0 0 28 0 10136 0 0 0 0 268 17041 2679 22 78 0
301 0 0 141616 19704 0 2064 0 0 0 0 0 0 24 0 0 396 20956 2741 26 74 0
90 0 0 155932 32980 0 1341 0 0 0 0 0 0 381 0 0 425 22025 3937 25 75 0
2 0 0 150372 28428 0 1600 0 0 0 0 0 0 0 0 0 279 21120 3810 29 71 0
{noformat}
The following table is from a suggestion by Adrian Cockcroft regarding CPU pressure and the run queue:
||run queue rule||level||action
|0|White|Idle|
|0 < runQueuePerCPU < 3|Green|No problem|
|3 <= runQueuePerCPU < 5|Orange|Busy (warning)|
|5 <= runQueuePerCPU < 5|Orange|possible CPU starvation condition (alert)|
In this table, runQueuePerCpu is (1st column of vmstat output)/(number of CPUs).
h3. Processor Online
Check that processors are online with psrinfo:
{noformat}
solaris-devx / # psrinfo
0 on-line since 09/20/2007 10:35:34
solaris-devx / #
{noformat}
h2. Memory Pressure
Memory availability is difficult to measure. Suffice it to say that on Solaris, one can easily measure the amount of memory on the free list using System Activity Reporting (SAR) and also using vmstat. One condition to avoid at all costs is paging. Swapping is even worse, but paging is bad enough.
The 12th column of vmstat output on Solaris indicates the level of paging activity. The column is labeled "sr", meaning "scan rate". If the number in this column is non-zero, the pager daemon is attempting to relieve memory pressure by paging pieces (pages, actually) of processes to backing store to satisfy requests from other processes. If a system is consistently paging, that system is being subjected to "memory pressure" and performance of processes on the system will be affected. Paging should be traced to root cause and corrected immediately.
A system that is not paging at the moment, although there has been some history of paging:
{noformat}
solaris-devx / # vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd cd -- -- in sy cs us sy id
0 0 0 235696 80608 2 8 1 0 0 0 2 0 1 0 0 263 228 325 1 5 95
0 0 0 267604 105760 6 25 12 0 0 0 0 0 0 0 0 273 247 350 0 5 95
0 0 0 267604 105864 0 0 0 0 0 0 0 0 0 0 0 264 260 341 1 5 94
0 0 0 267604 106052 0 0 0 0 0 0 0 0 0 0 0 261 221 319 0 5 95
{noformat}
h1. Directory Server
h2. readWaiters
The readWaiters attribute is maintained by Directory Server and indicates the number of connections that are pending but not currently assigned to a thread or serviced by a thread. If this number is non-zero, Directory Server is unable to assign a thread to service a connection
h2. request-que-backlog
The request-que-backlog attribute is maintained by Directory Server and indicates the number of requests that are waiting to be processed by a thread. This number should be zero, or nearly zero. If it is consistently non-zero, requests are being held in a queue to be processed and LDAP clients will not see a response to an LDAP request until they are processed. To correct, load-balance LDAP clients and/or increase the number of threads available to Directory Server.
h2. Hit Ratio(s) in the Entry Cache(s)
h2. Database Cache Efficiency
h2. Response Time
h2. Log Analyzer Tool
h2. Replication
h3. Replication Delay
h3. Replication Conflict
h2. SNMP
h1. Links
* [Directory Server Monitoring | http://docs.sun.com/app/docs/doc/820-0382/fnyvg?a=view]
* [Directory Server Monitoring Attributes |http://docs.sun.com/app/docs/doc/820-0382/6nc4i7qts?a=view#indexterm-141]
* [Directory Proxy Server JVM | http://docs.sun.com/app/docs/doc/819-0995/6n3cq3b83?a=view]
h1. Contributors
{contributors-summary}
{toc:type=flat}
----
Directory Server installations differ, use the following table as a minimum as to what should be monitored. Values in the table, where they exist, are for guidance, they are not (necessarily) best practice.
||data point||sample rate||minimum useful data retention periods||nominal minimum value||nominal maximum value||actionable threshold||data retention period|| comparison to baseline values||
|CPU utilization|10m|
|Memory pressure|10m|
|swap space|10|
|disk space|3m|
|I/O pressure|3m|
|concurrent connections|3m|
|queued connections|5m|
|request queue backlog|5m|
|entry cache hit ratio|24h|
|database cache performance|24h|
|response times|5m|
|operations initiated vs completed|24h|
|replication latency|30m|
h1. System Resource Monitoring
h2. CPU
h3. CPU Pressure
CPU pressure occurs for a variety of reasons, and is measured in different ways. CPU utilization as a percentage in the user space and the system space as shown below:
{noformat}
$ sar -u 1 10
17:48:22 %usr %sys %idle
17:48:23 3 11 86
17:48:24 1 3 96
17:48:25 4 1 95
17:48:26 3 0 97
{noformat}
where %usr is user space utilization and %sys is system space utilization.
CPU pressure can also manifest as the queue of processes that are ready to to run but are blocked. This is shown in the leftmost column of vmstat output on a Solaris system - the run queue. In most cases, consistent non-zero values indicate a system under CPU pressure, and possibly on the road to CPU starvation. One could almost say that any system exhibiting consistently non-zero run queues is CPU starved relative to the load placed on the system.
A system that is not exhibiting CPU pressure:
{noformat}
solaris-devx / # vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd cd -- -- in sy cs us sy id
0 0 0 235776 80648 1 6 1 0 0 0 1 0 1 0 0 263 220 324 1 5 95
0 0 0 208744 54368 0 22 54 0 0 0 0 0 0 0 0 305 419 401 1 6 93
0 0 0 208736 54388 0 0 0 0 0 0 0 0 0 0 0 270 282 338 1 6 93
0 0 0 208736 54396 0 0 0 0 0 0 0 0 0 0 0 250 242 306 0 4 96
0 0 0 208736 54396 0 0 0 0 0 0 0 0 0 0 0 254 198 309 0 4 96
0 0 0 208736 54396 0 0 0 0 0 0 0 0 0 0 0 259 276 335 1 5 94
0 0 0 208736 54400 0 0 0 0 0 0 0 0 0 0 0 279 205 340 0 5 95
{noformat}
And then that same system comes under severe CPU pressure seconds later:
{noformat}
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd cd -- -- in sy cs us sy id
0 0 0 203200 47572 0 0 0 0 0 0 0 0 0 0 0 284 293 361 1 5 94
0 0 0 203200 47688 0 0 0 0 0 0 0 0 0 0 0 262 240 323 0 5 95
0 0 0 203200 47720 0 0 0 0 0 0 0 0 0 0 0 279 322 361 1 5 94
0 0 0 164092 34208 619 3411 0 0 0 0 0 0 0 0 0 296 12805 1733 27 64 9
32 0 0 170660 38920 2 1297 0 0 0 0 0 0 0 0 0 289 17470 3543 33 67 0
2 0 0 156364 27356 1 2914 0 0 0 0 0 7 0 0 0 375 18106 2519 27 73 0
110 0 0 145904 19976 6 2712 0 0 0 0 0 0 0 0 0 266 16687 2926 33 67 0
9 0 0 135384 11296 0 1855 4 0 36 0 14715 1 0 0 0 272 15540 2210 24 76 0
120 0 0 144496 19824 1 1811 12 0 0 0 0 2 0 0 0 291 18754 2828 36 64 0
108 0 0 137624 14340 0 897 0 0 28 0 10136 0 0 0 0 268 17041 2679 22 78 0
301 0 0 141616 19704 0 2064 0 0 0 0 0 0 24 0 0 396 20956 2741 26 74 0
90 0 0 155932 32980 0 1341 0 0 0 0 0 0 381 0 0 425 22025 3937 25 75 0
2 0 0 150372 28428 0 1600 0 0 0 0 0 0 0 0 0 279 21120 3810 29 71 0
{noformat}
The following table is from a suggestion by Adrian Cockcroft regarding CPU pressure and the run queue:
||run queue rule||level||action
|0|White|Idle|
|0 < runQueuePerCPU < 3|Green|No problem|
|3 <= runQueuePerCPU < 5|Orange|Busy (warning)|
|5 <= runQueuePerCPU < 5|Orange|possible CPU starvation condition (alert)|
In this table, runQueuePerCpu is (1st column of vmstat output)/(number of CPUs).
h3. Processor Online
Check that processors are online with psrinfo:
{noformat}
solaris-devx / # psrinfo
0 on-line since 09/20/2007 10:35:34
solaris-devx / #
{noformat}
h2. Memory Pressure
Memory availability is difficult to measure. Suffice it to say that on Solaris, one can easily measure the amount of memory on the free list using System Activity Reporting (SAR) and also using vmstat. One condition to avoid at all costs is paging. Swapping is even worse, but paging is bad enough.
The 12th column of vmstat output on Solaris indicates the level of paging activity. The column is labeled "sr", meaning "scan rate". If the number in this column is non-zero, the pager daemon is attempting to relieve memory pressure by paging pieces (pages, actually) of processes to backing store to satisfy requests from other processes. If a system is consistently paging, that system is being subjected to "memory pressure" and performance of processes on the system will be affected. Paging should be traced to root cause and corrected immediately.
A system that is not paging at the moment, although there has been some history of paging:
{noformat}
solaris-devx / # vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd cd -- -- in sy cs us sy id
0 0 0 235696 80608 2 8 1 0 0 0 2 0 1 0 0 263 228 325 1 5 95
0 0 0 267604 105760 6 25 12 0 0 0 0 0 0 0 0 273 247 350 0 5 95
0 0 0 267604 105864 0 0 0 0 0 0 0 0 0 0 0 264 260 341 1 5 94
0 0 0 267604 106052 0 0 0 0 0 0 0 0 0 0 0 261 221 319 0 5 95
{noformat}
h1. Directory Server
h2. readWaiters
The readWaiters attribute is maintained by Directory Server and indicates the number of connections that are pending but not currently assigned to a thread or serviced by a thread. If this number is non-zero, Directory Server is unable to assign a thread to service a connection
h2. request-que-backlog
The request-que-backlog attribute is maintained by Directory Server and indicates the number of requests that are waiting to be processed by a thread. This number should be zero, or nearly zero. If it is consistently non-zero, requests are being held in a queue to be processed and LDAP clients will not see a response to an LDAP request until they are processed. To correct, load-balance LDAP clients and/or increase the number of threads available to Directory Server.
h2. Hit Ratio(s) in the Entry Cache(s)
h2. Database Cache Efficiency
h2. Response Time
h2. Log Analyzer Tool
h2. Replication
h3. Replication Delay
h3. Replication Conflict
h2. SNMP
h1. Links
* [Directory Server Monitoring | http://docs.sun.com/app/docs/doc/820-0382/fnyvg?a=view]
* [Directory Server Monitoring Attributes |http://docs.sun.com/app/docs/doc/820-0382/6nc4i7qts?a=view#indexterm-141]
* [Directory Proxy Server JVM | http://docs.sun.com/app/docs/doc/819-0995/6n3cq3b83?a=view]
h1. Contributors
{contributors-summary}