Poll Watch Monitoring

Name (Title)
Cloud Monitoring Using Poll Watch
Description Simple monitoring based on cloud provider "watch" service
Alias application monitoring, monitor by polling external service
Intent This pattern builds upon a cloud watching service that separately collects measurements on a running virtual machine instance.
Allows actor to invoke observation services running below the guest OS in the virtualized environment.
Two building blocks in this pattern:
  1. Watcher - Alleviates the concern with instrumenting this observation point. The monitor is either on or off. Turning it on initiates data collection to a remote service that gathers all available performance data into a collection point.
  2. Poller - Script or agent can Poll the Watcher for specific samples on intervals that are appropriate for service level and cost control targets.

Motivation This pattern is needed for monitoring of application health in cloud services where no application monitoring is available.
The advantages are:
  • there is no resource overhead with the running instance
  • it's very simple to implement - no deciding which measures to collect - it collects them all,
  • it's easy to do aggregations across different dimensions such as "all instances running this specific machine image", or "all instances running in availability zone A".
  • there are no sustaining implications from the node perspective because there is no agentry exposed to the user.
Applicability This is essentially what Amazon's Cloudwatch service does. Since this pattern does not provide real time event detection it may not be suitable for certain applications with high availability requirements, or which include single points of failure in the architecture.
Structure (Click on sequence diagram at right)
The first sequence (top) just initiates the data collection - call the interface for CloudWatch and say "monitor this instanceId".
Then, once data is being collected, you poll the service (bottom sequence), asking for stats from a specific measure, e.g., average and maximum CPU utilization, on a specific time interval, over a range of running instances as defined by the Dimension, (Dimension is an Amazon semantic for an aggregation), whether it's Image type, availability zone, or just a specific instance. The results are returned from the collection point, not the instance running your application, so this is an asynchronous kind of monitoring.

Participants
  • Watcher Service - e.g., AWS Cloud Watch
  • Poller - program or scripts that initiate requests to the Watcher Service
  • Rules - polling cycles, sample interval, and thresholds for actions specific to the different monitoring needs, e.g., SLA Management, Cost Control
  • Logs - history of polling results and triggered actions
  • Console - human readable view of logs and relevant summary. Ideally, the console would also provide dials for adjusting rules, initiating actions, and drill down views.
  • Cloud interface - command line and API interface for accessing Watcher Service
Collaboration If you combine this with some synthetic transactions, you should be able to get enough monitoring for a basic web application, without having to go the agent based route. But your cloud has to provide this kind of service.
Since monitoring can be used to trigger other management activity, this pattern potentially collaborates with provisioning and resource management patterns.
Consequences
  • In CloudWatch, you're paying 1.5ยข/instance/hour (which includes the transmission of data to the remote cloud watch service).
  • You don't have the rich set of data that you might get from other agent based systems - instead you get the basics - CPU utilization, disk iops and bytes in and out, and network ops and bytes in and out. No correlation of metrics is built in.
  • Since this pattern depends on a Watcher Service from the cloud provider it increases your exposure to risk of lock in.
Implementation AWS CloudWatch appears to be the only commercially available service that enables this pattern.
Known uses (Click on flow diagram at right)
  1. Service Level Compliance One important use case for this pattern is to drive the auto scaling capability of the cloud. Here we start by polling for latency data (say once every five minutes), which asks the collection point for average and maximum latency on load balancer A for a sample time interval (say one hour). You log the maximum latency metric for trend analysis, and compare the average latency metric with your service level targets. If latency exceeds those targets you would then poll for utilization on the running instances to see if CPU load is causing the service level problem. You get the CPU stats from the collection point for the aggregation of instances involved in the service, then compare with your target utilization range (say between 50% and 85%). If the utilization range is exceed then you would invoke a provisioning sequence similar to that described in the Application Static Image Provisioning pattern.
  2. Cost Control The converse of the provisioning decision in #1 above, which is to deprovision when utilization is below your target range, would likely not be triggered by the service level compliance use case that starts by checking application latency (as shown in the flow diagram). A more practical use case for cost control, would be to poll for CPU Utilization directly, typically on a less frequent cycle (say two hours) over a longer sample interval (say eight hours). Then trigger provisioning and deprovisioning sequences accordingly.

Related patterns
Author Scott Mattoon <scott.mattoon at sun dot com>
Reviewer  

Labels

cloud_pattern cloud_pattern Delete
developer developer Delete
provisioning provisioning Delete
scalability scalability Delete
availability availability Delete
cloud cloud Delete
monitoring monitoring Delete
pattern pattern Delete
cloudcomputing cloudcomputing Delete
architecture architecture Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

Sign up or Log in to add a comment or watch this page.


The individuals who post here are part of the extended Sun Microsystems community and they might not be employed or in any way formally affiliated with Sun Microsystems. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Sun nor any other party necessarily agrees with them.

Copyright 1994-2009 Sun Microsystems, Inc.
Powered by Atlassian Confluence
Sun Guidelines on Public Discourse Privacy Policy Terms of Use Trademarks Site Map Employment Investor Relations Contact