Agenda
What's Here?
- Summary - Briefly describe the bundled cluster monitoring tools and related documentation pages
- Explore the thinking behind cluster monitoring
- Describe the use-cases for key monitoring tools included with the Continuent Tungsten Clustering software
- Examine the best practices for using each tool along with examples
Summary
The Short version
All businesses strive for maximum uptime, and monitoring is key to uptime - if you don’t know that something is broken, you won’t know to fix it!
This blog post shows you the thinking behind each included Tungsten Cluster monitoring tool, and when to use which tool.
Continuent provides multiple methods out of the box to monitor the cluster health.
The most popular is the suite of Nagios/NRPE scripts (cluster-home/bin/check_tungsten_*
).
We also have Zabbix scripts (cluster-home/bin/zabbix_tungsten_*
).
Additionally, there are standalone scripts available like tungsten_monitor
and tungsten_health_check
, based upon the shared Ruby-based tpm libraries. We also include a very old shell script called check_tungsten.sh
, but it is obsolete.
Resources To Guide You
We have Nagios-specific documentation to assist with configuration:
In addition to this post, we have some other very descriptive blog posts about how to implement the Nagios-based cluster monitoring solutions:
- How to Integrate Tungsten Clustering Monitoring Tools with PagerDuty Alerts
- Continuent Blog: Global Multi-Primary MySQL Cluster Monitoring Using Nagios and NRPE
- Essential MySQL Cluster Monitoring Using Nagios and NRPE
The Thinking Behind Monitoring
Pay Attention To The Man Behind The Curtain
Why monitor?
- More Uptime - if you do not know it is broken, you cannot fix it
- Less Downtime - costs money in terms of lost revenue, lost reputation and lost time
- Better Reliability - the more you can watch, the faster you can react to problems and potentially make improvements to prevent it happening again
- Trending - be able to notice changes or trends in your system to predict issues
What things should I watch in my cluster?
- Manager
- Replicator
- Connector
- Database
- OS
- Hardware
- Network
What should I look for?
- Errors
- Delays
- Lack of operation
- Wrong states
- Unusual activity
- Threshold Exceeded (too high or low as compared to desired norm)
Exploring the Bundled Cluster Monitoring Tools
What tools are provided to monitor the cluster?
There are five available Nagios/NRPE-based check scripts, and the online documentation for each is listed below:
- check_tungsten_services - verify that the specified services are running, i.e. via the `
ps
` command - check_tungsten_online - verify that all services are in the
ONLINE
state, either for a single node (-n) or for all nodes (default), and you may specify a service name using `-s
` in case you have more than one - check_tungsten_policy - verify that the dataservice policy for the cluster is
AUTOMATIC
- check_tungsten_progress - verify that the Replicator sequence number is increasing within a specific time period which you may specify using `
-t
` (default: 1 second) - check_tungsten_latency - verify that the current replication latency is below the specified Warning (-w) and Critical (-c) levels in seconds
Some tools are designed to help without Nagios:
- tungsten_monitor - provides a mechanism for monitoring the cluster state when monitoring tools like Nagios aren't available. For example, here is a crontab entry to run the check once per hour:
10 * * * * /opt/continuent/tungsten/cluster-home/bin/tungsten_monitor --from=you@yourcompany.com --email=group@yourcompany.com >/dev/null 2>&1
- tungsten_health_check - checks the cluster against known best practices, typically used on a periodic basis manually to verify the cluster, often during a health check call with Continuent
Two of the tools are designed to be run all the time and alert every time they find an issue:
- check_tungsten_services - if the Java processes are not running, neither is the cluster node!
- check_tungsten_online - if the services are not in the
ONLINE
state, something is not as it should be and requires investigation
One of the tools are designed to be run all the time but alert only outside of planned maintenance:
- check_tungsten_policy - ensure the policy is
AUTOMATIC
because the cluster cannot react to an outage otherwise (i.e. inMAINTENANCE
mode)
Two of the tools are designed to be tuned to match your environment:
- check_tungsten_progress - tune the time period using `
-t
` (default: 1 second)
Perhaps your cluster does not have many updates, and so this check would signal an error condition when none existed. For example, wait for five seconds for a write to occur:
`check_tungsten_progress -t 5
` - check_tungsten_latency - tune the specified Warning (-w) and Critical (-c) levels in seconds.
In this case, both the warning and critical values are required. A well-conditioned cluster should show replication latencies under one second. For example, for a properly-running cluster, specify values that would indicate a real issue to limit false positives:
`check_tungsten_latency -w 2 -c 4
`
Component | Test | Tool | Built-In | Order | Tunable |
---|---|---|---|---|---|
Manager | Running? | check_tungsten_services | Yes | 1 | No |
Manager | Online? | check_tungsten_online | Yes | 2 | No |
Manager | Policy automatic? | check_tungsten_policy | Yes | 5 | No |
Replicator | Running? | check_tungsten_services | Yes | 1 | No |
Replicator | Online | check_tungsten_online | Yes | 2 | No |
Replicator | Latency Too High? | check_tungsten_latency | Yes | 4 | Yes |
Replicator | Progressing? | check_tungsten_progress | Yes | 3 | Yes |
Connector | Running? | check_tungsten_services | Yes | 1 | No |
Connector | Listening/reachable? | check_mysql or the client application | No | Last | - |
What Other Things Should I Watch?
Database | Running? |
Database | Listening/reachable? |
Database | Errors |
Database | Resources |
OS | Running? |
OS:CPU | Utilization too high? |
OS:Memory | Enough free ram? |
OS:Network | i/o bandwidth free? |
OS:Network | Errors? |
OS:network | Packet latency low enough? |
OS:Disk | Enough free space? |
OS:Disk | i/o bandwidth free? |
OS | Other resources? |
The Wrap-Up
Continuent provides multiple methods out of the box to monitor the cluster health.
We have described the built-in tools that allow you to monitor cluster operations, and how to tune those tools to minimize false positives.
Our documentation is extensive, please use the many links provided to explore the depths of the utilities.
If you have questions or concerns, or need a hand implementing any of this in your environment, please reach out to Continuent Support and we will be happy to help!
Lastly, in our next post, we will next cover the new Prometheus exporters included in version 7, due out later this year.
Comments
Add new comment