In a previous post we went into detail about how to implement Tungsten-specific checks. In this post, we will focus on the other standard Nagios checks that would help to monitor the health of your cluster nodes.
Your database cluster contains your most business-critical data. The Replica nodes must be online, healthy and in sync with the Primary in order to be viable failover candidates.
This means keeping a close watch on the health of the database nodes using many perspectives, from ensuring sufficient disk space to testing that replication traffic is flowing.
A robust monitoring setup is essential for cluster health and viability — if your Replica’s replicator goes offline, and you do not know about it, then that Replica node becomes effectively useless because it has stale data.
Nagios Checks
The Power of Persistence
One of the best (and also the worst) things about Nagios is the built-in nagging — it just screams for attention until you pay attention to it.
Nagios server uses services.cfg
which defines a service that calls the check_nrpe
binary with at least one argument — the name of the check to execute on the remote host.
Once on the remote host, the NRPE daemon processes the request from the Nagios server, comparing the check name sent by the Nagios server request with the list of defined commands in the /etc/nagios/nrpe.cfg
file. If a match is found, the command is executed by the nrpe
user. If different privileges are needed, then sudo must be employed.
Prerequisites
Before You Can Use These Examples
This is NOT a Nagios tutorial as such, although we present configuration examples for the Nagios framework. You will need to already have the following:
- Nagios server installed and fully functional
- NRPE installed and fully functional on each cluster node you wish to monitor
Please note that installing and configuring Nagios and NRPE in your environment is not covered in this article.
Teach the Targets
Tell NRPE on the Database Nodes What to Do
The NRPE commands are defined in the /etc/nagios/nrpe.cfg
file on each monitored database node. We will discuss three NRPE plugins called by the defined commands: check_disk
, check_mysql
and check_mysql_query
.
First, let's ensure sufficient disk space using the check_disk
plugin by defining two custom commands, each calling check_disk to monitor a different disk partition:
command[check_root]=/usr/lib64/nagios/plugins/check_disk -w 20 -c 10 -p /
command[check_disk_data]=/usr/lib64/nagios/plugins/check_disk -w 20 -c 10 -p /volumes/data
Next, let's validate that we are able to login to mysql directly, bypassing the connector by using port 13306, and using the check_mysql
plugin by defining a custom command also called check_mysql
:
command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H localhost -u nagios -p secret -P 13306
If there is a Connector proxy running on that node, you may wish to run the same test to validate that login work through the Connector on port 3306. For that particular example, a dedicated nagios
user is in use, so if the Connector is in proxy mode, make sure the user.map
file has the nagios
user defined properly. The same check_mysql
plugin will be used, specifying the Connector port and defining a custom command called check_mysql_connector
:
command[check_mysql_connector]=/usr/lib64/nagios/plugins/check_mysql -H localhost -u nagios -p secret -P 3306
Finally, you may run any MySQL query you wish to validate further, normally via the local MySQL port 13306 to ensure that the check is testing the local host:
command[check_mysql_query]=/usr/lib64/nagios/plugins/check_mysql_query -q 'select mydatacolumn from nagios.test_data' -H localhost -u nagios -p secret -P 13306
Here are some other example commands you may define that are not Tungsten-specific:
command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200
command[check_users]=/usr/lib64/nagios/plugins/check_users -w 15 -c 25
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 5,4,3 -c 6,5,4
command[check_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200
command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z
Additionally, there is no harm in defining NRPE agent commands that may not be called by the upstream Nagios server. This allows for simple administration — keep the golden copy in one place and then just push updates to all nodes as needed then restart nrpe
.
Big Brother Sees You
Tell the Nagios server to begin watching
Here are the service check definitions for the /opt/local/etc/nagios/objects/services.cfg
file:
# Service definition
define service{
service_description Root partition - Tungsten Clustering
servicegroups myclusters
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
check_command check_nrpe!check_root
contact_groups admin
use generic-service
}
# Service definition
define service{
service_description Data partition - Tungsten Clustering
servicegroups myclusters
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
check_command check_nrpe!check_disk_data
contact_groups admin
use generic-service
}
# Service definition
define service{
service_description mysql local login - Tungsten Clustering
servicegroups myclusters
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
contact_groups admin
check_command check_nrpe!check_mysql
use generic-service
}
# Service definition
define service{
service_description mysql login via connector - Tungsten Clustering
servicegroups myclusters
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
contact_groups admin
check_command check_nrpe!check_mysql_connector
use generic-service
}
# Service definition
define service{
service_description mysql local query - Tungsten Clustering
servicegroups myclusters
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
contact_groups admin
check_command check_nrpe!check_mysql_query
use generic-service
}
/opt/local/etc/nagios/objects/hosts.cfg
file.
Let's Get Practical
How to Test the Remote NRPE Calls From the Command Line
The best way to ensure things are working well is to divide and conquer. My favorite approach is to use the check_nrpe
binary on the command line from the Nagios server to make sure that the call(s) to the remote monitored node(s) succeed long before I configure the Nagios server daemon and start getting those evil text messages and emails.
To test a remote NRPE client command from a nagios server via the command line, use the check_nrpe
command:
shell> /opt/local/libexec/nagios/check_nrpe -H db1 -c check_disk_data
DISK OK - free space: /volumes/data 40234 MB (78% inode=99%);| /volumes/data=10955MB;51170;51180;0;51190
The above command calls the NRPE daemon running on host db1 and executes the NRPE command "check_disk_data" as defined in the db1:/etc/nagios/nrpe.cfg
file.
The Wrap-Up
Put It All Together and Sleep Better Knowing Your Tungsten Cluster Is Under Constant Surveillance
Once your tests are working and your Nagios server config files have been updated, just restart the Nagios server daemon and you are on your way!
Tuning the values in the nrpe.cfg
file may be required for optimal performance, as always, YMMV.
To learn about Continuent solutions in general, check out our Products & Solutions.
For more information about monitoring Tungsten clusters, please visit our documentation.
Tungsten Clustering is the most flexible, performant global database layer available today — use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!
Want to learn more or run a POC? Contact us.
Comments
Add new comment