Introduction
Tungsten Clustering depends on a number of prerequisites and best practices to function optimally.
In this blog post, we explore a critical, yet easily-overlooked step when installing a Tungsten Cluster node - setting up start at boot, ideally under `systemd` control.
To ensure proper functioning of a Tungsten Cluster, please ensure that start-at-boot / stop-at-shutdown has been configured using deployall
.
Tungsten Clustering relies upon a voting quorum and therefore not having the node configured to start at boot can impact the functionality badly. If managers can’t form a majority of the quorum then even failover is in danger. As an example, imagine that NO start-at-boot support has been deployed. If the first node reboots no Tungsten service will run after reboot. If the second node restarts, the cluster will be in a shunned state as the third node isn’t part of the majority of the quorum and will shun itself. If start-at-boot support is in place we will always have at least 2 managers up and running and failover can happen cleanly.
The Question
Recently a customer asked us:
“What caused the failover to hang for a long time after a GCP virtual power-off was invoked?”
Plug The Hole: Root Cause
Tungsten processes (specifically the Tungsten Manager) were NOT under systemd
control.
Tell Me More
This is a corner case where the coordinator is the primary node, and the node is shut down.
When the Coordinator and Primary are the same node, and Tungsten is NOT stopped by systemd
during the power-off sequence, then the MySQL Server is stopped, and the Tungsten Manager remains running, which then invokes the failover before the power down completes. The power is then halted, and the failover never completes because that node was the active coordinator, and it is now dead.
The Fine Print
There is a difference between a graceful power-down signal and an instant power-off/dirty fail.
Tungsten Cluster WILL fail over in the event of a Primary instant power fail even if it was the COORDINATOR because:
- the Manager as Coordinator would not have any time to take any action due to the instal power-off
- the other two Manager on the remaining nodes would notice a missing coordinator and elect a replacement.
When a GCP virtual poweroff is invoked, the Linux systemd power-down sequence will gracefully shut down processes in the reverse order that they were started up.
As a result, we would expect the Tungsten processes to be stopped BEFORE the MySQL Server process when under systemd control.
What happened to cause the long delay was that the Tungsten processes were NOT under systemd control, so they were NOT STOPPED as part of the systemd graceful power-down process.
This allowed the Manager as Coordinator to begin a fail over that never got to complete, because it was stopped by the power-off in the middle.
The remaining Managers have a lengthy timeout to process because the Coordinator simply vanished due to the power down.
Plug The Hole: Solutions
The solution is to make the Tungsten Cluster start at boot and stop at shutdown using systemd
or init
via the deployall
tool.
The deployall
script will automatically detect the initialization system in use (systemd
or init
) and prefer systemd
when both are available.
By default, the deployall
script must be run manually to enable start-at-boot/stop-at-shutdown.
To automatically execute the deployall
script at installation time, add the install=true
tpm option to your configuration.
The online documentation for deployall
may be found here:
https://docs.continuent.com/tungsten-clustering-7.0/cmdline-tools-deployall.html
Java Environment
Since systemd will start services using sudo, java needs to be accessible to the root user. Please ensure that the java environment is correct under sudo access.
If you downloaded and extracted a java tarball somewhere, then you will need the following update-alternatives --install
command to register the location. For example, if you extracted the tarball under directory /opt/jre1.8.0_312/
, then your command might look something like this:
shell> sudo update-alternatives --install /usr/bin/java java /opt/jre1.8.0_312/bin/java 20
Next, confirm that there is a selected java using update-alternatives --config
like this:
shell> sudo update-alternatives --config java
There is 1 program that provides 'java'.
Selection Command
-----------------------------------------------
*+ 1 /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
Enter to keep the current selection[+], or type selection number:
Lastly, confirm the user environment is healthy for both root and the tungsten OS user:
tungsten@db7-demo:/home/tungsten # sudo which java
/usr/bin/java
tungsten@db7-demo:/home/tungsten # sudo java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
tungsten@db7-demo:/home/tungsten # which java
/usr/bin/java
tungsten@db7-demo:/home/tungsten # java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
Cluster Start At Boot
When installing a new cluster, the tpm tungsten.ini flag install=true
will automatically install services and start them with the systemd or initd command.
When updating a running cluster, the following steps are needed to properly install the services, depending on the method in use:
Using init
When using the older init
method of configuring start-at-boot/stop-at-shutdown, there is just a single command to run:
shell> deployall
Using systemd
When using the modern systemd
method of configuring start-at-boot/stop-at-shutdown, there are potentially multiple steps to run, especially if the cluster is already up and running.
For continuity-of-service reasons, the deployall
script will NOT restart individual components if they had already been previously started by other methods.
For example:
shell> cctrl
cctrl> set policy maintenance
cctrl> exit
shell> deployall
shell> /opt/continuent/tungsten/tungsten-replicator/bin/replicator stop sysd
shell> sudo systemctl start treplicator
shell> /opt/continuent/tungsten/tungsten-manager/bin/manager stop sysd
shell> sudo systemctl start tmanager
shell> /opt/continuent/tungsten/tungsten-connector/bin/connector stop sysd
shell> sudo systemctl start tconnector
shell> cctrl
cctrl> set policy automatic
cctrl> exit
Removing Cluster Start At Boot
To remove the boot scripts from the system, use the undeployall
command:
shell> undeployall
Wrap-Up
In this post we explored a critical, yet easily-overlooked step when installing a Tungsten Cluster node - setting up start at boot and stop at shutdown, under either init
or systemd
control.
To ensure proper functioning of a Tungsten Cluster, please ensure that start-at-boot / stop-at-shutdown has been configured using deployall
.
Smooth sailing!
Comments
Add new comment