The Question
"What is the meaning of the SHUNNED state field value in the cctrl> ls
output?
|db1(slave:SHUNNED(MANUALLY-SHUNNED), |
|progress=31893503241, latency=0.777) |
|STATUS [SHUNNED] [2023/01/13 04:30:02 AM EST] |
The Answer
From the documentation page:
A SHUNNED datasource implies that the datasource is OFFLINE. Unlike the OFFLINE state, a SHUNNED datasource is not automatically recovered.
A datasource in a SHUNNED state is not connected or actively part of the dataservice. Individual services can be reconfigured and restarted. The operating system and any other maintenance to be performed can be carried out while a host is in the SHUNNED state without affecting the other members of the dataservice.
Datasources can be manually or automatically shunned. The current reason for the SHUNNED state is indicated in the
cctrl> ls
output.
The DataSource STATE Field
SHUNNED is one of multiple STATE values that may be seen in the cctrl output.
The purpose of the STATE
field is to provide standard values for the current cluster node (datasource) status.
Here are the possible values for the datasource STATE field:
DS State | Meaning |
---|---|
ONLINE | Operating normally, accepts requests from the Connectors for both reads and writes |
OFFLINE | Does not accept requests from the Connectors for either reads or writes. An OFFLINE datasource will be automatically recovered to ONLINE when the policy is AUTOMATIC. |
SHUNNED | Does not accept requests from the Connectors for either reads or writes. Unlike the OFFLINE state, a SHUNNED datasource is NOT automatically recovered when the policy is AUTOMATIC. |
FAILED | Not operating normally, do not accept any Connector traffic; requires cctrl> recover to heal |
Examples of SHUNNED DataSources
Below are various examples of datasource in the SHUNNED state, along with possible associated solutions, where applicable.
- SHUNNED(DRAIN-CONNECTIONS)
- SHUNNED(FAILSAFE_SHUN)
- SHUNNED(MANUALLY-SHUNNED)
- SHUNNED(CONFLICTS-WITH-COMPOSITE-MASTER)
- SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure)
- SHUNNED(SUBSERVICE-SWITCH-FAILED)
- SHUNNED(FAILED-OVER-TO-db2)
- SHUNNED(SET-RELAY)
- SHUNNED(FAILOVER-ABORTED AFTER UNABLE TO COMPLETE FAILOVER…)
- SHUNNED(CANNOT-SYNC-WITH-HOME-SITE)
SHUNNED(DRAIN-CONNECTIONS)
The DRAIN-CONNECTIONS state means that the `datasource [NODE|CLUSTER] drain [timeout]` command has been successfully completed and the node or cluster is now SHUNNED as requested.
The datasource drain command will prevent new connections to the specified data source, while ongoing connections remain untouched. If a timeout (in seconds) is given, ongoing connections will be severed after the timeout expires. This command returns immediately, no matter whether a timeout is given or not. Under the hood, this command will put the data source into "SHUNNED" state, with lastShunReason set to "DRAIN-CONNECTIONS". This feature is available as of version 7.0.2.
+---------------------------------------------------------------------------------+
|emea(composite master:ONLINE, global progress=21269, max latency=8.997) |
|STATUS [OK] [2023/01/17 09:11:36 PM UTC] |
+---------------------------------------------------------------------------------+
| emea(master:ONLINE, progress=21, max latency=8.997) |
| emea_from_usa(relay:ONLINE, progress=21248, max latency=3.000) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|usa(composite master:SHUNNED(DRAIN-CONNECTIONS), global progress=21, max |
|latency=2.217) |
|STATUS [SHUNNED] [2023/01/19 08:05:02 PM UTC] |
+---------------------------------------------------------------------------------+
| usa(master:SHUNNED, progress=-1, max latency=-1.000) |
| usa_from_emea(relay:ONLINE, progress=21, max latency=2.217) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db16-demo.continuent.com(master:SHUNNED(DRAIN-CONNECTIONS), progress=-1, THL |
|latency=-1.000) |
|STATUS [SHUNNED] [2023/01/19 08:05:02 PM UTC][SSL] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=OFFLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db17-demo.continuent.com(slave:SHUNNED(DRAIN-CONNECTIONS), progress=-1, |
|latency=-1.000) |
|STATUS [SHUNNED] [2023/01/19 08:05:03 PM UTC][SSL] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db16-demo.continuent.com, state=OFFLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db18-demo.continuent.com(slave:SHUNNED(DRAIN-CONNECTIONS), progress=-1, |
|latency=-1.000) |
|STATUS [SHUNNED] [2023/01/19 08:05:02 PM UTC][SSL] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db16-demo.continuent.com, state=OFFLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
cctrl> use world
cctrl> datasource usa welcome
SHUNNED(FAILSAFE_SHUN)
The FAILSAFE-SHUN state means that there was a complete network partition so that none of the nodes were able to communicate with each other. The database writes are blocked to prevent a split-brain from happening. See my previous blog post for more information about split-brain scenarios.
+----------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILSAFE_SHUN), progress=56747909871, THL |
|latency=12.157) |
|STATUS [OK] [2021/09/25 01:09:04 PM CDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=374639937, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db2(slave:SHUNNED(FAILSAFE_SHUN), progress=-1, latency=-1.000) |
|STATUS [OK] [2021/09/15 11:58:05 PM CDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=OFFLINE) |
| DATASERVER(state=STOPPED) |
| CONNECTIONS(created=70697946, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db3(slave:SHUNNED(FAILSAFE_SHUN), progress=56747909871, latency=12.267) |
|STATUS [OK] [2021/09/25 01:09:21 PM CDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=168416988, active=0) |
+----------------------------------------------------------------------------+
cctrl> set force true
cctrl> datasource db1 welcome
cctrl> datasource db1 online (if needed)
cctrl> recover
SHUNNED(MANUALLY-SHUNNED)
The MANUALLY-SHUNNED state means that an administrator has issued the `datasource {NODE|CLUSTER} shun
` command using cctrl
or the REST API, resulting in the specified node or cluster being SHUNNED.
|db1(slave:SHUNNED(MANUALLY-SHUNNED), |
|progress=31893503241, latency=0.777) |
|STATUS [SHUNNED] [2023/01/13 04:30:02 AM EST] |
cctrl> datasource db1 welcome
+----------------------------------------------------------------------------+
|db1(master:SHUNNED(MANUALLY-SHUNNED), progress=15969982, THL |
|latency=0.531) |
|STATUS [SHUNNED] [2014/01/17 02:57:19 PM MST] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=4204, active=23) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db2(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure), |
|progress=15969982, latency=0.000) |
|STATUS [OK] [2014/01/09 11:14:32 PM MST] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db3(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure), |
|progress=15969982, latency=0.000) |
|STATUS [OK] [2014/01/09 11:19:46 PM MST] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
cctrl> set force true
cctrl> datasource db1 welcome
cctrl> datasource db1 online (if needed)
cctrl> recover
SHUNNED(CONFLICTS-WITH-COMPOSITE-MASTER)
The CONFLICTS-WITH-COMPOSITE-MASTER state means that we already have an active primary in the cluster and we can’t bring this primary online because of this.
+----------------------------------------------------------------------------+
|db1(master:SHUNNED(CONFLICTS-WITH-COMPOSITE-MASTER), |
|progress=25475128064, THL latency=0.010) |
|STATUS [SHUNNED] [2015/04/11 02:35:24 PM PDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=2568, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db2(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe |
|procedure), progress=25475128064, latency=0.089) |
|STATUS [OK] [2015/03/19 09:50:23 PM PDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=455, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db3(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe |
|procedure), progress=25475128064, latency=0.049) |
|STATUS [OK] [2015/03/19 09:50:14 PM PDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=457, active=0) |
+----------------------------------------------------------------------------+
SHUNNED(FAILSAFE AFTER Shunned by fail-safe procedure)
The “FAILSAFE AFTER Shunned by fail-safe procedure” state means that the Manager voting Quorum encountered an unrecoverable problem and shut down database writes to prevent a Split-brain situation.
+----------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILSAFE AFTER Shunned by fail-safe |
|procedure), progress=96723577, THL latency=0.779) |
|STATUS [OK] [2014/03/22 01:12:35 AM EDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=135, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db2(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe |
|procedure), progress=96723575, latency=0.788) |
|STATUS [OK] [2014/03/31 04:52:39 PM EDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=28, active=0) |
+----------------------------------------------------------------------------+
...
+----------------------------------------------------------------------------+
|db5(slave:SHUNNED:ARCHIVE (FAILSAFE AFTER Shunned by |
|fail-safe procedure), progress=96723581, latency=0.905) |
|STATUS [OK] [2014/03/22 01:13:58 AM EDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=23, active=0) |
+----------------------------------------------------------------------------+
cctrl> set force true
cctrl> datasource db1 welcome
cctrl> datasource db1 online (if needed)
cctrl> recover
SHUNNED(SUBSERVICE-SWITCH-FAILED)
The SUBSERVICE-SWITCH-FAILED state means that the cluster tried to switch the Primary role to another node in response to an admin request, but was unable to do so due to a failure at the sub-service level in a Composite Active-Active (CAA) cluster.
+---------------------------------------------------------------------------------+
|db1(relay:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6668586, |
|latency=1.197) |
|STATUS [SHUNNED] [2021/01/14 10:20:33 AM UTC][SSL] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=relay, master=db4, state=ONLINE) |
| DATASERVER(state=ONLINE) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6668586, |
|latency=1.239) |
|STATUS [SHUNNED] [2021/01/14 10:20:39 AM UTC][SSL] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6668591, |
|latency=0.501) |
|STATUS [SHUNNED] [2021/01/14 10:20:36 AM UTC][SSL] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=pip-db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
+---------------------------------------------------------------------------------+
cctrl> use {SUBSERVICE-NAME-HERE}
cctrl> set force true
cctrl> datasource db1 welcome
cctrl> datasource db1 online (if needed)
cctrl> recover
SHUNNED(FAILED-OVER-TO-db2)
The FAILED-OVER-TO-{nodename} state means that the cluster automatically and successfully invoked a failover from node db1 to node db2. The fact that there appear to be two masters is completely normal after a failover, and indicates the cluster should be manually recovered once the node which failed is fixed.
+----------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILED-OVER-TO-db2), progress=248579111, |
|THL latency=0.296) |
|STATUS [SHUNNED] [2016/01/23 02:15:16 AM CST] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=108494736, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db2(master:ONLINE, progress=248777065, THL latency=0.650) |
|STATUS [OK] [2016/01/23 02:15:24 AM CST] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=3859635, active=591) |
+----------------------------------------------------------------------------+
cctrl> recover
SHUNNED(SET-RELAY)
The SET-RELAY state means that the cluster was in the middle of a switch which failed to complete for either a Composite (CAP) Passive cluster or in a Composite (CAA) sub-service.
+---------------------------------------------------------------------------------+
|db1(relay:SHUNNED(SET-RELAY), progress=-1, latency=-1.000) |
|STATUS [SHUNNED] [2022/08/05 08:13:03 AM PDT] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=relay, master=db4, state=SUSPECT) |
| DATASERVER(state=ONLINE) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=14932, |
|latency=0.000) |
|STATUS [SHUNNED] [2022/08/05 06:13:36 AM PDT] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=14932, |
|latency=0.000) |
|STATUS [SHUNNED] [2022/08/05 06:13:38 AM PDT] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
+---------------------------------------------------------------------------------+
cctrl> use {PASSIVE-SERVICE-NAME-HERE}
cctrl> set force true
cctrl> datasource db1 welcome
cctrl> datasource db1 online (if needed)
cctrl> recover
SHUNNED(FAILOVER-ABORTED AFTER UNABLE TO COMPLETE FAILOVER …)
The “FAILOVER-ABORTED AFTER UNABLE TO COMPLETE FAILOVER” state means that the cluster tried to automatically fail over the Primary role to another node, but was unable to do so.
+----------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILOVER-ABORTED AFTER UNABLE TO COMPLETE FAILOVER |
| FOR DATASOURCE 'db1'. CHECK COORDINATOR MANAGER LOG), |
| progress=21179013, THL latency=4.580) |
|STATUS [SHUNNED] [2020/04/10 01:40:17 PM CDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=294474815, active=0) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db2(slave:ONLINE, progress=21179013, latency=67.535) |
|STATUS [OK] [2020/04/02 09:42:42 AM CDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=22139851, active=1) |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db3(slave:ONLINE, progress=21179013, latency=69.099) |
|STATUS [OK] [2020/04/07 10:20:20 AM CDT] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=66651718, active=7) |
+----------------------------------------------------------------------------+
SHUNNED(CANNOT-SYNC-WITH-HOME-SITE)
The CANNOT-SYNC-WITH-HOME-SITE state is a composite-level state which means that the sites were unable to see each other at some point in time. This scenario may need a manual recovery at the composite level for the cluster to heal.
From usa side:
emea(composite master:SHUNNED(CANNOT-SYNC-WITH-HOME-SITE)
from emea side
usa(composite master:SHUNNED(CANNOT-SYNC-WITH-HOME-SITE)
cctrl compositeSvc> recover
Wrap-Up
In this post we examined the purpose of the cctrl> ls STATE
output field, what the different values mean, and how to recover the cluster from some of the errors shown.
Smooth sailing!
Comments
Add new comment