Overview
This blog post discusses the Tungsten Replicator and how latency is calculated when extracting events on a Primary, and when applying events to the Replicas. We also examine what latency is, and the two different types of latency that we track - how to interpret them and how they differ.
Sequence Numbers: The Heart of Replication
The Tungsten Replicator reads writes from the MySQL binary long and converts them into Transaction History Logs or THL
for short. During processing, each event or set of events created in THL get a sequence number or seqno
, which is a global transaction id, to ensure consistent replication order on all Replicas. Events are applied on a Replica in sequence number order.
A single sequence number is a complete transaction that could represent one write or thousands of writes depending on how the original event was written to the Primary node.
You may view the contents of individual THL sequence numbers using the command:thl list -seqno {seqno_here}
The sequence numbers (seqno
) on a Primary will increment when the Replicator has read the binary log, processed the entries and written the events to disk as THL.
The sequence numbers on a Replica will increment when the Replicator has processed and committed the events which are being applied to the target database server.
Replication Latency: What is it anyway?
There are two different latencies tracked by the Tungsten Replicator: Applied (appliedLatency
) and Relative (relativeLatency
). You can see both by running the trepctl status
command:
shell> trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysql-bin.000211:0000000020094766;0
appliedLastSeqno : 78022
appliedLatency : 0.571
...
relativeLatency : 8.944
Finished status command...
Applied Latency
Applied latency indicates the time needed to complete the last Seqno, either the time to extract an event from the binary logs and create the THL from it on the Primary, or the time to apply the event to the target database server on a Replica.
Relative Latency
Relative latency indicates the time since the appliedSeqno advanced, effectively how long ago was the last operation completed.
Comparison: Applied versus Relative
In an idle cluster with no write traffic, relative latency will climb, while the applied latency remains unchanged.
If Applied latency climbs in a busy system, it means:
- On a Primary: time to extract is taking longer, check disk performance
-
On a Replica: time to apply is taking longer, it is usually an indication of a long running event on the replicas - check target database indexing and disk performance
“Long running transaction” in this context could be either a single statement taking a long time to run or a transaction containing many many smaller statements. It could also be DDL/Schema change that is taking a long time to complete.
If Relative latency increases significantly in a busy system, it may be a sign that replication is stalled on either a Primary or a Replica. This is a good parameter to check in monitoring scripts.
Here are the docs comparing the types of latency: https://docs.continuent.com//tungsten-clustering-7.1/connector-routing-latency.html#connector-routing-latency-comparison
Displaying Latency
The basic command to see latency is trepctl status
:
shell> trepctl status
...
appliedLastSeqno : 406734
appliedLatency : 1.122
relativeLatency : 4.541
...
To monitor latency on a Replica, run the following command, which displays all stages of replication on the Replica with performance stats and refreshes every three seconds: trepctl perf -r 3
.
If you look at the output, you can see the progress of the 3 different stages within the Replicator, for example:
shell> trepctl perf -r 3
Statistics since last put online 1095951.622s ago
Stage | Seqno | Latency | Events | Extraction | Filtering | Applying | Other | Total
remote-to-thl | 13998690459 | 0.726s | 217192525 | 1090162.006s | 19.635s | 6552.409s | 289.506s | 1097023.556s
Avg time per Event | 0.005s | 0.000s | 0.000s | 0.000s | 0.005s
thl-to-q | 13977405027 | 25186.709s | 195907346 | 1036875.129s | 20.078s | 59752.325s | 374.006s | 1097021.538s
Avg time per Event | 0.005s | 0.000s | 0.000s | 0.000s | 0.006s
q-to-dbms | 13977405028 | 25187.722s | 4040490 | 1064409.815s | 609.628s | 31527.774s | 474.324s | 1097021.541s
Avg time per Event | 0.263s | 0.000s | 0.000s | 0.008s | 0.272s
Filters in stage | mysqlsessions -> pkey
Key values to look at here are the seqno associated with the remote-to-thl stage and the q-to-dbms stage.
If we examine the output captured every 3 seconds, we can see that both of these values are increasing quite significantly.
The q-to-dbms value increases as the transactions are committed to the database, and the growth of the remote-to-thl is an indication of the volume of changes that are being made in the Primary.
In one customer case, we saw that the difference between these two values every 3 secs was always in the region of 23 million, meaning that the Replicas are roughly 23 million events behind, but the gap was not increasing, which indicates the Replicator is applying properly.
In this case, the problem was that the gap was not really decreasing very much either. The customer did not see the latency coming down and in fact saw it going up because of the way calculations are done.
In short, this means that new changes were being made on the Primary faster than could be applied on the Replicas.
This comes down to the sheer volume of traffic on the cluster and the fact that the customer had only a single applier thread.
Parallel Apply
Typically, in this situation Continent would suggest using Parallel Apply.
Latency is tracked in a special database schema called tungsten_{service_name}
in table trep_commit_seqno
. There is one row per channel, and so there is one channel/one row for standard replication and 2 or more rows when using Parallel Apply.
By default, this only works with ROW based replication.
We are currently testing a new feature for the next patch release that allows Parallel apply to work with STATEMENT-based replication providing a single transaction does not update multiple tables. This new feature will work best when tables within the database that are the most active are identified - these tables are what would be assigned to their own Parallel Apply channel.
Bonus Tips
BONUS 1
To see just the critical information, use the egrep
command to filter trepctl status
:
shell> trepctl status | egrep 'relative|stat|appl|role'
Processing status command...
appliedLastEventId : mysql-bin.000103:0000000000000635;16
appliedLastGoodLatency : 1.122
appliedLastGoodSeqno : 406734
appliedLastSeqno : 406734
appliedLatency : 1.122
relativeLatency : 4.541
role : slave
state : ONLINE
Finished status command...
BONUS 2
To provide Replica latency information to Continuent support, run the following command and leave it running for approx. 1 minute, then hit Control+C to stop it. Upload the resulting status.log
file to the case for us to review.
shell> trepctl perf -r 3 >> /tmp/status.log
Wrap-Up
This blog post discussed the Tungsten Replicator and how latency is calculated when extracting events on a Primary, and when applying events to the Replicas. We also examined what latency is, and the two different types of latency that we track - how to interpret them and how they differ.
Comments
Add new comment