Continuent Blog: Tungsten Replicator: Understanding Latency Calculations

Blog

Our team of MySQL database experts regularly blogs on topics that range from MySQL availability, MySQL replication, multi-master MySQL, and MySQL-aware proxies, all the way through to ‘how to’ content for our solutions: Tungsten Clustering, Tungsten Replicator and Tungsten Proxy.

Overview

This blog post discusses the Tungsten Replicator and how latency is calculated when extracting events on a Primary, and when applying events to the Replicas. We also examine what latency is, and the two different types of latency that we track - how to interpret them and how they differ.

Sequence Numbers: The Heart of Replication

The Tungsten Replicator reads writes from the MySQL binary long and converts them into Transaction History Logs or THL for short. During processing, each event or set of events created in THL get a sequence number or seqno, which is a global transaction id, to ensure consistent replication order on all Replicas. Events are applied on a Replica in sequence number order.

A single sequence number is a complete transaction that could represent one write or thousands of writes depending on how the original event was written to the Primary node.

You may view the contents of individual THL sequence numbers using the command:
thl list -seqno {seqno_here}

The sequence numbers (seqno) on a Primary will increment when the Replicator has read the binary log, processed the entries and written the events to disk as THL.

The sequence numbers on a Replica will increment when the Replicator has processed and committed the events which are being applied to the target database server.

Replication Latency: What is it anyway?

There are two different latencies tracked by the Tungsten Replicator: Applied (appliedLatency) and Relative (relativeLatency). You can see both by running the trepctl status command:

shell> trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000211:0000000020094766;0
appliedLastSeqno       : 78022
appliedLatency         : 0.571
...
relativeLatency        : 8.944
Finished status command...

Applied Latency

Applied latency indicates the time needed to complete the last Seqno, either the time to extract an event from the binary logs and create the THL from it on the Primary, or the time to apply the event to the target database server on a Replica.

Relative Latency

Relative latency indicates the time since the appliedSeqno advanced, effectively how long ago was the last operation completed.

Comparison: Applied versus Relative

In an idle cluster with no write traffic, relative latency will climb, while the applied latency remains unchanged.

If Applied latency climbs in a busy system, it means:

On a Primary: time to extract is taking longer, check disk performance
On a Replica: time to apply is taking longer, it is usually an indication of a long running event on the replicas - check target database indexing and disk performance
“Long running transaction” in this context could be either a single statement taking a long time to run or a transaction containing many many smaller statements. It could also be DDL/Schema change that is taking a long time to complete.

If Relative latency increases significantly in a busy system, it may be a sign that replication is stalled on either a Primary or a Replica. This is a good parameter to check in monitoring scripts.

Here are the docs comparing the types of latency: https://docs.continuent.com//tungsten-clustering-7.1/connector-routing-latency.html#connector-routing-latency-comparison

Displaying Latency

The basic command to see latency is trepctl status:

shell> trepctl status
...
appliedLastSeqno       : 406734
appliedLatency         : 1.122
relativeLatency        : 4.541
...

To monitor latency on a Replica, run the following command, which displays all stages of replication on the Replica with performance stats and refreshes every three seconds: trepctl perf -r 3.

If you look at the output, you can see the progress of the 3 different stages within the Replicator, for example:

shell> trepctl perf -r 3
Statistics since last put online 1095951.622s ago
Stage                |      Seqno |     Latency |      Events |  Extraction |   Filtering |    Applying |       Other |       Total
remote-to-thl        | 13998690459 |      0.726s |   217192525 | 1090162.006s |     19.635s |   6552.409s |    289.506s | 1097023.556s
Avg time per Event |      0.005s |      0.000s |      0.000s |      0.000s |      0.005s
thl-to-q             | 13977405027 |  25186.709s |   195907346 | 1036875.129s |     20.078s |  59752.325s |    374.006s | 1097021.538s
Avg time per Event |      0.005s |      0.000s |      0.000s |      0.000s |      0.006s
q-to-dbms            | 13977405028 |  25187.722s |     4040490 | 1064409.815s |    609.628s |  31527.774s |    474.324s | 1097021.541s
Avg time per Event |      0.263s |      0.000s |      0.000s |      0.008s |      0.272s
Filters in stage | mysqlsessions -> pkey

Key values to look at here are the seqno associated with the remote-to-thl stage and the q-to-dbms stage.

If we examine the output captured every 3 seconds, we can see that both of these values are increasing quite significantly.

The q-to-dbms value increases as the transactions are committed to the database, and the growth of the remote-to-thl is an indication of the volume of changes that are being made in the Primary.

In one customer case, we saw that the difference between these two values every 3 secs was always in the region of 23 million, meaning that the Replicas are roughly 23 million events behind, but the gap was not increasing, which indicates the Replicator is applying properly.

In this case, the problem was that the gap was not really decreasing very much either. The customer did not see the latency coming down and in fact saw it going up because of the way calculations are done.

In short, this means that new changes were being made on the Primary faster than could be applied on the Replicas.

This comes down to the sheer volume of traffic on the cluster and the fact that the customer had only a single applier thread.

Parallel Apply

Typically, in this situation Continent would suggest using Parallel Apply.

Latency is tracked in a special database schema called tungsten_{service_name} in table trep_commit_seqno. There is one row per channel, and so there is one channel/one row for standard replication and 2 or more rows when using Parallel Apply.

By default, this only works with ROW based replication.

We are currently testing a new feature for the next patch release that allows Parallel apply to work with STATEMENT-based replication providing a single transaction does not update multiple tables. This new feature will work best when tables within the database that are the most active are identified - these tables are what would be assigned to their own Parallel Apply channel.

Bonus Tips

BONUS 1

To see just the critical information, use the egrep command to filter trepctl status:

shell> trepctl status | egrep 'relative|stat|appl|role'
Processing status command...
appliedLastEventId     : mysql-bin.000103:0000000000000635;16
appliedLastGoodLatency : 1.122
appliedLastGoodSeqno   : 406734
appliedLastSeqno       : 406734
appliedLatency         : 1.122
relativeLatency        : 4.541
role                   : slave
state                  : ONLINE
Finished status command...

BONUS 2

To provide Replica latency information to Continuent support, run the following command and leave it running for approx. 1 minute, then hit Control+C to stop it. Upload the resulting status.log file to the case for us to review.

shell> trepctl perf -r 3 >> /tmp/status.log

Wrap-Up

This blog post discussed the Tungsten Replicator and how latency is calculated when extracting events on a Primary, and when applying events to the Replicas. We also examined what latency is, and the two different types of latency that we track - how to interpret them and how they differ.

Published In

Categories:

Advanced Replication, Performance

Series:

Tungsten University

Tags:

tungsten replicator, Latency, Applied Latency, Relative Latency

Authors

Chris Parker

VP of Customer Success, EMEA

Chris is based in the UK, and has over 20 years of experience working as a database administrator. Prior to joining Continuent, Chris managed large-scale Oracle and MySQL deployments at Warner Bros., BBC, and prior to joining the Continuent Team, he worked at the online fashion company, Net-A-Porter.

View All Chris’s Posts

Eric M. Stone

COO and VP of Product Management

Eric is a veteran of fast-paced, large-scale enterprise environments with 40 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

View All Eric M.’s Posts

Tungsten Replicator: Understanding Latency Calculations