The Question
Recently, a customer asked us:
After importing a new section of user data into our Tungsten cluster, we are seeing perpetually rising replication lag. We are sitting at 8.5hrs estimated convergence time after importing around 50 million rows and this lag is climbing continuously. We are currently migrating some of our users from a NoSQL database into our Tungsten cluster. We have a procedure to write out a bunch of CSV files after translating our old data into columns and then we recursively send them to the write master using the mysql client. Specifically our import SQL is doing
LOAD DATA LOCAL INFILE
and the reading in a large CSV file to do the import. We have 20k records per CSV file and we have 12 workers which insert them in parallel.
Simple Overview
The Skinny
In cases like this, the slaves are having trouble with the database unable to keep up with the apply stage (q-to-dbms
).
Technical Explanation
A Deep Dive
Each CSV file will equate to a transaction, which means one sequence number will contain 20,000 rows.
Naturally the load is quick on the master, because LOAD DATA INFILE
is a fast and efficient process, and especially when you have 12 parallel threads running.
However, the replicators are not running in parallel so it is going to be the same as a single load of 20,000 rows at a time.
Reducing the number of records per CSV and the number of parallel workers may reduce the latency but with 50 million records there is going to be some significant overhead to process this downstream to the slaves.
LOAD DATA
is treated as unsafe and when binlog_format=MIXED
the statement is logged in ROW-based format.
So, both ROW and MIXED = ROW, which means that the slower method is used.
The Answer
Session-based Success!
For LOAD DATA INFILE
operations, set the binlog_format
to STATEMENT for that specific LOAD DATA
session only - this will generate less THL traffic to be processed.
It is critical that this setting must only be at session level for the LOAD DATA
process, not for anything else.
Select a slave, and add the following property to your ini file:
optimize-row-events=false
Turning this optimize parameter off will create a single INSERT per row, which may not sound like it would be quicker but it means the target database is processing one insert at a time rather than 20,000.
As it stands, your 20,000-record CSV file is most likely getting converted into a single INSERT for all 20000 rows. In some cases the optimize-row-events
setting being true
is good as it can help to batch multiple inserts into a single statement for quicker processing, however it could be that 20,000 records per insert is causing too much stress on the target database hence it's slower to apply.
Also, does the target table have indexes? Huge data loads with a heavily indexed table will increase the time taken for the DB to process these rows.
Summary
The Wrap-Up
In this blog post we discussed importing CSV data into a Tungsten Cluster.
To learn about Continuent solutions in general, check out https://www.continuent.com/solutions
The Library
Please read the docs!
For more information about monitoring Tungsten clusters, please visit https://docs.continuent.com.
Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!
For more information, please visit https://www.continuent.com/solutions
Want to learn more or run a POC? Contact us.
Comments
csvfile (not verified)
Sun, 10/20/2019 - 21:11
Wonderful information, I learned a lot of new things from your blog for my project, thanks for sharing with us.
Add new comment