Continuent Blog: The Fast Way to Import CSV Data Into a Tungsten Cluster

Blog

Our team of MySQL database experts regularly blogs on topics that range from MySQL availability, MySQL replication, multi-master MySQL, and MySQL-aware proxies, all the way through to ‘how to’ content for our solutions: Tungsten Clustering, Tungsten Replicator and Tungsten Proxy.

The Question

Recently, a customer asked us:

After importing a new section of user data into our Tungsten cluster, we are seeing perpetually rising replication lag. We are sitting at 8.5hrs estimated convergence time after importing around 50 million rows and this lag is climbing continuously. We are currently migrating some of our users from a NoSQL database into our Tungsten cluster. We have a procedure to write out a bunch of CSV files after translating our old data into columns and then we recursively send them to the write master using the mysql client. Specifically our import SQL is doing LOAD DATA LOCAL INFILE and the reading in a large CSV file to do the import. We have 20k records per CSV file and we have 12 workers which insert them in parallel.

Simple Overview

The Skinny

In cases like this, the slaves are having trouble with the database unable to keep up with the apply stage (q-to-dbms).

Technical Explanation

A Deep Dive

Each CSV file will equate to a transaction, which means one sequence number will contain 20,000 rows.

Naturally the load is quick on the master, because LOAD DATA INFILE is a fast and efficient process, and especially when you have 12 parallel threads running.

However, the replicators are not running in parallel so it is going to be the same as a single load of 20,000 rows at a time.

Reducing the number of records per CSV and the number of parallel workers may reduce the latency but with 50 million records there is going to be some significant overhead to process this downstream to the slaves.

LOAD DATA is treated as unsafe and when binlog_format=MIXED the statement is logged in ROW-based format.

So, both ROW and MIXED = ROW, which means that the slower method is used.

The Answer

Session-based Success!

For LOAD DATA INFILE operations, set the binlog_format to STATEMENT for that specific LOAD DATA session only - this will generate less THL traffic to be processed.

It is critical that this setting must only be at session level for the LOAD DATA process, not for anything else.

Select a slave, and add the following property to your ini file:

optimize-row-events=false

Turning this optimize parameter off will create a single INSERT per row, which may not sound like it would be quicker but it means the target database is processing one insert at a time rather than 20,000.

As it stands, your 20,000-record CSV file is most likely getting converted into a single INSERT for all 20000 rows. In some cases the optimize-row-events setting being true is good as it can help to batch multiple inserts into a single statement for quicker processing, however it could be that 20,000 records per insert is causing too much stress on the target database hence it's slower to apply.

Also, does the target table have indexes? Huge data loads with a heavily indexed table will increase the time taken for the DB to process these rows.

Summary

The Wrap-Up

In this blog post we discussed importing CSV data into a Tungsten Cluster.

To learn about Continuent solutions in general, check out https://www.continuent.com/solutions

The Library

Please read the docs!

For more information about monitoring Tungsten clusters, please visit https://docs.continuent.com.

Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!

For more information, please visit https://www.continuent.com/solutions

Want to learn more or run a POC? Contact us.

Published In

Categories:

Deployment, Database Administration, Native, Performance

Series:

Tungsten University

Tags:

Cluster, Clustering, CSV, LOAD DATA INFILE, MIXED, MySQL, RBR, Replicate, replication, Replicator, ROW, Session, Slow, Speed, sql_log_bin, STATEMENT, Tungsten

Author

Chris Parker

VP of Customer Success, EMEA

Chris is based in the UK, and has over 20 years of experience working as a database administrator. Prior to joining Continuent, Chris managed large-scale Oracle and MySQL deployments at Warner Bros., BBC, and prior to joining the Continuent Team, he worked at the online fashion company, Net-A-Porter.

View All Chris’s Posts

Comments

csvfile (not verified)

Sun, 10/20/2019 - 21:11

Wonderful information, I learned a lot of new things from your blog for my project, thanks for sharing with us.

The Fast Way to Import CSV Data Into a Tungsten Cluster