Understanding MySQL Sharding Simplified 101

on Database Sharding • October 27th, 2021 • Write for Hevo

MySQL Sharding

Most businesses today have started becoming data-oriented. This means that they rely on data-driven decision-making to plan future strategies pertaining to various departments such as Marketing, Growth, Business Development, Product, etc.

In order to perform an accurate analysis, businesses have to ensure that all the required data is collected in a form suitable for analysis and is easily accessible. Since the goal is to perform an accurate analysis, a high volume of data is collected which makes it impossible to store on a single machine. This led to the concept of Database Sharding.

There are a large number of databases that businesses use today in order to perform their day-to-day operations. One of the most well-known databases is MySQL. This article will help you understand what Database Sharding is and how MySQL Sharding works.

Table of Contents

Introduction to MySQL

MySQL Logo
Image Source

MySQL is considered to be one of the most popular Open-Source Relational Database Management Systems (RDBMS). MySQL implements a simple Client-Server Model that helps its users manage Relational Databases i.e. data stored in the form of rows and columns across tables. It uses the well-known query language, Structured Query Language (SQL) that allows users to perform all required CRUD (Create, Read, Update, Delete) operations.

MySQL was first developed by a Swedish company called MySQL AB in 1994. The company was then taken over by Sun Microsystems in 2008 which was finally taken over by the US tech giant, Oracle. Oracle is now responsible for the development and growth of MySQL. Even though MySQL is Open-Source and free of cost for everyone, it houses some premium features that are offered by Oracle only to those customers who are willing to pay for its usage.

Even though there is very high competition in the database market today, MySQL is considered to be the preferred database for more than 5000 companies including Uber, Netflix, Pinterest, Amazon, Airbnb, Twitter, etc.

Key Features of MySQL

Some of the key features of MySQL are as follows:

  • Robust Transactional Support: Implementation of ACID (Atomicity, Consistency, Isolation, Durability) properties that ensures no data loss or inconsistency.
  • Ease of Use: Considering that it makes use of SQL for querying data, anyone with basic knowledge of SQL can perform the required tasks easily.
  • Security: Implements a complex data security layer that ensures that only authorized users can access sensitive data.
  • Scalable: Considered to be highly scalable due to support for multi-threading. 
  • Roll-back Support: MySQL supports roll-backs, commits, and crash recovery for all transactions.
  • High Performance: Houses various fast load utilities along with Table Index Partitioning and Distinct Memory Caches that can ensure high performance.

To know more about MySQL, visit this link.

Download the Ultimate Guide on Database Replication
Download the Ultimate Guide on Database Replication
Download the Ultimate Guide on Database Replication
Learn the 3 ways to replicate databases & which one you should prefer.

Simplify ETL Using Hevo’s No-code Data Pipeline

Hevo is a No-code Data Pipeline that offers a fully-managed solution to set up data integration from 100+ data sources including MySQL and will let you directly load data to a Data Warehouse or the destination of your choice. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully-automated solution to manage data in real-time and always have analysis-ready data.

Hevo takes care of all your data preprocessing needs required to set up the integration and lets you focus on key business activities and draw a much powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination.

Get Started with Hevo for Free

Let’s Look at Some Salient Features of Hevo:

  • Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer. 
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and maps it to the destination schema.
  • Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within pipelines.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-Day Free Trial!

Introduction to Sharding

Businesses that rely on Monolithic Relational Database Management Systems (RDBMS) hit bottlenecks as the volume of data being stored increases. Response Time is bound to suffer because of limited CPU power, memory, storage capacity, and throughput. Although Vertical Scaling can be implemented in these situations, it has its own limitations and serves diminishing returns after a point.

Horizontal Scaling is the best way to deal with high volumes of data. If tables are partitioned horizontally and split across multiple Servers, the load on each Server would decrease and multiple Servers executing a single query parallely would result in less Response Time. This process of breaking up large tables into smaller chunks and storing them across multiple Servers is known as Sharding. Each chunk is known as a Shard.

A Shard can be seen as a horizontal data partition that contains a subset of the original data set. Hence, it is responsible for serving only a portion of the total workload. Sharding can also ensure data availability in case of unplanned outages. If the Server goes down for some reason for an Unsharded Database, there will be zero data availability. However, in the case of a Sharded Database, there are multiple Shards across which the data is spread. This means that in case of an unplanned outage, the data only in the Shards that are down will not be accessible. The ones that are active will still be able to produce the appropriate response.

The following image shows how an Unsharded Database is split into 2 Shards:

Image Source

What is MySQL Sharding?

As the data on the MySQL Database increases, some performance issues arise with scaling write data. Companies solve this problem with the help of Horizontal Sharding in the MySQL Database in which data from a single MySQL Database is partitioned across several MySQL Database servers with identical schema. It spreads workload from a single server to multiple servers, and each server is called a shard. This process is called MySQL Sharding.

Common Auto-Sharding Architectures

Different Sharding Architectures and implementations have been used to build large-scale systems. The three common Auto-Sharding Architectures are listed below:

1) Hash Sharding

MySQL Sharding - Hash Sharding
Image Source

Hash Sharding inputs a shard’s key and outputs a hash value for it that is used to determine in which shard the data should store. It ensures that the data distribute evenly across all the servers using hash functions and reducing the risk of hotspots. The data that has close shard keys are likely to be placed on the same shard server. Hash Sharding is greatly used for targeted data operations.

2) Range Sharding

MySQL Sharding - Range Sharding
Image Source

In Range Sharding the data is divided based on ranges or keyspaces, and the nearer the shard keys, the more likely for data to place under the same range and shard. Every shard has an identical schema taken from the original database. It allows users to easily run queries to read data within contiguous data ranges. Choosing the shard keys poorly can result in database hotspots. That’s why one should pick keys with large cardinality, low recurring frequency, and whose magnitude does not increase monotonically.

3) Geo-Partitioning

The Geo-based sharding first partitions data according to the user-specified column so that it can map range shards to specific locations and the nodes in those regions. For every location, the data is sharded using range sharding or hash.

The Challenges of Manual Sharding

Monolithic databases such as MySQL, Oracle, PostgreSQL, Amazon Aurora, etc. do not support automatic sharding. As the data increases the complexity to handle MySQL sharding becomes an additional development task. A few more challenges of Manual Sharding are listed below:

  • It needs additional sharding logic to configure how the data should be distributed and how to retrieve it. It also involves the process of selecting the right approach to adopt for MySQL Sharding, the number of nodes needed, how to rebalance it across all the nodes for optimal performance.
  • Developers need to change the data sharding techniques for change in business needs.
  • Manual sharding of data can cause uneven shard allocation that will lead to unbalanced shards that means some might be overloaded with data some might be empty creating a non-uniform allocation.
  • Implementing a manual process for sharding complicates the operational process that includes, maintenance, backups from multiple data servers, etc. Applying all the processes manually is a cumbersome task.

Understanding MySQL Sharding

Unfortunately, Monolithic Databases like PostgreSQL, Oracle, MySQL along with modern distributed SQL Databases such as Amazon Aurora do not support Automatic Sharding. This means that if something like MySQL Sharding has to be implemented, it has to be done manually in the Application Layer which requires immense Engineering bandwidth since an entire Sharding logic that determines how data is distributed and how it is fetched has to be set up.

Since the implementation has to be done manually, there are certain design decisions that have to be made. The decisions that need to be made are as follows:

1) Choosing MySQL Sharding Key

The MySQL Sharding Key will control how data is distributed across Shards. When implementing Sharding in MySQL, the MySQL Sharding Key should be chosen carefully as the wrong key might cause system inflexibility later. For example, Referential Integrity i.e. the parent/child relationship between tables that is maintained by the Relational Database Management System (RDBMS), won’t be maintained automatically if the parent and child rows are stored on separate Shards.

The two possible types of MySQL Sharding Keys are as follows:

  • Intelligent MySQL Sharding Keys: More suitable to avoid joins across Shards but is considered to be more exposed to skews. For example, if there is a Customer table that is Sharded based on the attribute that denotes the Customer ID, then it would make sense to store information related to all customers such as interactions, touchpoints, etc. in a single Shard to avoid any joins.
  • Hashed MySQL Sharding Keys: Ensures automatic distribution of data across Shards. This is done to ensure that the data is well spread and a single Shard does not experience a very high load. For example, if the customers are expected to grow a lot, it makes more sense to use Hashed MySQL Sharding Keys to ensure proper distribution of data across Shards. However, in order to perform any useful analysis, complex join operations would have to be performed across multiple Shards and that is not possible. 

2) Handling Schema Changes

MySQL allows users to change table Schemas in the database at any point in time after creation. If MySQL Sharding has been implemented, this change in the Schemas has to be implemented across all Shards before any application can leverage the new Schemas. If the change is not implemented in even one of the Shards, it could lead to application errors or data inconsistency. Hence, the users should either develop a logic that can ensure that Schema changes are implemented across all Shards or not make any changes in the Schema once MySQL Sharding is implemented.

3) Mapping between Physical Servers, Shards, and MySQL Sharding Keys

MySQL Sharding cannot be implemented if there isn’t proper mapping between Physical Servers, Shards, and MySQL Sharding Keys. This mapping might have to be updated regularly as the number of Shards increase or the Schemas change. Also, this mapping would have to be accessed for almost every query that is run on the database and hence, should be stored in a location where fast lookups are possible. Therefore, it is considered ideal to store it in an in-memory database.

Limitations of MySQL Sharding

Implementation of MySQL Sharding has certain drawbacks because of which its implementation for most applications in production is not recommended. If Sharding is an absolute requirement, then it is advised to use NoSQL Databases such as MongoDB that support Automatic Sharding.

The limitations of implementing MySQL Sharding are as follows:

  • MySQL Sharding does not allow cross-joins across multiple Shards. This is considered to be a huge disadvantage of MySQL Sharding since a lot of businesses use SQL for Analytics which cannot be performed without complex join operations.
  • Another major advantage of using SQL Databases for Analytics is the ability to perform aggregations easily. If MySQL Sharding is implemented, users lose the ability to perform aggregations across multiple Shards.
  • ACID Compliance is considered to be one of the most significant advantages of using any SQL Database. If MySQL Sharding is implemented, the database is not considered to be ACID Compliant anymore and hence, the possibility of data loss or inaccurate or inconsistent data increases significantly.
  • MySQL Sharding process has to be implemented manually based on business and data requirements. This means that a high amount of resources and time would be required in testing and developing its implementation.

Conclusion

This article provided you with an in-depth analysis of what MySQL is, how Sharding works and why it is required along with the design decisions that have to be made while implementing MySQL Sharding. It also provided you with a comprehensive understanding of why Sharding is considered to be a bad idea for most businesses.

Implementation of Sharding is considered to be more suitable for NoSQL Databases. Hence, if Sharding is of paramount importance for your business and data requirements, it would be ideal to transfer all the required data to a NoSQL Database and then implement Sharding on it. This migration of data from MySQL to a NoSQL Database can either be implemented manually or automated using platforms like Hevo. 

Visit our Website to Explore Hevo

Hevo helps you directly transfer data from a source of your choice to a Data Warehouse or desired destination in a fully automated and secure manner without having to write the code or export data repeatedly. It will make your life easier and make data migration hassle-free. It is User-Friendly, Reliable, and Secure.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of learning about MySQL Sharding in the comments section below!

No-code Data Pipeline For MySQL