Fragmentation and Replication In Distributed Database

Q: 1. What is replication in a distributed database?

Replication in a distributed database involves creating and maintaining copies of data across multiple database servers or nodes to ensure high availability, fault tolerance, and load balancing.

Q: 2. What is fragmentation in db?

Fragmentation is the process of dividing a database into smaller, more manageable pieces called fragments or partitions.

Q: 3. What are the three types of data fragmentation?

a) Horizontal fragmentation b) Vertical fragmentation c) Hybrid fragmentation

Most businesses today operate from multiple geographies. Each unit requires its own set of data, which is where Distributed Database Management Systems (DDBMS) come in. Distributing data across multiple sites provides high availability, easy sharing, and local query optimization. Replicating data automatically helps in data recovery if the database on any site gets damaged or malfunctions.

This comprehensive guide presents an overview of Distributed Database Management Systems. We discuss three essential processes: allocation, fragmentation and replication in distributed databases, with in-depth coverage of each.

What Is a Distributed Database?

A distributed database refers to a collection of databases scattered across multiple sites over a network. Think of this system as a Multinational Corporation whose offices and systems are distributed across the globe, yet they work harmoniously for a common purpose.
In a Distributed Database Management System (Distributed DBMS or DDBMS), data, processes, and interface components of a system are divided across multiple locations to allow independent functioning. Different databases are stored across multiple computers, and accordingly, the workload processing is distributed. This network architecture delivers much better performance and increased reliability compared to Centralized Database Management Systems.

One of the main benefits of using Distributed Database Management Systems is that your applications need not know where your data resides (also called transparent access to data). When a user executes a query on a distributed database, multiple sites from various data centers can seamlessly collaborate to answer the user’s inquiry.
Unlike the Centralized Database Management System, the distributed database comes with consistent commit protocols, concurrency control techniques, and recovery methods that are extremely useful in managing heavy user transactions or preventing database system failures.

Get ahead with the performance of your distributed database using no-code pipelines from Hevo. With Hevo’s out-of-the-box Distributed Database Connectors and blazing-fast Data Pipelines, you can extract & aggregate data from Distributed Databases like MySQL and Oracle Databases straight into your Data Warehouse, Database, or any destination. Check out the features of Hevo:

Forget about manually mapping the schema into the destination with automapping.
Connect with Live Support, available 24/5, and chat with data engineers.
Transform your data with custom Python scripts or use the drag-and-drop feature.

Boost your data replication strategies now for free and see for yourself why Hevo has a 4.3 rating on G2.

Get Started with Hevo for Free

Features of Distributed Database Management System

A Distributed Database Management System (DDBMS) carries the full functionality of a standard DBMS and provides transparent management of distributed and replicated data. It also:

Improves reliability & data availability through distributed transactions.
Provides an easier and more economical system for database expansion.
Maintains confidentiality and data integrity of databases.
It is hardware-independent.
Delivers quick responses to user queries through efficient distribution of data.
Gives users a simple interface to open, read/write records, and close files.
Comes with declarative query capabilities, transaction management, and integrity enforcement.

What Do You Mean By Allocation Fragmentation and Replication In Distributed Database?

In this section on fragmentation and replication in distributed database guide, we help you in understanding replication in databases and distributed systems to a greater depth.

Data Allocation

Data Allocation is an intelligent distribution of your data pieces, (called data fragments) to improve database performance and Data Availability for end-users. It aims to reduce overall costs of transaction processing while also providing accurate data rapidly in your DDBMS systems.

Data Allocation is one of the key steps in building your Distributed Database Systems. There are two common strategies used in optimal Data Allocation: Data Fragmentation and Data Replication. In the following sections on fragmentation and replication in distributed databases, we discuss both of these techniques in greater detail.

Data Fragmentation

Data fragmentation refers to breaking data into smaller pieces, or fragments, for efficient storage and management across multiple locations. This process ensures optimal distribution and accessibility of data in a distributed database system.

Database Fragmentation can be of two types: horizontal or vertical.

In a horizontal fragmentation, each tuple of a relation r is assigned to one or more fragments.
In vertical fragmentation, the schema for a relation r is split into numerous smaller schemas with a common candidate key and a special attribute.

Fragmentation can also extend beyond traditional databases:

Application-Level Fragmentation: Data stored in unique formats by different software applications.
Physical Fragmentation: Data stored across different servers or storage devices.
Logical Fragmentation: Breaking a single file into smaller parts across the system.

Data Replication

Distributed Database Replication is the process of creating and maintaining multiple copies (redundancy) of data in different sites. The main benefit it brings to the table is that duplication of data ensures faster retrieval. This eliminates single points of failure and data loss issues if one site fails to deliver user requests, and hence provides you and your teams with a fault-tolerant system.

However, Distributed Database Replication also has some disadvantages. To ensure accurate and correct responses to user queries, data must be constantly updated and synchronized at all times. Failure to do so will create inconsistencies in data, which can hamper business goals and decisions for other teams.

Discover more information about Data Replication in Distributed Database Systems. Our further sections will help you in understanding replication in databases and distributed systems in greater depth.

Methods of Data Fragmentation of a Table

In this section of our fragmentation and replication in distributed database guide, we discuss the two fundamental fragmentation strategies: horizontal and vertical. In addition to these, distributed database management systems also allow the nesting of fragments in a hybrid fashion, which is called hybrid fragmentation. This will be discussed separately in our third fragmentation strategy.

Horizontal Fragmentation
Vertical Fragmentation
Hybrid Fragmentation

Horizontal Fragmentation (or Sharding)

A Horizontal Fragmentation strategy divides a table horizontally by selecting a subset of rows in accordance with the values of one or more fields. After partition, these data fragments are assigned to different sites of a Distributed Database System. When a user makes a complete table request, these fragments are then combined using a union operation.
There are two versions of Horizontal Fragmentation: Primary Horizontal Fragmentation, which uses predicates of relation to perform fragmentation, and Derived Horizontal Fragmentation, which uses predicates defined on another relation to partition a relation.
Horizontal fragmentation allows for the parallel processing of a relation. You can also split a global table into tuples and allocate them to places where they are most frequently accessed for efficient data storage and better access.

Vertical Fragmentation

Vertical Fragmentation splits a table vertically by attributes or columns, often aligning well with columnar database architectures. In this case, data fragments keep only certain attributes of the original table. They are then assigned to different sites of a DDBMS.
Every data fragment gets a primary key that is required while restoring the original table. The fragmentation is done in such a way that reconstructing a table from fragments only requires a normal JOIN operation. To do so, a specific property called Tuple-id is added to the schema.
Vertical fragmentation is highly useful for cases where you want to enforce data privacy.

Hybrid Fragmentation

Out of the two discussed in fragmentation and replication in distributed databases, Hybrid Fragmentation takes a different approach. It comprises a combination of both Horizontal and Vertical Fragmentation.
Here the tables are initially fragmented in any form (horizontal or vertical) and then these fragments are partially replicated across different sites according to the frequency of accessing the database fragments. In this case, the original table can be reconstructed by applying union and natural JOIN operations in the appropriate order.

Integrate Amazon Ads to BigQuery

Get a Demo Try it

Integrate Facebook Ads to Redshift

Get a Demo Try it

Integrate TikTok Ads to Snowflake

Get a Demo Try it

Advantages and Disadvantages of Fragmentation

Here, in this section on fragmentation and replication in distributed database guide, we discuss the pros and cons of Database Fragmentation. Let’s have a closer look at those.

Advantages

Using Database Fragmentation, you and your teams can:

Concurrently execute a number of transactions.
Capitalize on the parallel processing of a single query.
Take advantage of increased system throughput.
Store data efficiently by saving frequently used data close to the site of usage.
Use local query optimization.
Preserve the security and privacy of your database systems.
Benefit from fault-tolerance architecture with better disaster recovery mechanisms.

Disadvantages

Database Fragmentation falls short in the following scenarios:

When application views are defined on more than one fragment, they can develop conflicting requirements.
When doing recurrent fragmentation, the reconstruction task might become rather large.
In simple operations like checking for dependencies, which might result in chasing data across several sites.
When data from several fragments is required, access times can be extremely fast.

Data Replication In DBMS

Data Replication duplicates your database and creates multiple copies at two or more sites. This is useful for providing continuous system availability, performance, and scalability to fulfill your application requirements. It creates a system of distributed databases, which can in turn help your users access data relevant to their tasks without interfering with the work of other users. On a high level, there are three ways to perform Database Replication:

Transactional Replication
Snapshot Replication
Merge Replication

Transactional Replication

Transactional Replication makes a complete copy of your database, as well as copies of new data changes. In this type of Data Replication, changes to your database are synced in real-time and in the same order as they occur. This guarantees transactional consistency.

Transactional Database Replication is your go-to option when you have frequent changes that you wish to send to your subscribers. You can either choose to aggregate transactions and send them at regular intervals, or you can send them as they occur. This type of replication improves performance and decreases latency while providing a high volume of read, write, and delete activity.

Snapshot Replication

Snapshot Replication is perhaps the simplest type of Data Replication that copies “snapshots” of your database. It replicates the current state of your database as is, at a specific point in time, without including any changes/updates to your data. This kind of replication is helpful when changes made to your databases are infrequent.

Snapshot Replication is slower compared to Transactional Replication because it sends data in bulk as opposed to new changes. It is generally used in scenarios when your subscribers can function in read-only mode or when your subscribers don’t require updated data to operate.

Merge Replication

Merge Replication combines data from several databases into a single database. This type of Data Replication tracks subsequent data changes and schema modifications made at publishers and subscribers and synchronizes the same to your database using merge agents, making it important to have a well-structured database schema design to ensure consistency and conflict resolution. A great advantage of using Merge Replication is that it allows publishers and subscribers to independently modify the database.

Merge Replication is extremely useful for subscribers who can’t always connect to the publisher, but yet need to carry along a copy of the database that they can modify. They can make changes offline, and later synchronize changes with the Publisher and other Subscribers.

Pros and Cons of Data Replication in DBMS

Advantages of Data Replication

Data Reliability: Your databases continue to work even in case of a site failure. Using Distributed Database Replication, you can request and receive the same copy from a different site.
Scalability: As your systems grow geographically and in terms of the number of locations (and hence the number of access requests), replication provides a seamless way to handle this expansion without compromising on response times.
Quicker Response: Data Replication enables copies of data to be available close to their access sites. This method of localization delivers quick query processing and, consequently, fast response times.
Simpler Transactions: With Data Replication, user transactions become simple since they require fewer table joins and minimal coordination across the network.

Disadvantages of Data Replication

High Storage Requirements: If your databases are of a gigantic scale, creating and maintaining copies of those databases will demand a high storage capacity.
Increased Costs and Complexity: More copies mean more storage costs. With every update, your DDBMS system must ensure that new changes are reflected in all the copies of the data at all sites.
Undesirable Application – Database Coupling: Inherent to data update mechanisms are possibilities of Data Inconsistency. Eliminating those requires complex coordination and careful investigation at the application level.

Distributed Database Replication Schemas

You can perform Data Replication in DDBMS in either full or partial forms. A Full Replication is when you create full copy clones of your existing database and store them on multiple sites. A Partial Replication is when you copy and store only the frequently used fragments of your database, leaving the rest. A No Replication is when you exclusively hold fragments at one site without any duplication.

Full Replication
Partial Replication
No Replication

In this section of fragmentation and replication in distributed databases guide, we discuss these replication schemes in detail:

Full Replication

Full Replication is a process of cloning your database entirely. In Full Replication, you would create full copies of your database at every site of the distributed system. Although this improves system availability, since you have multiple backups of the same database, this process takes a long time to take effect.

Full Replication is appropriate for worldwide enterprises with teams that require complete data access. Using Full Replication, for example, North American counterparts can take up the remaining work from their European counterparts if their servers fail.

Advantages of Full Replication

Guaranteed High Availability.
Rapid retrieval of global queries.
Faster query execution.
Negligible communication costs.

Disadvantages of Full Replication

Processing concurrent queries is difficult.
Slower update process due to bulk duplication.

Up next, in our fragmentation and replication in distributed database guide, we take a look at Partial Replication.

Partial Replication

Partial Replication is a process of replicating only frequently used fragments of your database. Here, each fragment is replicated based on its importance and user demand. The number of copies can range from one to many, depending on the total number of available nodes in the Distributed Database Management System.

Partial Replication is a very helpful Distributed Database Replication Schema for mobilized teams. They can use it to carry relevant and important pieces of data on their laptops and mobiles and periodically synchronize it with the main server once they are back online.

More so than this, even for some businesses, keeping confidential information inside their private network is a necessity. Teams who want to use such datasets can duplicate only the necessary pieces and sync them back once they are complete. Alternatively, they can continue working on their own datasets, leaving the secret data on the private network and then returning to make the necessary changes, all using Partial Replication.

Advantages of Partial Replication

Better database performance.
Duplication of only relevant data fragments ensures better resource utilization and fewer costs.
Increased Data Availability.

Disadvantages of Partial Replication

Keeping all data current and relevant could be a challenge.
Performance may suffer when a user requests less used data fragments.

No Replication

When your database fragments are partitioned and exist in singularity, you have a No Replication situation. In this case, each fragment resides at only one site, and therefore failure of one site might result in users losing access to data.

Of all the distributed database replication schemas, No Replication is the fastest to perform, but provides lower data availability and is slow to execute user queries when multiple users request at the same time.

Advantages of No Replication

Easy recovery of data.

Disadvantages of No Replication

Slow execution time in case of concurrent user requests.
There is a high probability of data loss; there are no replicas.

Fragmentation and Replication In Distributed Database: Rule of Thumb

At this point, you might ask: Should I replicate my database?

Fortunately, there exists a rule of thumb while performing fragmentation and replication in distributed databases to help you answer the question. For distributed database replication schemas, the rule of thumb says:

If (Read-only Queries)/(Update Queries) ≥ 1, Data Replication is advantageous.
If (Read-only Queries)/(Update Queries) < 1, Data Replication may cause problems.

Conclusion

In the current scenario of the fast-changing world, the distribution of data has become an essential need. Cloud databases and distributed computing have accelerated business interest in Distributed Database Management Systems and Data Allocation is becoming more useful than ever. Data Allocation, as we discussed, comprises Data Fragmentation and Data Replication techniques.

Although both fragmentation and replication in distributed databases aren’t simple processes, if applied judiciously can be an extraordinary solution for developing better applications and improving database performance.

Along with your Distributed Database Systems comes a need for Distributed Database Integration. How do you manage the effortless integration of your distributed databases, CRM, and SaaS platforms into a common repository like a Data Warehouse? Lucky you because Hevo Data packs the ultimate ETL solution to start your database pipelines in minutes.

Hevo Data with its strong integration with Distributed Databases such as MySQL, PostgreSQL, and Oracle Database allows you to not only export data from sources & load data to the destinations, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools.

Why not try Hevo? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand. You can also check our pricing and make a decision on your best-suited plan.

Thank you for reading! We hope you found our fragmentation and replication in the distributed database guide informative.

Frequently Asked Questions

1. What is replication in a distributed database?

Replication in a distributed database involves creating and maintaining copies of data across multiple database servers or nodes to ensure high availability, fault tolerance, and load balancing.

2. What is fragmentation in db?

Fragmentation is the process of dividing a database into smaller, more manageable pieces called fragments or partitions.

3. What are the three types of data fragmentation?

a) Horizontal fragmentation
b) Vertical fragmentation
c) Hybrid fragmentation

Divyansh Sharma Marketing Research Analyst, Hevo Data

Divyansh is a Marketing Research Analyst at Hevo who specializes in data analysis. He is a BITS Pilani Alumnus and has collaborated with thought leaders in the data industry to write articles on diverse data-related topics, such as data integration and infrastructure. The contributions he makes through his content are instrumental in advancing the data industry.

Allocation Fragmentation and Replication In Distributed Databases: A Quick Start Guide

What Is a Distributed Database?

Features of Distributed Database Management System