Ensuring real-time data availability across numerous complex networks is one of the biggest challenges various organizations operating over-complicated networks face daily. Robust data warehouses that house intuitive functionalities such as easy data transfer, scalable cloud storage, etc. help tackle and overcome this growing challenge, however, meeting the currently evolving data needs is no small feat.

Data replication is one such technique that helps users achieve high data availability by maintaining numerous copies of the data across various servers, sites, etc., thereby allowing users to access data with ease across networks.

This article aims at providing you with in-depth knowledge about the concept of data replication, and Cassandra replication architecture along with a step-by-step guide to help you step up Cassandra Replication seamlessly.

Table of Contents

What is Cassandra?

Cassandra Replication: Cassandra Logo | Hevo Data
Image Source

Cassandra is a robust open-source NoSQL database/storage system that houses a highly distributed/decentralized architecture that delivers high data availability and performance with nearly zero downtime.

Leveraging its distributed nature and intuitive functionalities such as linear scalability, it allows users to handle petabytes of data and store it across numerous cloud-based environments while carrying out hundreds of data operations in parallel. Its internal architecture takes inspiration from Amazon Dynamo and Google’s BigTable data model and thus houses a replication model that has no single point of failure.

Key Features of Cassandra

  • Zero-Downtime: Cassandra houses the support for high and real-time data availability by leveraging its distributed nature. It replicates data across numerous cloud-based data centers and switches automatically to a healthy node in case of failure, ensuring that data is always available to users.
  • Scalable: Cassandra houses a highly scalable architecture that scales horizontally, allowing users to add nodes as per the data needs.
  • High Performance: With intuitive functionalities such as linear scalability, it ensures a fast response time and efficient throughput by increasing the number of nodes in real time.
  • Query Support: Cassandra allows users to query data using the Cassandra Query Language (CQL). CQL is highly similar to SQL, and hence, users need to spend a large amount of time getting familiar with it.

What is Data Replication?

Data replication refers to the process of creating and storing multiple copies of data across various locations, thereby boosting the availability of data across a particular network. With replication in place, you can either replicate your entire database or a distinct portion of it, as per your data needs, either for an individual system or across numerous servers.

Cassandra, a peer-to-peer distributed system, allows users to leverage its robust architecture to handle large volumes of data and replicate their data across a complex set of networks.

What are the Components of Cassandra Replication?

Cassandra is a robust distributed NoSQL database that houses intuitive functionalities such as linear scalability, database applications, high performance & data availability, and a lot more. Its robust architecture allows a database management system to handle large volumes of data spread across multiple data centers.

It has no master nodes and single points of failure and hence can support a complex network topology supporting numerous data centers & nodes.

Cassandra Replication: Cassandra Architecture | Hevo Data
Image Source

The Cassandra replication architecture typically consists of the following components:

  • A node that stores the data.
  • A data center that acts as a collection of nodes spread across numerous locations.
  • A cluster that contains multiple data centers.
  • Commit table, mem-table, SS table, and bloom filter to support the smooth functioning of internal architecture components.

An easier alternative to setting up Cassandra Replication: Hevo Data

Hevo Data, a No-code Data Pipeline can help you replicate data from Cassandra (among 100+ sources) swiftly to a database/data warehouse of your choice. Hevo is fully-managed and completely automates the process of monitoring and replicating the changes on the secondary database rather than making the user write the code repeatedly. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Hevo provides you with a truly efficient and fully-automated solution to replicate and manage data in real-time and always have analysis-ready data in your desired destination. It allows you to focus on key business needs and perform insightful analysis using BI tools. 

Get Started with Hevo for Free

Have a look at the amazing features of Hevo:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to export. 
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Completely Managed Platform: Hevo is fully managed. You need not invest time and effort to maintain or monitor the infrastructure involved in executing codes.
Sign up here for a 14-Day Free Trial!

Prerequisites

  • Working knowledge of Cassandra.
  • A general idea about data replication.

What are the Steps to set up Cassandra Replication?

To set up Cassandra Replication successfully, you need to set up three distinct nodes and run them as separate terminals before you associate them with a Cassandra cluster.

You can implement Cassandra Replication successfully using the following steps:

Step 1: Installing Java and Cassandra

To start replicating your Cassandra data using various nodes, you will first have to download and install Java on your system. To do this, you can use the following lines of code:

root@ubuntu:~# apt-get update
 
root@ubuntu:~# apt-get install default-jdk
Setting up default-jdk (2:1.8-56ubuntu2) ...
Setting up gconf-service-backend (3.2.6-3ubuntu6) ...
Setting up gconf2 (3.2.6-3ubuntu6) ...
Setting up libgnomevfs2-common (1:2.24.4-6.1ubuntu1) ...
Setting up libgnomevfs2-0:amd64 (1:2.24.4-6.1ubuntu1) ...
Setting up libgnome2-common (2.32.1-5ubuntu1) ...
Setting up libgnome-2-0:amd64 (2.32.1-5ubuntu1) ...
Processing triggers for libc-bin (2.23-0ubuntu3) ...
Processing triggers for ureadahead (0.100.0-19) ...
Processing triggers for systemd (229-4ubuntu4) ...
Processing triggers for ca-certificates (20160104ubuntu1) ...
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
 
done.
done.
root@ubuntu:~# java -version
openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-0ubuntu4~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

Once you’ve installed Java on your system, you now need to install Cassandra using the following lines of code:

root@ubuntu:~# groupadd cassandra
root@ubuntu:~# useradd -d /home/cassandra -s /bin/bash -m -g cassandra cassandra
 
root@ubuntu:~# grep cassandra /etc/passwd
cassandra:x:1000:1000::/home/cassandra:/bin/bash

Once you’ve executed the command, you will now be able to see the following output on your screen, allowing you to track the installation process of Cassandra:

root@ubuntu:/tmp# wget http://mirror.cc.columbia.edu/pub/software/apache/cassandra/3.6/apache-cassandra-3.6-bin.tar.gz
--2016-06-12 08:36:47-- http://mirror.cc.columbia.edu/pub/software/apache/cassandra/3.6/apache-cassandra-3.6-bin.tar.gz
Resolving mirror.cc.columbia.edu (mirror.cc.columbia.edu)... 128.59.59.71
Connecting to mirror.cc.columbia.edu (mirror.cc.columbia.edu)|128.59.59.71|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35552323 (34M) [application/x-gzip]
Saving to: ‘apache-cassandra-3.6-bin.tar.gz’
 
apache-cassandra-3.6-bin.tar.gz 100%[===================================================================>] 33.91M 6.43MB/s in 12s
 
2016-06-12 08:37:01 (2.93 MB/s) - ‘apache-cassandra-3.6-bin.tar.gz’ saved [35552323/35552323]
 
root@ubuntu:/tmp# tar -xvf apache-cassandra-3.6-bin.tar.gz -C /home/cassandra --strip-components=1

Step 2: Setting up Cassandra Nodes

With Cassandra now up and running on your system, you can now create a node for your Cassandra instance. To do this, you can use the following line of code:

cassandra@ubuntu:~$ sh bin/cassandra

You will now be able to see the following output on your screen indicating that your Cassandra server is up & running and can be associated with a cluster:

INFO 09:10:39 Cassandra version: 3.6
INFO 09:10:39 Thrift API version: 20.1.0
INFO 09:10:39 CQL supported versions: 3.4.2 (default: 3.4.2)
INFO 09:10:39 Initializing index summary manager with a memory pool size of 24 MB and a resize interval of 60 minutes
INFO 09:10:39 Starting Messaging Service on localhost/127.0.0.1:7000 (lo)
INFO 09:10:39 Loading persisted ring state
INFO 09:10:39 Starting up server gossip
INFO 09:10:39 Updating topology for localhost/127.0.0.1
INFO 09:10:39 Updating topology for localhost/127.0.0.1
INFO 09:10:39 Node localhost/127.0.0.1 state jump to NORMAL

Once you’ve created a node for your Cassandra instance, you can determine the status of your cluster using the following lines of code:

root@ubuntu:/home/cassandra# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 142.65 KiB 256 100.0% fc76be14-acde-47d4-a4a2-5d015804bb3c rack1

The status and state notation UN means it is up and normal.

You now need to repeat the same steps to create three such nodes to start setting up replication for your Cassandra instance.

This is how you can create a node to set up Cassandra Replication.

Step 3: Building a Cluster in Cassandra

With your Cassandra nodes now set up, you now need to create a cluster for them. To do this, you will first have to modify the configuration properties of these three nodes by editing the “cassandra. yaml” file as follows:

cluster_name: 'Test Cluster'
 
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "your-server-ip,your-server-ip-2,your-server-ip-3"
 
listen_address: your-server-ip
 
rpc_address: your-server-ip

Once you’ve made the necessary changes in the “cassandra. yaml” file, you will have to set up an endpoint snitch by appending the bootstrap and ensuring that the data center names are compatible with each other. You can use the following lines of code for the same:

sed -i 's/endpoint_snitch: SimpleSnitch/endpoint_snitch: GossipingPropertyFileSnitch/g' ~/conf/cassandra.yaml
 
echo 'auto_bootstrap: false' >> ~/conf/cassandra.yaml
 
sed -i 's/dc=dc1/dc=datacenter1/g' ~/conf/cassandra-rackdc.properties

To bring the changes into effect, restart the nodes as follows:

Cassandra Replication-Restarting Nodes in Cassandra.

Once you’ve restarted your Cassandra nodes, you now need to establish a connection between your console and other nodes by using the cqlsh command as follows:

cqlsh ip.addr.of.node 9042
Cassandra Replication: Connecting Cluster Nodes using the cqlsh commands | Hevo Data

Step 4: Setting up Cassandra Replication

Once you’ve established a connection between your nodes and clusters, you can now start replicating your Cassandra data. To do this, you first need to select a replication strategy of your choice, choosing between SimpleStrategy or NetworkTopologyStrategy as follows:

CREATE KEYSPACE linoxide WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1' : 3};

With your replication strategy now set up, you can use a select statement to create multiple replicas for your Cassandra instance as follows:

SELECT * FROM system_schema.keyspaces;
 keyspace_name | durable_writes | replication

You will now be able to see the following output on your screen, indicating that a particular data center now contains three such replicas:

linoxide | True | {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '3'}
system_auth | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
system_schema | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
system_distributed | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
system | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
system_traces | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}

Once you’ve set up the replicas for your Cassandra instance, you can now use the node tool status command to check the status of your data center as follows:

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 45.33.77.49 250.7 KiB 256 100.0% 34689c1e-939c-4bd3-8774-ac4534880744 rack1
UN 45.56.109.42 188.02 KiB 256 100.0% 7542e062-d6d3-473a-b79c-4f5e11547c1f rack1
UN 45.33.69.15 236.58 KiB 256 100.0% 2f10690c-1e6e-4297-bda6-c3fb36279495 rack1

This is how you can set up Cassandra Replication by adding Cassandra replicas nodes to your Cassandra cluster.

Conclusion

This article teaches you how to set up Cassandra Replication with ease, and answers all your queries regarding it. It provides a brief introduction of various concepts related to it & helps the users understand them better and use them to perform data replication & recovery in the most efficient way possible. These methods, however, can be challenging especially for a beginner & this is where Hevo saves the day.

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline, can help you replicate data in real time without writing any code. Hevo being a fully-managed system provides a highly secure automated solution to help perform replication in just a few clicks using its interactive UI.

Give Hevo a shot! Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand. Check out the pricing details to get a better understanding of which plan suits you the most.

Why don’t you share your experience of setting up Cassandra Replication in the comments? We would love to hear from you!

Aman Sharma
Freelance Technical Content Writer, Hevo Data

Driven by a problem-solving approach and guided by analytical thinking, Aman loves to help data practitioners solve problems related to data integration and analysis through his extensively researched content pieces.

No-code Data Pipeline For Cassandra

Get Started with Hevo