Apache Hadoop and its map-reduce programming model started the big data revolution back in 2006. The core of the offering was a distributed file system abstraction built over commodity hardware and a framework for processing on this distributed data by employing massive parallel processing. But there was one caveat. Setting up a Hadoop cluster needed a large initial investment in infrastructure and maintenance; until the completely managed services based on the distributed computing concepts came into the picture. This effectively meant an organization could now rent out the infrastructure and software and pay only for their use without worrying about the large initial investment. In this article on Amazon Redshift vs Hadoop, we compare Apache Hadoop with one of the most popular completely managed service offering in the space from Amazon- AWS Redshift. Redshift is a data warehouse offered as a cloud service with a Postgres compatible querying layer.
Amazon Redshift Vs Hadoop: Features
AWS Redshift is a completely managed data warehouse service offered by Amazon. All the activities related to database maintenance and infrastructure management are handled by Amazon in the case of Redshift and users are effectively renting Amazon’s software and hardware. Redshift provides a ‘pay as you go’ pricing model in which customers only have to pay for the amount of space they use and the processing power they consume. Customers can choose from dense compute instance types and dense storage instance types available at amazon to get the best possible configuration that can get in their budget.
Redshift offers a Postgres based querying layer that can provide very fast results even when the query spans over millions of rows. Architecturally, Redshift is based on a cluster of nodes out of which one acts as the leader node and others act as compute nodes. The leader node manages client communication, creates execution plans for queries and assigns tasks to the compute nodes. You can get more information on Redshift architecture in our detailed blog post here.
Redshift can be scaled by adding more nodes, upgrading existing nodes or both. Redshift clusters based on newer generation nodes that come with elastic resize can scale in a matter of minutes with very short downtimes.
Redshift has a feature called the Redshift spectrum that enables the customers to use Redshift’s computing engine to process data stored outside of the Redshift database.
Apache Hadoop consists of two frameworks that do the heavy lifting –
- The Hadoop distributed file system which serves as an abstraction layer over the filesystems of the computers that are part of the cluster. It helps users handle them as a single file system.
- Map Reduce framework – A programming paradigm that helps to process distributed data using the map and reduce functions. Map functions are typically transformations and reduce functions express the aggregation logic.
Architecturally Hadoop includes a cluster of computers running 5 Hadoop daemon services that handle all activities related to keeping track of data and cluster management. The five daemons and their activities are listed below.
- Namenode – Only one node in the cluster will have this service running. It contains all the information related to data blocks, their location, and the replicated locations.
- Secondary Name node – This node keeps tracks of the checkpoints in the name node and in the event of a name node failure, can be used to recreate the name node.
- Data node – Data nodes are the slave nodes that actually store the data and executes the read and write operations for the clients.
- Node manager – Node manager is part of Yarn – which acts as the cluster manager for all the resources in the cluster. The node manager is responsible for the individual resource containers that execute the map and reduce tasks.
- Resource Manager – Resource manager is also part of YARN and acts as the master authority which does the scheduling and manages the lifecycle of the applications. It is also responsible for resource allocation.
Scaling in Hadoop is accomplished by adding more nodes or upgrading nodes – which in turn adds more processing power and more storage space.
What Hadoop brought to the ecosystem was the ability to handle an infinite amount of data using commodity hardware and this led to the development of many frameworks that exploited the two frameworks. Many databases were built on Hadoop which could use HDFS to store their data and provided a querying layer that could convert SQL to map-reduce code. A notable one is Hive, which gained popularity as a reliable data warehouse and metadata management system. NoSQL databases like HBase were also built using HDFS as the storage layer. Processing frameworks like Spark use Hadoop as only the data storage mechanism and provides an alternate execution engine instead of the Map-Reduce framework. So in short, Hadoop as such just provides the file system and a processing mechanism, but the applications that were built on these features provide the real ability to use it as database and data warehouses.
Amazon Redshift Vs Hadoop: Scaling
Scaling in Hadoop is done by adding more nodes or upgrading nodes. Both these actions mean the storage capacity and processing power of the cluster go up. Adding more nodes or upgrading them involves purchasing new hardware, installing the OS and required libraries, adding them to running cluster, changing configurations to reflect the nodes and then managing the syncing of data across the newer nodes. This is not a simple process and needs an expert cluster administrator to execute.
Redshift can also scale by adding more nodes or upgrading nodes. In Redshift’s case, it can be done by a few clicks and waiting for AWS to do its magic to sync the data to the new nodes. The downtime in this case very minimal and is in the range of minutes for a cluster with an elastic resize feature.
Amazon Redshift Vs Hadoop: Storage Capacity
There are virtually no limits to scaling a Hadoop cluster if you have the right engineering skillset.
You can scale to any amount of storage space if you have the ability to keep adding hardware and manage the node integration process.
Redshift can only scale up to 2 PB, but it is very unlikely that customers will hit that theoretical limit.
Amazon Redshift Vs Hadoop: Data Replication
Copying data to the Hadoop cluster is done using the HDFS put and get shell commands. There is also a distributed copy tool that can be used to copy data from the HDFS cluster to another. In the usual case, Hadoop is used along with many other big data applications built on the HDFS like Hive or Hbase. In such cases copying data is done through the application-specific commands.
Copying to Redshift is done by first copying the data to S3 and then using the copy command. This can get complex if the target Redshift table already has data. In such cases, a staging table needs to be used. Another way to accomplish this is by using AWS services like AWS data pipeline which has built-in templates to handle data migration with Redshift as a target. The only caveat is that such services are mostly designed to fit in the AWS ecosystem and does not do a good job when the source or target is outside AWS. An elegant way to overcome this to use an ETL tool like Hevo. Hevo is a master of data migration across various sources-target combinations and can provide a one-stop solution for all data migration requirements.
Hevo – The Easiest way to Move Data to Redshift
Hevo Data helps you move data from a wide array of data sources to Redshift in real-time. Hevo facilitates this over an intuitive point and click interface without you having to write a single line of code. This will ensure your data is reliably moved to Redshift in just a few minutes.
Hevo comes with power-packed features that let you do both ETL and ELT, map schema automatically, monitor your data load at a granular level and more.
Sign up for a 14-day free trial and load data to Redshift in minutes.
Amazon Redshift Vs Hadoop: Pricing
The cost of a Hadoop cluster includes the cost of hardware and cost of the engineering skill set required in managing the cluster. Most of the Hadoop based software is typically open source and can be used free of cost if there is no need for external support to maintain them. It is very tough to attach a number to the cost of a Hadoop cluster since it would depend on the type of hardware and the capacity of the cluster that is required.
Redshift pricing depends on the instance type chosen by the customer. The lowest price instance type starts at $.25 and is a dense compute instance. For the one who needs better performance, Redshift offers dense storage instances that come with SSDs. Lowest specification dense storage instances are charged at $.85. Redshift users are eligible for one hour of concurrency scaling usage for every 24 hours that a cluster stays effective. Concurrency scaling is a mechanism for scaling the cluster up and down when the workloads with higher performance needs are active. You can read more on Redshift Pricing here.
Amazon Redshift Vs Hadoop: Performance
Redshift is designed as a data warehouse and not as generic data storage. This means it exhibits blazing fast performance when it comes to complex queries spanning millions of rows. That said, in cases where the data size or the querying target range near petabytes of data, Hadoop has a reputation of being the faster one.
Quantifying Hadoop’s performance is a daunting task since there are various kinds of applications that are available which uses Hadoop as its storage layer. HDFS and spark combination has the reputation of offering the best performance among all the available applications. Since Spark also provides an SQL execution layer, it can very well act as a data warehouse on its own. In the famous Terasort benchmark, the HDFS-Spark combination could sort 100 TB of data in about 23 minutes.
Amazon Redshift Vs Hadoop: Data structure
Redshift is a columnar database optimized for working with complex queries that span millions of rows. Redshift arranges the data in a table format and supports most constructs conforming to Postgres standard.
This is very different from Hadoop which is just a storage layer and stores data only as files without considering any underlying structure in data. There are data warehouse systems built over Hadoop which can do this. An example is Hive, which has a very comprehensive SQL layer and understands the data structure of files stored in HDFS.
Both Redshift and Hive lacks the ability to maintain unique key constraints and referential integrity. Hence it is the responsibility of the source systems to ensure this.
Amazon Redshift Vs Hadoop: Use cases
Now that we have analyzed Amazon Redshift vs Hadoop on the basis of factors that determine their value in the ETL pipeline, let’s discuss some of the specific use cases, where choosing one over the other makes more business sense.
Amazon Redshift Use cases
- You want a data warehouse and not a general platform that can support multiple applications.
- You do not want to make an initial investment to set up the cluster.
- You are averse to maintaining a data warehouse system on your own and wants to focus only on the business logic.
- Your data volume is not expected to go in PBs and you are fine with great performance on terabytes of data.
- Your business does not demand multiple kinds of databases and if at all the need arises, you are fine with using another completely managed service for your further database needs.
Hadoop Use cases
- You want a complete big data platform and your application demands the use of multiple big data frameworks on the cluster.
- You have a strong infrastructure administration team that can handle all the maintenance activities related to the cluster.
- You are fine with renting only hardware from a cloud service provider or have enough financial bandwidth to setup your own on-premise clusters.
- You anticipate your data volume to go to multiple PBs and want fast performance even when the data consideration range is in PBs.
- You want a very big file system storage that can accommodate any kind of data.
- You do not want to pay for software and have strong software engineering skills to handle all the applications installed on the cluster and their updates.
Are there any other factors that you need us to compare between Amazon Redshift vs Hadoop? Let us know in the comments.