More data has been created in the past two years than was ever created in the human history. With the exploding volumes of data, people are now looking for data warehouse solutions, which can benefit them in terms of performance, cost, security, and durability. To have an answer to this problem, many companies released data warehousing solutions.
Amazon’s cloud computing platform, Amazon Web Services, launched its data warehouse called Amazon Redshift which is an enterprise-level, petabyte scale, and fully managed data warehousing service. Primarily used for business intelligence, Druid, on the other hand, is an open-source data warehouse designed for queries on both historical and real-time data.
In this article on Amazon Redshift Vs Druid, we will shed light on the differences in structure, performance, architecture, and scalability of these two data warehouses by comparing them in detail. This will help different organizations in deciding which of the two data warehouses is more suitable for them according to their needs.
Redshift being is a fully managed service from AWS makes it easy to get started in a matter of few steps. Additionally, Redshift takes the maintenance burden off the user.
Since Druid is open source, setting up is a slightly longer process. You will have to take complete ownership of monitoring and maintaining the deployment.
Redshift is ANSI SQL compatible and works well with Business Intelligence tools.
In contrast, Druid has limited SQL capabilities and the query parser is based on Apache Calcite. As a result, Druid may not seamlessly integrate with your BI tool.
Both Redshift and Druid are highly scalable and can easily scale for petabytes of data. Redshift is a managed service, hence scaling operations are supported out of the box through a UI or a CLI.
Being an open source database Druid, however, may demand a higher level of maintenance.
Both Redshift and Druid are columnar databases which result in highly optimized storage, especially in wide row scenarios.
Redshift stores individual rows of data whereas Druid maintains measures aggregated on all combinations of dimensions, thereby losing the identity of individual rows of data. This difference makes Redshift a preferred choice when access to individual rows of data is required. However, for OLAP style queries and cubes, Druid gives a better performance
Redshift doesn’t support primary or secondary indexes. It relies on data partitioning, sorting and MPP (Massively Parallel Processing) to speed up query execution.
Druid, on the other hand, relies heavily on indexes to speed up queries.
Real Time Data Ingestion
Redshift, on the other hand, does not support stream ingestion. The prescribed method to ingest data into Redshift is through loading micro-batches using the COPY command. Or, alternatively, you can use Hevo to bring data from any source into Redshift in real-time.
In Druid, data is stored in segments which are partitioned by time. Scaling up or down does not require any downtime.
On the other hand, Redshift partitions data through hashing. When scaling the cluster up or down, data will be re-hashed across the nodes and this will require some downtime.
Both Druid and Amazon Redshift boast a versatile clientele from different industry verticals such as Finance, Healthcare, Media, Technology, etc. Philips, Nasdaq, Pinterest, Amazon, Coursera, Soundcloud are a few customers who have tasted success with Amazon Redshift. You can read their stories here. Airbnb, Netflix, Appsflyer, Alibaba, are few businesses that are powered by Druid. You can get the complete list here.
In summation, it can be said that both the data storage warehouses have their distinct sets of strengths and weaknesses.
Druid would be the right choice if your primary use case is to evaluate time-series data. Given that druid was developed by an advertising analytics company, it is also specially geared toward analyzing advertising events such as bid requests, impressions, clicks, etc. over time. Druid stores only aggregated data and hence would not be the best choice if you will want to analyze row-level data.
Redshift, on the other hand, is built to accommodate slicing and dicing of large data sets. Its strength lies in allowing users to perform complex data joins and aggregations.
Redshift being a managed service and having properties of a traditional relational database can be used for a wider range of use cases with minimal maintenance overhead.
Druid, on the other hand, is optimized only for OLAP style query workloads and may require sophisticated maintenance making it unsuitable for a wide variety use cases.