In today’s world where analytics is the backbone of the progress of any business, there are many established technology players who are providing equally good solutions for Data Warehousing. In a set up like this, choosing the warehouse without a deep feature and architectural comparison can be a tricky call. This blog aims to help you evaluate two of the most talked about warehousing solutions currently available in the market – Redshift Vs Netezza.
The blog will compare the two Data Warehouse solutions based on their architecture, use cases, performance capabilities and pricing. At the end of the article, you would have enough data points to be able to choose the right solution for you.
Redshift vs Netezza: Brief Overview
Amazon Redshift is a solution based on the MPP architecture (massively parallel processing). It has a cluster-based architecture and employs a columnar data storage technique to get a high level of performance from the configured system.
- Amazon invested in ParAccel (A California based company that built database management software for analytics and business intelligence) sometime mid-2011. Eventually, Amazon went on to build an OLAP-as-a-Service offering on top of it, now called Redshift.
- Redshift was launched by AWS as an initial offering for cloud-based analytics system in the year 2012
- It is also a petabyte-scale data warehouse and analytics solution
Netezza Twinfin is the advanced analytics and warehousing solution provided by IBM. It currently has been rebranded as IBM Puredata for analytics (PDA).
- It was an offering from a company known as Netezza launched in 1999 and then got acquired by IBM in the year 2010. Ever since it has been developed as a subsidiary of IBM.
- It is based on the AMPP (asymmetric massively parallel processing) architecture which has an SMP frontend to get the queries from the client and communicate with the MPP backend to do the processing
- IBM Netezza Analytics’ advanced technology supports data warehousing and in-database analytics into a scalable, high-performance, massively parallel advanced analytic platform that is designed to work with petascale data volumes.
Redshift vs Netezza: Architecture Highlights
While comparing Redshift vs Netezza, one of the primary aspects you would want to consider is the architectural strengths and weaknesses. Here is a quick overview of the same.
Amazon Redshift Architecture:
Here are the core components of Redshift’s architecture:
- Redshift is designed to work in a cluster formation. This is the core infrastructure component of AWS Redshift. It runs the Amazon Redshift engine and can have one or more databases.
- A typical Redshift Cluster has two or more Compute Nodes which are coordinated through a Leader Node. All client applications communicate with the cluster only with the Leader Node.
- Leader Node: This Node manages communication with the client applications and compute nodes. It parses the query sent in by the client and creates a query execution plan to be performed by the compute nodes
- Compute Node: These nodes execute the compiled code sent by the leader node and then send back the results for aggregation by the leader node.
- Node Slices: These are the partitions in the compute node. Each slice has a part of the memory. The processing of the workload happens in disk space of a node. The slices work in parallel to reach the result of an operation.
- Internal Network: Amazon Redshift makes use of the high bandwidth connections, close proximity to provide secure and high-speed network communication between compute nodes (among themselves also) and leader node.
- Columnar Data Storage: Redshift stores data in a columnar manner. This drastically reduces the I/O on disks.
- Massively Parallel Processing (MPP): Amazon Redshift architecture allows it to use Massively parallel processing (MPP) for fast query processing. Redshift can process the most complex queries involving large data sets in very little time. In order to maximize parallel processing, many compute nodes execute the same query code on smaller portions of data.
You can read more about Redshift Architecture here.
Here are the highlights of Netezza’s architecture.
- Netezza has an AMPP architecture where it has an SMP (symmetric multiprocessor) and a shared MPP (massively parallel processing) backend for query processing.
- Netezza architecture resembles Hadoop cluster design in many ways. e.g. Distribution, active-passive node, data storing methods, replications, etc
- Netezza is based on PostgreSQL and supports standard SQL, ODBC, JDBC and OLE DB interfaces
- Netezza has a two-tiered system. It has a simple Linux based frontend which is called as the SMP. This mainly receives the queries from the client application (often a which can be a BI/Analytics application). It then processes them and divides them into subqueries or subtasks which are in turn sent to the second tier of multiple backend units of MPP for parallel processing.
Getting into more details and depth of Netezza would be out of the scope of this blog. You can read more on Netezza’s architecture here.
Redshift vs Netezza – Features that Boost Performance
Amazon Redshift Performance Boosters:
Amazon redshift employs various techniques or features to improve the overall performance of the system:
- Massively Parallel Processing:
MPP system allows processing queries and computations on multiple backend CPUs at once improving the turnaround time and overall output of the system.
- Columnar Data Storage:
Instead of storing the complete table at one single location in the database, Amazon redshift stores a table’s data in a way where each column’s data is stored at different memory locations and the metadata table for each column is maintained. That is why it is advised to have queries specifying specific columns required in the output of the redshift instead of doing a select *.
- Data Compression:
Data is always stored in a compressed manner which in turns utilises less network bandwidth to store and retrieve the resultant data.
- Query Optimizer:
Redshift’s Query Optimizer generates MPP-aware query plans that take advantage of Columnar Data Storage. Query Optimizer uses analyzed information about tables to generate efficient query plans for execution. The queries are optimised in a manner so that the data distribution required between different nodes is minimal.
- Result Caching:
When a system or user executes an exactly same query again and again which is the case with most of the BI tools where the same results are required by the business on a regular basis to generate a report. Then Redshift gives the results from the cached state.
Netezza Performance Boosters:
Netezza supports 2000 user connections simultaneously and can process 2TB of data per hour. NPS (Netezza platform software) supports high backup creating pace – over 4 TB of data per hour. (Source)
In order to understand the next segment, you would have to read up and understand about Netezza’s Snippet Processing Unit – SPU(Learn more about SPUs here). In simple terms, SPUs comprise of individual units that provide CPU, memory, and processing power for the queries (snippets – as Netezza terms it) that run on Netezza. The following features on Netezza guarantee high performance:
Netezza makes use of zone maps which provide the mapping to the data records or extent as called in Netezza which is the data stored in a single SPU. Zone mapping in the latest releases can be of 2 types.
- A column-oriented zone mapping where the same column number’s information is kept at the same memory location. This, which in turn enhances the data analysis turnaround time as the column level analysis will have a common address to hit and get the relevant data
- A table oriented zone mapping where the mapping for the complete table including its all the columns is maintained at the same location. This helps in data ingestion a lot as the system has to make reference to one memory location to store the metadata for the data ingested.
Netezza, like redshift, has a concept of distribution keys where we can specify the columns on which the data should be distributed among the MPP enabled backend SPUs. Unlike redshift, Netezza can have a maximum of 4 columns which helps to distribute the data among the SPUs.
Data storage and compression
Data in Netezza, unlike redshift, is stored in a row ordered manner, and compression happens based on the similar values in the columns of a table.
Redshift vs Netezza Pricing
Amazon Redshift Pricing:
Redshift pricing depends on the number of nodes and the type of nodes one chooses for setting up an infrastructure having a redshift. There are mainly three ways to avail redshift services:
- On-Demand pricing: no upfront costs – you simply pay an hourly rate based on the type and number of nodes in your cluster.
- Amazon Redshift Spectrum pricing: enables you to run SQL queries directly against all of your data, out to exabytes, in Amazon S3 – you simply pay for the number of bytes scanned.
- Reserved Instance pricing: enables you to save up to 75% over On-Demand rates by committing to using Redshift for a 1 or 3-year term.
For more details on the pricing, you can visit: https://aws.amazon.com/redshift/pricing/
There are no explicit official sources to get the pricing details of the Netezza software but according to some unofficial statements the Netezza appliance runs with $2500 per user per TB compared to the industry standard of $10000
The Usecase for Redshift and Netezza:
So, should you choose Netezza’s on-premise system or Amazon’s on cloud only offering – Redshift?
- If your business systems are pretty much defined and are on-premise – It might make sense to opt for an on-premise Data Warehouse solution like Netezza. If your systems/applications are cloud-native, a better case can be built to opt for a Cloud Data Warehouse like Redshift. When we are trying to integrate a cloud service with an on-premise system like Netezza, there might be lags due to slow network or network discrepancies.
- Another way to look at this is from the Data Security perspective: The data is much more secure while residing in an on-premise system as compared to cloud architectures and systems. However, Amazon Redshift a variety of strong security features. There are options like VPC for network isolation, various ways to handle access control, data encryption etc.
Loading Data to Redshift or Netezza:
Given we are talking about large chunks of data being loaded into the warehouse for analysis, you should carefully evaluate your options to load data in a reliable manner to your warehouse. If not thought through, this can prove to be one of the biggest challenges in your warehousing project. There are two ways to approach this:
- Build Custom ETL solution – You would need to hire/deploy engineering resources to build data pipelines that move data into your warehouse. Unless you have an expert resource pool and a flexible project timeline, this might not be the best option to go for.
- Explore a Data Pipeline Solution – If you choose to go with Netezza, you can opt for a Data Integration tool like Talend. In case you choose to implement Redshift, you can seamlessly move data using a real-time Data Integration Platform like Hevo.
Hope this blog was able to share enough perspectives around considerations you should make while choosing a Data Warehouse Solution. If you have not yet made up your mind on a warehouse solution, you should consider reading Redshift Vs BigQuery here and Snowflake Data Warehouse features.
How are you going to choose between Redshift and Netezza? Let us know in the comments.