In this post, you will explore the similarities and the differences between the two Data Warehouse choices – Redshift Vs Postgres, specifically in terms of factors that are critical to the life of a Data Engineer.
Understanding Amazon Redshift and PostgreSQL
- With the advent of completely managed Data Warehouse services, the ETL space is undergoing a paradigm shift with many businesses choosing to migrate from their on-premise database installations.
- The biggest advantage of such services is the relief from all the housekeeping activities related to maintaining a reliable and always available database. Redshift is a very popular choice for a completely managed Data Warehouse service which can scale up to petabytes of data.
- For companies using on-premise databases, Postgres has remained as one of the most favorites because of its performance, reliability, and features. Postgres has been in development for over 30 years and has a strong querying layer which makes it a great choice for OLTP workloads.
Redshift vs PostgreSQL – Features
- Amazon Redshift Features
- PostgreSQL Features
Amazon Redshift Features
- Redshift is a completely managed data warehouse as a service offered by Amazon. Redshift has a query layer very similar to PostgreSQL query standard but lacks many features that standard PostgreSQL querying layer has.
- Redshift can store petabytes of data and is designed for running complex analytical queries spanning over millions of rows.
- Under the hood, Redshift has a cluster of nodes, of which one is designated as a leader node.
- The leader node is responsible for all client communications, query optimization, and planning, delegating tasks to member nodes, etc. Member nodes are the ones storing the data and execute the query plan provided by the leader node.
- Redshift allows the end-users to scale by increasing the number of nodes to upgrading the node configuration or both.
- Scaling usually happens in minutes for the newer generation nodes with elastic resizing support, but could take hours for certain older generation nodes.
- Redshift also allows the customers to choose from different types of nodes according to their performance requirements and budget.
- All the administrative tasks related to maintaining the cluster is automated and there is little intervention needed from the users for keeping the Redshift cluster running.
- Redshift is a paid offering from Amazon and the pricing is inclusive of the software, computing resources, and storage resources.
- Amazon allows the customers to pay only for their actual use and enables them to scale whenever required seamlessly.
- This closer control of the budget is possible in case of querying performance also by adjusting querying slots and AWS workload management queues accordingly.
PostgreSQL Features
- PostgreSQL is a free and open-source database that is best suited for transactional workloads. It is a row-oriented database and is known for its stability with little need for maintenance activities.
- PostgreSQL offers all the typical features expected of an enterprise transactional database like unique primary keys, foreign key referential integrity, user-defined types, multi-version concurrency control, etc.
- The latest version of PostgreSQL makes it possible to have tables that can store 100s of TBs of data and in practice is only limited by disk space constraints.
- Internally PostgreSQL works based on a single server architecture with no concept of clusters. PostgreSQL runs as a single process and support for multi-core in case of single queries is possible from version 9.6 onwards.
- The biggest advantage when it comes to PostgreSQL is that it is completely free and the end-user only needs to pay for the infrastructure on which it is deployed.
- Upgrading the storage capacity in PostgreSQL can be a pain since the end-user will have to manage a lot of data backing up and copying using the tools provided.
- We now have a basic idea of the features of both Redshift and PostgreSQL. It is now time to compare these two databases to see which one can be the best database for your use case.
Redshift vs PostgreSQL Comparison
Redshift Vs PostgreSQL: Scaling
- Redshift was designed with seamless scaling as a basic criterion. Redshift can be scaled in a matter of minutes for the new generation nodes with support for elastic resize.
- The accepted way of scaling Redshift is to add nodes, upgrade node configuration or through a combination of both.
- All this is managed by the AWS console and the end-user does not have to intervene other than initiating the scaling. Read more about Amazon Redshift scaling here.
- Since PostgreSQL is an on-premise database based on a single server, scaling is not one of its strong points.
- If storage capacity needs to be upgraded, the customer will have to first copy the data to new disk drives for a seamless transition. That said if PostgreSQL is now available as a completely managed service from Amazon through their RDS offering. In such cases, scaling is relatively easy since it is managed by AWS.
Performance
- Redshift is optimized for complex analytical workloads. It exhibits better performance for queries with a large scan range. In the case of simpler queries with short data scan ranges, Redshift does poorly when compared to traditional databases.
- This is because Redshift has a sophisticated query optimizer and planner which at times takes more time than the actual execution time.
- This means, if there is nothing much to optimize, this query planner will actually pull Redshift’s performance.
- Redshift also has a mechanism called concurrency scaling which allows it to support practically unlimited concurrent users and concurrent queries. Read more about Redshift performance.
- PostgreSQL generally exhibits good performance when it comes to queries that do not test its limits. PostgreSQL is optimized for online transactional workloads and does very well until the queries have to scan millions of rows.
- In such cases, the time to execution can go up drastically leading to never-ending query processes. Read more about PostgreSQL performance and performance tips.
Data Replication
- With PostgreSQL, loading data can be accomplished using the
COPY FROM
command which loads the rows from a file to table. This command is optimized for loading a large number of rows parallel.
- Redshift also has a
COPY
command which can be used to load data from S3 to Redshift tables. As a best practice, it is better to use this command with several evenly sized files to take advantage of Redshift’s massive parallelism.
- Things to note when moving data to Redshift or PostgreSQL:
- Given you are trying to move data for into one of these systems for key business use cases – it is important to have reliable, accurate and consistent data. While building custom scripts seem like a straightforward option to move data, there are many limitations, caveats and challenges that you will uncover in your journey.
Data Structure
When it comes to data types, Redshift only supports hand full of them. Given below is a list of data types supported by Redshift.
Data Type | Aliases | Description |
SMALLINT | INT2 | Signed two-byte integer |
INTEGER | INT, INT4 | Signed four-byte integer |
BIGINT | INT8 | Signed eight-byte integer |
DECIMAL | NUMERIC | Exact numeric of selectable precision |
REAL | FLOAT4 | Single precision floating-point number |
DOUBLE PRECISION | FLOAT8, FLOAT | Double precision floating-point number |
BOOLEAN | BOOL | Logical Boolean (true/false) |
CHAR | CHARACTER, NCHAR, BPCHAR | Fixed-length character string |
VARCHAR | CHARACTER VARYING, NVARCHAR, TEXT | Variable-length character string with a user-defined limit |
DATE | | Calendar date (year, month, day) |
TIMESTAMP | TIMESTAMP WITHOUT TIME ZONE | Date and time (without time zone) |
TIMESTAMPTZ | TIMESTAMP WITH TIME ZONE | Date and time (with time zone) |
- PostgreSQL, on the other hand, supports all the above data types and many other data types. Some of the notable PostgreSQL data types not supported by Redshift are the Datetime data types, arrays, and JSON.
- There are many more data types that are unsupported in Redshift, but to summarize, PostgreSQL has much more comprehensive data type support when compared to Redshift.
Maintenance
- Both Redshift and PostgreSQL requires some administrative tasks to be executed by the end-user to keep the service running smoothly.
- Both of these databases use delete markers while executing
UPDATE
and DELETE
statements and hence need to recover the disk space later through a manual command called VACUUM.
- Recent versions of both the databases have support for auto executing these commands based on a threshold number of rows changing, but still, it is recommended to manually do this in case the user has a definite knowledge of many updates or deletes in a table.
- Another such command is the
ANALYZE
command which helps the query planner keep its statistics up to date.
- Since these statistics can directly influence the query processing performance, it is recommended manually do this once in a while.
Pricing
- Redshift is a paid service and the lowest current generation dense compute dc2.large starts at .25 $ per hour.
- There is also another type of instance called dense storage that comes with higher storage capacity using HDDs instead of SSDs. It costs more than dense compute nodes and starts at .85 $.
- PostgreSQL is completely free and open source. Customers do not need to pay anything for the software, but they will have to purchase the hardware themselves based on their budget.
Data Security
- Redshift comes with the complete suite of security by default. Customers can choose to isolate Redshift clusters using amazon’s virtual private clouds.
- Amazon’s IAM console allows fine-grained control of which user has access to what options. Data is encrypted at rest and in motion and SSL support is also present. Read more about Redshift security.
- PostgreSQL also provides encryption, SSL support, user access management, etc, but it is the user’s responsibility to manage everything and ensure sufficient security by making use of these features.
- Network isolation will need to be done for the actual hardware server separately. Read more about PostgreSQL security.
- With the comparison out of the way, it is now time to enumerate some specific use cases whether each of these databases will be a sensible choice.
SQL Statements
Let’s discuss some SQL command difference between Redshift and Postgres.
- CREATE TABLE:
Amazon Redshift doesn’t support tablespace, inheritance, table partition, etc. Redshift enables you to define the sort and distribution algorithm for your table to optimise your parallel processing in it.
- ALTER TABLE:
As Redshift doesn’t support ALTER COLUMN, ADD COLUMN supports adding only one column in each ALTER TABLE command.
- COPY:
The COPY command of Redshift can load data from Amazon S3 buckets and Amazon DynamoDB tables and is capable of facilitating automatic compression.
- INSERT, UPDATE, DELETE:
Redshift doesn’t support the with clause.
- VACUUM:
The VACUUM operation in Redshift is VACUUM FULL that reclaims full disk space and resorts all rows whereas VACUUM operation in PostgreSQL reclaims space and make it available for use.
Use cases
When to use Amazon Redshift?
- Your data volume can go up to petabytes
- You want the offload all of your maintenance and infrastructure activities and focus only on the core business logic.
- Your workload is mainly analytical and reporting type and you anticipate them interfering with OLTP workloads if handled using the same database.
- You are willing to take responsibility of ensuring unique key constraints and referential integrity.
- You do not mind having to design the data structure using SORT KEYS AND DIST KEYS to get the maximum performance out of your data warehouse
- Your workload needs a lot of column processing and the number of columns is in the range of hundreds or thousands.
When to use PostgreSQL?
- Your main use case is online transaction processing and does not anticipate high load reporting or analytical workload
- The data volume is in TBs and a large increase in data to be processed is not expected in the near future.
- You are particular that your data should stay inside your network completely under your control
- You have enough resources to maintain a highly available and reliable database.
- Your queries are simpler and do not have very large data scan ranges.
- Your workload is mainly dealing with complete rows and the number of columns in your tables is not in hundreds or thousands.
Conclusions
- The final call to go for Redshift or Postgres completely depends on your unique set of business requirements.
- We hope this detailed blog was able to provide you with perspectives on important parameters that matter.
Vivek Sinha is a seasoned product leader with over 10 years of expertise in revolutionizing real-time analytics and cloud-native technologies. He specializes in enhancing Apache Pinot, focusing on query processing and data mutability. Vivek is renowned for his strategic vision and ability to deliver cutting-edge solutions that empower businesses to harness the full potential of their data.