Redshift is a completely managed data warehouse offered as a cloud service by Amazon. It can scale up to petabytes of data and offers great performance even for complex queries. Its columnar nature with Postgres as the querying standard makes it very popular for analytical and reporting use cases. These days it is very common to use Redshift as the backbone data warehouse for highly reliable ETL or ELT systems. In this post, we will look at ways of setting up a robust Redshift ETL.
Here is a snapshot of what this post covers:
- Redshift ETL – An Introduction
- How to Perform ETL when Source Data is Inside AWS
- Source Data is Outside AWS
- Real-time Data Extraction
- Data Transformation in Redshift ETL
- Redshift ETL Best Practices
Redshift ETL – The Data Extraction
Whether it is an ETL or ELT system, extraction from multiple sources of data is the first step. If what you have in mind is an ETL system, the extraction will involve loading the data to intermediate filesystem storage like S3 or HDFS. In ELT’s case, there is no need for intermediate storage, and the data is directly pushed to Redshift, and transformations are done on Redshift tables.
Finalizing the extraction strategy will depend a lot on the type of use case and the target system one envisions. At a broad level, the extraction strategy will depend on the following factors
- The use case can be handled using batch processing and source data is in the AWS ecosystem itself.
- The use case can be handled using the batch process and the source data is in an on-premise database.
- The use case needs real-time processing and the source is streaming data.
Lets first consider the case where the data is already part of the AWS ecosystem.
Source Data is Inside AWS
This assumes that the source data is already in S3 or some kind of an AWS database service like RDS. In such cases, AWS itself provides us with multiple options for extracting and loading data to Redshift.
- AWS Data Pipeline – AWS data pipeline can integrate with all the available AWS services and provides templates with Redshift or S3 as a target. If you are imagining an ELT system, it would be worthwhile to use the Redshift target template directly so that the data is loaded directly to Redshift. That said, internally even this will use S3 as intermediate storage, but the end-user does not have to worry about that. One issue with AWS data pipeline is that it normally works in the batch mode as periodically scheduled jobs and this does not do a good job in case the data extraction needs to be in real-time or near-real-time. More details on this can be found here.
- AWS RDS Sync with Redshift – If your source data is in RDS and wants to sync the RDS directly to Redshift in real-time, AWS offers partner solutions like Attunity and Flydata which works based on bin log streaming.
- Custom Pipeline Implementation – If you are unwilling to use a managed service for data load, it can be manually implemented by coding each step. Extracting the data first to AWS S3 and then using the copy command. More details on such an approach can be found here.
- Hevo, a No-code Data Pipeline – AWS data pipeline can be a bit overwhelming to set up considering the simple use case it serves. A No-code Data Pipeline like Hevo can do everything AWS data pipeline does in just a few clicks, abstracting away all the complexity and intermediate steps involved. Hevo can also accomplish a near-real-time Amazon Redshift ETL and can act as a one-stop tool that can do many things for you.
Source Data is Outside AWS
In cases where the source data is outside the AWS ecosystem, the options are less.
- AWS Data Migration Service – This is a completely managed AWS service, which can integrate with many databases like Oracle, Microsoft SQL Server, etc. and migrate data to Redshift or intermediate storage like S3. This is mainly used for one-off migrations and lacks integration with many widely used databases like MongoDB.
- Custom Pipeline Implementation – The custom pipeline mentioned in the above section can be used here as well. The first step is to extract data from the on-premise source to S3 and use the copy command. You can look at an example of how this can be achieved for MySQL to Redshift ETL. In a similar fashion, the rest of the sources can be connected to data load.
- Hevo, a No-code Data Pipeline – Hevo natively integrates with a multitude of data sources. Databases, Cloud Applications, Custom Sources like FTP, SDKs, and more. It can accomplish data extractions to Redshift in a matter of few clicks, without writing any code and can hide away much of the complexity. Explore its complete list of Data Sources here.
Real-time Data Extraction
In the case of real-time streams, whether or not the source is in AWS does not really matter since the data anyway has to be streamed to the system specified endpoints. Let’s explore different ways of accomplishing a real-time stream load to AWS Redshift.
- AWS Kinesis Firehose – If here is a real-time streaming source of data that needs to be put directly into Redshift, Kinesis Firehose is the option you can use. Kinesis Firehose uses the COPY command of Redshift in a fault-tolerant way. This is also a completely managed service and charges only for actual use. Kinesis is not to be confused as a method for real-time duplication of data between two databases. That is a different use case and is accomplished using RDS sync or products like Attunity. Kinesis is for streaming data in the form of events. An example would be AWS cloud watch log events or S3 data update events.
- Kafka and Kafka Connect – In case, using a managed service is not preferable and you want close control over the streaming infrastructure, the next best thing to use is a Kafka cluster and the open-source JDBC Kafka connector to load data in real-time.
Redshift ETL – Data Transformation
In the case of an ELT system, transformation is generally done on Redshift itself and the transformed results are loaded to different Redshift tables for analysis. For an ETL system, transformation is usually done on intermediate storage like S3 or HDFS, or real-time as and when the data is streamed. Regardless of this, all the tools mentioned below can be used for performing transformations. Similar to extraction strategies, the transformation method to be used will depend on whether the use case demands batch/real-time transformations.
- AWS Glue – AWS glue is a completely managed Apache Spark-based ETL service that can execute transformations using an elastic Spark backend. The end-user is spared of all the activities involved in setting up and maintaining the spark infrastructure and is charged only for the actual usage. AWS Glue also provides automatic data catalogs using its crawlers and works with a wide variety of sources including S3. It can auto-generate scala or python code for running the jobs. Custom jobs are also possible along with the use of third party libraries. You can read more about this here.
- AWS Data Pipeline – AWS Data Pipeline can also be used to perform scheduled transform jobs. It differs from Glue in the sense that more close control of the infrastructure of jobs is possible. AWS Data pipeline allows selecting your EC2 clusters and frameworks for running the jobs. Jobs can be run using map-reduce, hive or pig scripts or even using Spark or Spark SQL.
- Kinesis Firehose – Kinesis firehose has an option of invoking a custom Lambda function which can transform the streaming data in real-time and deliver the transformed data to destinations.
- Kafka streaming – Kafka streams have functionalities like map, filter, flatmap, etc to do stateless transformations on the streaming data.
- Hevo Data Integration Platform– Hevo excels in its capability to run transformations on the fly. It provides a python based user interface where custom python code can be written to clean, enrich, and transform the data. Users can even test the scripts on sample data before deploying the same. Hevo does all of this on the fly and in real-time. This takes away the complexity of reliably transforming the data for analysis.
Redshift ETL Best Practices
Now that we understand the details of Amazon Redshift ETL, let’s learn about some of the best practices that can be useful while working with Redshift as your data warehouse.
- While using the COPY command of Redshift, it is always better to use it on multiple source files rather than one big file. This helps in parallelly loading them and can save a lot of time. Using a manifest file is recommended in case of lading from multiple files.
- In cases where data needs to be deleted after a specific time, it is better to use the time series tables. This involves creating different tables for different time windows and dropping them when that data is not needed. Dropping a table is way faster than deleting rows from one master table.
- Execute ’VACUUM’ command at regular intervals and after the large update or delete operations. Also, use ‘ANALYZE’ command frequently to ensure that database statistics are updated.
- AWS workload management queues enable the user to prioritize different kinds of Redshift loads by creating separate queues for each type of job. It is recommended to create a separate queue for all the ETL jobs and a separate one for reporting jobs.
- For transformations that span across multiple SQL statements, it is recommended to execute ‘commit’ command after the complete group is executed rather than committing after each statement. For example, let’s say there are two INSERT statements in one of your ETL steps. It is better to use the COMMIT statement after both the statements than using a COMMIT after each statement.
- While using intermediate tables and transferring data between an intermediate table and master table, it is better to use ‘ALTER table APPEND’ command to insert data from the temporary table to the target table. This command is generally faster than using “CREATE TABLE AS” or “INSERT INTO” statements. It is to be noted that ‘ALTER table APPEND’ command empties the source table.
- If there is a need to extract a large amount of data from Redshift and save to S3 or other storage, it is better to use ‘UNLOAD’ command rather than ‘SELECT’ command since the former command will be executed parallelly by all the nodes saving a lot of time.
Redshift is a great choice to be used as a data warehouse because of its columnar nature and availability of multiple tools to cover the entire spectrum of Amazon Redshift ETL activities. AWS itself provides multiple tools, but the choice of what to use in different scenarios can cause a great deal of confusion for a first time user. The AWS tools also have some limitations in supporting source data from outside the AWS ecosystem.
All the above drawbacks can be addressed by using a fuss-free ETL tool like Hevo, which can implement a complete ETL system in a few clicks without any code. Hevo’s fault-tolerant architecture will ensure that your data from any data source is moved to Redshift in a reliable, consistent, and secure manner.