Setting up Redshift Data Lake Export: Made Easy 101

Last Modified: December 29th, 2022

Amazon Redshift Data Lake Export

AWS (Amazon Web Services) is one of the leading providers of Cloud Services. It provides Cloud services like Amazon Redshift, Amazon S3, and many others for Data Storage. Extract, Transform, Load are 3 important steps performed in the field of Data Warehousing and Databases. So, extracting and loading data from these data storage is one of the important steps in the ETL (Extract, Transform, Load) process.

In the ETL process, you load raw data into Amazon Redshift. After performing Transformation and Schema Management on this data, your Amazon Redshift now has valuable Analysis-ready data. However, you don’t want to perform queries on Amazon Redshift directly. You wish to load this data into your local system or other Cloud service platforms like Amazon S3 to perform queries on this data. This is where unloading Data from Amazon Redshift to S3 helps. AWS S3 is an AWS storage repository that is connected with almost all AWS-supported data and Analysis services. Redshift uses the Redshift Data Lake Export feature which allows unloading the result of a Redshift query to the S3 data lake in Apache Parquet format.

In this article, you will be introduced to Amazon Redshift and Amazon S3. You will also learn about the key features of these AWS Tools. Moreover, you will also be introduced to Amazon Redshift Data Lake Export feature. You will also know about the steps involved in setting up Amazon Redshift Data Lake Export. Read along to learn more about Amazon Redshift Data Lake Export.

Table of Contents

Introduction to Amazon Redshift

Image Source

Amazon Web Services (AWS) is a subsidiary of Amazon saddled with the responsibility of providing a cloud computing platform and APIs to individuals, corporations, and enterprises. AWS offers high computing power, efficient content delivery, database storage with increased flexibility, scalability, reliability, and relatively inexpensive cloud computing services.

Amazon Redshift, a part of AWS, is a Cloud-based Data Warehouse service designed by Amazon to handle large data and make it easy to discover new insights from them. Its operations enable you to query and combine exabytes of structured and semi-structured data across various Data Warehouses, Operational Databases, and Data Lakes.

Amazon Redshift is built on industry-standard SQL with functionalities to manage large datasets, support high-performance analysis, provide reports, and perform large-scaled database migrations. Amazon Redshift also lets you save queried results to your S3 Data Lake using open formats like Apache Parquet from which additional analysis can be done on your data from other Amazon Web Services such as EMR, Athena, and SageMaker.

For further information on Amazon Redshift, you can follow the Official Documentation.

Key Features of Amazon Redshift

The key features of Amazon Redshift are as follows:

1) Massively Parallel Processing (MPP)

Massively Parallel Processing (MPP) is a distributed design approach in which the divide and conquer strategy is applied by several processors to large data jobs. A large processing job is broken down into smaller jobs which are then distributed among a cluster of Compute Nodes. These Nodes perform their computations parallelly rather than sequentially. As a result, there is a considerable reduction in the amount of time Redshift requires to complete a single, massive job.

2) Fault Tolerance

Data Accessibility and Reliability are of paramount importance for any user of a database or a Data Warehouse. Amazon Redshift monitors its Clusters and Nodes around the clock. When any Node or Cluster fails, Amazon Redshift automatically replicates all data to healthy Nodes or Clusters.

3) Redshift ML

Amazon Redshift houses a functionality called Redshift ML that gives data analysts and database developers the ability to create, train, and deploy Amazon SageMaker models using SQL seamlessly.

4) Column-Oriented Design

Amazon Redshift is a Column-oriented Data Warehouse. This makes it a simple and cost-effective solution for businesses to analyze all their data using their existing Business Intelligence tools. Amazon Redshift achieves optimum query performance and efficient storage by leveraging Massively Parallel Processing (MPP), Columnar Data Storage, along with efficient and targeted Data Compression Encoding schemes.

Introduction to AWS S3 Amazon

Amazon Redshift Data Lake Export - Amazon S3
Image Source

Amazon S3 houses simple yet intuitive data management functionalities that allow you to configure finely-tuned access controls and organize your data to meet your specific organizational, business, and compliance requirements. AWS S3 Amazon has been designed to provide a robust and durable framework that can store data for millions of applications for companies across the world.

Amazon S3 or Simple Storage Service is a scalable, low-cost, high-speed data storage web service provided by Amazon. Amazon S3 is designed for online backup and archiving of data and application programs. Amazon S3 stores data in the form of objects. Each object consists of a file with an associated ID and metadata. These files work like records and directories to store data within your AWS region.

Amazon S3 allows you to upload, store, and download any type of file up to 5 TB in size. All of its subscribers can access the same storage capabilities that Amazon uses on its website too. Amazon S3 is designed to give the subscriber total control over the accessibility of data.

The architecture of Amazon S3 is shown below.

Redshift Data Lake Export - Amazon S3 Architecture
Image Source

To have further information about Amazon S3, you can follow the Official Documentation.

Key Features of Amazon S3

AWS S3 Amazon offers various features that you can utilize to manage and organize your data. Data is stored here as objects in something called ‘Buckets’. Amazon S3 features include capabilities like configuring and enforcing data access controls, monitoring data at the Object and Bucket levels among so many others. Here is a list of features of AWS S3 Amazon:

1) Intelligent Tiering

Intelligent Tiering allows you to optimize storage costs by automatically moving objects between the four access tiers whenever there’s a change in the access patterns. Two opt-in archive access tiers are designed for asynchronous data access with cost savings for objects that are rarely accessed up to 95%. The other two access tiers are low latency access tiers optimized for infrequent and frequent access to help you save up to 40% on storage costs.

The Objects that are transitioned or uploaded to Intelligent Tiering are automatically stored in the Frequent Access Tier. This tiering system works by monitoring access patterns and then shifting the objects that have not been accessed for 30 consecutive days to the Infrequent Access Tier

Once one or both of the archive access tiers have been activated, Intelligent Tiering will move the objects to the Archive Access Tier or the Deep Archive Access Tier. The objects that haven’t been accessed in the last 90 days move to the Archive Access Tier and Deep Archive Access Tier otherwise. 

Amazon Redshift Data Lake Export - S3 Intelligent Tiering Diagram
Image Source

2) Storage Lens

Storage Lens offers organization-wide visibility into activity trends, object storage usage, and provides actionable recommendations to apply the best practices for data protection and improve cost efficiency. This is the first Cloud Storage Analytics solution to provide a unified view of object storage usage and activity across hundreds of accounts in an organization. Storage Lens also offers drill-downs to generate insights at the bucket, account, or prefix level. 

It can analyze organization-wide metrics to deliver contextual recommendations and find ways to reduce storage costs and improve efficiency in the process. 

Amazon Redshift Data Lake Export - S3 Storage Lens Diagram
Image Source

3) Access Points

Access Points simplify Data Access Management at scale for applications using shared datasets in S3. Access Points are defined as unique hostnames created by the customers to enforce network controls and distinct permissions for any request made through the access point.

Customers with user-generated content and shared datasets like Media Archives and Data Lakes can easily scale for hundreds of applications. This is done by creating individualized access points with permissions and names tailored for each application. You can restrict any access point to a Virtual Private Cloud (VPC) to firewall AWS S3 Amazon data access within private networks of customers. You can use AWS Service Control Policies to ensure that all access points are VPC restricted.  

Amazon Redshift Data Lake Export - S3 Access Points Diagram
Image Source

4) Object Lambda

Object Lambda allows you to add your code to AWS S3 Amazon GET requests to modify and process data as it returns to an application. For the first time, custom code can be used to modify the data returned by standard GET requests to dynamically resize images, redact confidential data, filter rows, and much more.

S3 Object Lambda is powered by AWS Lambda functions, which means that your code runs on AWS-managed infrastructure. This eliminates the need to store and create derivative copies of your data or to run expensive copies, all with no changes required for applications. You can configure a Lambda function and attach it to an Object Lambda Access Point in just a few clicks. Amazon S3 can then automatically call your Lambda function to process any data retrieved through the S3 Object Lambda Access Point. It returns a transformed result to the application.

Amazon Redshift Data Lake Export - S3 Object Lambda Diagram
Image Source

5) Batch Operations

Batch Operation is an AWS S3 Amazon Data Management feature that helps you manage billions of objects at scale through a single API request or the Amazon S3 Management Console in just a few clicks. To carry out work in an Batch operation, you need to first create a job. The job consists of the list of the objects, set of parameters, and the actions to be performed for that type of operation. You can create and run multiple jobs at a time or use job priorities in Batch Operations to define the precedence of each job. This ensures that the most critical work happens first.

S3 Batch Operations also tracks progress, manages retries, generates reports, sends completion notifications, and delivers events to AWS CloudTrail for all tasks executed and changes made.

Amazon Redshift Data Lake Export - S3 Batch Operations Diagram
Image Source

6) Block Public Access

Block Public Access provides control access across an entire AWS account or at the individual AWS S3 Amazon Bucket level. This ensures that objects never have public access at any point in time. Public Access can be granted to objects and buckets through Bucket Policies or Access Control Lists (ACLs) or both. To block access to all your S3 Objects and Buckets, you can block all public access at the account level.

The S3 Block Public Access settings override Amazon S3 permissions that permit public access. This makes it easy for the account administrator to ensure a centralized control and prevent any variations in the security configuration. This is carried out regardless of how an object is added or a bucket is created.

Amazon Redshift Data Lake Export - S3 Block Public Access Diagram
Image Source

Simplify Redshift ETL and Analysis with Hevo’s No-code Data Pipeline

A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate and load data from 100+ different sources (including 30+ free sources) to a Data Warehouse such as Amazon Redshift or Destination of your choice in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources allows users to bring in data of different kinds in a smooth fashion without having to code a single line. 

Get Started with Hevo for Free

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  • Connectors: Hevo supports 100+ integrations to SaaS platforms, files, Databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.  
  • Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (including 30+ free sources) that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Understanding Amazon Redshift Data Lake Export

Amazon Redshift is the Cloud Data Warehouse in AWS. Amazon Redshift provides seamless integration with other storages like Amazon S3. It enables a very cost-effective Data Warehouse solution where the warm data can be kept in Amazon Redshift storage while the cold data can be kept in the S3 storage. The user can access the S3 data from Redshift in the same way, the data is accessed from the Redshift storage itself. Amazon Redshift enables S3 data access from Amazon Redshift.

Redshift uses the Redshift Data Lake Export feature which allows unloading the result of a Redshift query to the S3 Data Lake in Apache Parquet format,an efficient open columnar storage format optimized for analytics. This way, Redshift can offload the cold data to the S3 storage.

These 2 services i.e, Amazon Redshift and Amazon S3 can come together to optimize the storage cost as well as the performance.

Steps to Set Up Amazon Redshift Data Lake Export

The steps to set up Amazon Redshift Data Lake Export are as follows:

Step 1: Create S3 Bucket

  • Log in to the AWS Console and choose your appropriate region.
  • Now, go to S3 Management Studio. Then, create an S3 Bucket and select a name from the available names. And create a folder in the bucket with an appropriate name.

In the picture given below, the name of the S3 Bucket is dojo-lake and the name of the folder is orders.

Amazon Redshift Data Lake Export - S3 Bucket
Image Source

Step 2: Create IAM Role

To give Redshift write access to your S3 bucket, AWS Identity and Access Management (IAM) role is used.

  • Go to the IAM (Identity and Access Management) Console.
  • From the left navigation pane, click on the “Roles” option and then click on the “create role” button.
Amazon Redshift Data Lake Export - Roles
Image Source
  • Now, select the “Redshift-Customizable” option from the list of use cases. Then, click on the “Next: Permissions” button.
Amazon Redshift Data Lake Export - Use Cases
Image Source
  • Now, in the next window, select “PowerUserCase” as the policy. Then, click on the “Next: Tags” button.
Amazon Redshift Data Lake Export - Policy
Image Source
  • Now, click on the “Next: Review” button.
  • In the next window, fill the details such as Role name and Role Description and click on the “create role” button.

In this case, the role name is given as dojoredshiftrole

Amazon Redshift Data Lake Export - Create role
Image Source

Now the role is created immediately. Please make note of the Role ARN as you will need it later in the process.

Step 3: Launch Redshift Cluster

You launch a Redshift Cluster which is used to unload the query data to the S3 bucket.

  • Go to Redshift Management Console, and click on the “create cluster” button.
Amazon Redshift Data Lake Export - Create cluster
Image Source
  • In the next window, give a unique name for the cluster identifier and select the option for which you want to use the cluster for.

In this case, the name of the cluster identifier is given as dojoredshift.

Amazon Redshift Data Lake Export - Cluster Configuration
Image Source
  • Fill in the Database Configurations details.
Amazon Redshift Data Lake Export - Database details
Image Source
  • Now, expand the “Cluster permissions (optional)” section. Select the appropriate name from the list of IAM Roles. Then click on the “Associate IAM Role” button to add the role to the cluster. Finally, click on the “Create cluster” button in the bottom of the screen.

In this case, dojoredshiftrole is selected as the IAM Role.

Amazon Redshift Data Lake Export - Cluster permissions
Image Source
  • The Redshift cluster will take some time to build. Wait until the cluster’s status changes to Available.
Amazon Redshift Data Lake Export - Availability
Image Source

Few things need to be remembered that will be used in the process. Those are:

  • Cluster Identifier
  • Username
  • Password
  • Database

Step 4: Unload Data from Redshift Database

Now, you can unload query results in Redshift to the S3 bucket.

  • In the Redshift Management Console, Click the “EDITOR” menu on the left navigation pane. It will bring up the Connect to database dialogue box.
  • Choose the option to create a new connection. Give the name of the cluster, database, username, and password.
  • Now, to connect to the cluster, click the “Connect to database” button.
Amazon Redshift Data Lake Export - Connect to database
Image Source
  • On the next window, run the following SQL statement to create the required table.

In this case, the “orders” table is created.

Amazon Redshift Data Lake Export - create table
Image Source
CREATE TABLE orders (
  orderNumber int,
  productCode varchar(15),
  quantityOrdered int,
  priceEach decimal,
  orderLineNumber int
  );
  • Now, you can insert values into the table using the appropriate SQL statement.
Amazon Redshift Data Lake Export - Insert records
Image Source
insert  into orders(orderNumber,productCode,quantityOrdered,priceEach,orderLineNumber) values 
(10100,'S18_1749',30,'136.00',3),
(10100,'S18_2248',50,'55.09',2),
(10100,'S18_4409',22,'75.46',4),
(10100,'S24_3969',49,'35.29',1),
(10101,'S18_2325',25,'108.06',4),
(10101,'S18_2795',26,'167.06',1),
(10101,'S24_1937',45,'32.53',3),
(10101,'S24_2022',46,'44.35',2),
(10102,'S18_1342',39,'95.55',2),
(10102,'S18_1367',41,'43.13',1),
(10103,'S10_1949',26,'214.30',11);
  • Perform the following query to unload a query result into the S3 bucket.  If you created an S3 bucket with a different name, replace dojo-lake with that in the query. Replace REDSHIFT-ROLE with the ARN of the Redshift IAM Role you created previously.
UNLOAD ('select-statement')
TO 'S3://object-path/name-prefix'
iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
[ option [ ... ] ]

where option is
{ [ FORMAT [ AS ] ] CSV | PARQUET
| PARTITION BY ( column_name [, ... ] ) [ INCLUDE ]
| MANIFEST [ VERBOSE ] 
| HEADER           
| DELIMITER [ AS ] 'delimiter-char' 
| FIXEDWIDTH [ AS ] 'fixedwidth-spec'   
| ENCRYPTED [ AUTO ]
| BZIP2  
| GZIP 
| ZSTD
| ADDQUOTES 
| NULL [ AS ] 'null-string'
| ESCAPE
| ALLOWOVERWRITE
| PARALLEL [ { ON | TRUE } | { OFF | FALSE } ]
| MAXFILESIZE [AS] max-size [ MB | GB ] 
| REGION [AS] 'aws-region' }
Amazon Redshift Data Lake Export - Unload
Image Source
UNLOAD ('SELECT * FROM orders')
TO 's3://dojo-lake/orders/'
FORMAT AS PARQUET
PARTITION BY (orderLineNumber)
CREDENTIALS 'aws_iam_role={REDSHIFT-ROLE}';
  • In the above query, you attempted to unload the results of the query “SELECT * FROM orders” to the S3 bucket.  You further partitioned the data files on the basis of the orderLineNumber field. You can view the query results in partitioned files under the orders folder by going to the S3 bucket.
Amazon Redshift Data Lake Export - order1
Image Source
Amazon Redshift Data Lake Export - Order 1
Image Source

Step 5: Clean Up (Optional)

If you don’t want to incur any additional cost then you can clean up the resources in case you don’t require them.

  • Delete dojoredshift Redshift Cluster.
  • Delete dojo-lake S3 Bucket.
  • Delete dojoredshiftrole IAM Role.

For further information about the steps involved in setting up Amazon Redshift Data Lake export, visit here.

Conclusion

In this article, you have learned about the Amazon Redshift, Amazon S3 and their key features. This article also provided information on Amazon Redshift Data Lake export features and the steps in setting up Amazon Redshift Data Lake Export.

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.

Visit our Website to Explore Hevo

Hevo Data with its strong integration with 100+ sources (including 30+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice such as Amazon Redshift, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools. 

Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.

Share your experience of understanding the Amazon Redshift Data Lake Export feature in the comment section below! We would love to hear your thoughts.

mm
Former Research Analyst, Hevo Data

Manisha is a data analyst with experience in diverse data tools like Snowflake, Google BigQuery, SQL, and Looker. She has written more than 100 articles on diverse topics related to data industry.

No-code Data Pipeline for Amazon Redshift