AWS (Amazon Web Services) is one of the leading providers of Cloud Services. It provides Cloud services like Amazon Redshift, Amazon S3, and many others for Data Storage. Extract, Transform, Load are 3 important steps performed in the field of Data Warehousing and Databases.
So, extracting and loading data from these data storage is one of the important steps in the ETL (Extract, Transform, Load) process. In this article, you will be introduced to Amazon Redshift and Amazon S3. You will also learn about the key features of these AWS Tools.
Moreover, you will also be introduced to Amazon Redshift Data Lake Export feature. You will also know about the steps involved in setting up Amazon Redshift Data Lake Export. Read along to learn more about Amazon Redshift Data Lake Export.
Introduction to Amazon Redshift
Amazon Web Services (AWS) is a subsidiary of Amazon saddled with the responsibility of providing a cloud computing platform and APIs to individuals, corporations, and enterprises. AWS offers high computing power, efficient content delivery, database storage with increased flexibility, scalability, reliability, and relatively inexpensive cloud computing services.
Amazon Redshift, a part of AWS, is a Cloud-based Data Warehouse service designed by Amazon to handle large data and make it easy to discover new insights from them. Its operations enable you to query and combine exabytes of structured and semi-structured data across various Data Warehouses, Operational Databases, and Data Lakes.
Amazon Redshift is built on industry-standard SQL with functionalities to manage large datasets, support high-performance analysis, provide reports, and perform large-scaled database migrations. Amazon Redshift also lets you save queried results to your S3 Data Lake using open formats like Apache Parquet from which additional analysis can be done on your data from other Amazon Web Services such as EMR, Athena, and SageMaker.
Key Features of Amazon Redshift
The key features of Amazon Redshift are as follows:
1) Massively Parallel Processing (MPP)
2) Fault Tolerance
3) Redshift ML
4) Column-Oriented Design
Introduction to AWS S3 Amazon
Amazon S3 houses simple yet intuitive data management functionalities that allow you to configure finely-tuned access controls and organize your data to meet your specific organizational, business, and compliance requirements. AWS S3 Amazon has been designed to provide a robust and durable framework that can store data for millions of applications for companies across the world.
Amazon S3 or Simple Storage Service is a scalable, low-cost, high-speed data storage web service provided by Amazon. Amazon S3 is designed for online backup and archiving of data and application programs. Amazon S3 stores data in the form of objects. Each object consists of a file with an associated ID and metadata. These files work like records and directories to store data within your AWS region.
Amazon S3 allows you to upload, store, and download any type of file up to 5 TB in size. All of its subscribers can access the same storage capabilities that Amazon uses on its website too. Amazon S3 is designed to give the subscriber total control over the accessibility of data.
The architecture of Amazon S3 is shown below.
To have further information about Amazon S3.
Key Features of Amazon S3
AWS S3 Amazon offers various features that you can utilize to manage and organize your data. Data is stored here as objects in something called ‘Buckets’. Amazon S3 features include capabilities like configuring and enforcing data access controls, monitoring data at the Object and Bucket levels among so many others. Here is a list of features of AWS S3 Amazon:
1) Intelligent Tiering
- Intelligent Tiering allows you to optimize storage costs by automatically moving objects between the four access tiers whenever there’s a change in the access patterns. Two opt-in archive access tiers are designed for asynchronous data access with cost savings for objects that are rarely accessed up to 95%. The other two access tiers are low latency access tiers optimized for infrequent and frequent access to help you save up to 40% on storage costs.
- The Objects that are transitioned or uploaded to Intelligent Tiering are automatically stored in the Frequent Access Tier. This tiering system works by monitoring access patterns and then shifting the objects that have not been accessed for 30 consecutive days to the Infrequent Access Tier.
- Once one or both of the archive access tiers have been activated, Intelligent Tiering will move the objects to the Archive Access Tier or the Deep Archive Access Tier. The objects that haven’t been accessed in the last 90 days move to the Archive Access Tier and Deep Archive Access Tier otherwise.
2) Storage Lens
- Storage Lens offers organization-wide visibility into activity trends, object storage usage, and provides actionable recommendations to apply the best practices for data protection and improve cost efficiency.
- This is the first Cloud Storage Analytics solution to provide a unified view of object storage usage and activity across hundreds of accounts in an organization. Storage Lens also offers drill-downs to generate insights at the bucket, account, or prefix level.
- It can analyze organization-wide metrics to deliver contextual recommendations and find ways to reduce storage costs and improve efficiency in the process.
3) Access Points
- Access Points simplify Data Access Management at scale for applications using shared datasets in S3. Access Points are defined as unique hostnames created by the customers to enforce network controls and distinct permissions for any request made through the access point.
- Customers with user-generated content and shared datasets like Media Archives and Data Lakes can easily scale for hundreds of applications. This is done by creating individualized access points with permissions and names tailored for each application.
- You can restrict any access point to a Virtual Private Cloud (VPC) to firewall AWS S3 Amazon data access within private networks of customers. You can use AWS Service Control Policies to ensure that all access points are VPC restricted.
4) Object Lambda
- Object Lambda allows you to add your code to AWS S3 Amazon GET requests to modify and process data as it returns to an application. For the first time, custom code can be used to modify the data returned by standard GET requests to dynamically resize images, redact confidential data, filter rows, and much more.
- S3 Object Lambda is powered by AWS Lambda functions, which means that your code runs on AWS-managed infrastructure. This eliminates the need to store and create derivative copies of your data or to run expensive copies, all with no changes required for applications.
- You can configure a Lambda function and attach it to an Object Lambda Access Point in just a few clicks. Amazon S3 can then automatically call your Lambda function to process any data retrieved through the S3 Object Lambda Access Point. It returns a transformed result to the application.
5) Batch Operations
- Batch Operation is an AWS S3 Amazon Data Management feature that helps you manage billions of objects at scale through a single API request or the Amazon S3 Management Console in just a few clicks. To carry out work in an Batch operation, you need to first create a job.
- The job consists of the list of the objects, set of parameters, and the actions to be performed for that type of operation. You can create and run multiple jobs at a time or use job priorities in Batch Operations to define the precedence of each job. This ensures that the most critical work happens first.
- S3 Batch Operations also tracks progress, manages retries, generates reports, sends completion notifications, and delivers events to AWS CloudTrail for all tasks executed and changes made.
6) Block Public Access
- Block Public Access provides control access across an entire AWS account or at the individual AWS S3 Amazon Bucket level. This ensures that objects never have public access at any point in time. Public Access can be granted to objects and buckets through Bucket Policies or Access Control Lists (ACLs) or both. To block access to all your S3 Objects and Buckets, you can block all public access at the account level.
- The S3 Block Public Access settings override Amazon S3 permissions that permit public access. This makes it easy for the account administrator to ensure a centralized control and prevent any variations in the security configuration. This is carried out regardless of how an object is added or a bucket is created.
Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.
Start for free now!
Get Started with Hevo for Free
Understanding Amazon Redshift Data Lake Export
- Amazon Redshift is the Cloud Data Warehouse in AWS. Amazon Redshift provides seamless integration with other storages like Amazon S3. It enables a very cost-effective Data Warehouse solution where the warm data can be kept in Amazon Redshift storage while the cold data can be kept in the S3 storage. The user can access the S3 data from Redshift in the same way, the data is accessed from the Redshift storage itself. Amazon Redshift enables S3 data access from Amazon Redshift.
- Redshift uses the Redshift Data Lake Export feature which allows unloading the result of a Redshift query to the S3 Data Lake in Apache Parquet format,an efficient open columnar storage format optimized for analytics. This way, Redshift can offload the cold data to the S3 storage.
- These 2 services i.e, Amazon Redshift and Amazon S3 can come together to optimize the storage cost as well as the performance.
Steps to Set Up Amazon Redshift Data Lake Export
The steps to set up Amazon Redshift Data Lake Export are as follows:
Step 1: Create S3 Bucket
- Log in to the AWS Console and choose your appropriate region.
- Now, go to S3 Management Studio. Then, create an S3 Bucket and select a name from the available names. And create a folder in the bucket with an appropriate name.
In the picture given below, the name of the S3 Bucket is dojo-lake and the name of the folder is orders.
Step 2: Create IAM Role
To give Redshift write access to your S3 bucket, AWS Identity and Access Management (IAM) role is used.
- Go to the IAM (Identity and Access Management) Console.
- From the left navigation pane, click on the “Roles” option and then click on the “create role” button.
- Now, select the “Redshift-Customizable” option from the list of use cases. Then, click on the “Next: Permissions” button.
- Now, in the next window, select “PowerUserCase” as the policy. Then, click on the “Next: Tags” button.
- Now, click on the “Next: Review” button.
- In the next window, fill the details such as Role name and Role Description and click on the “create role” button.
In this case, the role name is given as dojoredshiftrole
Now the role is created immediately. Please make note of the Role ARN as you will need it later in the process.
Step 3: Launch Redshift Cluster
You launch a Redshift Cluster which is used to unload the query data to the S3 bucket.
- Go to Redshift Management Console, and click on the “create cluster” button.
- In the next window, give a unique name for the cluster identifier and select the option for which you want to use the cluster for.
In this case, the name of the cluster identifier is given as dojoredshift.
- Fill in the Database Configurations details.
- Now, expand the “Cluster permissions (optional)” section. Select the appropriate name from the list of IAM Roles. Then click on the “Associate IAM Role” button to add the role to the cluster. Finally, click on the “Create cluster” button in the bottom of the screen.
In this case, dojoredshiftrole is selected as the IAM Role.
- The Redshift cluster will take some time to build. Wait until the cluster’s status changes to Available.
Few things need to be remembered that will be used in the process. Those are:
- Cluster Identifier
- Username
- Password
- Database
Step 4: Unload Data from Redshift Database
Now, you can unload query results in Redshift to the S3 bucket.
- In the Redshift Management Console, Click the “EDITOR” menu on the left navigation pane. It will bring up the Connect to database dialogue box.
- Choose the option to create a new connection. Give the name of the cluster, database, username, and password.
- Now, to connect to the cluster, click the “Connect to database” button.
- On the next window, run the following SQL statement to create the required table.
In this case, the “orders” table is created.
CREATE TABLE orders (
orderNumber int,
productCode varchar(15),
quantityOrdered int,
priceEach decimal,
orderLineNumber int
);
- Now, you can insert values into the table using the appropriate SQL statement.
insert into orders(orderNumber,productCode,quantityOrdered,priceEach,orderLineNumber) values
(10100,'S18_1749',30,'136.00',3),
(10100,'S18_2248',50,'55.09',2),
(10100,'S18_4409',22,'75.46',4),
(10100,'S24_3969',49,'35.29',1),
(10101,'S18_2325',25,'108.06',4),
(10101,'S18_2795',26,'167.06',1),
(10101,'S24_1937',45,'32.53',3),
(10101,'S24_2022',46,'44.35',2),
(10102,'S18_1342',39,'95.55',2),
(10102,'S18_1367',41,'43.13',1),
(10103,'S10_1949',26,'214.30',11);
- Perform the following query to unload a query result into the S3 bucket. If you created an S3 bucket with a different name, replace dojo-lake with that in the query. Replace REDSHIFT-ROLE with the ARN of the Redshift IAM Role you created previously.
UNLOAD ('select-statement')
TO 'S3://object-path/name-prefix'
iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
[ option [ ... ] ]
where option is
{ [ FORMAT [ AS ] ] CSV | PARQUET
| PARTITION BY ( column_name [, ... ] ) [ INCLUDE ]
| MANIFEST [ VERBOSE ]
| HEADER
| DELIMITER [ AS ] 'delimiter-char'
| FIXEDWIDTH [ AS ] 'fixedwidth-spec'
| ENCRYPTED [ AUTO ]
| BZIP2
| GZIP
| ZSTD
| ADDQUOTES
| NULL [ AS ] 'null-string'
| ESCAPE
| ALLOWOVERWRITE
| PARALLEL [ { ON | TRUE } | { OFF | FALSE } ]
| MAXFILESIZE [AS] max-size [ MB | GB ]
| REGION [AS] 'aws-region' }
UNLOAD ('SELECT * FROM orders')
TO 's3://dojo-lake/orders/'
FORMAT AS PARQUET
PARTITION BY (orderLineNumber)
CREDENTIALS 'aws_iam_role={REDSHIFT-ROLE}';
- In the above query, you attempted to unload the results of the query “SELECT * FROM orders” to the S3 bucket. You further partitioned the data files on the basis of the orderLineNumber field. You can view the query results in partitioned files under the orders folder by going to the S3 bucket.
Step 5: Clean Up (Optional)
If you don’t want to incur any additional cost then you can clean up the resources in case you don’t require them.
- Delete dojoredshift Redshift Cluster.
- Delete dojo-lake S3 Bucket.
- Delete dojoredshiftrole IAM Role.
Learn More About:
Conclusion
In this article, you have learned about the Amazon Redshift, Amazon S3 and their key features. This article also provided information on Amazon Redshift Data Lake export features and the steps in setting up Amazon Redshift Data Lake Export.
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.
Hevo Data with its strong integration with 150+ sources (including 30+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice such as Amazon Redshift, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.
Want to take Hevo for a spin? Sign Up or a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also checkout our unbeatable pricing to choose the best plan for your organization.
Share your experience of understanding the Amazon Redshift Data Lake Export feature in the comment section below! We would love to hear your thoughts.
Manisha Jena is a data analyst with over three years of experience in the data industry and is well-versed with advanced data tools such as Snowflake, Looker Studio, and Google BigQuery. She is an alumna of NIT Rourkela and excels in extracting critical insights from complex databases and enhancing data visualization through comprehensive dashboards. Manisha has authored over a hundred articles on diverse topics related to data engineering, and loves breaking down complex topics to help data practitioners solve their doubts related to data engineering.