Amazon Redshift Parquet: 2 Easy Methods

on Amazon Redshift, Data Warehouses, Tutorials • September 3rd, 2021 • Write for Hevo

Redshift Parquet - Featured image

Scanning a lot of data is easy but not necessarily quick. In this dynamic and ever-growing world, it is a necessity for every business to scan and analyze their data speedily and effectively. Every decision made after analyzing your customer data decides the future of your business. Therefore, it is important to use the right tools and strategies. This is where Amazon Redshift Parquet Integration comes into play.

Parquet is based on the Columnar Model and is available to all the projects in the Hadoop Ecosystem. It has made it easier to process complex data in bulk since the time businesses have started using NoSQL data in abundance. But to analyze your data you need a bigger platform that can provide you relevant tools. Amazon Redshift is one of the best options for that. Amazon Redshift is a Data Warehousing Solution from Amazon Web Services and is one of the biggest in the market. It helps you to transfer your data into a centralized location for easy access and analysis.

The article introduces you to Amazon Redshift and Parquet. It also provides a comprehensive step-by-step guide to loading your Parquet file to Amazon Redshift Data Warehouse manually. It also provides the limitation of Amazon Redshift Parquet Integration that you might face while working with Parquet files.

Table of Contents

What is Amazon Redshift?

redshift parquet - amazon redshift logo | Hevo Data
Image Source

Amazon Redshift is a Data Warehousing Solution from Amazon Web Services (AWS). A data warehouse is a central repository where raw data, metadata, and aggregated data are stored for easy access. The data is taken from operational systems, transformed i.e. cleaned and integrated, and then stored in a format suitable for easy access.

Amazon Redshift works on Shared-Nothing Massively Parallel Processing (MPP) architecture and stores data in the Structural format. In this type of architecture, the Nodes work independently and do not share the same memory or storage. Different data is processed in parallel in different Nodes which makes processing data more effective and efficient.

However, getting raw data from multiple sources and loading it to your Amazon Redshift Data Warehouse in an analysis-ready form is a time-consuming task. A portion of your engineering bandwidth always is engaged in continuously monitoring and maintaining the ever-evolving data connectors. A more economical and effortless approach is opting for a Cloud-based ETL Tool like Hevo Data that simply the data migration from 100+ sources to your Redshift Data Warehouse.

Key Features of Amazon Redshift

Listed below are some of the key features of Amazon Redshift:

  • As soon as data is stored or created in Amazon Redshift, a copy of that data is made, and through an encrypted SSL Connection, a snapshot of it is stored in Amazon S3. This allows you to recover your data in times of crisis.
  • Amazon Redshift allows both Horizontal and Vertical Scaling. It also provides Concurrency Scaling to handle sudden spikes in workloads by elastically Scaling the resources as needed. It automatically adds temporary Clusters when queries start to get backlogged because of high traffic and automatically removes them as soon as activity reduces.
  • Amazon Redshift provides a Data API that simplifies data access and ingestion.

For more information on Amazon Redshift, click here.

Simplify Redshift ETL Using Hevo’s No-code Data Pipeline

Hevo Data, an Automated No Code Data Pipeline can help you ETL your data swiftly from a multitude of sources to Redshift in real-time without any assistance from your side. You can set up the Redshift Destination on the fly, as part of the Pipeline creation process, or independently.

Hevo’s fault-tolerant ETL Pipeline offers you a secure option to unify data from 100+ other sources (including 40+ free sources) and store it in Redshift or any other Data Warehouse of your choice without writing a single line of code. You can entrust us with your data transfer process and enjoy a hassle-free experience. This way, you can focus more on Data Analysis, instead of data consolidation.

ETL your Data to Redshift for Free

What is Parquet?

redshift parquet - Parquet logo | Hevo Data
Image Source

Parquet follows the Columnar Storage Model and is available to any project in the Hadoop Ecosystem. Unlike the traditional Sequential Storage Model where data is written in sequence, the Columnar Storage Model stores column values together. Although the Sequential Storage Model has advantages in processing transactions, it is not suitable for running Analytical Queries on Big Data.

Parquet uses the Record Shredding and Assembly Algorithm along with a Columnar Storage Model which makes it more efficient to run any Analytical Queries. With Parquet, you can work with complex data more easily. It has features that provide optimized solutions for efficient Data Compression.

Key Features of Parquet

Listed below are some of the key features of Parquet:

  • It works on the Columnar Storage Model that allows you to skip over non-relevant data easily and quickly. This increases the query processing speed of Parquet and minimizes the time to access your data.
  • Parquet supports advanced, nested, and complex data structures. This allows you to store relational as well as non-relational data easily.
  • Parquet easily integrates with other platforms like Amazon Redshift, Google BigQuery, AWS Athena, etc. This enhances the functionalities of Parquet.

For more information on Parquet, click here.

Methods to Connect Amazon Redshift Parquet

redshift parquet - Redshift to Parquet Integration image | Hevo Data
Image Source

There are mainly 2 methods that you can use to connect Amazon Redshift Parquet. Each of these methods is suited for different use cases. You can choose to go with any one of these methods based on your business requirements. Listed below are the methods to connect Amazon Redshift Parquet:

Let’s dive into these methods one by one.

1) Amazon Redshift Parquet: Using Amazon Redshift’s COPY Command

This is one of the simplest methods for Amazon Redshift Parquet Integration. Amazon Redshift supports COPY from 6 file formats namely ACRO, CSV, Parquet, ORC, JSON, and TXT.

  • You can use the following COPY command syntax to connect Amazon Redshift Parquet and copy Parquet files to Amazon Redshift:
COPY table-name 
[ column-list ]
FROM data_source
authorization
[ [ FORMAT ] [ AS ] data_format ] 
[ parameter [ argument ] [, ... ] ]
  • Suppose you want to connect Amazon Redshift Parquet and load the Parquet files inside the “parquet” folder at the Amazon S3 location “s3://mybucket/data/listings/parquet/”, then use the following command:
COPY listing
FROM 's3://mybucket/data/listings/parquet/'
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyAmazon RedshiftRole'
FORMAT AS PARQUET;

The above code will connect Amazon Redshift Parquet and will copy the data from Parquet to Amazon Redshift for Analytical purposes. Although this is the easiest method of all, it has its limitations. This method works well when you want to copy your data only once. By default, the COPY command appends the new data to the existing table. Moreover, the COPY command does not give you the ability to automate this whole process.

That’s it. You have successfully copied or loaded your Parquet data to Amazon Redshift using the COPY command.

Use FILLRECORD while loading Parquet data from Amazon S3

With the COPY command, you can also use FILL RECORD given that the Parquet file has fewer fields entered than the Amazon Redshift target table. To explain the importance of FILLRECORD, this Redshift Parquet example first shows loading the file into the table without specifying the FILLRECORD parameter with the COPY command & then using the FILLRECORD parameter with the COPY command.

To carry out this Redshift Parquet example, you need to have the following components:

  • Amazon Redshift cluster with database, public schema, aws user as an administrator user, and AWS Identity and Access Management (IAM) role used to execute COPY commands Load files from Amazon S3 connected to an Amazon Redshift cluster.
  • An Amazon Redshift table named call_center_parquet.
  • Parquet files that have already been uploaded to your S3 bucket. From there, the files are copied to your Amazon Redshift cluster.
  • Step 1: To connect Redshift Parquet for copying, you can start by defining the call_center_parquet table with 31 columns as follows:
DROP TABLE IF EXISTS public.call_center_parquet;
CREATE TABLE IF NOT EXISTS public.call_center_parquet
(
	cc_call_center_sk INTEGER NOT NULL ENCODE az64
	,cc_call_center_id varchar(100) NOT NULL  ENCODE lzo
	,cc_rec_start_date VARCHAR(50)   ENCODE lzo
	,cc_rec_end_date VARCHAR(50)   ENCODE lzo
	,cc_closed_date_sk varchar (100)   ENCODE lzo
	,cc_open_date_sk INTEGER   ENCODE az64
	,cc_name VARCHAR(50)   ENCODE lzo
	,cc_class VARCHAR(50)   ENCODE lzo
	,cc_employees INTEGER   ENCODE az64
	,cc_sq_ft INTEGER   ENCODE az64
	,cc_hours VARCHAR(20)   ENCODE lzo
	,cc_manager VARCHAR(40)   ENCODE lzo
	,cc_mkt_id INTEGER   ENCODE az64
	,cc_mkt_class VARCHAR(50)   ENCODE lzo
	,cc_mkt_desc VARCHAR(100)   ENCODE lzo
	,cc_market_manager VARCHAR(40)   ENCODE lzo
	,cc_division INTEGER   ENCODE az64
	,cc_division_name VARCHAR(50)   ENCODE lzo
	,cc_company INTEGER   ENCODE az64
	,cc_company_name VARCHAR(50)   ENCODE lzo
	,cc_street_number INTEGER   ENCODE az64
	,cc_street_name VARCHAR(60)   ENCODE lzo
	,cc_street_type VARCHAR(15)   ENCODE lzo
	,cc_suite_number VARCHAR(10)   ENCODE lzo
	,cc_city VARCHAR(60)   ENCODE lzo
	,cc_county VARCHAR(30)   ENCODE lzo
	,cc_state CHAR(2)   ENCODE lzo
	,cc_zip INTEGER   ENCODE az64
	,cc_country VARCHAR(20)   ENCODE lzo
	,cc_gmt_offset NUMERIC(5,2)   ENCODE az64
	,cc_tax_percentage NUMERIC(5,2)   ENCODE az64
)
DISTSTYLE ALL
;
ALTER TABLE public.call_center_parquet OWNER TO awsuser;

The Parquet file has 29 columns and doesn’t contain any value for the cc_gmt_offset and cc_tax_percentage fields. To connect Redshift Parquet for copying, the schema definition for the Parquet file located in Amazon S3, which is loaded into Amazon Redshift is shown below.

redshift parquet - Parquet File Schema 
| Hevo Data
Image Source
  • Step 2: Connect Redshift Parquet for copying without the FILLRECORD parameter.
COPY call_center_parquet
FROM 's3://*****************/parquet/part-00000-d9a3ab22-9d7d-439a-b607-2ddc2d39c5b0-c000.snappy.parquet'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole'
FORMAT PARQUET;

In the Redshift Parquet output, it generated an error.

redshift parquet - Without Fillrecord error | Hevo Data
Image Source
  • Step 3: Connect Redshift Parquet for copying with the FILLRECORD parameter.
COPY call_center_parquet
FROM 's3://*****************/parquet/part-00000-d9a3ab22-9d7d-439a-b607-2ddc2d39c5b0-c000.snappy.parquet'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole'
FORMAT PARQUET FILLRECORD;

The Redshift Parquet copy operation was performed successfully and NULL was entered into the cc_gmt_offset and cc_tax_percentage columns.

redshift parquet - With Fillrecord output | Hevo Data
Image Source

What makes Hevo’s Data Ingestion Experience Unique?

Ingesting data can be a tiresome task without the right set of tools. Hevo’s Data Ingestion platform empowers you with everything you need to have a smooth Data Ingestion, Processing, and integration experience for Redshift Warehouse. Our platform has the following in store for you!

  • Smooth Schema Mapping: Fully-managed Automated Schema Management for incoming data with the desired destination.
  • Blazing-fast Setup: Straightforward interface for new customers to work on, with minimal setup time.
  • Built to Scale: Exceptional Horizontal Scalability with Minimal Latency for Modern-data Needs.
  • Data Transformations: Best-in-class & Native Support for Complex Data Transformation at fingertips. Code & No-code Flexibility ~ designed for everyone.
  • Built-in Connectors: Support for 100+ Data Sources, including Databases, SaaS Platforms, Files & More to destinations like Amazon Redshift. Native Webhooks & REST API Connector available for Custom Sources.
  • Exceptional Security: A Fault-tolerant Architecture that ensures Zero Data Loss.
Sign up here for a 14-Day Free Trial!

2) Amazon Redshift Parquet: Using Amazon Redshift Data Pipeline

This is a complex and time-consuming method to connect Amazon Redshift Parquet. Unfortunately, this method does not provide you with an option to copy your data directly from Parquet. You need to have your data inside your Amazon S3 Bucket first and only from there, you can import it to your Amazon Redshift Data Warehouse using Amazon Redshift Data Pipeline. This method is majorly divided into 2 steps as follows:

Step 1: Upload the Parquet File to your Amazon S3 Bucket

Follow the steps below to create an Amazon S3 Bucket and upload the Parquet file to that bucket:

  • Sign in to your AWS Management Console using this link.
  • Under “Storage & Content Delivery”, select “S3” as shown in the image below.
redshift parquet - AWS Services Page image | Hevo Data
Image Source
  • You will be redirected to Amazon S3 Console Dashboard. Click on “Create Bucket”.
  • Under the “Bucket Name” textbox, type the name for your Bucket as shown in the image below.
redshift parquet - create bucket dialog box | Hevo Data
Image Source
  • Under “Region”, choose the region which is closest to your place. This will optimize latency and minimize cost.
    Note: If you are living in India, then choose “Mumbai” under “Region”.
  • Once you are done, click on “Create”. This will take some time and the console will display your empty Bucket in the “Buckets” window.
  • Click on the Bucket that you just created and click on Upload > Add Files as shown in the image below.
redshift parquet - upload button | Hevo Data
Image Source
  • Now, find the file you want to upload and then click on Open > Start Upload. This will start the upload process which you can watch in the “Transfer” window.
redshift parquet - Start Upload button 
 | Hevo Data
Image Source

Step 2: Copy Data from Amazon S3 Bucket to Amazon Redshift Data Warehouse

Following are the steps to copy the data that you stored in the last step from Amazon S3 Bucket to Amazon Redshift Data Warehouse. Hence, establishing an Amazon Redshift Parquet connection.

  • Create an Amazon Redshift Data Pipeline from the AWS Management Console using this link.
  • Once you have successfully created an Amazon Redshift Data Pipeline, select the “Load Data from S3 into Amazon Redshift” template under the “Build using a template” option. This will copy data from your Amazon S3 Bucket into an Amazon Amazon Redshift table. It also gives you the option to write a query and create a new table rather than just copying the data into an existing table.
    Note: If you are copying the data into an existing file then the table should have the same schema as the data in Amazon S3 Bucket. 
  • Once you are done with the above step, you can use the following COPY command to make the process of copying data from Amazon S3 Bucket to your Amazon Redshift table automated. For easier understanding, let’s assume the file name as “attendance.parquet”
copy TABLENAME 
from 
's3://<s3bucket>/<s3folder>/attendance.parquet' 
iam_role 'arn:aws:iam::<actid>:role/<rolenm>' 
format as parquet;
  • This will give authorization to the COPY command to access your Amazon S3 Bucket via AWS Identity and Access Management (IAM) Role. For more information on AWS Identity and Access Management (IAM) Role, click here.

That’s it. You have successfully copied or loaded your Parquet data to Amazon Redshift using the Amazon Redshift Data Pipeline.

Limitations of Amazon Redshift Parquet Integration

Now that you have a basic understanding of Amazon Redshift and Parquet and have established the Amazon Redshift Parquet connection, let’s look into the challenges or limitations you might face while working with Parquet files. Listed below are the limitations of Amazon Redshift Parquet Integration:

  • Difficult to Modify or Update Data Type: A Parquet file houses the schema information in the file metadata. Therefore, it does not allow you to modify or update the data type of any column in the Amazon Redshift table. Therefore the schema of the Parquet table and the Amazon Redshift table should always be the same.
  • Difficult to Debug: It is difficult to debug any error in a Parquet file as Amazon Redshift does not provide the option to preview a Parquet file. You need an external tool to preview the file and rectify the errors.
  • Immutable in Nature: Unlike a CSV file, a Parquet file is immutable to any changes.

Conclusion

This article introduced you to Amazon Redshift and Apache Parquet. It also provided 2 main methods that you can use to establish the Amazon Redshift Parquet Integration. Furthermore, it also provided the limitations or challenges you might face with Amazon Redshift Parquet Integration. Amazon Redshift provides its methods to establish this integration but truth be told, these are not easy to implement.

Both the methods discussed in this article demand high technical knowledge and experience. Moreover, loading data from any source to Redshift manually is a tough nut to crack. You will have to write a complex custom script from scratch and invest a lot of time and resources. Furthermore, such a method will require high maintenance and regular debugging.

Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. You can leverage Hevo to seamlessly transfer data from various sources to Redshift in real-time without writing a single line of code. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner. Hevo caters to 100+ data sources (including 40+ free sources) and can directly transfer data to Data Warehouses, Business Intelligence Tools, or any other destination of your choice in a hassle-free manner. It will make your life easier and make data migration hassle-free.

Visit our Website to Explore Hevo

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand. Do check out the pricing details to understand which plan fulfills all your business needs.

Share your experience of setting up Redshift Parquet Integration in the comment section below!

No-code Data Pipeline for Amazon Redshift