The popularity of the Cloud as a Data Storage Platform has grown manifolds as individuals and businesses have recognized the benefits of storing their data in the Cloud Storage Systems. For this reason, most companies have now moved their data from On-Premise Databases to Cloud Storage Systems. The choice of the Data Carrier is of crucial importance for any organization, as it determines the speed at which queries are processed and the costs involved. Cloud Storage Systems are cheaper for data storage as they don’t burden users with the responsibility of Server Maintenance.
Amazon Redshift is one of the prominent names in the field of Cloud Storage Services. It offers fully managed systems that can service the Data Warehousing and Database needs of businesses. Upon a complete walkthrough of this article, you will gain a decent understanding of Amazon Redshift and Redshift Spectrum along with the unique features that it offers. You will also learn about how to create Spectrum Table using Redshift Create External Table Command.
Table of Contents
What is Amazon Redshift?
Amazon Redshift is a fully managed, Cloud-based Petabyte scale Data Warehouse developed by Amazon for the storage and analysis of mammoth scales of datasets. Amazon Redshift is based on an MPP (Massively Parallel Processing) Columnar Architecture and was developed to connect with numerous SQL-based Clients, Business Intelligence, and Data Visualization tools and to make data available to users in real-time. Based on PostgreSQL 8, Amazon Redshift offers significantly enhanced performance and more efficient queries compared to all other Data Warehouses. This helps teams make data-driven business decisions and Analysis. More than 15,000 companies worldwide now use Amazon Redshift, including large companies such as Pfizer, McDonald’s, Facebook, etc.
Understanding Amazon Redshift Spectrum
Image Source
Amazon Redshift Spectrum is a component of Amazon Web Services that provides a common platform for extracting/viewing data from both a Hot Data Store and a Cold Data Store (Legacy data) without having to switch between software tools. Hot Live Data is stored in a more expensive Redshift Data Warehouse, whereas Cold Legacy Data that is accessed sparingly is stored in a less expensive Data Lake, such as an Amazon S3 bucket.
Redshift Spectrum assists many organizations in lowering storage costs by moving infrequently accessed data away from main storage and retrieving quarantined data at a reasonable speed without requiring much effort. Amazon Redshift Spectrum is an exceptional tool that offers seamless execution of complex SQL Queries against the data stored in Amazon S3.
Key Features of Redshift Spectrum
Image Source
Some of the amazing features offered by Redshift Spectrum are as follows:
- High Performance: Amazon Redshift Spectrum performs admirably when querying data in the location where it is stored. It is roughly 10x faster than other Data Warehouses. Redshift Spectrum also allows you to increase data retrieval speed by configuring catching sizes based on your data requirements.
- Cost-Effective: Amazon Redshift Spectrum is a cost-effective solution. It is roughly ten times less expensive than a traditional Data Warehouse. Redshift Spectrum is similar to Amazon Athena, and it is charged based on the volume of the data scanned.
- Scalability: Redshift Spectrum is a fully-managed platform in which Amazon handles all scaling operations based on the amount of data scanned and queried by the user.
A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate data from 100+ data sources (including 40+ Free Data Sources) to a destination of your choice such as Amazon Redshift in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources provides users with the flexibility to bring in data of different kinds, in a smooth fashion without having to code a single line.
Get Started with Hevo for Free
Check Out Some of the Cool Features of Hevo:
- Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
- Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
- 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources that can help you scale your data infrastructure as required.
- Connectors: Hevo supports 100+ data sources and integrations to SaaS platforms, files, databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake, Firebolt Data Warehouses; Amazon S3 Data Lakes; and MySQL, MongoDB, TokuDB, DynamoDB, PostgreSQL databases to name a few.
- 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!
How to use the Redshift Create External Table Command?
You can create an External/Spectrum Table based on the Column Definition from a query and write the results of that query to Amazon S3 by leveraging the CREATE EXTERNAL TABLE command. The output is in either Apache Parquet or Delimited Text format. If the Spectrum Table contains a Partition Key or Keys, Redshift automatically partitions new files and registers new partitions in the external catalog.
A) Syntax
CREATE EXTERNAL TABLE
external_schema.table_name
(column_name data_type [, …] )
[ PARTITIONED BY (col_name data_type [, … ] )]
[ { ROW FORMAT DELIMITED row_format |
ROW FORMAT SERDE 'serde_name'
[ WITH SERDEPROPERTIES ( 'property_name' = 'property_value' [, ...] ) ] } ]
STORED AS file_format
LOCATION { 's3://bucket/folder/' | 's3://bucket/manifest_file' }
[ TABLE PROPERTIES ( 'property_name'='property_value' [, ...] ) ]
B) Parameters
The parameters involved in the Create External Table command are as follows:
- External_schema.table_name represents the name of the table that needs to be created. The table name can occupy a maximum size of up to 127 bytes. In case, the size of the table name exceeds 127 bytes, the table name is truncated. UTF-8 multibyte characters can be up to four bytes long. Amazon Redshift enforces a Cluster Limit of 9,900 tables, which includes user-defined temporary tables as well as temporary tables created by Amazon Redshift during query processing or system maintenance. You can also optionally qualify the table name with the database name.
- column_name data_type represents the name and type of data stored in each column. The column name can occupy a maximum size of up to 127 bytes. In case, the size of the column name exceeds 127 bytes, the column name is truncated. UTF-8 multibyte characters can be up to four bytes long. Amazon Redshift generates External Tables by default, with the pseudo columns $path and $size. Setting the Spectrum enable_pseudo columns_configuration parameter to false disables the creation of pseudo columns for a session. The maximum number of columns you can define in a single table if pseudo columns are enabled is 1,598. The maximum number of columns you can define in a single table if pseudo columns are not enabled is 1,600.
- PARTITIONED BY (col_name data_type [, … ] ) represents a partitioned table clause with one or more partition columns. In some cases, a separate data directory is used for each specified combination, which can improve query performance. Partitioned columns do not exist in the table data. If you enter a value for a column name that is the same as the name of a table column, you will receive an error.
- ROW FORMAT DELIMITED rowformat represents a clause specifying the format of the underlying data. Given below are the possible values for rowformat:
- LINES TERMINATED BY ‘delimiter‘
- FIELDS TERMINATED BY ‘delimiter‘
C) Usage Notes for using Create External Table Command
Some of the key pointers that you need to consider while using the Create External Table Command are as follows:
- At times, you might want to use the Create External Table command on an AWS Glue Data Catalog, or an Apache Hive Metastore. In such cases, you create the External Schema using an AWS Identity and Access Management (IAM) role. On Amazon S3, this IAM role must have both read and write permissions.
- If you use a Lake Formation catalog, the IAM role must have the ability to create catalog tables. It must also have the Data Lake location permission on the target Amazon S3 path in this case. This IAM role gains ownership of the new AWS Lake Formation table.
- Amazon Redshift, by default, uses the following format for the name of each file uploaded to Amazon S3 to ensure that file names are unique.
<date>_<time>_<microseconds>_<query_id>_<slice-number>_part_<part-number>.<format>.
An example is 20200303_004509_810669_1007_0001_part_00.parquet.
D) Conceptual Example: Creating Spectrum Table using Create External Table Command
If you want to create a Spectrum Table, the first thing you would need to do is create an External Schema. You can use the following snippet of code to create an External Schema named users_data:
create external schema users_data
from data catalog
database 'users'
iam_role 'arn:aws:iam::282924389374:role/SpectrumRole'
create external database if not exists;
Once you have created an External Schema, you can create a Spectrum Table by retrieving the data from Amazon S3. The following code will create an External Table called mytable with rows like Gender, First Name, Last Name, Id Name, and Id Value:
create external table users_data.mytable(
id_name varchar(32),
id_value varchar(64),
gender varchar(16),
name_title varchar(32),
name_first varchar(64),
name_last varchar(64)
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://users-data-282924389374';
Conclusion
In this article, you learned about Amazon Redshift Spectrum and the salient features that it offers. You also learned about how to create Spectrum Tables using Redshift Create External Table Command. With Amazon Redshift live and running, you’ll need to extract data from multiple platforms to carry out your analysis. However, integrating and analyzing your data from a diverse set of data sources can be challenging and this is where Hevo Data comes into the picture.
Visit our Website to Explore Hevo
Hevo Data is a No-code Data Pipeline and has awesome 100+ pre-built integrations that you can choose from. Hevo can help you integrate data from numerous sources and load it into a Data Warehouse/Database like Amazon Redshift to analyze real-time data with a BI tool and create your Dashboards. It will make your life easier and make data migration hassle-free.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
Share with us your experience of creating Spectrum Tables in the comments below!
Rakesh is a Cloud Engineer with a passion for data, software architecture, and writing technical content. He has experience writing articles on various topics related to data integration and infrastructure.