Amazon Redshift Data Types

This blog aims to explain the details on Redshift data types in-depth. Before we get to that, let us understand some details on Redshift first.

Table of Contents

What is Amazon Redshift – A Brief Introduction

Amazon Redshift is a petabyte-scale data warehouse service which works on the concept of clusters – a collection of nodes. It is a fully managed and fast cloud data warehouse which in turn makes a simple and cost-effective solution for analyzing all the company’s data. The data can be analyzed using existing Business Intelligence (BI) tools and standard SQL. Redshift is a fully managed service. It manages all of the work from setting up to operating and scaling the data warehouse.

Redshift utilizes a columnar data storage method. Its multi-node architecture helps to achieve an impeccable throughput time. Redshift allows its user to analyze petabytes of structured data using complex queries. Complex (and simple too) queries are executed using sophisticated query optimization, massively parallel query execution and columnar storage on high-performance local disks.

You will be looking at the following aspects:

Amazon Redshift Data Types
Redshift Data Types – Compatibility
Challenges While Dealing with Redshift Data Types

Redshift supports ingestion/inserting of many different data types onto the warehouse. This, in turn, allows a user or a system to handle a wide range of use cases.

Type of data also imposes a restriction on the dataset which can be ingested in a system, which maintains the sanctity of the data. Each value stored and retrieved from an Amazon Redshift table(s) has a data type which has a fixed set of associated properties and constraints.

Data types are declared when tables are created, but can surely be changed in the future if required but with some set of constraints around compatibility. We will discuss in the coming sections. A data type constrains or limits the set of values that a column or argument can contain.

Redshift tables support a modest range of data types. Below is the list for the same:

Redshift Data Types	Aliases	Description
SMALLINT	INT2	Signed two-byte integer
INTEGER	INT, INT4	Signed four-byte integer
BIGINT	INT8	Signed eight-byte integer
DECIMAL	NUMERIC	Exact numeric of selectable precision
REAL	FLOAT4	Single precision floating-point number
DOUBLE PRECISION	FLOAT8, FLOAT	Double-precision floating-point number
BOOLEAN	BOOL	Logical Boolean (true/false)
CHAR	CHARACTER, NCHAR, BPCHAR	Fixed-length character string
VARCHAR	CHARACTER VARYING, NVARCHAR, TEXT	Variable-length character string with a user-defined limit
DATE		Calendar date (year, month, day)
TIMESTAMP	TIMESTAMP WITHOUT TIME ZONE	Date and time (without time zone)
TIMESTAMPTZ	TIMESTAMP WITH TIME ZONE	Date and time (with time zone)

Source: Amazon Redshift official documentation

Each of the above-mentioned data types belongs to a specific group of data type group namely:

Numeric types
- Integer: Integer data types occupy storage from 2 bytes to 8 bytes depending on the type of integer we use like smallint, int or bigint
- Decimal: They occupy a variable amount of storage up to 128-bit signed integers with up to 38 digits of precision, depending on the user-defined precision
- Float/real: They use the REAL and DOUBLE PRECISION data types to store numeric values with variable precision. They occupy 4 bytes for real values with 6 significant digits of precision and 8 bytes for double-precision values having 15 significant digits of precision
Character types
- Char: They can store fixed-length 1-byte character sequences. Additionally, they can store up to 4096 bytes
- Varchar: They can store multibyte characters having a space requirement of 4 bytes or less for each character. They can store up to 65535 bytes
- We also have BPCHAR, TEXT, NCHAR, NVARCHAR under the character types in Redshift but they are implicitly stored as a char or varchar type only
Datetime types
- Date: This data type stores only the simple date information without time and timezone information. They occupy 4 bytes of storage
- Timestamp: They contain date and time information without the timezone information. They occupy 8 bytes of storage
- Timestamp with timezone information (timestampz): They contain the date and time information having time zone information along with it. When an input value includes a time zone, the value is updated to UTC (Coordinated Universal Time) and the changed value is storedNote: To see the available/ supported timezones one can execute this command in redshift:
```
-> select pg_timezone_names();
```
Boolean Type
- Boolean: The BOOLEAN data type stores true and false value. This is stored in a single-byte column. Regardless of the input string, a Boolean will always contain and show ‘t’ for True and ‘f’ for False values

Redshift Data Types – Compatibility

Redshift allows the above-mentioned data types to be stored in its table. However, performing various query operations and comparisons among the column does not need the column data type to be exactly similar. This is where the compatibility and conversion (implicit or explicit) comes into the picture in Redshift.

Data type compatibility which includes the matching of data types to constants and literal values occurs during various database operations, including the following:

Data manipulation language (DML) operations on tables
UNION, INTERSECT, and EXCEPT queries
CASE expressions
Evaluation of predicates, such as LIKE and IN
Evaluation of SQL functions that do comparisons or extractions of data
Comparisons with mathematical operators

The results of these operations depend on type conversion rules and redshift data type compatibility. Compatibility in among Redshift data types is defined implicitly and if there are data types which are incompatible the user can take the help of explicit conversion to make data types compatible.

Order of data types also matters for eligibility of conversion, for example, a numeric data type can be implicitly converted to char/varchar value in comparison but for opposite to hold true one has to take the help of explicit conversion like CAST/CONVERT.

The data type under the same group category of data like numeric data types are eligible for implicit conversion and are compatible.

There are two types of implicit conversions in Redshift:

Implicit conversions in assignments, such as setting values in INSERT or UPDATE commands
Implicit conversions in expressions, such as performing comparisons in the WHERE clause

Implicit conversion (and explicit) conversion among the data types in Redshift can be done as per rules mentioned in the below table:

From Type	To Type(s)
BIGINT (INT8)	BOOLEAN, CHAR, DECIMAL (NUMERIC), DOUBLE PRECISION (FLOAT8), INTEGER (INT, INT4), REAL (FLOAT4), SMALLINT (INT2), VARCHAR
CHAR	VARCHAR
DATE	CHAR, VARCHAR, TIMESTAMP, TIMESTAMPTZ
DECIMAL (NUMERIC)	BIGINT (INT8), CHAR, DOUBLE PRECISION (FLOAT8), INTEGER (INT, INT4), REAL (FLOAT4), SMALLINT (INT2), VARCHAR
DOUBLE PRECISION (FLOAT8)	BIGINT (INT8), CHAR, DECIMAL (NUMERIC), INTEGER (INT, INT4), REAL (FLOAT4), SMALLINT (INT2), VARCHAR
INTEGER (INT, INT4)	BIGINT (INT8), BOOLEAN, CHAR, DECIMAL (NUMERIC), DOUBLE PRECISION (FLOAT8), REAL (FLOAT4), SMALLINT (INT2), VARCHAR
REAL (FLOAT4)	BIGINT (INT8), CHAR, DECIMAL (NUMERIC), INTEGER (INT, INT4), SMALLINT (INT2), VARCHAR
SMALLINT (INT2)	BIGINT (INT8), BOOLEAN, CHAR, DECIMAL (NUMERIC), DOUBLE PRECISION (FLOAT8), INTEGER (INT, INT4), REAL (FLOAT4), VARCHAR
TIMESTAMP	CHAR, DATE, VARCHAR, TIMESTAMPTZ
TIMESTAMPTZ	CHAR, DATE, VARCHAR, TIMESTAMP

Important things while defining the data type for a column in Redshift:

Try to keep the column size as minimum as possible so that the query execution time and space requirements for temporary tables are optimised as much as possible
Try to store the data with a specific type in their corresponding data type column. For example, DO NOT keep date-time information as varchar in your table, as Redshift optimises and stores each data type more efficiently based on the value it is expecting. So storing date-time information in date types’ columns will increase the query execution performance

Challenges While Dealing with Redshift Data Types

Every system and its feature comes with many delightful pros. However, they also pose some challenges for the users of the product. Same goes for Redshift and following are some of the challenges that one might have to face while dealing with Redshift data types:

Absence of a few data types present in some traditional data warehousing/ RDBMS solutions like LOBs or LONG
The data types need to be selected very carefully to avoid any failure while ingesting data to the warehouse tables or running any ETL process. One requires a complete understanding of the data to design table effectively and efficiently, to avoid any ingestion issues

A foolproof way to ensure this is not a breakpoint is by using an automated Data Integration platform like Hevo Data to move data from any source into Redshift. Hevo helps you bring data from any 150+ sources to Redshift, in real-time, without having to write a single line of code.

Let’s look at some unbeatable features of Hevo:

Simple: Hevo offers a simple and intuitive user interface. It has a minimal learning curve.
Fully Automated: Hevo can be set up in a few minutes and requires zero maintenance.
Fault-Tolerant: Hevo can automatically detect anomalies and informs you instantly. All affected rows are kept aside for correction.
Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
Scalability: Hevo is built to handle millions of records per minute without any latency.
Live Support: Hevo is available round the clock to extend exceptional support to its customers through chat, email, and support call.

Hevo automatically picks the schema of the source, maps the source data type to the closest Redshift data type. This removes all the hassle and makes data ingestion a breeze.

GET STARTED WITH HEVO FOR FREE

A company might (or mostly have) have to keep a staging or initial layer where one directly dumps the data coming from the source. These staging tables do not have data type constraints defined and mostly have ‘varchar’ fields to avoid any drop in the source data. A subsequent layer of data or tables is designed in a manner where data is kept under a column of a table with a relevant and effective data type. This eventually leads to more space consumption as one is duplicating the data to avoid any failure or loss of data in the initial level ingestion and therefore having a redundant data layer which can be avoided.
The data types need to be selected very carefully so that the query performance is not affected in a negative manner. One should not keep the column size too large unnecessary or strong the wrong type of data under a column (by the usage of implicit/explicit conversion).
Wrong selection of data types can also lead to a partial loss of data without any failure of the system, like if we keep decimal data under an integer column then the data post the decimal point will be silently neglected causing a huge problem. This kind of issue might not get noticed at a very early stage of development. So data profiling becomes a very important step before selecting data types in Redshift.

Conclusion

Hope this blog gives a detailed overview of Redshift data types, the variants and shortcomings.

You can overcome these shortcomings by using Hevo. Hevo is a No-code Data Pipeline. It supports pre-built data integrations from 150+ data sources at a reasonable price. It will automate your data flow easily.

Want to take Hevo for a spin?

Do you have any questions for us? Let us know in the comments.

Antariksh Goyal Freelance Technical Content Writer, Hevo Data

Antariksh has a flair for writing and loves to combine that with his problem-solving approach to help data teams solve complex business problems.

Amazon Redshift Data Types – A Detailed Overview

What is Amazon Redshift – A Brief Introduction