This blog aims to explain the details on Redshift data types in-depth. Before we get to that, let us understand some details on Redshift first.
What is Amazon Redshift – A Brief Introduction
Amazon Redshift is a petabyte-scale data warehouse service which works on the concept of clusters – a collection of nodes. It is a fully managed and fast cloud data warehouse which in turn makes a simple and cost-effective solution for analyzing all the company’s data. The data can be analyzed using existing Business Intelligence (BI) tools and standard SQL. Redshift is a fully managed service. It manages all of the work from setting up to operating and scaling the data warehouse.
Redshift utilizes a columnar data storage method. Its multi-node architecture helps to achieve an impeccable throughput time. Redshift allows its user to analyze petabytes of structured data using complex queries. Complex (and simple too) queries are executed using sophisticated query optimization, massively parallel query execution and columnar storage on high-performance local disks.
Amazon Redshift Data Types:
Redshift supports ingestion/inserting of many different data types onto the warehouse. This, in turn, allows a user or a system to handle a wide range of use cases.
Type of data also imposes a restriction on the dataset which can be ingested in a system, which maintains the sanctity of the data. Each value stored and retrieved from an Amazon Redshift table(s) has a data type which has a fixed set of associated properties and constraints.
Data types are declared when tables are created, but can surely be changed in the future if required but with some set of constraints around compatibility. We will discuss in the coming sections. A data type constrains or limits the set of values that a column or argument can contain.
Redshift tables support a modest range of data types. Below is the list for the same:
|Redshift Data Types||Aliases||Description|
|SMALLINT||INT2||Signed two-byte integer|
|INTEGER||INT, INT4||Signed four-byte integer|
|BIGINT||INT8||Signed eight-byte integer|
|DECIMAL||NUMERIC||Exact numeric of selectable precision|
|REAL||FLOAT4||Single precision floating-point number|
|DOUBLE PRECISION||FLOAT8, FLOAT||Double-precision floating-point number|
|BOOLEAN||BOOL||Logical Boolean (true/false)|
|CHAR||CHARACTER, NCHAR, BPCHAR||Fixed-length character string|
|VARCHAR||CHARACTER VARYING, NVARCHAR, TEXT||Variable-length character string with a user-defined limit|
|DATE||Calendar date (year, month, day)|
|TIMESTAMP||TIMESTAMP WITHOUT TIME ZONE||Date and time (without time zone)|
|TIMESTAMPTZ||TIMESTAMP WITH TIME ZONE||Date and time (with time zone)|
Source: Amazon Redshift official documentation
Each of the above-mentioned data types belongs to a specific group of data type group namely:
- Numeric types
- Integer: Integer data types occupy storage from 2 bytes to 8 bytes depending on the type of integer we use like smallint, int or bigint
- Decimal: They occupy a variable amount of storage up to 128-bit signed integers with up to 38 digits of precision, depending on the user-defined precision
- Float/real: They use the REAL and DOUBLE PRECISION data types to store numeric values with variable precision. They occupy 4 bytes for real values with 6 significant digits of precision and 8 bytes for double-precision values having 15 significant digits of precision
- Character types
- Char: They can store fixed-length 1-byte character sequences. Additionally, they can store up to 4096 bytes
- Varchar: They can store multibyte characters having a space requirement of 4 bytes or less for each character. They can store up to 65535 bytes
- We also have BPCHAR, TEXT, NCHAR, NVARCHAR under the character types in Redshift but they are implicitly stored as a char or varchar type only
- Datetime types
- Date: This data type stores only the simple date information without time and timezone information. They occupy 4 bytes of storage
- Timestamp: They contain date and time information without the timezone information. They occupy 8 bytes of storage
- Timestamp with timezone information (timestampz): They contain the date and time information having time zone information along with it. When an input value includes a time zone, the value is updated to UTC (Coordinated Universal Time) and the changed value is storedNote: To see the available/ supported timezones one can execute this command in redshift:
-> select pg_timezone_names();
- Boolean Type
- Boolean: The BOOLEAN data type stores true and false value. This is stored in a single-byte column. Regardless of the input string, a Boolean will always contain and show ‘t’ for True and ‘f’ for False values
Redshift Data Types – Compatibility
Redshift allows the above-mentioned data types to be stored in its table. However, performing various query operations and comparisons among the column does not need the column data type to be exactly similar. This is where the compatibility and conversion (implicit or explicit) comes into the picture in Redshift.
Data type compatibility which includes the matching of data types to constants and literal values occurs during various database operations, including the following:
- Data manipulation language (DML) operations on tables
- UNION, INTERSECT, and EXCEPT queries
- CASE expressions
- Evaluation of predicates, such as LIKE and IN
- Evaluation of SQL functions that do comparisons or extractions of data
- Comparisons with mathematical operators
The results of these operations depend on type conversion rules and redshift data type compatibility. Compatibility in among Redshift data types is defined implicitly and if there are data types which are incompatible the user can take the help of explicit conversion to make data types compatible.
Order of data types also matters for eligibility of conversion, for example, a numeric data type can be implicitly converted to char/varchar value in comparison but for opposite to hold true one has to take the help of explicit conversion like CAST/CONVERT.
The data type under the same group category of data like numeric data types are eligible for implicit conversion and are compatible.
You can have a more detailed look on the compatible and conversion rules here on the redshift official documentation page:
There are two types of implicit conversions in Redshift:
- Implicit conversions in assignments, such as setting values in INSERT or UPDATE commands
- Implicit conversions in expressions, such as performing comparisons in the WHERE clause
Implicit conversion (and explicit) conversion among the data types in Redshift can be done as per rules mentioned in the below table:
|From Type||To Type(s)|
|BIGINT (INT8)||BOOLEAN, CHAR, DECIMAL (NUMERIC), DOUBLE PRECISION (FLOAT8), INTEGER (INT, INT4), REAL (FLOAT4), SMALLINT (INT2), VARCHAR|
|DATE||CHAR, VARCHAR, TIMESTAMP, TIMESTAMPTZ|
|DECIMAL (NUMERIC)||BIGINT (INT8), CHAR, DOUBLE PRECISION (FLOAT8), INTEGER (INT, INT4), REAL (FLOAT4), SMALLINT (INT2), VARCHAR|
|DOUBLE PRECISION (FLOAT8)||BIGINT (INT8), CHAR, DECIMAL (NUMERIC), INTEGER (INT, INT4), REAL (FLOAT4), SMALLINT (INT2), VARCHAR|
|INTEGER (INT, INT4)||BIGINT (INT8), BOOLEAN, CHAR, DECIMAL (NUMERIC), DOUBLE PRECISION (FLOAT8), REAL (FLOAT4), SMALLINT (INT2), VARCHAR|
|REAL (FLOAT4)||BIGINT (INT8), CHAR, DECIMAL (NUMERIC), INTEGER (INT, INT4), SMALLINT (INT2), VARCHAR|
|SMALLINT (INT2)||BIGINT (INT8), BOOLEAN, CHAR, DECIMAL (NUMERIC), DOUBLE PRECISION (FLOAT8), INTEGER (INT, INT4), REAL (FLOAT4), VARCHAR|
|TIMESTAMP||CHAR, DATE, VARCHAR, TIMESTAMPTZ|
|TIMESTAMPTZ||CHAR, DATE, VARCHAR, TIMESTAMP|
Important things while defining the data type for a column in Redshift:
- Try to keep the column size as minimum as possible so that the query execution time and space requirements for temporary tables are optimised as much as possible
- Try to store the data with a specific type in their corresponding data type column. For example, DO NOT keep date-time information as varchar in your table, as Redshift optimises and stores each data type more efficiently based on the value it is expecting. So storing date-time information in date types’ columns will increase the query execution performance
Challenges While Dealing with Redshift Data Types:
Every system and its feature comes with many delightful pros. However, they also pose some challenges for the users of the product. Same goes for Redshift and following are some of the challenges that one might have to face while dealing with Redshift data types:
- Absence of a few data types present in some traditional data warehousing/ RDBMS solutions like LOBs or LONG
- The data types need to be selected very carefully to avoid any failure while ingesting data to the warehouse tables or running any ETL process. One requires a complete understanding of the data to design table effectively and efficiently, to avoid any ingestion issues
Easy Way to Handle Redshift Data Types While Ingestion
A foolproof way to ensure this is not a breakpoint is by using an automated Data Integration platform like Hevo Data to move data from any source into Redshift.
Hevo helps you bring data from any source to Redshift, in real-time, without having to write a single line of code.
Hevo automatically picks the schema of the source, maps the source data type to the closest Redshift data type. This removes all the hassle and makes data ingestion a breeze. Click here for a 14-day free trial of Hevo.
3. A company might (or mostly have) have to keep a staging or initial layer where one directly dumps the data coming from the source. These staging tables do not have data type constraints defined and mostly have ‘varchar’ fields to avoid any drop in the source data. A subsequent layer of data or tables is designed in a manner where data is kept under a column of a table with a relevant and effective data type. This eventually leads to more space consumption as one is duplicating the data to avoid any failure or loss of data in the initial level ingestion and therefore having a redundant data layer which can be avoided
4. The data types need to be selected very carefully so that the query performance is not affected in a negative manner. One should not keep the column size too large unnecessary or strong the wrong type of data under a column (by the usage of implicit/explicit conversion)
5. Wrong selection of data types can also lead to a partial loss of data without any failure of the system, like if we keep decimal data under an integer column then the data post the decimal point will be silently neglected causing a huge problem. This kind of issue might not get noticed at a very early stage of development. So data profiling becomes a very important step before selecting data types in Redshift
Hope this blog gives a detailed overview of Redshift data types, the variants and shortcomings. Do you have any questions for us? Let us know in the comments.