Redshift MD5 Functions: Comprehensive Guide

• March 17th, 2022

redshift md5 - featured image

Redshift is a warehousing solution provided by Amazon. Being a warehouse, it is optimized for analytical queries rather than transactional queries. Therefore, it organizes data into columns instead of rows to facilitate parallel query execution. Like other data warehouses, it uses SQL for working with data. In this article, we will be covering a specific hash function part of the Redshift SQL: Redshift MD5.

Table of Contents

What is Redshift?

Redshift is a database that applies the concepts of Online Analytical Processing (OLAP) and follows a columnar structure. It is based on the PostgreSQL 8.0.2 version which makes it usable with regular SQL queries. It also applies the Massively Parallel processing technology which enables faster query response times. The MPP utilizes multiple computer processes which work parallelly to provide faster execution times and faster calculations. This technology was developed by ParAccel. MPP also works on processors that are spread access different locations to collectively process resource-intensive tasks. The redshift comes out to be a very cost-effective option, that is it provides more benefits at about 1/5th the price of competitors.

Simplify Redshifts ETL with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 150+ data sources (including 40+ free data sources) and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination like Redshift. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

GET STARTED WITH HEVO FOR FREE[/hevoButton]

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
SIGN UP HERE FOR A 14-DAY FREE TRIAL

What is Redshift MD5?

Redshift MD5 (MD stands for message-digest) algorithm is a widely used hash function. However, it is cryptographically broken. It doesn’t fulfill the basic requirement of a cryptographic hash function, that two distinct messages shouldn’t output the same value when the hash function is run on them. In other words, it is possible that you get the same hash output for two different messages (this is referred to as a collision).

Nevertheless, this remains a popular function. It is often used as a checksum to validate data integrity. There are other non-cryptographic applications as well. For example, if you have a database containing multiple columns, and want a single key to represent the unique combinations of multiple columns, you can simply return the Redshift MD5 hash of the concatenation of the columns (with the assumption that the chances of collisions within your database will be slim).

The Redshift MD5 function always outputs a 128-bit hash value. Redshift outputs a string corresponding to the hexadecimal equivalent of this 128-bit value. This string is 32 characters long.

Understanding the Redshift MD5

Prerequisites

You will need a basic understanding of SQL to understand this article. If you wish to try out the SQL statements shown here, you will need to set up a Redshift cluster within an AWS account. If you haven’t created a Redshift cluster in the past, you can get two months of the free trial.

Syntax

The syntax for Redshift MD5 is very simple and straightforward, as you can see below:

md5(string_whose_hash_needs_to_be_computed)

The simplest example of using this in a query is given below:

SELECT md5('hevodata') as hash

This query will return:

eaa4e7243d02a5ae18af5b53790a1a3f

As you can see, this is a 32 character hexadecimal. Even with a very long string, the output hash will still be 32 characters long. For example:

SELECT md5('this is a very long string input for computing the md5 hash output')

The output is:

c9d0b5cb4c898b36444669b7577d6d4c 

Examples

Now that we have seen how to compute the Redshift md5 function in a standalone manner, let’s see how to use it with a table. We will use the sample tables that are available to you whenever you create a redshift cluster.

redshift MD5: tables in redshift
Image Source: Self

If you explore the users table, you will see several columns pertaining to the likes and dislikes of the users (likesports, liketheatre, etc.), as you can see in the image below.

redshift MD5: viewing the users table
Image Source: Self

Now, suppose that you don’t want to deal with so many columns, rather, group them all into a single column that represents the likes of a user. The Redshift md5 function can help here.

First, let’s create this additional column (you can give it a better name; you can also restrict it to 32 characters, i.e. use char(32) data type instead of varchar):

ALTER TABLE users
ADD COLUMN md5_hash varchar

For the sake of brevity, we will consider only the first 5 columns pertaining to the likes and dislikes. We see that the columns are of type bool, with several null values as well. We’ll handle null values using the NVL function and replace them with integer 2. Non-null values will also be converted to int. Later the ints will be converted to varchar, and concatenated.

In fact, you can see the output of this in action:

SELECT nvl(likesports::int,2)::varchar + nvl(liketheatre::int,2)::varchar + nvl(likeconcerts::int,2)::varchar + nvl(likejazz::int,2)::varchar + nvl(likeclassical::int,2)::varchar as likings from users limit 10
redshift MD5: result of md5 on the users table
Image Source: Self

As you can see, this itself can serve as the single column that gives us information about the likes of a person. However, it reveals a lot. While it doesn’t reveal the names of the columns from which it was constructed, it reveals that there are 5 items being concatenated, and there are 3 possible values for each item (0, 1, and 2). A person having some context can gather a lot of insights from this data. If user data protection is important, the Redshift md5 function can help here.

Here’s how the same data can be stored in a separate column in the hashed format using the Redshift md5 function (it can also be stored in a separate table altogether with another column like userid, if original data isolation is important).

UPDATE users
SET md5_hash = md5(nvl(likesports::int,2)::varchar + nvl(liketheatre::int,2)::varchar + nvl(likeconcerts::int,2)::varchar + nvl(likejazz::int,2)::varchar + nvl(likeclassical::int,2)::varchar)

Now, here’s the same data as above, but it hardly reveals anything:

redshift MD5: hashing the data
Image Source: Self

This hash column greatly simplifies querying for analytics. For example, if you need to find all the users who share similar interests to a specific user, it will be much easier to do that using this single-column, than with 5 different columns. All you need to do is figure out the rows with matching hash values.

For example, if you want to get a list of all users who share similar interests with the user with userid = 2, your query will be:

SELECT * from users
WHERE md5_hash = (SELECT md5_hash from users where userid = 2)

If you know the interesting pattern you are looking for beforehand, you can use the Redshift md5 function in the query as well. For example:

SELECT * from users
WHERE md5_hash = md5('21100')

Conclusion

We got an introduction to the Redshift md5 function and its utility in general. Then we understood how the Redshift md5 function is used and saw its use cases through examples. I’d reiterate that the Redshift md5 function is cryptographically broken, and it should be used with caution for your applications. If you need to use a function specifically for cryptographic applications, you can use SHA2 (its predecessor, SHA1 is also cryptographically broken). 

Redshift is a trusted data warehouse that a lot of companies use and store data as it provides many benefits but transferring data into it is a hectic task. The Automated data pipeline helps in solving this issue and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 100+ pre-built Integrations that you can choose from.

visit our website to explore hevo

Hevo can help you integrate your data from numerous sources and load them into a destination to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about the Redshift MD5 in the comments section below.

No-code Data Pipeline For Redshift