Working with AWS DataBrew: 3 Easy Steps

By: Published: January 21, 2022

AWS DataBrew

Data is one of the essential requirements for every business to grow in this competitive environment. Access to high-quality volumes of data gives an edge to companies. Cleaning, preparing, and managing raw data incoming from multiple data sources is a cumbersome and time-consuming task. These steps convert low-quality data into high-quality data that can be further used for Data Analysis, feeding to Machine Learning models and BI tools, etc.

Users can automate the process of data preparation by using AWS DataBrew for visual data preparation. AWS DataBrew is a tool that helps Data Scientists and Data Analysts clean and normalize the raw data quickly using pre-built transformations and automation. With the help of AWS Glue DataBrew, users can save time and spend more time analyzing the data.

In this article, you will learn about AWS DataBrew and go through simple steps on how the AWS DataBrew works. You will also go through the steps to create a sample Dataset using DataBrew and understand the process for data preparation by cleaning and normalizing it quickly.

Table of Contents

What is AWS DataBrew?

AWS DataBrew Cover Image
Image Source

AWS Data is a Data preparation tool that is part of AWS (Amazon Web Services). It allows companies, Data Scientists, and Data Analysts to clean and normalize data without writing any code. Users use AWS DataBrew because it helps them reduce the time that they spend on cleaning and preparing data before running Analytics on it. With the help of AWS Glue DataBrew, users cut up to 80% of the time by leveraging Machine Learning in data preparation. 

AWS DataBrew comes with over 250 pre-built transformations that help users automate the data transformation tasks involved in data preparation. These transformations include filtering anomalies, correcting invalid values, and converting data to a simple standard format.

AWS DataBrew offers interactive interfaces that help users easily discover, visualize, and clean the raw data. It is a serverless tool and enables users to explore terabytes of raw data.

Key Features of AWS DataBrew

Some of the main features of AWS DataBrew are listed below:

  • Visual Interface: AWS DataBrew comes with an interactive visual interface that makes it easier for users to clean and prepare the raw data using the click and select feature.
  • Automate: AWS DataBrew offers may save transformations that users can use to directly apply transformations for cleaning and normalizing data.
  • Map Data Lineage: With the help of AWS Glue DataBrew, users can visually map the lineage of data to understand the data sources and their transformations steps.

To know more about AWS Databrew, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

What is AWS Glue?

Image Source

It can take a long time to create a workflow that efficiently accomplishes the above-mentioned operations. This is where AWS Glue comes into play. It’s a fully managed ETL solution that’s built to deal with massive amounts of data. Its task is to harvest data from a variety of different AWS services and integrate it into data lakes and warehouses. Since it offers both code-based and visual interfaces, Glue is extremely versatile and simple to use.

AWS DataBrew is a popular and recent feature. DataBrew data can be cleaned, normalized, and even augmented with Glue without having to write any code, and Glue Elastic Views makes merging and replicating data across disparate data stores with SQL a breeze.

Working of AWS DataBrew

Working of AWS DataBrew
Image Source

Let’s first understand how AWS DataBrew works and helps in quickly preparing data. The following steps to prepare data using AWS DataBrew are given below:

  • You can connect one or more Datasets from AWS S3, AWS Redshift, RDS, and AWS Glue Data Catalog or upload the file from your local system from the AWS DataBrew Console that supports CSV, JSON, Parquet, and .XLSX formats.
  • Then, create a project so that you can visually explore, understand, analyze, clean, normalize the data from the Dataset. 
  • By visualizing all the information you can quickly identify patterns in the data and spot anomalies.
  • After that, you can generate a data profile of the Dataset with over 40 statistics by running a job in the “Profile View”.
  • AWS DataBrew will give you recommendations to improve the data quality. It can help you with cleaning data using more than 250 pre-built transformations.
  • You can save, publish and version recipes. After that, you can automate the data preparation tasks by applying recipes to all incoming data.
  • You can visually track and explore the connectivity of Datasets with projects, recipes, and jobs.

Preparing a Sample Dataset with AWS DataBrew

Now that you have understood about AWS DataBrew. In this section, you will go through the process to create a sample dataset using AWS DataBrew. The simple steps for using AWS DataBrew are listed below:

Step 1: Connecting to a Dataset

  • Log in to your AWS Glue DataBrew Console here.
  • Create a new project by navigating to the “Projects tab and then clicking on the “Create project” option.
  • You can enter any name to the project. In this tutorial, the project name is “Comments”, as shown in the image below.
Creating New AWS DataBrew Project
Image Source
  • Along with this, a new recipe will also be created and automatically updated with the data transformations that you will see further in this tutorial to AWS DataBrew.
  • Then, select the “New Dataset” option and name it as you like, eg.- “Comments”, as shown in the image below.
Selecting New Dataset
Image Source
  • Now, the new dialog box will ask you to connect to a new Dataset. Here, you need to upload the new Dataset that you want to use.
  • For this tutorial, the “Comments.csv” file is used as a Dataset. While working with real Datasets, you need to connect AWS DataBrew with AWS S3 or an existing source in the AWS Glue Data Catalog.
  • Next, for the “Comments.csv” file, you can provide the AWS S3 destination to store the Dataset file, as shown in the image below.
Connecting to Dataset
Image Source
  • The “Comments.csv” file contains the first column as the name of the columns and rests all the rows have the values that include comments, category, customer id, and item id as text, rating as numerical, and comment_sentiment and support_needed.
  • The sample data is shown below.
customer_id,item_id,category,rating,comment,comment_sentiment,support_needed
234,2345,"Electronics;Computer", 5,"I love this!",Positive,False
321,5432,"Home;Furniture",1,"I can't make this work... Help, please!!!",negative,true
123,3245,"Electronics;Photography",3,"It works. But I'd like to do more",,True
543,2345,"Electronics;Computer",4,"Very nice, it's going well",Positive,False
786,4536,"Home;Kitchen",5,"I really love it!",positive,false
567,5432,"Home;Furniture",1,"I doesn't work :-(",negative,true
897,4536,"Home;Kitchen",3,"It seems OK...",,True
476,3245,"Electronics;Photography",4,"Let me say this is nice!",positive,false
  • Now, it will ask you for permissions and select the role to access AWS DataBrew.
  • In the “Access Permissions” select the AWS Identity and Access Management (IAM) that will grant the permissions to access data to AWS S3. 
  • Here, select the “Role name” as the “AWSRoleForDataBrew” option from the drop-down menu, as shown in the image below.
Accessing Permissions
Image Source

Step 2: Exploring the Dataset

  • If the dataset you are using is big then you can use the “Sampling” technique to limit the number of rows. 
  • You will make use of projects to create recipes and then jobs to apply these recipes to all the data. 
  • You can use the “Tagging” to manage, filter, and search resources that you created with the help of AWS DataBrew, as shown in the image below.
Exploring the Data in AWS DataBrew
Image Source
  • After, setting all these configurations, the project will start preparing and exploring the dataset.
  • Once the data preparation is done, it will show you the default Grid View where it will show you a summary of each column having range values and statistical distribution having numerical columns.
  • You can change the view by selecting Schema View to drill down on the schema and Profile View to collect statistical summaries about the data which is useful for large datasets.
  • Once the job is done, you can apply exploratory Data Analysis to the data by identifying the correlation between columns, gathering summaries and how many rows and columns are valid. AWS DataBrew presents all this information visually to help you easily identify the patterns in data, as shown in the image below.
Gathering Summary and Description of the data
Image Source
  • You can even drill down into a specific column after selecting it.

Step 3: Applying Transformations Using AWS DataBrew

  • Now let’s explore the “category” column. Here, it contains two pieces of information, separated by semicolons. 
  • For example, you can view the data in the first row as “Electronics;Computers”. 
  • You can split the “category” column into two different columns by selecting the “category” column then clicking the three dots option to explore more options.
  • Here, select the “Split Columns” option and then further select the “On a single delimiter” option, as shown in the image below. 
Splitting the Column Using Single Delimeter
Image Source
  • This will divide the “category” column into two separate columns with column names “category_1” and “category_2”. You can rename these columns as per your choice. For example – renaming them to “category” and “subcategory”.
  • Similar changes will be made to other categories.
  • Next, let’s move to the other column “rating” which contains the numeric values between 1 and 5. You can rescale these values using “min-max normalization” so that all the values are rescaled between 0 and 1. You can also use many other advanced techniques and a new column with the name “rating_normalized” will be added.
  • Similarly, you can use other transformation techniques on different columns to make optimize the data for Analysis using Machine Learning models. You can use a few common techniques such as one-hot encoding, missing values, word tokenization, etc. based on the type of data.
  • Once your data is ready, it can be published and used in a recurring job processing similar data, as shown in the image below.
Publishing the Job
Image Source
  • The recipe contains a list of all the transformations that are to be applied and the while the job is running the output will store in the AWS S3.
  • The output data is available for Data Analytics, feeding it to Machine LEandig models, building reports, and feeding it to BI tools for Visualization, as shown in the image below.
Creating a Recipe Job
Image Source

That’s it! You have completed the process to create a sample Dataset using AWS DataBrew and understood different transformation techniques and how AWS Glue DataBrew helps you quickly prepare data, clean and normalize it.

AWS DataBrew vs SageMaker DataWrangler

Since both services were released within a short period of time and served a similar purpose, many customers (mainly Data Scientists) were left with a decision to make: should they utilize AWS DataBrew or SageMaker DataWrangler to deal with data?

This question has no correct answer because it is dependent on the user’s needs. Advanced users, particularly data scientists, would no doubt point out that DataWrangler allows you to create custom transformations on the fly and utilize them to transform your data. It can also do high-level data analysis quickly, including the creation of short machine learning models to track information like feature importance.

On the other hand, AWS DataBrew’s simplicity cannot be overlooked. With as many built-in changes as it has, many users may find that it meets all of their needs. Working with AWS DataBrew also takes a lot less understanding and may be utilized by those with no technological background.

Overall, the target audiences for these two services are distinct. DataWrangler is designed for data scientists, offering them the flexibility they want when processing data for machine learning models. DataBrew, on the other hand, strives to keep things as easy as possible. It gives you less choice, yet it covers practically everything a typical user may desire. Advanced users may find the service’s capabilities limiting, although they are not the service’s primary demographic.

AWS DataBrew Example

Step 1: Creating a Source of Data

The creation of an S3 bucket is a step in this example that isn’t directly related to DataBrew. Go to the AWS S3 Management Console and click “Create bucket” to create an S3 bucket.

Image Source


Make a new bucket called “edlitera-databrew-bucket” and label it such. All other possibilities should be left alone.
The bucket will appear on our AWS S3 interface once you’ve created it.
You’re ready to work with DataBrew now that you’ve created a bucket. Click on the datasets tab on the DataBrew page, then “Connect new dataset.” When connecting a new dataset, there are a few factors to consider:

  • Name of the data set
  • source of data
  • Tags for the output destination (optional)

You can call this dataset “wine-reviews” and choose “File upload.” You may select a dataset on the local system and instruct AWS DataBrew to send it to the empty bucket you generated earlier using file upload. The new dataset should be ready to use now.

Image Source

Step 2: Data Analysis

Let’s start with some basic data analysis after you’ve defined the dataset that you’ll be using. A dataset profiling feature is included in AWS DataBrew. When working with data that is foreign to you, profiling data can be very valuable. You must go to the “Jobs” menu to create a profile job. Three possibilities will be presented to you:

  • Recipe jobs
  • Profile jobs
  • Schedules

At this time, you’d like to develop a profile of your dataset in order to get a better understanding of how your data appears. Select the “Profile jobs” tab and then “Create job” from the drop-down menu.

You’ll need to enter values for the following parameters when defining the job:

  • Job name
  • Job type
  • Job input
  • Job output settings
  • Permissions 
  • Optional settings

Your job title will be “wine-review-profile.” You’ll pick that you want to make a profile job and then your dataset. You’ll use the bucket you generated earlier for output.

Image Source


Finally, you must establish a role. You’ll establish a new role called “edlitera-profiling-job” because you don’t have one to choose from yet.
You only need to click “Create and run job” after you’ve defined everything, and AWS DataBrew will begin profiling your dataset.
When the job is finished, you can go to the upper right corner and click “View profile.”

The following sections make up a dataset profile:

  • Dataset preview
  • Data profile overview
  • Column statistics
  • Data lineage

The “Dataset preview” section shows the dataset together with details like the dataset name, data size, and where our data is housed, among other things.

Image Source

The “data profile” shows details about:

  • Number of rows
  • Number of columns
  • Data types of columns
  • Missing data
  • Duplicate data
  • Correlation matrix

There are no duplicates in your dataset, although it is lacking certain information. You may assume that you have a number of columns with categorical data because the correlation matrix only shows three values and you have fourteen columns in total, which is also corroborated by the data types section.

The following information is displayed when you click on “column statistics“:

  • Column data type
  • Percentage of missing data in column
  • Cardinality
  • Value distribution graph
  • Skewness factor
  • Kurtosis
  • Top ten most frequent unique values 
  • The correlation coefficient between columns

Finally, the “Data lineage” page displays a visual representation of our data’s lineage.

Image Source

Step 3: Data Transformation

As previously said, this is most likely AWS DataBrew’s most significant feature. A transformation recipe, or a series of transformations defined in a reusable format, is used to transform a dataset. You’ll create a DataBrew project and define a DataBrew transformation recipe to demonstrate some of the features DataBrew has to offer.

To do so, go to the “Projects” menu and click on “Create project.”
You must define values for the following variables in order to create a project:

  • Project name
  • Recipe name 
  • Dataset
  • Permissions
  • Sampling and tags (optional)

Your project will be called “wine-reviews-transformation,” and your new recipe will be called “wine-reviews-transformation-recipe.” After that, you’ll select “wine-reviews” as the dataset you wish to work with.

You’ll leave the option for “Sampling” at default, which means you’ll look at a sample of 500 rows, which is adequate to show how recipes are made. To complete the process definition, you’ll utilize the same role as before: the “AWSGlueDataBrewServiceRole-wine-reviews” role.

Image Source

AWS DataBrew will then begin the process of preparing a session, which will take some time. Your data can be shown as a grid or a schema. You’ll show it as a grid for this demonstration.

It’s time to get to work on your recipe. When you pick “Add step” from the drop-down menu, you can choose a transformation to apply to your dataset. In the toolbar above your dataset, you can see the many changes you can execute. They are used for a variety of reasons.

Image Source

AWS DataBrew: Pros & Cons

  • Connect data straight from your data lake, data warehouses, and databases to evaluate the quality of your data by profiling it to understand data patterns and find abnormalities.
  • With an interactive, point-and-click visual interface, choose from over 250 built-in transformations to visualise, clean, and normalise your data.
  • Visualize your Data’s Lineage to better understand the numerous data sources and transformation procedures it has gone through.
  • Apply previous transformations immediately to new data as it enters your source system to automate data cleaning and normalisation procedures.
  • Your dataset’s first 20,000 rows are used to create a data profile. It took 2 minutes and 2 seconds to complete this task. This report will not appear if your data becomes erratic after row 20k. This is merely a cursory examination, not a comprehensive analysis.
  • Correlations will be offered for your numerical data. This issn’t particularly amazing if you have largely strings.

AWS DataBrew: Best Practices

One-Hot a categorical column

When preparing data for machine learning, setting up One-Hot columns for a multi-category column is a regular activity. To execute this operation, you can utilise one of several Python modules. It was even easier to use this interface. There appear to be some limits to this technology that it may not be able to overcome. (For the sample dataset, there were more than ten categories.)

Image Source

Press the Apply button. New columns appear out of nowhere. That’s all there is to it.

Image Source

Publish your recipe

Make your recipe public so that others can benefit from it.

Image Source

AWS DataBrew: Pricing

Cost is always a factor to consider. Each interactive session costs $1.00. Transformation jobs cost $0.48 per node hour. The coolest aspect is that it’s completely pay-per-use. As a result, even the casual enthusiast will find the tool highly accessible and affordable.

Conclusion

In this article, you learnt about AWS Glue DataBrew and the working of this tool that help in automating the data preparation tasks and saves time. You also went through the process to create a sample Dataset using AWS DataBrew and understood how to use AWS Glue DataBrew for cleaning, normalizing, and preparing data with 250 pre-built transformations, automation, and visual interface to quickly identify the patterns in data.

Visit our Website to Explore Hevo

Amazon Redshift and Amazon S3 stores data from multiple sources and every source follows a different schema. Instead of manually writing scripts for every source to perform the ETL (Extract Transform Load) process, one can automate the whole process. Hevo Data is a No-code Data pipeline solution that can help you transfer data from 100+ sources to Amazon Redshift, S3, or other Data Warehouse of your choice. Its fault-tolerant and user-friendly architecture fully automates the process of loading and transforming data to destination without writing a single line of code. 

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about the AWS DataBrew in the comments section below!

mm
Former Research Analyst, Hevo Data

Aditya has a keen interest in data science and is passionate about data, software architecture, and writing technical content. He has experience writing around 100 articles on data science.

No-code Data Pipeline For your Amazon Redshift