AWS Glue – A Comprehensive Overview

on Data Integration • July 16th, 2019 • Write for Hevo

Are you looking to learn more about AWS Glue and its fitment for your ETL needs? If yes, you have landed on the right post. This blog helps you decode some of the key aspects to consider while evaluating AWS Glue (or any ETL solution for that matter): Glue’s features, pricing, use cases, and limitations. Before we dive in, let us try to understand a little about the ETL process itself. This will set some more context and help appreciate the features of Glue better.

What is ETL?

ETL stands for Extract, Transform, and Load. Just as the name implies, an ETL process involves performing the following steps:  

  1. Extract data from one or many data sources – this could be databases, business applications used by your sales, marketing, and support teams, etc.
  2. Transform the data as per the needs of the business. (Eg: Combine First Name and Last Name → Full Name, convert currency from INR to USD, and more.)
  3. Finally, load the transformed data to a target storage

Based on the needs of the business, this movement of data can be scheduled (e.g: once every few hours) or can be moved based on certain triggers (e.g: each time a record is updated in the database.)

Businesses can either choose to build custom code to perform ETL or simply implement an ETL tool to do the same. AWS Glue is one of the preferred ETL platforms especially if the data sources are hosted on AWS platforms.

What is AWS Glue?

AWS Glue is a fully-managed service provided by Amazon for deploying ETL jobs. AWS Glue reduces the cost, lowers the complexity, and decreases the time spent creating ETL jobs. For companies that are price-sensitive, but need a tool that can work with different ETL use cases, Amazon Glue might be a decent choice to consider. 

Launched sometime around August 2017, Glue has come a long way in adding value to its users. Here are some of the noteworthy pointers about Glue:

  • Glue is “Serverless” service. Hence, you do not need to provision or manage any resources or services.
  • With AWS Glue, you only pay for resources when Glue is actively running. 
  • Amazon Glue ETL comes with “crawlers” that can create metadata to view the data stored in S3. This metadata comes very handy while authoring ETL jobs.
  • With the use of Python scripts, Glue can translate one source format to another source format.
  • Glue allows you to create a development endpoint by yourself. This gives you the power to construct your ETL scripts swiftly and easily.

In this guide, you will learn:

AWS Glue Features

Since Amazon Glue is completely managed by AWS, deployment and maintenance is super simple. Below are some important features of Glue: 

  1. Integrated Data Catalog

    The Data Catalog is a persistent metadata store for all kind of data assets in your AWS account. Your AWS account will have one Glue Data Catalog. This is the place where multiple disjoint systems can store their metadata. In turn, they can also use this metadata to query and transform the data. The catalog can store table definitions, job definition, and other control information that help manage the ETL environment inside Glue.

  2. Automatic Schema Discovery

    AWS Glue allows you to set up crawlers that connect to the different data source. It classifies the data, obtains the schema related info and automatically stores it in the data catalog. ETL jobs can then use this information to manage ETL operations.

  3. Code Generation

    AWS Glue comes with an exceptional feature that can automatically generate code to extract, transform and load your data. The only input Glue would need is the path/location where the data is stored. From there, glue creates ETL scripts by itself to transform, flatten and enrich data. Normally, scala and python code is generated for Apache spark.

  4. Developer Endpoints

    This is one of the best features of Amazon Glue and helps interactively develop ETL code. When Glue automatically generates a code for you, you will need to debug, edit and test the same. The Developer Endpoints provide you with this service. Using this, custom readers, writers or transformations can be created. These can further be imported into Glue ETL jobs as custom libraries.

  5. Flexible Job Scheduler

    One of the most important features of Glue is that it can be invoked as per schedule, on-demand or on an event trigger basis. Also, you can simply start multiple jobs in parallel. Using the scheduler, you can also build complex ETL pipelines by specifying dependencies across jobs. AWS Glue ETL always retries the jobs in case they fail. They also automatically initiate filtering for infected or bad data. All kinds of inter job dependencies will be handled by Glue. 

AWS Glue Pricing 

AWS Glue always charges an hourly rate, billed by the second. The pricing will depend on crawlers – that discover the data and ETL Jobs – that process and load your data. In addition to this, a simple monthly fee is involved to store and access metadata from the Data Catalog. As per the free tier rules of Glue, Amazon does not charge for the first million objects stored and the first million objects accessed/requested. In addition to this, if you create a developer endpoint to generate develop your ETL code, you will have to pay an hourly rate, billed per second. Yes, the pricing structure can be slightly overwhelming. Let us look at some scenarios to understand this better:

  1. ETL Jobs For this example, consider Apache Spark as a Glue job type that runs for 10 minutes and consumes 6 DPUs. The Price of 1 Data Processing Unit (DPU) – Hour is 0.44 USD. Since your job ran for 10 Minutes of an hour and consumed 6 DPUs, you will be billed 6 DPUs X 10 minutes at $.44 per DPU-hour or $0.44.
  2. Development Endpoint – Now let’s assume that you will provide a development endpoint to direct connect to your computer to interactively develop your ETL code. If 5 DPUs have been provisioned for your endpoint, then you need to run this development endpoint for 24 minutes. Hence you will be charged for 5 DPUs X 24 Minutes at $.44 per DPU-Hour or $.88. 
  3. AWS Glue Data Catalog billing Example – As per Glue Data Catalog, the first 1 million objects stored and access requests are free. In case you store more than 1 million objects and place more than 1 million access requests, then you will be charged.  Let’s assume that you will use 330 minutes of crawlers and they hardly use 2 data processing unit (DPU). The number of objects stored is < 1 million. Hence the storage cost is 0. Assume that your access requests exceed 1 million requests. In this case, you will be charged $1. Crawlers will be charged at .44 USD per Data Processing Unit (DPU)-Hour, and you will pay for 2 Data Processing Unit (DPU)s X 30 minutes at $.44 or $.44. Hence the monthly total bill will be $ 1.44.

AWS Glue Use Cases

This section highlights the most common use cases of Glue. You can use Glue with some of the famous tools and applications listed below: 

  1. AWS Glue with Athena

    In Athena, you can easily use AWS Glue Catalog to create databases and tables, which can later be queried. Alternatively, you can use Athena in AWS Glue ETL to create the schema and related services in Glue.

  2. AWS Glue for Non-native JDBC Data Sources

    AWS Glue by default has native connectors to data stores that will be connected via JDBC. This can be used in AWS or anywhere else on the cloud as long as they are reachable via an IP. AWS Glue natively supports the following data stores- Amazon Redshift, Amazon RDS ( Amazon Aurora, MariaDB, MSSQL Server, MySQL, Oracle, PgSQL.)

  3. AWS Glue integrated with AWS Data Lake

    AWS Glue can be integrated with AWS Data Lake. Further, ETL processes can be run to ingest, clean, transform and structure data that is important to you.

  4. Snowflake with AWS Glue

    Snowflake has great plugins that seamlessly gel with AWS Glue. Snowflake data warehouse customers can manage their programmatic data integration process without worrying about physically maintain it or maintaining any kind of servers and spark clusters. This allows you to get the benefits of Snowflake’s query pushdown, SQL translation into Snowflake and Spark workloads. 

AWS Glue Limitations and Challenges

While there are many noteworthy features of AWS glue, there are some serious limitations as well. 

  1. In comparison to the other ETL options available today, Glue has only a few pre-built components. Also, given it is developed by and for the AWS Console, it is not open to match all kinds of environments.
  2. Glue works well only with ETL from JDBC and S3 (CSV) data sources. In case you are looking to load data from other cloud applications, File Storage Base, etc. Glue would not be able to support.
  3. Using Glue, all data is first staged on S3. This sync has no option for incremental sync from your data source. This can be limiting if you are looking ETL data in real-time.
  4. Glue is a managed AWS Service for Apache spark and not a full-fledged ETL solution. Tons of new work is required to optimize pyspark and scala for Glue. 
  5. Glue does not give any control over individual table jobs. ETL is applicable to the complete database.
  6. While Glue provides support to writing transformations in scala and python, it does not provide an environment to test the transformation. You are forced to deploy your transformation on parts of real data, thereby making the process slow and painful.
  7. Glue does not have good support for traditional relational database type of queries. Only SQL types of queries are supported that too through some complicated virtual table.
  8. The learning curve for Glue is steep. If you are looking to use glue for your ETL needs, then you would have to ensure that your team comprises of engineering resources that have a strong knowledge of spark concepts.
  9. The soft limit of handling concurrent jobs is 3 only, though it can be increased by building a queue for handling limits. You will have to write a script to handle a smart auto-DPU to adjust the input data size.

Hevo – A Simpler Alternative to AWS Glue

Hevo is a simple to use Data Pipeline Platform that helps you load data from any source to any destination in real-time without having to write a single line of code. Hevo helps you overcome the limitations presented by Glue. Here is why Hevo is the right ETL partner for you: 

  • Minimal Setup Time: No Learning Curve: Hevo has a point-and-click visual interface that lets you connect your data source and destination in a jiffy. No ETL scripts, cron jobs or technical knowledge is needed to get started. Your data will be moved to the destination in minutes, in real-time.  
  • Automatic Schema Mapping: Once you have connected your data source, Hevo automatically detects the schema of the incoming data and maps it to the destination tables. With its AI-powered algorithm, it automatically takes care of data type mapping and adjustments – even when the schema changes at a later point.
  • Mature Data Transformation Capability: Hevo allows you to enrich, transform and clean the data on the fly using an easy Python interface. What more – Hevo also comes with an environment where you can test the transformation on a sample data set before loading to the destination.
  • Secure and Reliable Data Integration: Hevo has a fault-tolerant architecture that ensures that the data is moved from the data source to destination in a secure, consistent and dependable manner with zero data loss. 
  • Unlimited Integrations: Hevo has a large integration list for Databases, Data Warehouses, SDKs & Streaming, Cloud Storage, Cloud Applications, Analytics, Marketing, and BI tools. This, in turn, makes Hevo the right partner for the ETL needs of your growing organization.

While you are evaluating your options for a seamless ETL platform, do try out Hevo by signing up for a 14-day free trial here.

No-code Data Pipeline for your Data Warehouse