With the ability to integrate data faster and at scale, AWS provides organizations with product offerings that are serverless and fully managed — indeed, very helpful for organizations that aim to further streamline their processes. And, one such product offering is AWS Glue Workflow.
Organizations that leverage AWS Glue Workflow not just save time because the need for manually updating data doesn’t exist anymore; they benefit from the newly built competencies that can help them in the long run as they scale.
The problem statement: Usually, organizations tend to update data every week, as only after updating the ETL processes would run. But what about the ad hoc data updating that happens very often? How can we also take this into the picture?
Because of the fact that datasets get impacted due to the release of new information, the need for Workflow in AWS Glue becomes evident. To provide a solution to this issue, in this tutorial article, we’ll discuss, in detail, how to create and build it and how actually AWS Glue and Workflow work. Let’s begin.
What is AWS Glue?
- AWS Glue — a serverless data integration and ETL service — makes identifying, preparing, and combining data for data analysis, machine learning, and application development tasks simple. AWS Glue provides both visual and code-based tools to make the data integration process seamless.
- Amazon Glue is made up of three parts: the AWS Glue Data Catalog, an ETL engine that generates Python or Scala code automatically, and a customizable scheduler that handles dependencies, job monitoring, and restarts processes.
- Amazon Glue provides all of the data integration tools you’ll need to get insights and use your knowledge to make new improvements in minutes rather than months.
The following are some aspects you should be informed of:
- Drag & Drop Job Editor: You can define the ETL process using a drag-and-drop job editor, and AWS Glue will quickly develop the code to extract, transform, and upload the data.
- Automatic Schema Discovery: You may use the Glue service to build crawlers connecting different data sources. It effectively organizes the data, extracts scheme-related information, and puts it in the data catalog. This data may then be utilized to monitor ETL operations via ETL jobs.
- Job Scheduling: Glue can be used on a regular basis, on-demand, or in response to an event. You may also utilize the scheduler to build elaborate ETL pipelines by setting task dependencies.
- Code Generation: Glue Elastic Views makes it simple to develop materialized views that aggregate and replicate data across several data stores without the need for proprietary code.
- Built-in Machine Learning: Glue has a Machine Learning tool called “FindMatches.” It finds and deduplicates records that are imperfect copies of one another.
- Developer Endpoints: If you want to construct your ETL code actively, Glue provides developer endpoints for you to alter, debug, and test the code it has built.
AWS Glue Workflow — An Overview of How Things Work!
- It lets you design, then view complicated extract, transform, and load (ETL) operations that involve numerous crawlers, processes, and triggers. Each Workflow is in charge of managing the execution and monitoring of all of its tasks and crawlers.
- Each component of a process is executed, and the execution progress and status are recorded. This gives you an overview of the overall assignment as well as the specifics of each phase. On that note, it’s also important to mention, that the AWS Glue Workflow console displays a graph representation of a process, too.
But, before moving forward, let’s talk about Triggers in AWS Glue Workflow.
In AWS Glue Workflow, there are types of Triggers that help start both jobs and crawlers. Once the jobs or crawlers are complete, Triggers can also be fired. Triggers, in general, are deployed to create independent jobs and crawlers. The three types of triggers are mentioned below:
- Timetable Trigger: The Workflow is initiated based on a schedule that you provide. The schedule can be set to run on a daily, weekly, monthly, or weekly basis, or it can be a custom schedule based on a cron expression.
- On-demand Trigger: The Workflow is initiated manually using the AWS Glue UI, API, or AWS CLI.
- EventBridge Event Trigger: When a single Amazon EventBridge event or a batch of Amazon EventBridge events occurs, the Workflow begins. AWS Glue may act as an event consumer in an event-driven architecture with this trigger type. Any EventBridge event type can start a process. The introduction of a new item in an Amazon S3 bucket is a typical use case (the S3 PutObject operation).
Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.
Start for free now!
Get Started with Hevo for Free
Creating & Building AWS Glue Workflow
In this section of the blog post, we’ll learn how to create AWS Glue Workflow manually, one node at a time.
Typically, before manually creating AWS Glue Workflow, it’s prerequired to create the jobs and crawlers. This is to let you know that the Workflow is included. You may add new triggers to your process as you construct it, or you can clone existing triggers into the Workflow. When you clone a trigger, it adds to the Workflow all the catalog objects connected with it—the jobs or crawlers that fire it and the jobs or crawlers that it begins.
Let’s now see how to create it manually.
Step 1: Create the Workflow
- First, sign in to the AWS Management Console by opening the AWS Glue console. Click here to get started.
- Select Workflows from the ETL drop-down menu in the navigation pane.
- Select Add Workflow and fill out the Add a new ETL workflow form.
- Select Add Workflow. After this, an AWS Glue Workflow will appear in the list on the workflow page.
Step 2: Add a Start Trigger
- Select your new Workflow on the Workflows page. Then, at the very bottom of the page, make sure the Graph tab is chosen.
- Select Add trigger, and then perform one of the following in the Add trigger dialogue box:
- Select Clone existing and a trigger to clone. Then choose Add. The trigger is depicted on the graph, together with the tasks and crawlers that it monitors and the jobs and crawlers that it initiates. If you accidentally choose the wrong trigger, pick it on the graph and then choose Remove.
- Now select Add New, then complete the Add trigger form.
- Select Schedule, On-demand, or EventBridge event as the Trigger type. Select one of the Frequency choices for the trigger type Schedule. To insert a cron expression, choose Custom. Enter the Number of events (batch size) and, optionally, the Time delay for trigger type EventBridge event (batch window). If you leave the Time delay field blank, the batch window defaults to 15 minutes.
- Select add. A trigger will appear, along with the placeholder node and on the graph. The start trigger in the example below is a scheduling trigger called Month-close1. The trigger has not yet been stored.
- Complete the following steps if you introduce a new trigger:
- Choose one of the following options:
- Select the placeholder node (Add node).
- Make sure the start trigger is chosen, and then choose Add jobs/crawlers to trigger from the Action menu above the graph.
- Select one or more jobs or crawlers in the Add jobs(s) and crawler(s) to trigger the dialogue box, then click Add. The trigger is stored, and the selected jobs or crawlers show on the graph with trigger connectors. If you unintentionally added the wrong jobs or crawlers, you may pick either the trigger or a connection and choose Remove.
Step 3: Add more triggers
Continue to expand your process by adding more Event triggers. Use the icons to the right of the graph to zoom in or out, or to increase the graph canvas. Complete the following steps for each trigger you want to add:
- Follow the below steps to continue:
- To clone an existing trigger, make sure no node on the graph is chosen, and then click Add trigger from the Action menu.
- To create a new trigger that monitors a certain task or crawler on the graph, first, click the job or crawler node, and then choose the Add trigger placeholder node. You may add more jobs or crawlers to watch for this trigger in a subsequent phase.
- Do one of the following in the Add Trigger dialogue box:
- Select Add new and fill out the Add trigger form. Then choose Add. The trigger is shown on the graph. In a subsequent phase, you will finish the trigger.
- Select Clone existing and a trigger to clone. Then choose Add. The trigger is depicted on the graph, together with the tasks and crawlers that it monitors and the jobs and crawlers that it initiates. If you choose the incorrect trigger, select it on the graph and then choose Remove.
- Complete the following steps if you introduce a new trigger:
- Select the new trigger: The trigger De-dupe/fix successfully is chosen, as shown in the graph below, and placeholder nodes emerge for (1) events to observe and (2) actions.
- Pick the events-to-watch placeholder node, and then select one or more jobs or crawlers in the Add job(s) and crawler(s) to watch dialogue box. Select an event to monitor (SUCCEEDED, FAILED, etc.) and click on Add.
- Make sure the trigger is chosen, and then pick the actions placeholder node.
- Select one or more jobs or crawlers in the Add job(s) and crawler(s) to watch the dialogue box, then click Add. The graph displays the chosen tasks and crawlers, together with connections from the trigger.
Conclusion
In this tutorial article, we successfully created AWS Glue Workflow to automate weekly tasks that are needed to be updated regularly. We also parse through the nitty-gritty that goes around while creating AWS Glue Workflow in the overview section of the article.
But, if you want to learn more about the subject, either of these three AWS documentation can help:
- Overview of Workflows in AWS Glue
- Running and Monitoring a Workflow in AWS Glue
- Creating a Workflow from a Blueprint in AWS Glue
Hevo Data with its strong integration with 150+ Data Sources (including 40+ Free Sources) allows you to not only export data from your desired data sources & load it to the destination of your choice but also transform & enrich your data to make it analysis-ready. With all this taken care of by Hevo, you can then focus on your key business needs and perform insightful analysis.
Share your experience of understanding AWS Glue Workflow in the comment section below! We would love to hear your thoughts.
Yash is a Content Marketing professional with over three years of experience in data-driven marketing campaigns. He has expertise in strategic thinking, integrated marketing, and customer acquisition. Through comprehensive marketing communications and innovative digital strategies, he has driven growth for startups and established brands.
No Code Data Pipeline For Amazon Redshift