Google DataFlow allows teams to focus on programming instead of handling server cluster management, because, DataFlow’s serverless approach gets rid of operational overhead from Data Engineering workloads. Apart from this, Google DataFlow also offers the virtually limitless capacity to tackle your spiky and seasonal workloads without burning a hole in your pocket.
This blog gives a detailed account of the salient aspects of Google DataFlow that make it an integral part of any Data Pipeline. This includes its salient features and benefits followed by a deep dive into the different Google DataFlow Pipeline Lifecycle stages, resource management, and autotuning features to name a few.
Table of Contents
What is Google DataFlow?
Google Dataflow is a fully-managed data processing service by Google that follows a pay-as-you-go pricing model. It helps organizations leverage the robust functionality of ETL tools without putting much effort into building and maintaining the ETL tools while providing various provisions. Google DataFlow also takes care of all the resources required to carry out data processing operations.
Built using Apache Beam SDK, Google DataFlow supports both batch and stream data processing. It also allows users to set up commonly used source-target patterns using their open-source templates with ease.
Key Features of Google DataFlow
Here are a few key features of Google DataFlow that make it an indispensable tool for your pipeline:
- Inline Monitoring: With Google DataFlow Inline Monitoring, you can directly access job metrics to help with troubleshooting streaming and batch pipelines. You can even access monitoring charts at both the worker and step level visibility while setting alerts for conditions such as high system latency and stale data.
- Real-Time Change Data Capture: You can replicate or synchronize data reliably or with minimal latency across heterogeneous data sources to bolster power streaming analytics. With Google DataFlow’s extensive templates, you can easily integrate with Google DataStream to replicate data from Cloud storage into PostgreSQL, Google BigQuery, or Cloud Spanner.
- DataFlow SQL: With DataFlow SQL, you can use your SQL skills to create streaming DataFlow pipelines from the Google BigQuery web user interface itself. You can even join streaming data from Pub/Sub with files within Google BigQuery tables or files in Cloud Storage. It also allows you to build real-time dashboards by leveraging Google Sheets or other Business Intelligence tools.
- Smart Diagnostics: Google DataFlow offers a suite of new features such as Job Visualization capabilities and SLO-based Pipeline Management that provide users a visual way to inspect their job graph and pinpoint bottlenecks. With Smart Diagnostics, you also get a slew of automatic recommendations to locate and tune availability and performance issues.
- Customer-Managed Encryption Keys: You can easily create a streaming or batch pipeline that is protected with a Customer-Managed Encryption Key (CMEK) or look at CMEK-protected data within sinks and sources.
Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 100+ Data Sources straight into your Data Warehouse or any Databases. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!
Get Started with Hevo for Free
Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!
Understanding the DataFlow Pipeline Lifecycle
When you decide to run your DataFlow Pipeline, DataFlow will generate an execution graph from the code that develops your Pipeline object, including all of the transforms and their related processing functions. This phase is known as Graph Construction Time and runs locally on the computer where the pipeline is run.
During graph construction, Apache Beam can execute the code locally from the primary entry point of the pipeline code, stopping at the calls to a source, sink or transform step, and converting these calls into the nodes of a graph. As a result, a piece of code in a pipeline’s entry point is locally executed on the machine that runs the pipeline, while the same piece of code declared in a method of a DoFn object gets executed in the DataFlow workers.
DataFlow Pipeline Lifecycle Stages: Distribution and Parallelization
Google DataFlow service can automatically distribute and parallelize the processing logic within your pipeline to the workers you’ve allotted to carry out your job. It leverages the abstractions in the programming model to depict parallel processing functions, for instance, your ParDo transforms might cause Google DataFlow to automatically distribute your processing code to multiple workers to be run simultaneously.
When structuring your ParDo transforms and generating your DoFns, you can keep the following guidelines in mind:
- The Google DataFlow service doesn’t guarantee how many times a DoFn will be invoked.
- Google DataFlow service guarantees that every element within your input PCollection can be processed by a DoFn exactly once.
- It doesn’t guarantee the exact number of DoFn instances that will be generated for a pipeline.
- Apart from this, the Google DataFlow service is also fault-tolerant and may retry your code various times in case of worker issues. The DataFlow service may generate backup copies of your code and can have issues with manual side effects.
DataFlow Pipeline Lifecycle Stages: Execution Graph
Google DataFlow builds a graph of steps that depicts your pipeline, based on the transforms and data you used when building your Pipeline object. This is known as the DataFlow Pipeline Execution Graph. For instance, the WordCount pipeline included with Apache Beam SDKs contains an array of transforms to extract, read, count, format, and write the individual words within a collection of text, along with the occurrence count for every word.
In the following diagram, you can see how the transforms in the WordCount pipeline can be expanded into an execution graph:
DataFlow Pipeline Lifecycle Stages: Combine Optimization
Combining operations are an essential concept in large-scale Data Processing operations. It helps bring together data that might be conceptually distant, which makes it pivotal for correlation. The Google DataFlow programming model depicts aggregation operations as the CoGroupByKey, GroupByKey, and Combine transforms.
Google DataFlow’s aggregation operations collate data across the entire data set, including data that might be spread across multiple workers. During such operations, you should try to combine as much data locally as you can, before aggregating data across instances.
When you execute multi-level or partial combining, the Google DataFlow service can make different decisions based on whether your pipeline is tackling streaming or batch data. For bounded data, the service leans towards efficiency and carries out as much local combining as possible. For unbounded data, however, it favors lower latency, and might not execute partial combining (since it might increase latency).
DataFlow Pipeline Lifecycle Stages: Fusion Optimization
Now that the JSON form of your pipeline’s Execution Graph has been validated, the Google DataFlow service might modify the graph to execute optimizations. Such optimizations can include combining multiple transforms or steps in your pipeline’s Execution Graph into single steps. Combining steps allows the Google DataFlow service to materialize every intermediate PCollection within your DataFlow pipeline, which might be costly in terms of processing and memory overhead.
While all the transforms you’ve specified within your DataFlow pipeline construction are carried out on the service, they may be performed in a different order, or as part of a larger fused platform to maximize the efficiency of execution of your pipeline.
In the following diagram, you can see how the Execution Graph from the WordCount example included with the Apache Beam SDK for Java might be optimized and combined with the Google DataFlow service for efficient execution:
Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.
Check out what makes Hevo amazing:
Sign up here for a 14-Day Free Trial!
- Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
- Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making.
- Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
Understanding Autotuning Features to Optimize Google DataFlow Jobs
Google DataFlow service also offers an array of autotuning features that can further dynamically optimize your job while it’s running. These features include Horizontal Autoscaling, Vertical Autoscaling, and Dynamic Work Rebalancing.
DataFlow Pipeline Autotuning Features: Vertical Autoscaling for Python Streaming Pipelines
Vertical Autoscaling is a pivotal feature that allows Google DataFlow to dynamically scale up or down the memory accessible to workers to fit the needs of the job. The feature is designed to make jobs resilient to out-of-memory errors and to maximize the efficiency of your DataFlow pipeline. DataFlow Prime monitors your pipeline, detects situations where the workers exceed or lack available memory, and then replaces them with new workers with less or more memory.
Here are a few limitations of vertical autoscaling for your data pipeline:
- You can carry out vertical scaling only for workers’ memory and Python streaming jobs.
- By default, memory scaling has an upper limit of 16 GiB (26 GiB when leveraging GPUs) and a lower limit of 6 GiB (12 GiB while using GPUs). Offering a resource hint can considerably shift both the lower and upper limits.
DataFlow Pipeline Autotuning Features: Horizontal Autoscaling
By enabling horizontal autoscaling, the Google DataFlow service can automatically choose the appropriate number of worker instances needed to run your job. The service might even dynamically reallocate fewer or more workers during runtime to account for the characteristics of your job. Some parts of your DataFlow pipeline might be computationally heavier than others, and the Google DataFlow service might automatically spin up additional workers during these phases of your job (and shut them down when they’re no longer needed).
For batch pipelines, Google DataFlow automatically selects the number of workers based on the estimated total amount of work in each stage of your pipeline, which is reliant on both the current throughput and the input size. Google DataFlow will re-evaluate the amount of work based on the progress of execution every 30 seconds and dynamically scales down or up the number of workers to match the estimated total amount of work.
If any of the following conditions occur, to save idle resources, Google DataFlow can either maintain or reduce the number of workers:
- Parallelism is limited due to unparallelizable work, such as un-splittable data like data processed by I/O modules that don’t split or compressed files.
- Average worker CPU usage tends to be lower than 5%.
If your DataFlow pipeline leverages a custom data source that you’ve implemented, there are a few methods you can execute that offer more information to the Google DataFlow service’s Horizontal Autoscaling algorithm and effectively bolster performance.
DataFlow Pipeline Autotuning Features: Dynamic Work Rebalancing
The Dynamic Work Rebalancing feature allows the Google DataFlow service to dynamically bisect work based on runtime conditions. These conditions might include the following:
- Workers take longer than usual to finish up their tasks.
- Imbalances in work assignments.
- Workers finish faster than expected.
In terms of limitations, dynamic work rebalancing only takes place when the Google DataFlow service is processing some data in parallel: when reading data from an external input source, when working with a materialized intermediate PCollection, or when dabbling with the result of aggregation like GroupByKey. If a large number of steps in your job are aggregated, there are fewer intermediate PCollections in your job and dynamic work rebalancing will be limited to the number of elements in the source materialized PCollection.
Resource and Usage Management for Google DataFlow Pipeline
Google DataFlow fully manages resources available in Google Cloud on a per-job basis. This consists of spinning up and shutting down Compute Engine instances (which might be referred to as VMs or workers) and accessing your project’s Cloud Storage buckets for both temporary file staging and I/O. However, if your pipeline tackles Google Cloud storage technologies like Pub/Sub or Google BigQuery, you need to manage the resources and quota for those services.
Google DataFlow currently allows a maximum of 1000 Compute Engine Instances/jobs. For batch jobs, the default machine type is n1-standard-1. For streaming jobs, the default machine type for Streaming Engine-enabled jobs is n1-standard-2. When leveraging the default machine types, the Google DataFlow service can allocate upto 4000 cores/jobs. You can opt for a larger machine type if you need more cores for your job.
You can run upto 25 parallel Google DataFlow jobs with ease per Google Cloud project, however, this limit can be pushed by getting in touch with Google Cloud Support.
The service is limited to processing JSON job requests that are 20 MB in size or smaller at the moment. The size of the job request is specifically tied to the JSON representation of your pipeline. Therefore, a larger pipeline means a larger request. Apart from this, your job’s graph size should not exceed 10 MB either.
To have an understanding of the size of your pipeline’s JSOn request, you can run the pipeline with the following snippet:
--dataflowJobFile=< path to output file >
This command will write a JSON representation of your job to a file. The size of the serialized file is a good estimate of the size of the request, the actual size might be slightly larger due to some additional information included in the request.
The DataFlow Pipeline runner executes the steps of your streaming pipeline entirely on worker virtual machines while consuming memory, worker CPU, and Persistent Disk Storage. Google DataFlow’s Streaming Engine moves pipeline execution out of the worker VMs and moves it into the Google DataFlow backend.
Here are a few key benefits of leveraging Streaming Engine for your use case:
- More responsive horizontal autoscaling in response to variations in incoming data volume. Streaming Engine provides more granular and smoother scaling of workers for your convenience.
- A reduction in consumed CPU, memory, and Persistent Disk Storage resources on the worker VMs. Streaming Engine is at its peak when paired with smaller worker machine types and doesn’t need a Persistent Disk beyond a small worker boot disk, leading to less quota and resource consumption.
- Improved supportability, since you do not need to redeploy your pipelines to apply service updates.
Google DataFlow Shuffle is the base operation behind transforms like CoGroupByKey, GroupByKey, and Combine. The DataFlow Shuffle operation bisects and groups data by key in a scalable, fault-tolerant, and efficient manner. Batch jobs can leverage DataFloe Shuffle by default.
Here are a few key benefits of leveraging DataFlow Shuffle:
- Better fault-tolerance; an unhealthy VM holding Google DataFlow Shuffle data will not cause the entire job to fail, as might happen if not using this feature.
- It guarantees a faster execution time of batch pipelines for the majority of pipeline job types.
- A reduction can be seen in consumed memory, CPU, and Persistent Disk Storage resources on the worker VMs.
This blog talks about the different aspects of the Google DataFlow Pipeline in detail, discussing key aspects like Usage and Resource Management, Autotuning features, and the stages of the lifecycle after a brief introduction of the key features of Google DataFlow.
Visit our Website to Explore Hevo
Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. Hevo caters to 100+ data sources (including 40+ free sources) and can seamlessly ETL your data to a destination of your choice within minutes. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner without having to write any code. It will make your life easier and make data migration hassle-free.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand. Do check out the pricing details to understand which plan fulfills all your business needs.
Share your experience of this blog in understanding DataFlow Pipeline in the comments section!