From coding to testing and deployment, software development is a multi-stage process. As numerous teams are involved throughout the entire development lifecycle, ensuring seamless code updates to improve the software becomes difficult.

Developers can use CI/CD processes to automate each stage of the application development lifecycle to eliminate problems. Developers can use the CI/CD workflow to write & integrate code, run test cases, and release changes to the product in real-time, ensuring automated and dependable software delivery processes.

This article explains how to set up a GCP CI/CD Pipeline for data processing using Google Cloud managed products and CI/CD technologies. Data scientists and analysts can use GCP CI/CD approaches to help ensure that data processes and workflows are of high quality, maintainable, and adaptable.

What is CI/CD?


CI and CD stand for Continuous Integration and Continuous Delivery/Continuous Deployment, respectively. In a word, Continuous Integration (CI) is a modern software development method that involves making frequent and consistent incremental code changes.

CI/CD is a set of best practices for ensuring that product updates are sent to your Web Application on a regular and consistent basis. Your Web Application is bound to grow with each sprint that enters a new release cycle.

DevOps personnel benefit from CI/CD since it helps them to work more efficiently and effectively. It allows DevOps teams to be more imaginative in their software development by reducing time-consuming and arduous manual development tasks, as well as archaic approval processes.

However, as your team grows, there will be more points of interaction, and the chances of making a mistake will rise as you try to migrate all of the code changes from one staging environment to another. The CI/CD pipeline comes into play at this point.

It’s difficult to imagine a web application that is scalable in terms of speed and consistency without following CI/CD best practices in today’s world.

Key Features of CI/CD

  • Stable Testing Environment: Test the code in a cloned replica of the production environment for a stable testing environment.
  • Maximum Exposure: Each Developer should have access to the most recent executables and should be able to observe any repository modifications.
  • Predictable Deployment: Deployments are routine and low-risk, so the team should feel comfortable carrying them out at any time.
  • Repository Centralized: All of the files and scripts needed to make builds are stored in Source Code Management (SCM).
  • Regular Visits to the Main Branch: Trunk-based development is the early and regular integration of code into your trunk, mainline, or master branch.
  • Frequent Iteration: Multiple commits to the repository limit the number of places where conflicts can lurk, allowing for more frequent iteration.

What is Google Cloud Platform?


Google Cloud Platforms (GCP), like Amazon Web Services (AWS) and Microsoft Azure, is a public cloud provider. Customers can use computer resources located in Google’s data centers across the world for free or on a pay-per-use basis through GCP and other cloud partners.

GCP provides a range of computing services, including GCP Cost Management, Data Management, Web and Video delivery over the web, and AI and Machine Learning tools.

Key Features of Google Cloud Platform

The following are some key features of the Google Cloud Platform:

  • On-demand Services: It provides Web-based tools in an automated environment. As a result, there is no need for human interaction to gain access to resources.
  • Broad Network: The resources and information can be accessible from anywhere on the network.
  • Resource Pooling: It refers to the provision of a shared pool of computing resources to consumers on demand.
  • Rapid Elasticity: The capacity to add extra resources as needed.
  • Measured Service: The pay-as-you-go functionality allows consumers to pay only for the services they use.

Understanding GCP CI/CD Pipeline

The GCP CI/CD pipeline is made up of the following steps at a high level:

  • Using the Maven builder, Cloud Build turns the WordCount sample into a self-running Java Archive (JAR) file. The Maven builder is a container that contains Maven. Maven runs the tasks when a build step is configured to use the Maven builder.
  • The JAR file is uploaded to Cloud Storage using Cloud Build.
  • Cloud Build publishes the data-processing workflow code to Cloud Composer after running unit tests on it.
  • The JAR file is picked up by Cloud Composer, which then starts the data-processing task on Dataflow.

The following diagram shows a detailed view of the GCP CI/CD pipeline steps.

GCP CI/CD pipelines
Image Source

Data Processing Workflow of GCP CI/CD Pipeline

A Directed Acyclic Graph (DAG) developed in Python contains the instructions for how Cloud Composer conducts the data-processing workflow. All of the steps in the data-processing workflow, as well as their dependencies, are defined in the DAG.

In each build, the GCP CI/CD Pipeline deploys the DAG definition from Cloud Source Repositories to Cloud Composer automatically. Without requiring any human participation, this approach ensures that Cloud Composer is always up to date with the most recent workflow specification.

In addition to the data-processing workflow, an end-to-end test phase is defined in the DAG description for the test environment. The test phase ensures that the data-processing workflow is working properly.

The data-processing procedure is depicted in the diagram below.

GCP CI/CD: Data Processing
Image Source

The steps in the data-processing workflow are as follows:

  • Step 1: In Dataflow, run the WordCount data process.
  • Step 2: Download the WordCount process’s output files. Three files are generated by the WordCount process:
  • Step 3: Download the download_ref_string reference file.
  • Step 4: Compare the outcome to the reference file. This integration test combines the results from all three tests and compares them to the reference file.

Managing the data-processing workflow with a task-orchestration framework like Cloud Composer reduces the workflow’s code complexity.

How to Create GCP CI/CD Pipelines?

Granting Access

When you add additional access to the Cloud Build service account, Cloud Build installs Cloud Composer DAGs and initiates processes. See the access control documentation for more information on the various roles available when working with Cloud Composer.

  • To allow the Cloud Build task to specify Airflow variables in Cloud Composer, add the composer.admin role to the Cloud Build service account in Cloud Shell:
gcloud projects add-iam-policy-binding $GCP_PROJECT_ID 
  • To allow the Cloud Build job to initiate the data workflow in Cloud Composer, add the composer.worker role to the Cloud Build service account:
gcloud projects add-iam-policy-binding $GCP_PROJECT_ID 
What Makes Hevo’s Data Loading Process Unique

Aggregating and Loading data Incrementally can be a mammoth task without the right set of tools. Hevo’s automated platform empowers you with everything you need to have a smooth Data Collection, Processing, and Aggregation experience. Our platform has the following in store for you!

  • Exceptional Security: A Fault-tolerant Architecture that ensures Zero Data Loss.
  • Built to Scale: Exceptional Horizontal Scalability with Minimal Latency for Modern-data Needs.
  • Built-in Connectors: Support for 100+ Data Sources, including Databases, SaaS Platforms, Files & More. Hevo provides native support for various GCP platforms such as BigQuery, Cloud Storage, etc., and offers a wholesome experience.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the format of incoming data and replicates it to the destination schema. You can also choose between Full & Incremental Mappings to suit your Data Replication requirements.
  • Blazing-fast Setup: Straightforward interface for new customers to work on, with minimal setup time.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-Day Free Trial!

Creating GCP CI/CD Test Pipelines

Step 1: Creating and Building GCP CI/CD Test Pipelines

The YAML configuration file specifies the Build and Test pipeline phases. To perform the tasks in each build step in this tutorial, you utilize prebuilt builder images for git, maven, gsutil, and gcloud. At build time, you use configuration variable substitutions to create the environment settings. Variable substitutions, as well as the locations of Cloud Storage buckets, determine the location of the source code repository. This information is required by the build in order to deploy the JAR file, test files, and DAG definition.

To create the GCP CI/CD pipeline in Cloud Build, submit the Build Pipeline Configuration file in Cloud Shell:

cd ~/ci-cd-for-data-processing-workflow/source-code/build-pipeline
gcloud builds submit --config=build_deploy_test.yaml --substitutions=
  • Step A: Create and deploy a self-executing JAR file for WordCount.
    • Take a look at the source code.
    • Create a self-executing JAR file from the WordCount Beam source code.
    • Place the JAR file on Cloud Storage so that Cloud Composer may use it to perform the WordCount processing job.
  • Step B: Install Cloud Composer and set up the data-processing workflow.
    • Run the unit test on the workflow DAG’s custom-operator code.
    • Use Cloud Storage to store the test input and reference files. The WordCount processing task uses the test input file as its input. The test reference file is used as a check to ensure that the WordCount processing job’s output is correct.
    • Set the Cloud Composer variables to point to the JAR file that was just created.
    • In the Cloud Composer environment, deploy the workflow DAG definition.
  • Step C: To initiate the test-processing workflow, run the data-processing workflow in the test environment.

Step 2: Verifying the Test Pipeline

Verify the build procedures after submitting the build file.

  • Step A: Go to the Build History page in the Cloud Console to get a list of all prior and current builds.
  • Step B: Select the currently running build by clicking on it.
  • Step C: Verify that the build steps on the Build Details page match the steps given earlier.
GCP CI/CD: Test Pipelines step 2
Image Source

When the build is finished, the Status field on the Build details page indicates Build successful.

  • Step D: Verify that the WordCount sample JAR file was copied to the relevant bucket in Cloud Shell:
gsutil ls gs://$DATAFLOW_JAR_BUCKET_TEST/dataflow_deployment*.jar

The following is an example of the output:

  • Step E: Get the web address for your Cloud Composer account. Take note of the URL, as it will be needed in the next step.
gcloud composer environments describe $COMPOSER_ENV_NAME 
    --location $COMPOSER_REGION 
  • Step F: To validate a successful DAG run, navigate to the Cloud Composer UI using the URL from the previous step. Wait a few minutes and reload the page if the Dag Runs column does not display any information.
    • Hold the pointer over the light-green circle below DAG Runs and check that it says Running to ensure that the data-processing workflow DAG test_word_count is deployed and running.
    • Click the light-green circle, then Dag Id: test_word_count on the Dag Runs page to observe the running data-processing workflow as a graph.
    • Refresh the Graph View page to see the current state of the DAG run. The workflow typically takes three to five minutes to complete. Hold the pointer over each job and check that the tooltip says State: success to ensure that the DAG runs correctly. The integration test, named do_comparison, is the final task, and it compares the process output to the reference file.

Creating GCP CI/CD Production Pipelines

Step 1: Creating GCP CI/CD Production Pipelines

You can promote the current version of the workflow to production after the test processing workflow runs successfully. The workflow can be deployed to production in a number of ways:

  • Manually.
  • When all of the tests in the test or staging environments pass, this event is automatically triggered.
  • A scheduled job initiates the process automatically.

This guide does not cover the automatic approaches.

In this guide, you will use the Cloud Build production deployment build to do a manual deployment to production. The steps for the production deployment build are as follows:

  • Step A: From the test bucket, copy the WordCount JAR file to the production bucket.
  • Step B: Set the production workflow’s Cloud Composer variables to point to the newly promoted JAR file.
  • Step C: Deploy the production workflow DAG definition and run the workflow in the Cloud Composer environment.

The name of the most recent JAR file delivered to production with the Cloud Storage buckets utilized by the production processing workflow is defined by variable substitutions. Complete the following steps to establish the Cloud Build pipeline that delivers the production airflow workflow:

  • Step A: Print the Cloud Composer variable for the JAR filename in Cloud Shell to get the filename of the most recent JAR file:
export DATAFLOW_JAR_FILE_LATEST=$(gcloud composer environments run $COMPOSER_ENV_NAME 
    --location $COMPOSER_REGION variables get -- 
    dataflow_jar_file_test 2>&1 | grep -i '.jar')
  • Step B: Create the GCP CI/CD pipeline in Cloud Build using the deploy prod.yaml build pipeline configuration file.
cd ~/ci-cd-for-data-processing-workflow/source-code/build-pipeline
gcloud builds submit --config=deploy_prod.yaml --substitutions=

Step 2: Verifying the Data-Processing flows

  • Step A: Obtain the URL for your Cloud Composer user interface:
cd ~/ci-cd-for-data-processing-workflow/source-code/build-pipeline
gcloud builds submit --config=deploy_prod.yaml --substitutions=
  • Step B: Go to the URL you retrieved in the previous step and look for the prod_word_count DAG in the list of DAGs to confirm that the production data-processing workflow DAG is active.
    • Click Trigger Dag in the prod_word_count row on the DAGs page.
    • Click Confirm in the confirmation dialogue.
  • Step C: To see the current state of the DAG run, reload the page. Hold the pointer over the light-green circle below DAG Runs and check that it says Operating to ensure that the production data-processing workflow DAG is deployed and running.
  • Step D: Hold the pointer over the dark-green circle below the DAG runs column after the run completes and check that it shows Success.
  • Step E: List the result files in the Cloud Storage bucket in Cloud Shell:
gsutil ls gs://$RESULT_BUCKET_PROD

The following is an example of the output:



In this article, you saw how to implement the Test and Production GCP CI/CD pipelines. You got a deep understanding of each and every step behind the process. In case you want to export data from a source of your choice into your desired Database/destination then Hevo Data is the right choice for you! 

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations, with a few clicks. Hevo Data with its strong integration with 100+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools. Hevo provides native support for various GCP platforms such as BigQuery, Cloud Storage, etc., and offers a wholesome experience.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of learning about the GCP CI/CD pipelines! Let us know in the comments section below!

Harsh Varshney
Research Analyst, Hevo Data

Harsh is a data enthusiast with over 2.5 years of experience in research analysis and software development. He is passionate about translating complex technical concepts into clear and engaging content. His expertise in data integration and infrastructure shines through his 100+ published articles, helping data practitioners solve challenges related to data engineering.

No-code Data Pipeline For Your Data Warehouse