From coding to testing and deployment, software development is a multi-stage process. As numerous teams are involved throughout the entire development lifecycle, ensuring seamless code updates to improve the software becomes difficult.
Developers can use CI/CD processes to automate each stage of the application development lifecycle to eliminate problems. Developers can use the CI/CD workflow to write & integrate code, run test cases, and release changes to the product in real-time, ensuring automated and dependable software delivery processes.
This article explains how to set up a GCP CI/CD Pipeline for data processing using Google Cloud managed products and CI/CD technologies. Data scientists and analysts can use GCP CI/CD approaches to help ensure that data processes and workflows are of high quality, maintainable, and adaptable.
What is CI/CD?
CI and CD stand for Continuous Integration and Continuous Delivery/Continuous Deployment, respectively. In a word, Continuous Integration (CI) is a modern software development method that involves making frequent and consistent incremental code changes.
CI/CD is a set of best practices for ensuring that product updates are sent to your Web Application on a regular and consistent basis. Your Web Application is bound to grow with each sprint that enters a new release cycle.
DevOps personnel benefit from CI/CD since it helps them to work more efficiently and effectively. It allows DevOps teams to be more imaginative in their software development by reducing time-consuming and arduous manual development tasks, as well as archaic approval processes. Azure DevOps scheduled triggers further enhance CI/CD workflows by automating builds and deployments at set intervals, ensuring timely delivery and reducing manual intervention.
However, as your team grows, there will be more points of interaction, and the chances of making a mistake will rise as you try to migrate all of the code changes from one staging environment to another. The CI/CD pipeline comes into play at this point.
It’s difficult to imagine a web application that is scalable in terms of speed and consistency without following CI/CD best practices in today’s world. Integrating Snowflake into your CI/CD pipeline ensures seamless data operations, automated testing, and streamlined deployment, fostering greater agility and consistency in data workflows.
What is Google Cloud Platform?
Google Cloud Platforms (GCP), like Amazon Web Services (AWS) and Microsoft Azure, is a public cloud provider. Customers can use computer resources located in Google’s data centers across the world for free or on a pay-per-use basis through GCP and other cloud partners.
GCP provides a range of computing services, including GCP Cost Management, Data Management, Web and Video delivery over the web, and AI and Machine Learning tools. With GCP’s Pub/Sub, you can easily deploy message queues that scale automatically, ensuring reliable data delivery across applications in real time.
Hevo’s no-code platform is designed for quick and easy integration between 150+ sources, such as Oracle, to a destination of your choice. Check out the cool features of Hevo:
- Schema Management: Hevo Data automatically maps the source schema to perform analysis without worrying about the changing schema.
- Real-Time: Hevo Data works on the batch as well as real-time data transfer so that your data is analysis-ready always.
- Live Support: With 24/5 support, Hevo provides customer-centric solutions to the business use case.
Get Started with Hevo for Free
Understanding GCP CI/CD Pipeline
The GCP CI/CD pipeline is made up of the following steps at a high level:
- Using the Maven builder, Cloud Build turns the WordCount sample into a self-running Java Archive (JAR) file. The Maven builder is a container that contains Maven. Maven runs the tasks when a build step is configured to use the Maven builder.
- The JAR file is uploaded to Cloud Storage using Cloud Build.
- Cloud Build publishes the data-processing workflow code to Cloud Composer after running unit tests on it.
- The JAR file is picked up by Cloud Composer, which then starts the data-processing task on Dataflow.
The following diagram shows a detailed view of the GCP CI/CD pipeline steps.
Data Processing Workflow of GCP CI/CD Pipeline
A Directed Acyclic Graph (DAG) developed in Python contains the instructions for how Cloud Composer conducts the data-processing workflow. All of the steps in the data-processing workflow, as well as their dependencies, are defined in the DAG.
In each build, the GCP CI/CD Pipeline deploys the DAG definition from Cloud Source Repositories to Cloud Composer automatically. Without requiring any human participation, this approach ensures that Cloud Composer is always up to date with the most recent workflow specification.
In addition to the data-processing workflow, an end-to-end test phase is defined in the DAG description for the test environment. The test phase ensures that the data-processing workflow is working properly.
The data-processing procedure is depicted in the diagram below.
The steps in the data-processing workflow are as follows:
- Step 1: In Dataflow, run the WordCount data process.
- Step 2: Download the WordCount process’s output files. Three files are generated by the WordCount process:
download_result_1
download_result_2
download_result_3
- Step 3: Download the download_ref_string reference file.
- Step 4: Compare the outcome to the reference file. This integration test combines the results from all three tests and compares them to the reference file.
Managing the data-processing workflow with a task-orchestration framework like Cloud Composer reduces the workflow’s code complexity.
Say Goodbye to Manual Coding with Hevo
No credit card required
How to Create GCP CI/CD Pipelines?
Granting Access
When you add additional access to the Cloud Build service account, Cloud Build installs Cloud Composer DAGs and initiates processes. See the access control documentation for more information on the various roles available when working with Cloud Composer.
- To allow the Cloud Build task to specify Airflow variables in Cloud Composer, add the composer.admin role to the Cloud Build service account in Cloud Shell:
gcloud projects add-iam-policy-binding $GCP_PROJECT_ID
--member=serviceAccount:$PROJECT_NUMBER@cloudbuild.gserviceaccount.com
--role=roles/composer.admin
- To allow the Cloud Build job to initiate the data workflow in Cloud Composer, add the composer.worker role to the Cloud Build service account:
gcloud projects add-iam-policy-binding $GCP_PROJECT_ID
--member=serviceAccount:$PROJECT_NUMBER@cloudbuild.gserviceaccount.com
--role=roles/composer.worker
Creating GCP CI/CD Test Pipelines
Step 1: Creating and Building GCP CI/CD Test Pipelines
The YAML configuration file specifies the Build and Test pipeline phases. To perform the tasks in each build step in this tutorial, you utilize prebuilt builder images for git, maven, gsutil, and gcloud. At build time, you use configuration variable substitutions to create the environment settings. Variable substitutions, as well as the locations of Cloud Storage buckets, determine the location of the source code repository. This information is required by the build in order to deploy the JAR file, test files, and DAG definition.
To create the GCP CI/CD pipeline in Cloud Build, submit the Build Pipeline Configuration file in Cloud Shell:
cd ~/ci-cd-for-data-processing-workflow/source-code/build-pipeline
gcloud builds submit --config=build_deploy_test.yaml --substitutions=
REPO_NAME=$SOURCE_CODE_REPO,
_DATAFLOW_JAR_BUCKET=$DATAFLOW_JAR_BUCKET_TEST,
_COMPOSER_INPUT_BUCKET=$INPUT_BUCKET_TEST,
_COMPOSER_REF_BUCKET=$REF_BUCKET_TEST,
_COMPOSER_DAG_BUCKET=$COMPOSER_DAG_BUCKET,
_COMPOSER_ENV_NAME=$COMPOSER_ENV_NAME,
_COMPOSER_REGION=$COMPOSER_REGION,
_COMPOSER_DAG_NAME_TEST=$COMPOSER_DAG_NAME_TEST
- Step A: Create and deploy a self-executing JAR file for WordCount.
- Take a look at the source code.
- Create a self-executing JAR file from the WordCount Beam source code.
- Place the JAR file on Cloud Storage so that Cloud Composer may use it to perform the WordCount processing job.
- Step B: Install Cloud Composer and set up the data-processing workflow.
- Run the unit test on the workflow DAG’s custom-operator code.
- Use Cloud Storage to store the test input and reference files. The WordCount processing task uses the test input file as its input. The test reference file is used as a check to ensure that the WordCount processing job’s output is correct.
- Set the Cloud Composer variables to point to the JAR file that was just created.
- In the Cloud Composer environment, deploy the workflow DAG definition.
- Step C: To initiate the test-processing workflow, run the data-processing workflow in the test environment.
Step 2: Verifying the Test Pipeline
Verify the build procedures after submitting the build file.
- Step A: Go to the Build History page in the Cloud Console to get a list of all prior and current builds.
- Step B: Select the currently running build by clicking on it.
- Step C: Verify that the build steps on the Build Details page match the steps given earlier.
When the build is finished, the Status field on the Build details page indicates Build successful.
- Step D: Verify that the WordCount sample JAR file was copied to the relevant bucket in Cloud Shell:
gsutil ls gs://$DATAFLOW_JAR_BUCKET_TEST/dataflow_deployment*.jar
The following is an example of the output:
gs://…-composer-dataflow-source-test/dataflow_deployment_e88be61e-50a6-4aa0-beac-38d75871757e.jar
- Step E: Get the web address for your Cloud Composer account. Take note of the URL, as it will be needed in the next step.
gcloud composer environments describe $COMPOSER_ENV_NAME
--location $COMPOSER_REGION
--format="get(config.airflowUri)"
- Step F: To validate a successful DAG run, navigate to the Cloud Composer UI using the URL from the previous step. Wait a few minutes and reload the page if the Dag Runs column does not display any information.
- Hold the pointer over the light-green circle below DAG Runs and check that it says Running to ensure that the data-processing workflow DAG test_word_count is deployed and running.
- Click the light-green circle, then Dag Id: test_word_count on the Dag Runs page to observe the running data-processing workflow as a graph.
- Refresh the Graph View page to see the current state of the DAG run. The workflow typically takes three to five minutes to complete. Hold the pointer over each job and check that the tooltip says State: success to ensure that the DAG runs correctly. The integration test, named do_comparison, is the final task, and it compares the process output to the reference file.
Creating GCP CI/CD Production Pipelines
Step 1: Creating GCP CI/CD Production Pipelines
You can promote the current version of the workflow to production after the test processing workflow runs successfully. The workflow can be deployed to production in a number of ways:
- Manually.
- When all of the tests in the test or staging environments pass, this event is automatically triggered.
- A scheduled job initiates the process automatically.
This guide does not cover the automatic approaches.
In this guide, you will use the Cloud Build production deployment build to do a manual deployment to production. The steps for the production deployment build are as follows:
- Step A: From the test bucket, copy the WordCount JAR file to the production bucket.
- Step B: Set the production workflow’s Cloud Composer variables to point to the newly promoted JAR file.
- Step C: Deploy the production workflow DAG definition and run the workflow in the Cloud Composer environment.
The name of the most recent JAR file delivered to production with the Cloud Storage buckets utilized by the production processing workflow is defined by variable substitutions. Complete the following steps to establish the Cloud Build pipeline that delivers the production airflow workflow:
- Step A: Print the Cloud Composer variable for the JAR filename in Cloud Shell to get the filename of the most recent JAR file:
export DATAFLOW_JAR_FILE_LATEST=$(gcloud composer environments run $COMPOSER_ENV_NAME
--location $COMPOSER_REGION variables get --
dataflow_jar_file_test 2>&1 | grep -i '.jar')
- Step B: Create the GCP CI/CD pipeline in Cloud Build using the deploy prod.yaml build pipeline configuration file.
cd ~/ci-cd-for-data-processing-workflow/source-code/build-pipeline
gcloud builds submit --config=deploy_prod.yaml --substitutions=
REPO_NAME=$SOURCE_CODE_REPO,
_DATAFLOW_JAR_BUCKET_TEST=$DATAFLOW_JAR_BUCKET_TEST,
_DATAFLOW_JAR_FILE_LATEST=$DATAFLOW_JAR_FILE_LATEST,
_DATAFLOW_JAR_BUCKET_PROD=$DATAFLOW_JAR_BUCKET_PROD,
_COMPOSER_INPUT_BUCKET=$INPUT_BUCKET_PROD,
_COMPOSER_ENV_NAME=$COMPOSER_ENV_NAME,
_COMPOSER_REGION=$COMPOSER_REGION,
_COMPOSER_DAG_BUCKET=$COMPOSER_DAG_BUCKET,
_COMPOSER_DAG_NAME_PROD=$COMPOSER_DAG_NAME_PROD
Step 2: Verifying the Data-Processing flows
- Step A: Obtain the URL for your Cloud Composer user interface:
cd ~/ci-cd-for-data-processing-workflow/source-code/build-pipeline
gcloud builds submit --config=deploy_prod.yaml --substitutions=
REPO_NAME=$SOURCE_CODE_REPO,
_DATAFLOW_JAR_BUCKET_TEST=$DATAFLOW_JAR_BUCKET_TEST,
_DATAFLOW_JAR_FILE_LATEST=$DATAFLOW_JAR_FILE_LATEST,
_DATAFLOW_JAR_BUCKET_PROD=$DATAFLOW_JAR_BUCKET_PROD,
_COMPOSER_INPUT_BUCKET=$INPUT_BUCKET_PROD,
_COMPOSER_ENV_NAME=$COMPOSER_ENV_NAME,
_COMPOSER_REGION=$COMPOSER_REGION,
_COMPOSER_DAG_BUCKET=$COMPOSER_DAG_BUCKET,
_COMPOSER_DAG_NAME_PROD=$COMPOSER_DAG_NAME_PROD
- Step B: Go to the URL you retrieved in the previous step and look for the prod_word_count DAG in the list of DAGs to confirm that the production data-processing workflow DAG is active.
- Click Trigger Dag in the prod_word_count row on the DAGs page.
- Click Confirm in the confirmation dialogue.
- Step C: To see the current state of the DAG run, reload the page. Hold the pointer over the light-green circle below DAG Runs and check that it says Operating to ensure that the production data-processing workflow DAG is deployed and running.
- Step D: Hold the pointer over the dark-green circle below the DAG runs column after the run completes and check that it shows Success.
- Step E: List the result files in the Cloud Storage bucket in Cloud Shell:
gsutil ls gs://$RESULT_BUCKET_PROD
The following is an example of the output:
gs://…-composer-result-prod/output-00000-of-00003
gs://…-composer-result-prod/output-00001-of-00003
gs://…-composer-result-prod/output-00002-of-00003
Conclusion
In this article, you saw how to implement the Test and Production GCP CI/CD pipelines. You have a deep understanding of each and every step behind the process. In case you want to export data from a source of your choice into your desired Database/destination then Hevo Data is the right choice for you!
Hevo Data, a No-code Data Pipeline, provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks. Hevo Data, with its strong integration with 150+ sources (including 60+ free sources), allows you to not only export data from your desired data sources & load it to the destination of your choice but also transform & enrich your data to make it analysis-ready.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
Share your experience of learning about the GCP CI/CD pipelines! Let us know in the comments section below!
FAQs
1. What is CI/CD in GCP?
Continuous Integration and Continuous Deployment are referred to as CI/CD in GCP. In order to expedite development and boost productivity, it entails automating code testing, integration, and deployment utilizing technologies like Cloud Build and Cloud Deploy.
2. What is GCP in DevOps?
GCP is Google Cloud Platform which is a collection of cloud computing services from Google.
3. What CI/CD tool is used in Google?
Google Cloud uses Cloud Build for CI/CD, which automates the build, test, and deployment processes
Harsh is a data enthusiast with over 2.5 years of experience in research analysis and software development. He is passionate about translating complex technical concepts into clear and engaging content. His expertise in data integration and infrastructure shines through his 100+ published articles, helping data practitioners solve challenges related to data engineering.