It is a common practice for most businesses today to rely on data-driven decision-making. Businesses collect a large volume of data that can be used to perform an in-depth analysis of their customers and products allowing them to plan future Growth, Product, and Marketing strategies accordingly. 

In order to perform a proper analysis, the first step is to create a Single Source of Truth for all their data. This means that data has to be extracted from all platforms they use and stored in a centralized database. Considering the volume of data most businesses collect today, this becomes a complicated task.

In this article, you will gain information about setting up ETL using Python. You will also gain a holistic understanding of Python, its key features, Python, different methods to set up ETL using Python Script, limitations of manually setting up ETL using Python, and the top 10 ETL using Python tools. Read along to find out in-depth information about setting up ETL Automation using Python.

Setting Up ETL Using Python Script

This section will help you understand how you can set up a simple data pipeline that extracts data from MySQL, Microsoft SQL Server, and Firebird databases and loads it into a Microsoft SQL Server database. It will also give you a basic idea of how easy it is to set up ETL Using Python.

The same process can also be used to implement a custom script based on your requirements by making changes to the databases being used and queries accordingly. In order to set up ETL using a Python script, the following steps have to be implemented:

Simplify ETL using Hevo’s no-code Data Pipeline

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Start for free now!

Get Started with Hevo for Free

ETL Using Python Step 1: Installing Required Modules

The following Modules are required to set up ETL Using Python for the above-mentioned data sources:

The Pip command can be used to install the required modules. These Modules can be installed by running the following commands in the Command Prompt:

pip install mysql-connector-python
pip install pyodbc
pip install fdb

ETL Using Python Step 2: Setting Up ETL Directory

The following files should be created in your project directory in order to set up ETL Using Python:

  • db_credentials.py: Should have all data that is required to establish connections with all databases. For example, Database Password, Port Number, etc.
  • sql_queries.py: Should have all the commonly used database queries to extract, and load data in String format.
  • etl.py: Perform all necessary operations to connect to the database and run the required queries.
  • main.py: Responsible for maintaining the flow of the operations and performing the necessary operations in a specific order.

A) db_credentials.py

This file is required for setting up all Source and Target Database Connection Strings in the process to set up ETL using Python. This file should have all the required information to access the appropriate database in a list format so that it can be iterated easily when required.

The format for this file is as follows:

datawarehouse_name = 'your_datawarehouse_name'

# sql-server (target db, datawarehouse)
datawarehouse_db_config = {
  'Trusted_Connection': 'yes',
  'driver': '{SQL Server}',
  'server': 'datawarehouse_sql_server',
  'database': '{}'.format(datawarehouse_name),
  'user': 'your_db_username',
  'password': 'your_db_password',
  'autocommit': True,
}

# sql-server (source db)
sqlserver_db_config = [
  {
    'Trusted_Connection': 'yes',
    'driver': '{SQL Server}',
    'server': 'your_sql_server',
    'database': 'db1',
    'user': 'your_db_username',
    'password': 'your_db_password',
    'autocommit': True,
  }
]

# mysql (source db)
mysql_db_config = [
  {
    'user': 'your_user_1',
    'password': 'your_password_1',
    'host': 'db_connection_string_1',
    'database': 'db_1',
  },
  {
    'user': 'your_user_2',
    'password': 'your_password_2',
    'host': 'db_connection_string_2',
    'database': 'db_2',
  },
]

# firebird (source db)
fdb_db_config = [
  {
    'dsn': "/your/path/to/source.db",
    'user': "your_username",
    'password': "your_password",
  }
]

B) sql_queries.py

This file contains queries that can be used to perform the required operations to extract data from the Source Databases and load it into the Target Database in the process to set up ETL using Python.

The format for this file is as follows:

# example queries, will be different across different db platform
firebird_extract = ('''
  SELECT fbd_column_1, fbd_column_2, fbd_column_3
  FROM fbd_table;
''')

firebird_insert = ('''
  INSERT INTO table (column_1, column_2, column_3)
  VALUES (?, ?, ?)  
''')

firebird_extract_2 = ('''
  SELECT fbd_column_1, fbd_column_2, fbd_column_3
  FROM fbd_table_2;
''')

firebird_insert_2 = ('''
  INSERT INTO table_2 (column_1, column_2, column_3)
  VALUES (?, ?, ?)  
''')

sqlserver_extract = ('''
  SELECT sqlserver_column_1, sqlserver_column_2, sqlserver_column_3
  FROM sqlserver_table
''')

sqlserver_insert = ('''
  INSERT INTO table (column_1, column_2, column_3)
  VALUES (?, ?, ?)  
''')

mysql_extract = ('''
  SELECT mysql_column_1, mysql_column_2, mysql_column_3
  FROM mysql_table
''')

mysql_insert = ('''
  INSERT INTO table (column_1, column_2, column_3)
  VALUES (?, ?, ?)  
''')

# exporting queries
class SqlQuery:
  def __init__(self, extract_query, load_query):
    self.extract_query = extract_query
    self.load_query = load_query
    
# create instances for SqlQuery class
fbd_query = SqlQuery(firebird_extract, firebird_insert)
fbd_query_2 = SqlQuery(firebird_extract_2, firebird_insert_2)
sqlserver_query = SqlQuery(sqlserver_extract, sqlserver_insert)
mysql_query = SqlQuery(mysql_extract, mysql_insert)

# store as list for iteration
fbd_queries = [fbdquery, fbd_query_2]
sqlserver_queries = [sqlserver_query]
mysql_queries = [mysql_query]

C) etl.py

This file should contain all the code that helps establish connections among the correct databases and run the required queries in order to set up ETL using Python.

The format for this file is as follows:

# python modules
import mysql.connector
import pyodbc
import fdb
def etl(query, source_cnx, target_cnx):
  # extract data from source db
  source_cursor = source_cnx.cursor()
  source_cursor.execute(query.extract_query)
  data = source_cursor.fetchall()
  source_cursor.close()

  # load data into warehouse db
  if data:
    target_cursor = target_cnx.cursor()
    target_cursor.execute("USE {}".format(datawarehouse_name))
    target_cursor.executemany(query.load_query, data)
    print('data loaded to warehouse db')
    target_cursor.close()
  else:
    print('data is empty')

def etl_process(queries, target_cnx, source_db_config, db_platform):
  # establish source db connection
  if db_platform == 'mysql':
    source_cnx = mysql.connector.connect(**source_db_config)
  elif db_platform == 'sqlserver':
    source_cnx = pyodbc.connect(**source_db_config)
  elif db_platform == 'firebird':
    source_cnx = fdb.connect(**source_db_config)
  else:
    return 'Error! unrecognised db platform'
  
  # loop through sql queries
  for query in queries:
    etl(query, source_cnx, target_cnx)
    
  # close the source db connection
  source_cnx.close()

D) main.py

This code in this file is responsible for iterating through credentials to connect with the database and perform the required ETL Using Python operations.

The format for this file is as follows:

# variables
from db_credentials import datawarehouse_db_config, sqlserver_db_config, mysql_db_config, fbd_db_config
from sql_queries import fbd_queries, sqlserver_queries, mysql_queries

# methods
from etl import etl_process

def main():
  print('starting etl')
	
  # establish connection for target database (sql-server)
  target_cnx = pyodbc.connect(**datawarehouse_db_config)
	
  # loop through credentials

  # mysql
  for config in mysql_db_config: 
    try:
      print("loading db: " + config['database'])
      etl_process(mysql_queries, target_cnx, config, 'mysql')
    except Exception as error:
      print("etl for {} has error".format(config['database']))
      print('error message: {}'.format(error))
      continue
	
  # sql-server
  for config in sqlserver_db_config: 
    try:
      print("loading db: " + config['database'])
      etl_process(sqlserver_queries, target_cnx, config, 'sqlserver')
    except Exception as error:
      print("etl for {} has error".format(config['database']))
      print('error message: {}'.format(error))
      continue

  # firebird
  for config in fbd_db_config: 
    try:
      print("loading db: " + config['database'])
      etl_process(fbd_queries, target_cnx, config, 'firebird')
    except Exception as error:
      print("etl for {} has error".format(config['database']))
      print('error message: {}'.format(error))
      continue
	
  target_cnx.close()

if __name__ == "__main__":
  main()

Python ETL Example

ETL is the process of extracting a huge amount of data from a wide array of sources and formats and then converting & consolidating it into a single format before storing it in a database or writing it to a destination file.

In this example, some of the data is stored in CSV files while others are in JSON files. All of this data has to be consolidated into a single format and then stored in a unified file location.

Step 1: Import the modules and functions

In this ETL using Python example, first, you need to import the required modules and functions.

import glob 
import pandas as pd 
import xml.etree.ElementTree as ET 
from datetime import datetime
  • The dealership_data file contains CSV, JSON, and XML files for used car data. The features incorporated here are car_model, year_of_manufacture, price, and fuel. So you need to extract the file from the raw data, transform it into a target file, and then load it into the output.

Download the source file:

!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/datasource.zip

Extracting the zip file:

nzip datasource.zip -d dealership_data

Setting the path for Target files:

tmpfile    = "dealership_temp.tmp"               # store all extracted data

logfile    = "dealership_logfile.txt"            # all event logs will be stored

targetfile = "dealership_transformed_data.csv"   # transformed data is stored

Step 2: Extract

The Extract function in this ETL using Python example is used to extract a huge amount of data in batches. This data is extracted from numerous sources.

CSV Extract Function:

def extract_from_csv(file_to_process): 
    dataframe = pd.read_csv(file_to_process) 
    return dataframe

JSON Extract Function

def extract_from_json(file_to_process):
    dataframe = pd.read_json(file_to_process,lines=True)
    return dataframe

XML Extract Function

def extract_from_xml(file_to_process):

    dataframe = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel'])

    tree = ET.parse(file_to_process) 

    root = tree.getroot() 

    for person in root: 

        car_model = person.find("car_model").text 

        year_of_manufacture = int(person.find("year_of_manufacture").text)

        price = float(person.find("price").text) 

        fuel = person.find("fuel").text 

        dataframe = dataframe.append({"car_model":car_model, "year_of_manufacture":year_of_manufacture, "price":price, "fuel":fuel}, ignore_index=True) 

        return dataframe

Calling Extract Function()

def extract():
       extracted_data = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel']) 
    #for csv files
      for csvfile in glob.glob("dealership_data/*.csv"):
          extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)
    #for json files
      for jsonfile in glob.glob("dealership_data/*.json"):
          extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)
    #for xml files
      for xmlfile in glob.glob("dealership_data/*.xml"):
          extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)
      return extracted_data

Step 3: Transform

Using the transform function you can convert the data in any format as per your needs.

def transform(data):
       data['price'] = round(data.price, 2)
       return data

Step 4: Loading and Logging

In this step, the data is loaded to the destination file. A logging entry needs to be established before loading. Usually, the data can be loaded into:

  • Data warehouses
  • Databases
  • Flat files
  • Data lakes

A logging entry needs to be established before loading.

load function()

def load(targetfile,data_to_load):
    data_to_load.to_csv(targetfile)

log function()

def log(message):
    timestamp_format = '%H:%M:%S-%h-%d-%Y'
    #Hour-Minute-Second-MonthName-Day-Year
    now = datetime.now() # get current timestamp
    timestamp = now.strftime(timestamp_format)
    with open("dealership_logfile.txt","a") as f: f.write(timestamp + ',' + message + 'n')

Step 5: Running ETL Process

The log indicates that you have started the ETL process.

log("ETL Job Started")

The log indicates that you have started and ended the Extract phase.

log("Extract phase Started")
extracted_data = extract() 
log("Extract phase Ended")

The log indicates that you have started and ended the Transform phase.

log(“Transform phase Started”)
transformed_data = transform(extracted_data)
log("Transform phase Ended")

The log indicates that you have started and ended the Load phase.

log("Load phase Started")
load(targetfile,transformed_data)
log("Load phase Ended")

The log indicates that the ETL process has ended.

log("ETL Job Ended")

Using Python for ETL Processes 

With Python’s programming capabilities, you can create ETL pipelines that manage data and transform it in accordance with business requirements.

Because of Python’s versatility, data engineers and developers can write a program for almost any ETL process, including data aggregation. Python can efficiently handle the crucial parts of ETL operations, like indexing dictionaries and data structures. 

You can filter the null values from your data using the pre-built math module in Python. In addition, multiple Python ETL tools are built with externally defined functions and libraries and pure Python codes. These tools support other Python libraries for extracting, loading, and transforming tables of data from multiple sources into data warehouses.

Limitations of Manually Setting Up ETL Using Python

The example in the previous section performs extremely basic Extract and Load Operations. Also, it does not perform any transformations. In a real-life situation, the operations that have to be performed would be much more complex, dynamic, and would require complicated transformations such as Mathematical Calculations, Denormalization, etc. to be performed.

Manually programming each of the ETL processes & workflows whenever you wish to set up ETL Using Python would require an immense engineering bandwidth. There are a large number of tools that can be used to make this process comparatively easier than manual implementation. To automate the process of setting up ETL using Python, Hevo Data, an Automated No Code Data Pipeline will help you achieve it and load data from your desired source in a hassle-free manner.

There are a large number of Python ETL tools that will help you automate your ETL processes and workflows thus making your experience seamless.

Top 10 ETL Using Python Tools

Some of the best tools that can make ETL Using Python easier are as follows:

1) Hevo

Hevo allows you to replicate data in near real-time from 150+ sources to the destination of your choice including Snowflake, BigQuery, Redshift, Databricks, and Firebolt. Without writing a single line of code. Finding patterns and opportunities is easier when you don’t have to worry about maintaining the pipelines. So, with Hevo as your data pipeline platform, maintenance is one less thing to worry about.

For the rare times things do go wrong, Hevo ensures zero data loss. To find the root cause of an issue, Hevo also lets you monitor your workflow so that you can address the issue before it derails the entire workflow. Add 24*7 customer support to the list, and you get a reliable tool that puts you at the wheel with greater visibility. Check Hevo’s in-depth documentation to learn more.

If you don’t want SaaS tools with unclear pricing that burn a hole in your pocket, opt for a tool that offers a simple, transparent pricing model. Hevo has 3 usage-based pricing plans starting with a free tier, where you can ingest upto 1 million records.

Hevo was the most mature Extract and Load solution available, along with Fivetran and Stitch but it had better customer service and attractive pricing. Switching to a Modern Data Stack with Hevo as our go-to pipeline solution has allowed us to boost team collaboration and improve data reliability, and with that, the trust of our stakeholders on the data we serve.

– Juan Ramos, Analytics Engineer, Ebury

Check out how Hevo empowered Ebury to build reliable data products here.

Sign up here for a 14-Day Free Trial!

2) Petl

Petl (Python ETL) is one of the simplest tools that allows its users to set up ETL Using Python. It can be used to import data from numerous data sources such as CSV, XML, JSON, XLS, etc. It also houses support for simple transformations such as Row Operations, Joining, Aggregations, Sorting, etc.

However, Petl is not capable of performing any sort of Data Analytics and experiences performance issues with large datasets. Hence, it is considered to be suitable for only simple ETL Using Python operations that do not require complex transformations or analysis.

More information on Petl can be found here.

3) Pandas

Pandas is considered to be one of the most popular Python libraries for Data Manipulation and Analysis. Pandas makes use of data frames to hold the required data in memory. It allows users to write simple scripts that can help perform all the required ETL Using Python operations.

The biggest drawback of using Pandas is that it was designed primarily as a Data Analysis tool and hence, stores all data in memory to perform the required operations. This results in performance issues as the size of the dataset increases and is not considered to be suitable for Big Data applications.

More information on Pandas can be found here.

4) Apache Airflow

Apache Airflow is a Python-based Open-Source Workflow Management and Automation tool that was developed by Airbnb. Even though it is not an ETL tool itself, it can be used to set up ETL Using Python. It can be seen as an orchestration tool that can help users create, schedule, and monitor workflows. 

This means Apache Airflow can be used to create a data pipeline by consolidating various modules of your ETL Using Python process. 

Apache Airflow implements the concept of Directed Acyclic Graph (DAG). It also houses a browser-based dashboard that allows users to visualize workflows and track the execution of multiple workflows. 

Apache Airflow is a good choice if a complex ETL workflow by consolidating various existing and independent modules together has to be created but it does not make much sense to use it for simple ETL Using Python operations.

More information on Apache Airflow can be found here.

5) PySpark

PySpark houses robust features that allow users to set up ETL Using Python along with support for various other functionalities such as Data Streaming (Spark Streaming), Machine Learning (MLib), SQL (Spark SQL), and Graph Processing (GraphX). 

One of the most significant advantages of using PySpark is the ability to process large volumes of data with ease. Hence, if your ETL requirements include creating a pipeline that can process Big Data easily and quickly, then PySpark is one of the best options available.

More information on PySpark can be found here.

6) Luigi

Luigi is an Open-Source Python-based ETL tool that was created by Spotify to handle its workflows that processes terabytes of data every day. It is considered to be one of the most sophisticated tools that house various powerful features for creating complex ETL data pipelines. It also comes with a web dashboard that allows users to track all ETL jobs. Luigi is considered to be suitable for creating Enterprise-Level ETL pipelines.

More information on Luigi can be found here.

7) Bonobo

Bonobo is a simple and lightweight ETL tool. Pipelines will be able to be deployed quickly and in parallel in Bonobo. It can extract data from a variety of sources in formats such as CSV, JSON, XML, XLS, SQL, and others. This transformation adheres to the atomic UNIX principles. One of the best aspects of Bonobs is that new users do not need to learn a new API. It is especially simple to use if you have prior experience with Python. It is also capable of handling semi-complex schemas. One of the most significant advantages is that it is open source and scalable.

8) Beautiful Soup

Beautiful Soup is a well-known online scraping and parsing tool for data extraction. It provides tools for parsing hierarchical information formats, such as HTML pages or JSON files, which can be found on the web. It integrates with your preferred parser to provide idiomatic methods of navigating, searching and modifying the parse tree. It frequently saves programmers hours or even days of work.

9) Odo

Odo is a Python tool that converts data from one format to another and provides high performance when loading large datasets into different datasets. It includes memory structures such as NumPy arrays, data frames, lists, and so on. Users should consider Odo if they want to create simple pipelines but need to load large CSV datasets. It also accepts data from sources other than Python, such as CSV/JSON/HDF5 files, SQL databases, data from remote machines, and the Hadoop File System.

10) Pygrametl

Pygrametl is a Python framework for creating Extract-Transform-Load (ETL) processes. It is open-source and distributed under the terms of a two-clause BSD license. The ETL process is coded in Python by the developer when using Pygrametl. This is much more efficient than drawing the process in a graphical user interface (GUI) like Pentaho Data Integration.

Alternative Programming Languages for ETL

As we saw that Python, as a programming language is a very feasible choice for designing ETL tasks, but there are still some other languages that are used by developers in the ETL processes such as data ingestion and loading.

The languages are as follows:

1) Java

Java is a popular programming language, particularly for developing client-server web applications. Java has influenced other programming languages, including Python, and has spawned a number of branches, including Scala. Java serves as the foundation for several other big data tools, including Hadoop and Spark. The Java ecosystem also includes a library collection comparable to that of Python.

This programming language is designed in such a way that developers can write code anywhere and run it anywhere, regardless of the underlying computer architecture. It is also known as write once, run anywhere (WORA).

2) Ruby

Ruby, like Python, is a scripting language that allows developers to create ETL pipelines, but there are few ETL-specific Ruby frameworks available to make the task easier. However, several libraries are currently in development, including Nokogiri, Kiba, and Square’s ETL package.

3) Go

Go, also known as Golang, is a programming language that is similar to C and is intended for data analysis and big data applications. It was created to fill C++ and Java gaps discovered while working with Google’s servers and distributed systems. Go includes several machine learning libraries, including support for Google’s TensorFlow, data pipeline libraries such as Apache Beam, and two ETL toolkits, Crunch and Pachyderm.

Conclusion

In this article, you have learned about Setting up ETL using Python. This article also provided information on Python, its key features, Python, different methods to set up ETL using Python Script, limitations of manually setting up ETL using Python, and the top 10 ETL using Python tools.

Most businesses today however have an extremely high volume of data with a very dynamic structure. Creating an ETL pipeline for such data from scratch is a complex process since businesses will have to utilize a high amount of resources in creating this pipeline and then ensure that it is able to keep up with the high data volume and Schema variations. Businesses can instead use automated platforms like Hevo.

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.

Want to give Hevo a try? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing Hevo Price, which will assist you in selecting the best plan for your requirements.

Share your experience of understanding setting up ETL using Python in the comment section below! We would love to hear your thoughts.

mm
Former Research Analyst, Hevo Data

Manik has a keen interest in data, software architecture, and has a flair for writing hightly technical content. He has experience writing articles on diverse topics related to data integration and infrastructure.

Get Started with Hevo