It is a common practice for most businesses today to rely on data-driven decision-making. Businesses collect a large volume of data that can be used to perform an in-depth analysis of their customers and products allowing them to plan future Growth, Product, and Marketing strategies accordingly.
In order to perform a proper analysis, the first step is to create a Single Source of Truth for all their data. This means that data has to be extracted from all platforms they use and stored in a centralized database. Considering the volume of data most businesses collect today, this becomes a complicated task.
In this article, you will gain information about setting up ETL using Python. You will also gain a holistic understanding of Python, its key features, Python, different methods to set up ETL using Python Script, limitations of manually setting up ETL using Python, and the top 10 ETL using Python tools. Read along to find out in-depth information about setting up ETL using Python.
Table of Contents
What is Python?
Image Source
Python is one of the most popular general-purpose programming languages that was released in 1991 and was created by Guido Van Rossum. It can be used for a wide variety of applications such as Server-side Web Development, System Scripting, Data Science and Analytics, Software Development, etc.
Python is an Interactive, Interpreted, Object-Oriented programming language that incorporates Exceptions, Modules, Dynamic Typing, Dynamic Binding, Classes, High-level Dynamic Data Types, etc. It can also be used to make system calls to almost all well-known Operating Systems.
Key Features of Python
Some of the most well-known features of Python are as follows:
- Free and Open-Source: Python is available free of cost for everyone and can easily be downloaded and installed from the official website. Open-Source means that the source code is openly available. This gives users with enough knowledge the ability to make changes to the code as per business use cases and product requirements.
- Easy to Code and Read: Python is considered to be a very beginner-friendly language and hence, most people with basic programming knowledge can easily learn the Python syntax in a few hours.
- High-Level: While using Python, developers do not need to have any information on the System Architecture or manage memory usage manually. All this is automatically handled by the Python Interpreter.
- Portable: A Python code written on one system can easily be transferred to another system and can run without any issues.
- Interpreted: Python code is processed by the Interpreter at runtime. This means that users do not need to compile the code and then run it like other programming languages such as Java, C/C++, etc.
- Object-Oriented: Python also has support for the Object-Oriented Programming Paradigm which allows users to write readable and reusable code.
What is ETL?
Image Source
ETL stands for Extract, Transform and Load. It can be defined as the process that allows businesses to create a Single Source of Truth for all Online Analytical Processing. Businesses use multiple platforms to perform their day-to-day operations. This means that all their data is stored across the databases of various platforms that they use. These platforms could be Customer Relationship Management (CRM) systems such as HubSpot, Salesforce, etc., Digital Marketing Platforms such as Facebook Ads, Instagram Ads, etc. You can check our article about Salesforce ETL tools.
For a useful analysis to be performed, the data from all these platforms first has to be integrated and stored in a centralized location. This process of extracting data from all these platforms, transforming it into a form suitable for analysis, and then loading it into a Data Warehouse or desired destination is called ETL (Extract, Transform, Load). In the current scenario, there are numerous varieties of ETL platforms available in the market. No-Code Data Pipeline, Hevo Data is one such ETL tool that will automate your ETL process in a matter of minutes.
Setting Up ETL Using Python Script
This section will help you understand how you can set up a simple data pipeline that extracts data from MySQL, Microsoft SQL Server, and Firebird databases and loads it into a Microsoft SQL Server database. It will also give you a basic idea of how easy it is to set up ETL Using Python.
The same process can also be used to implement a custom script based on your requirements by making changes to the databases being used and queries accordingly. In order to set up ETL using a Python script, the following steps have to be implemented:
ETL Using Python Step 1: Installing Required Modules
The following Modules are required to set up ETL Using Python for the above-mentioned data sources:
The Pip command can be used to install the required modules. These Modules can be installed by running the following commands in the Command Prompt:
pip install mysql-connector-python
pip install pyodbc
pip install fdb
ETL Using Python Step 2: Setting Up ETL Directory
The following files should be created in your project directory in order to set up ETL Using Python:
- db_credentials.py: Should have all data that is required to establish connections with all databases. For example, Database Password, Port Number, etc.
- sql_queries.py: Should have all the commonly used database queries to extract, and load data in String format.
- etl.py: Perform all necessary operations to connect to the database and run the required queries.
- main.py: Responsible for maintaining the flow of the operations and performing the necessary operations in a specific order.
A) db_credentials.py
This file is required for setting up all Source and Target Database Connection Strings in the process to set up ETL using Python. This file should have all the required information to access the appropriate database in a list format so that it can be iterated easily when required.
The format for this file is as follows:
datawarehouse_name = 'your_datawarehouse_name'
# sql-server (target db, datawarehouse)
datawarehouse_db_config = {
'Trusted_Connection': 'yes',
'driver': '{SQL Server}',
'server': 'datawarehouse_sql_server',
'database': '{}'.format(datawarehouse_name),
'user': 'your_db_username',
'password': 'your_db_password',
'autocommit': True,
}
# sql-server (source db)
sqlserver_db_config = [
{
'Trusted_Connection': 'yes',
'driver': '{SQL Server}',
'server': 'your_sql_server',
'database': 'db1',
'user': 'your_db_username',
'password': 'your_db_password',
'autocommit': True,
}
]
# mysql (source db)
mysql_db_config = [
{
'user': 'your_user_1',
'password': 'your_password_1',
'host': 'db_connection_string_1',
'database': 'db_1',
},
{
'user': 'your_user_2',
'password': 'your_password_2',
'host': 'db_connection_string_2',
'database': 'db_2',
},
]
# firebird (source db)
fdb_db_config = [
{
'dsn': "/your/path/to/source.db",
'user': "your_username",
'password': "your_password",
}
]
B) sql_queries.py
This file contains queries that can be used to perform the required operations to extract data from the Source Databases and load it into the Target Database in the process to set up ETL using Python.
The format for this file is as follows:
# example queries, will be different across different db platform
firebird_extract = ('''
SELECT fbd_column_1, fbd_column_2, fbd_column_3
FROM fbd_table;
''')
firebird_insert = ('''
INSERT INTO table (column_1, column_2, column_3)
VALUES (?, ?, ?)
''')
firebird_extract_2 = ('''
SELECT fbd_column_1, fbd_column_2, fbd_column_3
FROM fbd_table_2;
''')
firebird_insert_2 = ('''
INSERT INTO table_2 (column_1, column_2, column_3)
VALUES (?, ?, ?)
''')
sqlserver_extract = ('''
SELECT sqlserver_column_1, sqlserver_column_2, sqlserver_column_3
FROM sqlserver_table
''')
sqlserver_insert = ('''
INSERT INTO table (column_1, column_2, column_3)
VALUES (?, ?, ?)
''')
mysql_extract = ('''
SELECT mysql_column_1, mysql_column_2, mysql_column_3
FROM mysql_table
''')
mysql_insert = ('''
INSERT INTO table (column_1, column_2, column_3)
VALUES (?, ?, ?)
''')
# exporting queries
class SqlQuery:
def __init__(self, extract_query, load_query):
self.extract_query = extract_query
self.load_query = load_query
# create instances for SqlQuery class
fbd_query = SqlQuery(firebird_extract, firebird_insert)
fbd_query_2 = SqlQuery(firebird_extract_2, firebird_insert_2)
sqlserver_query = SqlQuery(sqlserver_extract, sqlserver_insert)
mysql_query = SqlQuery(mysql_extract, mysql_insert)
# store as list for iteration
fbd_queries = [fbdquery, fbd_query_2]
sqlserver_queries = [sqlserver_query]
mysql_queries = [mysql_query]
C) etl.py
This file should contain all the code that helps establish connections among the correct databases and run the required queries in order to set up ETL using Python.
The format for this file is as follows:
# python modules
import mysql.connector
import pyodbc
import fdb
def etl(query, source_cnx, target_cnx):
# extract data from source db
source_cursor = source_cnx.cursor()
source_cursor.execute(query.extract_query)
data = source_cursor.fetchall()
source_cursor.close()
# load data into warehouse db
if data:
target_cursor = target_cnx.cursor()
target_cursor.execute("USE {}".format(datawarehouse_name))
target_cursor.executemany(query.load_query, data)
print('data loaded to warehouse db')
target_cursor.close()
else:
print('data is empty')
def etl_process(queries, target_cnx, source_db_config, db_platform):
# establish source db connection
if db_platform == 'mysql':
source_cnx = mysql.connector.connect(**source_db_config)
elif db_platform == 'sqlserver':
source_cnx = pyodbc.connect(**source_db_config)
elif db_platform == 'firebird':
source_cnx = fdb.connect(**source_db_config)
else:
return 'Error! unrecognised db platform'
# loop through sql queries
for query in queries:
etl(query, source_cnx, target_cnx)
# close the source db connection
source_cnx.close()
D) main.py
This code in this file is responsible for iterating through credentials to connect with the database and perform the required ETL Using Python operations.
The format for this file is as follows:
# variables
from db_credentials import datawarehouse_db_config, sqlserver_db_config, mysql_db_config, fbd_db_config
from sql_queries import fbd_queries, sqlserver_queries, mysql_queries
# methods
from etl import etl_process
def main():
print('starting etl')
# establish connection for target database (sql-server)
target_cnx = pyodbc.connect(**datawarehouse_db_config)
# loop through credentials
# mysql
for config in mysql_db_config:
try:
print("loading db: " + config['database'])
etl_process(mysql_queries, target_cnx, config, 'mysql')
except Exception as error:
print("etl for {} has error".format(config['database']))
print('error message: {}'.format(error))
continue
# sql-server
for config in sqlserver_db_config:
try:
print("loading db: " + config['database'])
etl_process(sqlserver_queries, target_cnx, config, 'sqlserver')
except Exception as error:
print("etl for {} has error".format(config['database']))
print('error message: {}'.format(error))
continue
# firebird
for config in fbd_db_config:
try:
print("loading db: " + config['database'])
etl_process(fbd_queries, target_cnx, config, 'firebird')
except Exception as error:
print("etl for {} has error".format(config['database']))
print('error message: {}'.format(error))
continue
target_cnx.close()
if __name__ == "__main__":
main()
Python ETL Example
ETL is the process of extracting a huge amount of data from a wide array of sources and formats and then converting & consolidating it into a single format before storing it in a database or writing it to a destination file.
In this example, some of the data is stored in CSV files while others are in JSON files. All of this data has to be consolidated into a single format and then stored in a unified file location.
Step 1: Import the modules and functions
In this ETL using Python example, first, you need to import the required modules and functions.
import glob
import pandas as pd
import xml.etree.ElementTree as ET
from datetime import datetime
- The dealership_data file contains CSV, JSON, and XML files for used car data. The features incorporated here are car_model, year_of_manufacture, price, and fuel. So you need to extract the file from the raw data, transform it into a target file, and then load it into the output.
Download the source file:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/datasource.zip
Extracting the zip file:
nzip datasource.zip -d dealership_data
Setting the path for Target files:
tmpfile = "dealership_temp.tmp" # store all extracted data
logfile = "dealership_logfile.txt" # all event logs will be stored
targetfile = "dealership_transformed_data.csv" # transformed data is stored
Step 2: Extract
The Extract function in this ETL using Python example is used to extract a huge amount of data in batches. This data is extracted from numerous sources.
CSV Extract Function:
def extract_from_csv(file_to_process):
dataframe = pd.read_csv(file_to_process)
return dataframe
JSON Extract Function
def extract_from_json(file_to_process):
dataframe = pd.read_json(file_to_process,lines=True)
return dataframe
XML Extract Function
def extract_from_xml(file_to_process):
dataframe = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel'])
tree = ET.parse(file_to_process)
root = tree.getroot()
for person in root:
car_model = person.find("car_model").text
year_of_manufacture = int(person.find("year_of_manufacture").text)
price = float(person.find("price").text)
fuel = person.find("fuel").text
dataframe = dataframe.append({"car_model":car_model, "year_of_manufacture":year_of_manufacture, "price":price, "fuel":fuel}, ignore_index=True)
return dataframe
Calling Extract Function()
def extract():
extracted_data = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel'])
#for csv files
for csvfile in glob.glob("dealership_data/*.csv"):
extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)
#for json files
for jsonfile in glob.glob("dealership_data/*.json"):
extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)
#for xml files
for xmlfile in glob.glob("dealership_data/*.xml"):
extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)
return extracted_data
Step 3: Transform
Using the transform function you can convert the data in any format as per your needs.
def transform(data):
data['price'] = round(data.price, 2)
return data
Step 4: Loading and Logging
In this step, the data is loaded to the destination file. A logging entry needs to be established before loading.
load function()
def load(targetfile,data_to_load):
data_to_load.to_csv(targetfile)
log function()
def log(message):
timestamp_format = '%H:%M:%S-%h-%d-%Y'
#Hour-Minute-Second-MonthName-Day-Year
now = datetime.now() # get current timestamp
timestamp = now.strftime(timestamp_format)
with open("dealership_logfile.txt","a") as f: f.write(timestamp + ',' + message + 'n')
Step 5: Running ETL Process
The log indicates that you have started the ETL process.
log("ETL Job Started")
The log indicates that you have started and ended the Extract phase.
log("Extract phase Started")
extracted_data = extract()
log("Extract phase Ended")
The log indicates that you have started and ended the Transform phase.
log(“Transform phase Started”)
transformed_data = transform(extracted_data)
log("Transform phase Ended")
The log indicates that you have started and ended the Load phase.
log("Load phase Started")
load(targetfile,transformed_data)
log("Load phase Ended")
The log indicates that the ETL process has ended.
log("ETL Job Ended")
Limitations of Manually Setting Up ETL Using Python
The example in the previous section performs extremely basic Extract and Load Operations. Also, it does not perform any transformations. In a real-life situation, the operations that have to be performed would be much more complex, dynamic, and would require complicated transformations such as Mathematical Calculations, Denormalization, etc. to be performed.
Manually programming each of the ETL processes & workflows whenever you wish to set up ETL Using Python would require an immense engineering bandwidth. There are a large number of tools that can be used to make this process comparatively easier than manual implementation. To automate the process of setting up ETL using Python, Hevo Data, an Automated No Code Data Pipeline will help you achieve it and load data from your desired source in a hassle-free manner.
There are a large number of Python ETL tools that will help you automate your ETL processes and workflows thus making your experience seamless.
Some of the best tools that can make ETL Using Python easier are as follows:
1) Hevo
Image Source: Self
Hevo allows you to replicate data in near real-time from 150+ sources to the destination of your choice including Snowflake, BigQuery, Redshift, Databricks, and Firebolt. Without writing a single line of code. Finding patterns and opportunities is easier when you don’t have to worry about maintaining the pipelines. So, with Hevo as your data pipeline platform, maintenance is one less thing to worry about.
For the rare times things do go wrong, Hevo ensures zero data loss. To find the root cause of an issue, Hevo also lets you monitor your workflow so that you can address the issue before it derails the entire workflow. Add 24*7 customer support to the list, and you get a reliable tool that puts you at the wheel with greater visibility. Check Hevo’s in-depth documentation to learn more.
If you don’t want SaaS tools with unclear pricing that burn a hole in your pocket, opt for a tool that offers a simple, transparent pricing model. Hevo has 3 usage-based pricing plans starting with a free tier, where you can ingest upto 1 million records.
Hevo was the most mature Extract and Load solution available, along with Fivetran and Stitch but it had better customer service and attractive pricing. Switching to a Modern Data Stack with Hevo as our go-to pipeline solution has allowed us to boost team collaboration and improve data reliability, and with that, the trust of our stakeholders on the data we serve.
– Juan Ramos, Analytics Engineer, Ebury
Check out how Hevo empowered Ebury to build reliable data products here.
Sign up here for a 14-Day Free Trial!
2) Petl
Image Source
Petl (Python ETL) is one of the simplest tools that allows its users to set up ETL Using Python. It can be used to import data from numerous data sources such as CSV, XML, JSON, XLS, etc. It also houses support for simple transformations such as Row Operations, Joining, Aggregations, Sorting, etc.
However, Petl is not capable of performing any sort of Data Analytics and experiences performance issues with large datasets. Hence, it is considered to be suitable for only simple ETL Using Python operations that do not require complex transformations or analysis.
More information on Petl can be found here.
Download the Guide on Should you build or buy a data pipeline?
Explore the factors that drive the build vs buy decision for data pipelines
3) Pandas
Image Source
Pandas is considered to be one of the most popular Python libraries for Data Manipulation and Analysis. Pandas makes use of data frames to hold the required data in memory. It allows users to write simple scripts that can help perform all the required ETL Using Python operations.
The biggest drawback of using Pandas is that it was designed primarily as a Data Analysis tool and hence, stores all data in memory to perform the required operations. This results in performance issues as the size of the dataset increases and is not considered to be suitable for Big Data applications.
More information on Pandas can be found here.
4) Apache Airflow
Image Source
Apache Airflow is a Python-based Open-Source Workflow Management and Automation tool that was developed by Airbnb. Even though it is not an ETL tool itself, it can be used to set up ETL Using Python. It can be seen as an orchestration tool that can help users create, schedule, and monitor workflows.
This means Apache Airflow can be used to create a data pipeline by consolidating various modules of your ETL Using Python process.
Apache Airflow implements the concept of Directed Acyclic Graph (DAG). It also houses a browser-based dashboard that allows users to visualize workflows and track the execution of multiple workflows.
Apache Airflow is a good choice if a complex ETL workflow by consolidating various existing and independent modules together has to be created but it does not make much sense to use it for simple ETL Using Python operations.
More information on Apache Airflow can be found here.
5) PySpark
Image Source
PySpark houses robust features that allow users to set up ETL Using Python along with support for various other functionalities such as Data Streaming (Spark Streaming), Machine Learning (MLib), SQL (Spark SQL), and Graph Processing (GraphX).
One of the most significant advantages of using PySpark is the ability to process large volumes of data with ease. Hence, if your ETL requirements include creating a pipeline that can process Big Data easily and quickly, then PySpark is one of the best options available.
More information on PySpark can be found here.
6) Luigi
Image Source
Luigi is an Open-Source Python-based ETL tool that was created by Spotify to handle its workflows that processes terabytes of data every day. It is considered to be one of the most sophisticated tools that house various powerful features for creating complex ETL data pipelines. It also comes with a web dashboard that allows users to track all ETL jobs. Luigi is considered to be suitable for creating Enterprise-Level ETL pipelines.
More information on Luigi can be found here.
7) Bonobo
Image Source
Bonobo is a simple and lightweight ETL tool. Pipelines will be able to be deployed quickly and in parallel in Bonobo. It can extract data from a variety of sources in formats such as CSV, JSON, XML, XLS, SQL, and others. This transformation adheres to the atomic UNIX principles. One of the best aspects of Bonobs is that new users do not need to learn a new API. It is especially simple to use if you have prior experience with Python. It is also capable of handling semi-complex schemas. One of the most significant advantages is that it is open source and scalable.
8) Beautiful Soup
Image Source
Beautiful Soup is a well-known online scraping and parsing tool for data extraction. It provides tools for parsing hierarchical information formats, such as HTML pages or JSON files, which can be found on the web. It integrates with your preferred parser to provide idiomatic methods of navigating, searching and modifying the parse tree. It frequently saves programmers hours or even days of work.
9) Odo
Odo is a Python tool that converts data from one format to another and provides high performance when loading large datasets into different datasets. It includes memory structures such as NumPy arrays, data frames, lists, and so on. Users should consider Odo if they want to create simple pipelines but need to load large CSV datasets. It also accepts data from sources other than Python, such as CSV/JSON/HDF5 files, SQL databases, data from remote machines, and the Hadoop File System.
10) Pygrametl
Image Source
Pygrametl is a Python framework for creating Extract-Transform-Load (ETL) processes. It is open-source and distributed under the terms of a two-clause BSD license. The ETL process is coded in Python by the developer when using Pygrametl. This is much more efficient than drawing the process in a graphical user interface (GUI) like Pentaho Data Integration.
Alternative Programming Languages for ETL
As we saw that Python, as a programming language is a very feasible choice for designing ETL tasks, but there are still some other languages that are used by developers in the ETL processes such as data ingestion and loading.
The languages are as follows:
1) Java
Java is a popular programming language, particularly for developing client-server web applications. Java has influenced other programming languages, including Python, and has spawned a number of branches, including Scala. Java serves as the foundation for several other big data tools, including Hadoop and Spark. The Java ecosystem also includes a library collection comparable to that of Python.
This programming language is designed in such a way that developers can write code anywhere and run it anywhere, regardless of the underlying computer architecture. It is also known as write once, run anywhere (WORA).
2) Ruby
Ruby, like Python, is a scripting language that allows developers to create ETL pipelines, but there are few ETL-specific Ruby frameworks available to make the task easier. However, several libraries are currently in development, including Nokogiri, Kiba, and Square’s ETL package.
3) Go
Go, also known as Golang, is a programming language that is similar to C and is intended for data analysis and big data applications. It was created to fill C++ and Java gaps discovered while working with Google’s servers and distributed systems. Go includes several machine learning libraries, including support for Google’s TensorFlow, data pipeline libraries such as Apache Beam, and two ETL toolkits, Crunch and Pachyderm.
Conclusion
In this article, you have learned about Setting up ETL using Python. This article also provided information on Python, its key features, Python, different methods to set up ETL using Python Script, limitations of manually setting up ETL using Python, and the top 10 ETL using Python tools.
Most businesses today however have an extremely high volume of data with a very dynamic structure. Creating an ETL pipeline for such data from scratch is a complex process since businesses will have to utilize a high amount of resources in creating this pipeline and then ensure that it is able to keep up with the high data volume and Schema variations. Businesses can instead use automated platforms like Hevo.
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.
Visit our Website to Explore Hevo
Hevo Data with its strong integration with 150+ Data Sources (including 50+ Free Sources) allows you to not only export data from your desired data sources & load it to the destination of your choice but also transform & enrich your data to make it analysis-ready. Hevo also allows integrating data from non-native sources using Hevo’s in-built REST API & Webhooks Connector. You can then focus on your key business needs and perform insightful analysis using BI tools.
Want to give Hevo a try? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.
Share your experience of understanding setting up ETL using Python in the comment section below! We would love to hear your thoughts.