Want to connect APIs, SQL, and flat files in one ETL script?

If yes, keep reading!

While standard ETL pipelines are effective, Python becomes crucial when data teams require more control, extensibility, and customization.

With Python’s extensive library, you can build end-to-end ETL pipelines to extract data from REST APIs, relational databases, and flat files. It helps you build a unified workflow that scales from simple prototypes to production-grade systems.

In this article, we’ll discuss Python’s role in ETL workflows, its use cases, and best practices for building ETL using Python. We have also listed key tools and libraries for Python-led ETL to accelerate development.

Let’s get started!

What is ETL?

etl process

ETL with Python refers to using Python as a primary language in the process of extracting data from diverse sources, transforming it to analytics-ready data, and loading it into the preferred target destination.

A typical ETL workflow with Python looks like this:

  • Collect data from databases using requests for APIs, pymysql for SQL database, and pandas for CSV files.
  • Cleaning, standardizing, and restructuring raw data using Python’s data manipulation libraries.

    Data is loaded into data warehouses (SQL database or MongoDB) using connectors, such as SQLAlchemy and pymongo.

    Why Should You Use Python for ETL Processes?

    Python has become a go-to choice for data engineers because of its broad range of functionalities. Let’s explore how it is benefiting organizations:

    1. Vast ecosystem

    As discussed earlier, Python provides extensive libraries to expedite ETL pipelines. Such as:

    • Numpy and pandas for data manipulation.
    • SQLAlchemy for sqlite3 for database connectivity.
    • requests for API interaction.

    Due to its vast ecosystem, data engineers are able to simplify complex data aggregation and integration tasks, boosting pipeline efficiency.

    2. Readability

    Python syntax is designed to be intuitive and close to natural language, which makes the code easy to write, read, and maintain even for new users. This reduces the barrier to entry:

    • The easy syntax helps analysts quickly identify logic errors and modify pipelines.
    • The syntax simplicity fosters effective collaboration on major ETL projects.
    • Emphasis on clean, readable code simplifies maintenance with evolving technical requirements.

    The end benefit is that enhanced readability enables quick prototyping and iterative development for faster optimization of ETL logic.

    3. Customization

    Unlike traditional ETL, Python-driven ETL workflows can be modularized to meet specific business requirements. Given its high customizability, Python scripts can be implemented for:

    • Non-standard data sources
    • Complex business logic
    • Intricate data relationships
    • Additional processing steps

    This enhanced flexibility allows you to add new elements without re-architecturing the entire pipeline.

    4. Automation & Orchestration

    You can automate Python-based ETL jobs and orchestrate the execution order with scheduling tools.

    • Use built-in modules like cron, schedule, and subprocess.
    • Trigger event-based workflows based on timestamp and upstream job status.
    • Leverage built-in monitoring, logging, and alerting mechanisms.
    • Integrate pipelines with orchestrators like Apache Airflow, Prefect, or Luigi.

    5. Strong community & documentation

    Python’s ETL ecosystem is backed by an active programming community. The key benefits of the community-driven support include:

    • Many ETL libraries (like pandas, Airflow, boto3, etc.) are open-source with active contributors and frequent updates.
    • Developers contribute via GitHub by adding new features, fixing bugs, improving, and creating integrations.
    • Python-based ETL tools provide detailed documentation and user guides for smooth operation.

    An added advantage is the multilingual support and networking opportunities via chat forums and online communities.

    Transform Your ETL Process with Hevo Data’s Python Support!

    Leverage Hevo Data’s capability to perform Python transformations during ETL to streamline your workflow and enhance data integration. With Python scripting, simplify complex data processing tasks and customize your ETL pipelines effortlessly. Hevo offers:

    Thousands of customers worldwide trust Hevo for their data ingestion needs. Join them and experience seamless data transformation and migration.

    Get Started with Hevo at Zero-Cost!

    What are the Steps to Use Python Script for ETL?

    Integrating Python with ETL workflows requires a series of technical steps to develop a scalable system. Here’s a step-by-step guide to using a Python script for ETL:

    1. Define the data source and data warehouse

    • Data sources: Identify the location of the data you want to extract. The location or the data source can include SQL/NoSQL databases, JSON, XML, and APIs.
    • Data destination: Figure out where you want to load the processed data. The destination can include data warehouses (like Snowflake or Redshift), relational databases (PostgreSQL), or data lakes.

    2. Design the data flow

    • Map the data journey: Clearly list all data sources and target destinations. Document the data route from each source to the destination throughout the ETL process to clarify dependencies.
    • Performance: Based on the data volume, decide if data will be processed in batches or as a real-time stream. Identify opportunities for parallel extraction and transformation to accelerate the pipeline.
    • Documentation: Use flowcharts with input, transform, and output blocks to visualize the ETL process. Document every step and transformation logic as a reference for testing and future design upgrades.

    3. Deploy Python

    • Install Python: Install the latest Python version using the official Python website or a trusted package manager, like choco for Windows, brew for macOS, and apt for Linux.
    • Connector installation: Install common Python connectors using pip, like:
      • MySQL connector: pip install mysql-connector-python
      • Microsoft SQL Server Connector: pip install pyodbc
      • Firebird Connector: pip install fdb
    • Directory: Structure the project for clarity and maintenance. A typical structure looks like this:
      • config/: Configuration files (e.g., database credentials, API keys)
      • src/: Source code and ETL pipeline scripts
      • utils/: Utility functions (e.g., connectors, transformations)
      • test/: Unit and integration tests
      • run.py: Main entry point to execute pipelines

      4. Extract data from sources

      With Python, you can establish a connection with various databases, execute queries, and extract raw data. Here are the types of data sources it supports:

      • Relational databases: Like PostgreSQL, SQL Server, and Oracle
      • NoSQL databases: Such as MongoDB, Cassandra, or Redis
      • Web APIs: RESTful or SOAP APIs using HTTP requests
      • Flat files: Common examples are CSV, Excel, JSON, and XML
      • Cloud storage: AWS S3, Google Cloud Storage, or Azure Blob Storage

      For example, extract data from:

      • CSV files: 
      import pandas as pd
      data = pd.read_csv('your_csv_file.csv')
      • APIs:
      import requests
      response = requests.get('https://api.example.com/data')
      <em>data = response.json()</em>

      5. Transform the data

      Convert the raw data into a structured, clean, and usable format before loading it into the destination.

      • Standardization: Remove null values, eliminate duplicate records, and convert data types using:
        • pandas.DataFrame.dropna()
        • pandas.DataFrame.drop_duplicates()
        • pd.to_datetime()
      • Filtering: Apply conditions to filter out rows and aggregate functions to summarize data by categories. For example, filter out rows where sales are below a specified target.
      • Mapping: Combine data from multiple sources using merge() or join() operations and apply custom logic to align datasets with specific requirements.

      6. Load the data

      Load the processed data into the target destination for further analytics.

      • SQL databases: Utilize libraries such as SQLAlchemy or pyodbc to establish a connection to the database. The pandas.DataFrame.to_sql() method allows you to write a DataFrame directly to a SQL table.
      • NoSQL database: Use suitable libraries, such as pymongo for MongoDB, to insert records.
      • Cloud storage: Use cloud SDKs like boto3 (AWS), google-cloud-storage (GCP), or azure-storage-blob (Azure) to upload files.
      • Flat files: Use pandas to write DataFrames to files.

      7. Orchestrate the pipeline

      • Automation: Use the cron module for Linux and workflow orchestrators like Apache Airflow to automate ETL pipelines.
      • Management: Schedule ETL jobs in the correct order as per the data volume and monitor dependencies throughout the ETL workflow.

      8. Launch the ETL pipeline

      • Maintenance: Continuously monitor the pipeline performance, update the ETL script and libraries to align them with evolving business requirements.
      • Deploy: Transfer the ETL scripts to a production environment and establish tracking mechanisms (Python’s logging module) to ensure pipeline health.

      What Is An Example of Python ETL?

      Now we’ll understand the workflow of a Python ETL script with an example. Let’s take a sample use case:

      We have to process a mixed-format dataset containing used car listings (that includes CSV, JSON, and XML), apply transformations to standardize raw datasets, and generate a filtered, consolidated CSV file. We’ll track execution via a log in the process.

      Let’s start!

      Step 1: Import libraries and configure data paths

      We use the following code to import necessary libraries and define life paths for input (raw data), output (filtered CSV), and logging:

      import glob
      import pandas as pd
      import xml.etree.ElementTree as ET
      from datetime import datetime

      What’s happening:

      • glob: Find files with similar patterns
      • pandas: For tabular data processing
      • xml.etree.ElementTree: Parses XML data
      • datetime: Adds timestamps to logs
      data_dir = "dealership_data"
      target_file = "dealership_transformed_data.csv"
      log_file = "dealership_logfile.txt"

      Explanation:

      • data_dir: Source directory containing raw input files.
      • target_file: The path where transformed data is written.
      • log_file: Captures ETL activity with timestamps for monitoring.

      Step 2: Data extraction

      Here we define an extraction function for each data type: CSV, JSON, and XML, to retrieve consistent data formats for downstream processing.

      For CSV:

      def extract_from_csv(path):
      return pd.read_csv(path)
      • Read the CSV file using pandas.
      • Returns a DataFrame for downstream processing.

      For JSON:

      def extract_from_json(path):
      return pd.read_json(path, lines=True)
      • Read the JSON file using pandas.
      • Used when each subject of the JSON file is stored on a separate line (lines=true).

      For XML,

      def extract_from_xml(path):
      cols = ['car_model','year_of_manufacture','price','fuel']
      rows = []
      tree = ET.parse(path)
      for elem in tree.getroot():
      rows.append({
      'car_model': elem.find("car_model").text,
      'year_of_manufacture': int(elem.find("year_of_manufacture").text),'
      price': float(elem.find("price").text),
      'fuel': elem.find("fuel").text
      })
      return pd.DataFrame(rows, columns=cols)
      • Parses the XML file, navigates its structure, and extracts specific fields.
      • Constructs a list of dictionaries and converts it to a DataFrame.

      Here, the consistent column names are car_model, year_of_manufacture, price, and fuel.

      Step 3: Combine data from all files

      The following code loops through the files in the dealership data factory, applies the accurate extractor, and creates a unified DataFrame with all the raw data.

      def extract_all(data_dir):
      df = pd.DataFrame(columns=['car_model','year_of_manufacture','price','fuel'])
      for pattern, func in [
      ("*.csv", extract_from_csv),
      ("*.json", extract_from_json),
      ("*.xml", extract_from_xml)
      ]:
      for filepath in glob.glob(f"{data_dir}/{pattern}"):
      df = pd.concat([df, func(filepath)], ignore_index=True)
      return df

      What’s happening:

      • glob is used to find matching files in the directory.
      • Iterates through each file type.

      Step 4: Data transformation

      Here we clean and prepare the dataset by removing null values, deduplicating to avoid redundancy, and formatting numerical columns.

      def transform(df):
      df = df.dropna(subset=['price', 'car_model'])
      df = df.drop_duplicates()
      df['price'] = df['price'].round(2)
      df['year_of_manufacture'] = pd.to_numeric(df['year_of_manufacture'], errors='coerce')
      return df

      Explanation:

      • Remove nulls in the price and car_model columns.
      • Converts years_of_manufacture to numeric values.
      • Rounds [df['price'].round(2)] price value to two decimals.

      Step 5: Data loading and log events

      Now, we’ll load the data and implement log events.

      For loading,

      def load_to_csv(df, target_file):<br>df.to_csv(target_file, index=False)

      What’s happening:

      • load_to_csv saves the transformed DataFrame to a CSV file.
      • index=False prevents row indices from being written as an extra column.

      For logging,

      def log_event(message):<br>ts = datetime.now().strftime('%Y-%m-%d %H:%M:%S')<br>with open(log_file, "a") as f:<br>f.write(f"{ts} - {message}\n")

      Explanation:

      • log_event timestamps for transparency and debugging.
      • Identifies failure points, monitors each stage, and time taken for execution.

      Step 6: Orchestration of the ETL pipeline

      Now we run the ETL process step-by-step:

      log_event("ETL Job Started")

      This logs the beginning of the ETL process.

        For extraction,

        log_event("Extract phase Started")
        df = extract_all(data_dir)
        log_event("Extract phase Ended")
        • Logs the start of the extraction process.
        • extract_all(data_dir) reads the files and merges the results in a single DataFrame.

        For transformation,

        log_event("Transform phase Started")<br>df = transform(df)<br>log_event("Transform phase Ended")

        • Drops rows with missing prices or car models.
        • Removes duplicates and ensures correct representation of year_of_manufacture.

        For loading,

        log_event("Load phase Started")
        load_to_csv(df, target_file)
        log_event("Load phase Ended")
        • load_to_csv() writes the cleansed dataset to the target file.
        • dealership_transformed_data.csv is the target file that stores datasets for analytics and reporting.

        To end the ETL job,

        log_event("ETL Job Ended")

        The closing log ensures pipeline execution without any interruption.

        Real-World Use Cases of Python ETL

        Here are some detailed use cases of Python ETL:

        1. Real-time data streaming

        Use case: Logistics companies require real-time shipment status to optimize routes and provide delivery updates.

        Python scripts process live data streams using libraries like kafka-python for stream ingestion. Python ETL scripts extract data from APIs and IoT devices, enrich it with delivery routes, and load it into a database for real-time analytics. 

        The ability to manage streaming data and integrate with distributed systems enables instant notification and route optimization.

        2. Cross-language interoperability

        Use case: Companies integrate data from systems written in distinct languages (Java, C++) and diverse platforms (Windows, macOS).

        You can leverage its interoperability layer to connect with:

        • Java via JPype or Py4J
        • C/C++ using ctypes, cffi, or SWIG
        • R via rpy2
        • .NET applications through pythonnet

        This makes Python a “glue language” and acts as a universal orchestrator in cross-platform environments. Companies can execute external programs, integrate with native modules, and operate across OS boundaries to span legacy systems and modern cloud services within a unified Python pipeline.

        3. Lightweight and scriptable ETL

        Use case: Teams quickly ingest data from new data sources without setting up a full ETL infrastructure.

        Data engineers can craft quick scripts using pandas to read CSV and JSON files, filter datasets, and load them into a destination in a few lines of code. These scripts are version-controlled and are ideal for ad hoc tasks, proofs-of-concept, or environments where agility is more important than orchestration. 

        This highlights rapid scripting capabilities appropriate for isolated tasks.

        4. Customer 360-degree view

        Use case: Brands want a comprehensive view of customer behavior by integrating with CRMs, transactional databases, and analytical platforms.

        As discussed earlier, Python scripts extract data from various sources, including:

        • APIs like Salesforce and HubSpot use requests or platform SDKs.
        • Campaign data from Mailchimp and Google Ads.
        • Transaction records from SQL databases through sqlite3 or psycopg2.

        The gathered data is normalized into unified datetime formats and deduplicated based on customer ID or email, and transformed into a consolidated DataFrame. Lastly, it is loaded into a data warehouse or exported to BI tools for dashboards and visual reports.

        5. Automated data analysis

        Use case: Healthcare institutions have to generate automated patient reports for operational efficiency.

        Python scripts are scheduled with cron and orchestrators like Airflow or Luigi to pull patient data from electronic health record (EHR) via SQL connectors (psycopg2 or pyodbc) and pandas

        The raw data is standardized and enriched with derived metrics like admission rates and occupied beds. Finally, the processed data is exported as a PDF or Excel for report distribution.

        6. E-commerce inventory management

        Use case: Retailers synchronize inventory data across multiple streams to maintain accurate stock levels.

        The data extracted from supplier APIs and internal databases is transformed to reconcile discrepancies, standardize units, and calculate inventory in real-time. The processed data is transferred to a centralized repository like PostgreSQL to drive stock alerts and update dashboards with real-time data feeds.

          What Tools and Libraries Can You Use for Python ETL?

          1. pandas: A powerful library for data manipulation and analysis in Python, commonly used during the transform stage of ETL processes to clean, restructure, and analyze data efficiently.
          2. sqlalchemy: A SQL toolkit and Object Relational Mapper (ORM) that enables seamless interaction with SQL databases, widely used for extracting and loading data in Python-based ETL pipelines.
          3. pyodbc: A Python library that allows easy connection to various databases, including SQL Server, making it a popular choice for extracting and loading data in ETL workflows.
          4. pyspark: A library for distributed data processing, ideal for handling large-scale datasets in parallel ETL pipelines, commonly used in big data processing environments.
          5. boto3: The AWS SDK for Python, providing simple methods for interacting with AWS services, especially Amazon S3, during the extract or load phases of ETL processes.

          Best Practices When Building an ETL Pipeline in Python

          The best practices to build a resilient and scalable ETL pipeline in Python are:

          1. Modular design

          Focus on breaking down your pipeline into smaller modules to handle specific tasks. This enhances code reusability, expedites debugging, and simplifies future modifications in the pipeline.

          • Isolate extract(), transform(), and load() to develop, test, and maintain separately.
          • Design modules to reuse across various ETL projects and pipelines.
          • Use configuration files like YAML and JSON to manage different data environments with minimal coding.

          A modular design enables teams to maintain individual components and independent testing of modules through unit testing and mocking.

          2. Fault tolerance

          Deploy comprehensive error-handling mechanisms at every stage of the pipeline to promote self-recovery and pipeline continuity. You can use built-in logging modules to detect log errors and dependencies.

          Ensure that re-executing the ETL script doesn’t result in duplicate records and data inconsistencies. For fault tolerance:

          • Use structured logging and context-rich error messages to facilitate debugging.
          • Leverage orchestration frameworks to apply built-in retry logic and failure handling.
          • Isolate rows that fail validation by writing them to a separate table for later inspection.

          3. Pipeline testing

          Testing is necessary to validate the dataset and ensure ETL pipeline accuracy.

          • Run unit tests and use Python testing frameworks like pytest or unittest to validate outputs, spot regressions when logic changes, and mock external dependencies.
          • Run integration tests to validate end-to-end execution, data flow between modules, and compatibility between sources and destinations.
          • Deploy schema and quality checks to validate dataframes.

              4. Scalability

              Prepare the pipeline to work with vast datasets for effective scalability. Implement features like:

              • Incremental loading with CDC to avoid full reloads.
              • Parallel processing with Python libraries like concurrent.futures, multiprocessing, or frameworks like dask for handling large datasets.
              • Distributed execution using Spark to deal with high-throughput and multi-source data pipelines.
              • Adopt version control (Git) and CI/CD practices to monitor changes and automate deployment.

              5. Big data integration

              Integrate with big data platforms and services to handle large datasets.

              • Cloud SDKs: Python libraries offer boto3 to connect with cloud-based data lakes and storage services. These SDKs enable secure, scalable, and code-driven data ingestion and distribution pipelines.
              • Serverless platforms: Deploy Python ETL scripts on serverless platforms like Google Cloud Functions, AWS Lambda, and Azure Data Factory to process large volumes with minimum management.

              Moreover, as Python supports various big data formats, leverage its flexibility in data ingestion while ensuring compatibility with diverse data ecosystems.

              What are the Alternative Programming Languages for ETL?

              The languages are as follows:

              1) Java

              Java is a popular programming language, particularly for developing client-server web applications. Java has influenced other programming languages, including Python, and spawned several branches, including Scala. Java is the foundation for several other big data tools, including Hadoop and Spark. The Java ecosystem also includes a library collection comparable to that of Python.

              This programming language is designed so developers can write code anywhere and run it anywhere, regardless of the underlying computer architecture. It is also known as write once, run anywhere (WORA).

                2) Ruby

                Ruby, like Python, is a scripting language that allows developers to create ETL pipelines, but few ETL-specific Ruby frameworks are available to make the task easier. However, several libraries are currently developing, including Nokogiri, Kiba, and Square’s ETL package.

                  3) Go

                  Go, also known as Golang, is a programming language that is similar to C and is intended for data analysis and big data applications. It was created to fill C++ and Java gaps discovered while working with Google’s servers and distributed systems. Go includes several machine learning libraries, including support for Google’s TensorFlow, data pipeline libraries such as Apache Beam, and two ETL toolkits, Crunch and Pachyderm.

                    Conclusion

                    This article provided information on Python, its key features, different methods to set up ETL using Python scripts, limitations of manually setting up ETL using Python, top Python libraries to set up an ETL pipeline, and top ETL tools using Python.

                    If you’re looking for a no-code solution to simplify your ETL process, Hevo is an excellent choice. Sign up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.

                      Frequently Asked Questions

                      1. Can Python be used for ETL?

                      Yes, Python can be used for ETL (Extract, Transform, Load) processes.

                      2. What is an example of real time ETL?

                      Real-time ETL involves continuously extracting, transforming, and loading data as soon as it is available rather than processing it in batches. This is critical for applications that require up-to-the-minute data, such as fraud detection, real-time analytics, and monitoring systems. Example of Real-Time ETL: Fraud Detection in Financial Transactions

                      3. Which is the best ETL tool?

                      The best ETL tool depends on your specific requirements, such as the complexity of your data workflows, your budget, your team’s expertise, and the systems you need to integrate. Here are some of the top ETL tools: Hevo, Apache NiFi, Informatica, Fivetran, etc.

                      4. Is SQL an ETL tool?

                      No, SQL (Structured Query Language) itself is not an ETL tool, but it is a powerful language used within ETL processes. SQL is primarily used for querying and manipulating data within relational databases.

                      Manik Chhabra
                      Research Analyst, Hevo Data

                      Manik is a passionate data enthusiast with extensive experience in data engineering and infrastructure. He excels in writing highly technical content, drawing from his background in data science and big data. Manik's problem-solving skills and analytical thinking drive him to create impactful content for data professionals, helping them navigate their day-to-day challenges. He holds a Bachelor's degree in Computers and Communication, with a minor in Big Data, from Manipal Institute of Technology.