How to Load Data into Python: A Comprehensive Guide 101

|

How to load data into python FI

As Python has become the go-to language for working with data, it is essential to know how to load data in Python. Based on your requirements, Python has several ways to load data. Using Python, you can load information from the SQL server, CSV, and binary files. To implement data loading in Python, you can either use built-in functions of different libraries or write your own custom function. 

In this article we will learn how to load data into python. Let’s dive in.

Loading Data in Python

Data science projects start with loading data into Python for cleaning data, obtaining insights, and building models. Ensuring that you know how to load data in Python correctly is essential for a seamless workflow.

Here are the top 10 ways on how to load data into Python. 

Read_CSV Function

CSV (comma separated values) is a popular file format among data scientists to store data. As the name suggests, data in CSV files are separated by a comma. This simplifies the structure of the file, making it lightweight. When the data is stored in other popular formats like workbooks, it retains the underlying file structure. Although this makes it easy for humans to handle data, CSV files are computationally efficient. 

  • The read_csv() is often used to get the data into a pandas dataframe, which is a two-dimensional tabular data format. Usually, large data files are stored in CSV format as workbooks and MS Excel files do not support millions of rows. 
  • To read data using read_csv() function, you must import the pandas library using the following command:
import pandas as pd
  • Here, the pandas library is imported as pd, which will be used for further reference.
  • Now, read the data from a CSV file into a pandas’ dataframe using the following command:
data = pd.read_csv(‘filename.csv’)
  • In the aforementioned code, the data stored in the filename.csv is stored in the variable data, which is a pandas’ dataframe.
  • It is a simple technique to load data into Python. But, you can further customize the import by using numerous parameters supported by read_csv().

Manual Function

In-built methods aside, you can also load data to Python by writing a custom function. However, implementing a manual function is more challenging than using existing functions. Nevertheless, it allows you to be more flexible as it can support a wide range of file formats like CSV, Excel, JSON, HTML, and more.

A manual function starts by reading a file (CSV file in this case) with the open() function. Next, you extract the first row of the file since it consists of column names. After collecting the names of columns, you will have to read the entire file line by line to extract the row values. 

Since values are separated by commas, you will have to split at every ‘,’ to identify the data points and store them as Python list items. Once both row values and column names are gathered in two different lists, you can focus on creating a pandas’ dataframe. You can use the Python list comprehension technique to generate a datafame. 

with open(‘sample.csv', 'r') as f:
    
    # Read the column names
    column_names = f.readline().strip().split(',')
    
    # Initialize a list to store the data
    # It will hold row values
    data = []
    
    # Loop over the remaining lines in the file
    for j in f:

        j = j.strip()
        values = j.split(',')
        
        # Append the values to the data list
        data.append(values)
        
# Create a dictionary with a list comprehension to hold the data
df = {column_names[i]: [row[i] for row in data] for i in range(len(column_names))}

# Create a dataframe from the dictionary
df = pd.DataFrame(df)

In the aforementioned code,

  • strip(): Is used to remove leading or trailing spaces.
  • split(): This is used to split the at a specified parameter. In this case, it was ‘,’ since CSV files store data separated by a comma.
  • readline(): Is used for returning the first line.

Loadtxt Function

Loadtxt loads data from a simple text file into a NumPy array. NumPy is a Python library used for implementing mathematical operations on data. With loadtxt function, you can quickly load big data. But, you might struggle to load data that have different data types or missing rows. 

You can use loadtxt function to load data in Python by importing the NumPy library as follows.

import numpy as np

Here, np is used to refer to the NumPy library. 

The syntax for using loadtxt function is as follows:

data = np.loadtxt(filename)

For example,

data = np.loadtxt(‘sample.csv’, delimiter = ‘,’)
  • sample.csv: It refers to the file you are trying to load.
  • delimiter: This is an optional parameter. But, since you are trying to read a CSV file, you must provide the delimiter type.

Genfromtxt Function

Genfromtxt also comes as an inbuilt function with the NumPy library. With genfromtxt, you can import data while working with files that have missing values. You can fill in the missing values by specifying the parameter while importing. 

import numpy as np
data = np.genfromtxt(‘sample.csv’, delimiter = ‘,’ , names = True, dtype = None)
  • names: Is used to set the first row as column names. 
  • delimiter: This is used to specify how values are separated in the file we are importing.
  • sample.csv: It is the name of the file.
  • dtype: This is used to avoid converting all the data into a single data type. This helps you avoid NaN while importing string values. 

Open() Function

Open() is a bit complex way to load files but is used for better performance. With Open(), you can close files without explicitly mentioning them to manage the memory efficiently. Ideally, Open() is used to perform read or write operations and then close the file automatically. 

Method 1: Close the File with a close() after Loading the Data 

In this method, you will open the file to write “Hello, world!” and close it using the close function.

file = open("sample.txt", "w") 
file.write("Hello, world!") 
file.close()
  • sample.txt: This refers to the file name. 
  • w: It specifies the mode of the file. It could be read (r), write (w), append (a), and more. 
  • file.write(“Hello, world!”): The write() function is used to add data into the file. Here we are adding “hello world” text.
  • file.close(): It is used to close the file. 

Note: read (r) is the default mode in the open() function. 

Method 2: Automatically Close the File after Loading the Data

In this method, you will close the file after reading the file sample.csv using the with statement.

with open(sample.csv') as f:
   lines = f.readlines()
  • with: You can use with to close the file automatically.
  • readlines(): It returns all the lines of the file.

Pickle

Pickle is yet another technique to load data in Python to serialize (also known as pickling) objects. It’s a process of converting a Python hierarchy into a byte stream. And unpickling is the inverse operation—converts byte stream to Python hierarchy. Generally, functions like load() and dump() are used to serialize objects and de-serialize the byte stream.

To use Pickle, import the Pickle module with the following code:

import pickle

Create a Pickle File

Create a sample.pkl file and store the data from the Python dataframe df into the pickle file

with open(‘sample.pkl’ , ‘wb’) as f:
pickle.dump(df, f) 

In the aforementioned code,

  • wb: This is used to open the sample.pkl file in a write-binary mode.
  • dump: Is used to store data in the opened file. Here, the dataframe is dumped into a sample.pkl file.

Read from a Pickle File

To load the pickle file in a dataframe, open it in a read-binary mode and then use load() to load data in the dataframe.  

with open(‘sample.pkl’ , ‘rb’) as f:
	data = pickle.load(f)
  • rb: This is used for reading the file in read-binary mode.
  • load: Is used to store the data in a dataframe.

Load from an SQL Server

Working with an SQL server is one of the repetitive tasks for fetching new information. However, you can use pyodbc library to load data in Python from an SQL server seamlessly.

Install and import the pyodbc library using the following command:

!pip install pyodbc
Import pyodbc

Next, establish the connection with the SQL server by providing necessary details like the database name, server name, and more. 

server_name = 'your_server_name'
db_name = 'your_database_name'
cnxn = pyodbc.connect('DRIVER={SQL Server}; SERVER=' + server_name + '; DATABASE=' + db_name + '; Trusted_Connection=yes;')

For authentication with username and password, replace the code with the following command:

cnxn = pyodbc.connect('DRIVER={SQL Server}; SERVER=' + server_name + '; DATABASE=' + db_name + '; UID=' + username + '; PWD=' + password)

After a successful connection, use read_sql() method to query from the database.

sql_query = 'SELECT * FROM your_table_name'
df = pd.read_sql(sql_query, cnxn)

After loading the data into a dataframe, close the connection using the following command:

cnxn.close()

Load Data From a Supabase

Supabase is a backend-as-a-service built on top of PostgreSQL. It is used to create and manage databases with an intuitive user interface quickly. You can use Python with Supabase to load data for your analytics projects.

Install the Supabase library using the following command:

pip3 install supabase

Next, import the module and establish a connection with the Supabase. Ensure to get the Supabase URL and KEY from the API section.

import os
from supabase import create_client, Client

url: str = os.environ.get("SUPABASE_URL")
key: str = os.environ.get("SUPABASE_KEY")
supabase: Client = create_client(url, key)

In the aforementioned case, a valid connection is created between Python and Supabase with an API. 

Now, fetch data from Supabase using the following command:

response = supabase.table('countries').select("*").execute()

The aforementioned code will fetch all the data from the table named “countries.” 

Check the output:

How to load data into python: api response
Image Source

Load Data from TXT File

TXT files are a common file format used for storing plain text. You can read .txt files with a simple read_table(). This function uses tab character (\t) by default to specify sep parameter. However, ensure sep is set to the actual character used in your txt file.  

df = pd.read_table("test_data.txt", sep=" ")

In the aforementioned code, you are loading the test_data.txt file with a space (“ “) as a separator. 

Load Data from Excel File

Excel is yet another widely used file in organizations to record data. You can load excel data into a dataframe with read_excel().

df = pd.read_excel("test_data.xlsx", sheet_name="test_sheet1", header=0, index_col=0)

Here, you read the .xlsx file into a dataframe while providing values for other parameters like index_col, sheet_name, and header

  • sheet_name: This is used to import the specific sheet from the .xslx file.
  • index_col: It refers to the first column, which is often an index for rows
  • header: It is used to specify the position of the columns (first row) in the excel sheet.

Conclusion

With the techniques mentioned above, data loading can be a straightforward technique. However, based on the complexity of the data you are importing, you might need to work with optional parameters of every in-built function. Optional parameters enable you to handle the data efficiently while loading data into Python. 

Instead of spending months developing and maintaining such data integrations, you can enjoy a smooth ride with Hevo’s 150+ plug-and-play integrations (including 40+ free sources).

Visit our Website to Explore Hevo

Saving countless hours of manual data cleaning & standardizing, Hevo’s pre-load data transformations get it done in minutes via a simple drag n drop interface or your custom python scripts. No need to go to your data warehouse for post-load transformations. You can simply run complex SQL transformations from the comfort of Hevo’s interface and get your data in the final analysis-ready form. 

Want to take Hevo for a ride? Sign Up for a 14-day free trial and simplify yo

Manjiri Gaikwad
Freelance Technical Content Writer, Hevo Data

Manjiri loves data science and produces insightful content on AI, ML, and data science. She applies her flair for writing for simplifying the complexities of data integration and analysis for solving problems faced by data professionals businesses in the data industry.

No-Code Data Pipeline for Your Data Warehouse