As Python has become the go-to language for working with data, it is essential to know how to load data in Python. Based on your requirements, Python has several ways to load data. Using Python, you can load information from the SQL server, CSV, and binary files. To implement data loading in Python, you can either use built-in functions of different libraries or write your own custom function.
In this article we will learn how to load data into python. Let’s dive in.
Python Overview
Python is a versatile, high-level programming language known for its simplicity and readability. It supports multiple programming paradigms, including object-oriented, procedural, and functional programming. Widely used in web development, data analysis, artificial intelligence, and automation, Python offers an extensive library ecosystem for various applications. Its user-friendly syntax makes it ideal for beginners and professionals alike.
Real-Life Use Cases of Data Loading in Python
- Data Analysis: Import datasets for exploratory analysis, statistical computations, and insights generation.
- Machine Learning: Load training and testing data for building predictive models.
- ETL Processes: Extract, transform, and load data into data warehouses for business intelligence.
- Web Scraping: Parse and load data from websites into structured formats for further analysis.
- Log Analysis: Load server logs to monitor system performance and identify issues.
- Financial Reporting: Import transactional data for budgeting, forecasting, and reporting.
- IoT Analytics: Process real-time data from IoT devices for trend analysis and monitoring.
Loading Data in Python
Data science projects typically begin with loading data into Python, as it’s crucial for data cleaning, analysis, and model development. Knowing how to efficiently load data ensures smooth workflows and accurate results. By using appropriate libraries and functions, you can handle various data formats such as CSV, JSON, Excel, and SQL databases. A solid understanding of these methods is key to optimizing your data processing, ensuring that the data is ready for analysis or machine learning tasks without delays or errors.
Here are the top 10 ways on how to load data into Python.
Read_CSV Function
CSV (comma separated values) is a popular file format among data scientists to store data. As the name suggests, data in CSV files are separated by a comma. This simplifies the structure of the file, making it lightweight. When the data is stored in other popular formats like workbooks, it retains the underlying file structure. Although this makes it easy for humans to handle data, CSV files are computationally efficient.
- The read_csv() is often used to get the data into a pandas dataframe, which is a two-dimensional tabular data format. Usually, large data files are stored in CSV format as workbooks and MS Excel files do not support millions of rows.
- To read data using read_csv() function, you must import the pandas library using the following command:
import pandas as pd
- Here, the pandas library is imported as pd, which will be used for further reference.
- Now, read the data from a CSV file into a pandas’ dataframe using the following command:
data = pd.read_csv(‘filename.csv’)
- In the aforementioned code, the data stored in the filename.csv is stored in the variable data, which is a pandas’ dataframe.
- It is a simple technique to load data into Python. But, you can further customize the import by using numerous parameters supported by read_csv().
Manual Function
In-built methods aside, you can also load data to Python by writing a custom function. However, implementing a manual function is more challenging than using existing functions. Nevertheless, it allows you to be more flexible as it can support a wide range of file formats like CSV, Excel, JSON, HTML, and more.
A manual function starts by reading a file (CSV file in this case) with the open() function. Next, you extract the first row of the file since it consists of column names. After collecting the names of columns, you will have to read the entire file line by line to extract the row values.
Since values are separated by commas, you will have to split at every ‘,’ to identify the data points and store them as Python list items. Once both row values and column names are gathered in two different lists, you can focus on creating a pandas’ dataframe. You can use the Python list comprehension technique to generate a datafame.
with open(‘sample.csv', 'r') as f:
# Read the column names
column_names = f.readline().strip().split(',')
# Initialize a list to store the data
# It will hold row values
data = []
# Loop over the remaining lines in the file
for j in f:
j = j.strip()
values = j.split(',')
# Append the values to the data list
data.append(values)
# Create a dictionary with a list comprehension to hold the data
df = {column_names[i]: [row[i] for row in data] for i in range(len(column_names))}
# Create a dataframe from the dictionary
df = pd.DataFrame(df)
In the aforementioned code,
- strip(): Is used to remove leading or trailing spaces.
- split(): This is used to split the at a specified parameter. In this case, it was ‘,’ since CSV files store data separated by a comma.
- readline(): Is used for returning the first line.
Loadtxt Function
Loadtxt loads data from a simple text file into a NumPy array. NumPy is a Python library used for implementing mathematical operations on data. With loadtxt function, you can quickly load big data. But, you might struggle to load data that have different data types or missing rows.
You can use loadtxt function to load data in Python by importing the NumPy library as follows.
import numpy as np
Here, np is used to refer to the NumPy library.
The syntax for using loadtxt function is as follows:
data = np.loadtxt(filename)
For example,
data = np.loadtxt(‘sample.csv’, delimiter = ‘,’)
- sample.csv: It refers to the file you are trying to load.
- delimiter: This is an optional parameter. But, since you are trying to read a CSV file, you must provide the delimiter type.
Genfromtxt Function
Genfromtxt also comes as an inbuilt function with the NumPy library. With genfromtxt, you can import data while working with files that have missing values. You can fill in the missing values by specifying the parameter while importing.
import numpy as np
data = np.genfromtxt(‘sample.csv’, delimiter = ‘,’ , names = True, dtype = None)
- names: Is used to set the first row as column names.
- delimiter: This is used to specify how values are separated in the file we are importing.
- sample.csv: It is the name of the file.
- dtype: This is used to avoid converting all the data into a single data type. This helps you avoid NaN while importing string values.
Open() Function
Open() is a bit complex way to load files but is used for better performance. With Open(), you can close files without explicitly mentioning them to manage the memory efficiently. Ideally, Open() is used to perform read or write operations and then close the file automatically.
Method 1: Close the File with a close() after Loading the Data
In this method, you will open the file to write “Hello, world!” and close it using the close function.
file = open("sample.txt", "w")
file.write("Hello, world!")
file.close()
- sample.txt: This refers to the file name.
- w: It specifies the mode of the file. It could be read (r), write (w), append (a), and more.
- file.write(“Hello, world!”): The write() function is used to add data into the file. Here we are adding “hello world” text.
- file.close(): It is used to close the file.
Note: read (r) is the default mode in the open() function.
Method 2: Automatically Close the File after Loading the Data
In this method, you will close the file after reading the file sample.csv using the with statement.
with open(sample.csv') as f:
lines = f.readlines()
- with: You can use with to close the file automatically.
- readlines(): It returns all the lines of the file.
Pickle
Pickle is yet another technique to load data in Python to serialize (also known as pickling) objects. It’s a process of converting a Python hierarchy into a byte stream. And unpickling is the inverse operation—converts byte stream to Python hierarchy. Generally, functions like load() and dump() are used to serialize objects and de-serialize the byte stream.
To use Pickle, import the Pickle module with the following code:
import pickle
Create a Pickle File
Create a sample.pkl file and store the data from the Python dataframe df into the pickle file.
with open(‘sample.pkl’ , ‘wb’) as f:
pickle.dump(df, f)
In the aforementioned code,
- wb: This is used to open the sample.pkl file in a write-binary mode.
- dump: Is used to store data in the opened file. Here, the dataframe is dumped into a sample.pkl file.
Read from a Pickle File
To load the pickle file in a dataframe, open it in a read-binary mode and then use load() to load data in the dataframe.
with open(‘sample.pkl’ , ‘rb’) as f:
data = pickle.load(f)
- rb: This is used for reading the file in read-binary mode.
- load: Is used to store the data in a dataframe.
Load from an SQL Server
Working with an SQL server is one of the repetitive tasks for fetching new information. However, you can use pyodbc library to load data in Python from an SQL server seamlessly.
Install and import the pyodbc library using the following command:
!pip install pyodbc
Import pyodbc
Next, establish the connection with the SQL server by providing necessary details like the database name, server name, and more.
server_name = 'your_server_name'
db_name = 'your_database_name'
cnxn = pyodbc.connect('DRIVER={SQL Server}; SERVER=' + server_name + '; DATABASE=' + db_name + '; Trusted_Connection=yes;')
For authentication with username and password, replace the code with the following command:
cnxn = pyodbc.connect('DRIVER={SQL Server}; SERVER=' + server_name + '; DATABASE=' + db_name + '; UID=' + username + '; PWD=' + password)
After a successful connection, use read_sql() method to query from the database.
sql_query = 'SELECT * FROM your_table_name'
df = pd.read_sql(sql_query, cnxn)
After loading the data into a dataframe, close the connection using the following command:
cnxn.close()
Load Data From a Supabase
Supabase is a backend-as-a-service built on top of PostgreSQL. It is used to create and manage databases with an intuitive user interface quickly. You can use Python with Supabase to load data for your analytics projects.
Install the Supabase library using the following command:
pip3 install supabase
Next, import the module and establish a connection with the Supabase. Ensure to get the Supabase URL and KEY from the API section.
import os
from supabase import create_client, Client
url: str = os.environ.get("SUPABASE_URL")
key: str = os.environ.get("SUPABASE_KEY")
supabase: Client = create_client(url, key)
In the aforementioned case, a valid connection is created between Python and Supabase with an API.
Now, fetch data from Supabase using the following command:
response = supabase.table('countries').select("*").execute()
The aforementioned code will fetch all the data from the table named “countries.”
Check the output:
Load Data from TXT File
TXT files are a common file format used for storing plain text. You can read .txt files with a simple read_table(). This function uses tab character (\t) by default to specify sep parameter. However, ensure sep is set to the actual character used in your txt file.
df = pd.read_table("test_data.txt", sep=" ")
In the aforementioned code, you are loading the test_data.txt file with a space (“ “) as a separator.
Load Data from Excel File
Excel is yet another widely used file in organizations to record data. You can load excel data into a dataframe with read_excel().
df = pd.read_excel("test_data.xlsx", sheet_name="test_sheet1", header=0, index_col=0)
Here, you read the .xlsx file into a dataframe while providing values for other parameters like index_col, sheet_name, and header.
- sheet_name: This is used to import the specific sheet from the .xslx file.
- index_col: It refers to the first column, which is often an index for rows
- header: It is used to specify the position of the columns (first row) in the excel sheet.
Common Challenges While Loading Data in Python
- File Path Errors: Incorrect or relative file paths often lead to file not found errors.
- Handling Large Datasets: Loading massive files can cause memory issues, slowing down processing.
- Missing Data: Incomplete datasets require special handling to avoid errors or incorrect analysis.
- Inconsistent Formats: Different file formats like CSV, JSON, or Excel can complicate the loading process.
- Encoding Issues: Files with non-standard encodings can lead to decoding errors during import.
- Data Type Mismatch: Incorrect data types may cause computation errors or inefficient memory usage.
Conclusion
With the techniques mentioned above, data loading can be a straightforward technique. However, based on the complexity of the data you are importing, you might need to work with optional parameters for every built-in function. Optional parameters enable you to handle the data efficiently while loading data into Python.
Discover how to use pandas to load JSON and streamline your data processing workflows with our step-by-step guide.
Instead of spending months developing and maintaining such data integrations, you can enjoy a smooth ride with Hevo’s 150+ plug-and-play integrations (including 60+ free sources). Saving countless hours of manual data cleaning & standardizing, Hevo’s pre-load data transformations get it done in minutes via a simple drag-and-drop interface or your custom Python scripts. Sign up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.
FAQs
1. How do I load data into Python?
You can load data into Python using libraries like pandas or csv.
2. How do you input data into a Python program?
Use the input() function to take user input as a string. If needed, You can convert the input to other data types using functions like int() or float().
3. How do you output data to a file in Python?
You can output data to a file in Python using the open() function along with the write() or writelines() methods.
Manjiri is a proficient technical writer and a data science enthusiast. She holds an M.Tech degree and leverages the knowledge acquired through that to write insightful content on AI, ML, and data engineering concepts. She enjoys breaking down the complex topics of data integration and other challenges in data engineering to help data professionals solve their everyday problems.