PostgreSQL is a database management system based on relational and object-oriented databases. To manage and modify complex data sets, it uses advanced SQL queries. Apache Parquet is a free, open-source column-oriented data file format that is compatible with the majority of data processing frameworks in the Hadoop environment. It has a striking resemblance to other column-oriented data storage formats available in Hadoop like ORC and RCfile.

This blog tells you about steps you can follow for Parquet to Postgres data migration. In addition to giving an overview on PostgreSQL and Parquet, it also sheds light on why Parquet is better than other Data Formats.

Understanding the Basics of Parquet and Postgresql

  • PostgreSQL: PostgreSQL is an open-source relational database management system that is popular for its reliability and scalability. It has the capabilities in managing structured data and is opted by many companies.
  • Parquet: Parquet is an open-source columnar storage file format that can store and process big datasets effectively. It will be useful for analytical workloads and is a popular choice for data warehousing and data lakes.
  • DuckDB: DuckDB is an open-source project which mainly focuses on analytical query processing. This can manage high-performance analytical workloads, and work with complex queries.
  • DuckDB FDW: This is a foreign data wrapper extension to sync PostgreSQL to DuckDB databases.

What Types of Data are Imported in PostgreSQL?

PostgreSQL has an Import/Export tool built into dbForge Studio that supports file transfer between most frequently used data formats. It allows you to save templates for repetitive export and import tasks. With this, you can easily migrate data from other servers, customize import and export jobs, save templates for recurring scenarios, and populate new tables with data.

Data formats that are compatible with PostgreSQL are:

  • Text
  • HTML
  • MS Excel
  • XML
  • CSV
  • JSON 
  • Parquet

How To Import Parquet to PostgreSQL?

Here are a few ways you can integrate PostgreSQL Parquet:

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. 

GET STARTED WITH HEVO FOR FREE

Check Out Some of the Cool Features of Hevo:

  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of source data and maps it to the destination schema.

Parquet to PostgreSQL Integration: Querying Parquet Data as a PostgreSQL Database

An ODBC driver makes use of Microsoft’s Open Database Connectivity (ODBC) interface, which allows applications to connect to data in database management systems using SQL as a standard. SQL Gateway is a  secure ODBC proxy for remote data access. With SQL Gateway, your ODBC data sources behave like a standard SQL Server or MySQL database.

Using the SQL Gateway, the ODBC Driver for Parquet, and the MySQL foreign data wrapper from EnterpriseDB we can access Parquet data as a PostgreSQL database on Windows.

Steps to import Parquet to PostgreSQL:

  • Configure Connection to Parquet: Specify the connection properties in an ODBC DSN (data source name). Microsoft ODBC Data Source Administrator can be used to create and configure ODBC DSNs. Then, Connect to your local Parquet files by setting the URI (Uniform resource identifier)connection properly to the location of the Parquet file.
  • Start the Remoting Device: The MySQL remoting service is a daemon process that waits for clients’ incoming MySQL connections. You can set up and configure the MySQL remoting device in the SQL Gateway.
  • Build the MySQL Foreign data wrapper: A Foreign Data Wrapper is a library that can communicate with an external data source while concealing the details of connecting to it and retrieving data from it. It can be installed as an extension to PostgreSQL without recompiling PostgreSQL. Building a Foreign Data Wrapper is essential for importing data from Parquet to PostgreSQL.

If you’re using PostgreSQL on a Unix-based system, you can install the mysql fdw using the PostgreSQL Extension Network (PGXN). Compile the extension if you’re using PostgreSQL on Windows to make sure you’re using the most recent version.

 Building Extension from Visual Studio

Obtain Prerequisites
  • Step 1: Step Install PostgreSQL.
  • Step 2: If you’re using a 64-bit PostgreSQL installation, get libintl.h from the PostgreSQL source as libintl.h is not currently included in the PostgreSQL 64-bit installer. 
  • Step 3: Get the source code for the mysql fdw foreign data wrapper from Enterprise DB.
  • Step 4: Install the MySQL Connector C.
Configure a Project 

You’re now ready to compile the extension with Visual Studio as you’ve obtained the necessary software and source code.

Follow the steps to create a project using mysql_fdw source:

  • Step 1: Create a new empty C++ project in Microsoft Visual studio.
  • Step 2:Right-click Source Files and click on Add -> Existing Item In the Solution Explorer. Select all of the .c and .h files from the mysql_fdw source In the file explorer.

To configure your project follow the steps:

  • Step 2.1: If you’re creating a 64-bit system, go to Build -> Configuration Manager and select x64 under Active Solution Platform.
  • Step 2.2: Right-click your project and select Properties from the drop-down menu.
  • Step 2.3: Select All Configurations from the Configuration drop-down menu.
  • Step 2.4: Dynamic Library should be selected under Configuration Properties -> General -> Configuration Type.
  • Step 2.5: Enable C++ Exceptions is set to No in Configuration Properties -> C/C++ -> Code Generation -> Enable C++ Exceptions.
  • Step 2.6: Select Compile as C Code in Configuration Properties -> C/C++ -> Advanced -> Compile As.
  • Step 2.7: Select No from Linker -> Manifest File -> Generate Manifest.
Adding Required Dependencies
  • Step 1:Select Edit in Linker -> Input -> Additional Dependencies and type the following:  
parquet to postgres: additional dependencies
 Image Source

Also, make sure that Inherit From Parent or Project Defaults are both checked.

  • Step 2:Select Edit and add the path to the lib folder in your PostgreSQL installation in Linker -> General -> Additional Library Directories.
  • Step 3:Select No from Linker -> General -> Link Library Dependencies.
  • Step 4: Add the following include to complete your project’s configuration: Add the following folders in the following order to C/C++ -> General -> Additional Include Directories:

Configure mysql_fdw for Windows

  • Step 1:Add these defines to mysql fdw.c:  
parquet to postgres : adding defines to fdw
ImageSource
  • Step 2:Delete the following line from the mysql load library definition:
parquet to postgres : deleting line from load library
ImageSource
  • Step 3:Replace the assignment of mysql dll handle with the following line in the mysql load library definition for a Windows build:
parquet to postgres : replacing assignment handle
ImageSource
  • Step 4:To export the function from the DLL, prefix the call to the mysql fdw handler function with the __declspec(dllexport) keyword:
parquet to postgres :export function from dll
ImageSource
  • Step 5:To export the function from the DLL, add the __declspec(dllexport) keyword to the declaration of the mysql fdw validator function in option.c:
parquet to postgres :export function from dll Step 5
ImageSource

You can now build and choose the Release configuration.

Install the Extension 

Follow the steps below to install the extension after you’ve compiled the DLL:

  • Step 1: In the PATH environment variable of the PostgreSQL server, add the path to the MySQL Connector C lib folder.
  • Step 2:Copy the DLL from your project’s Release folder to your PostgreSQL installation’s lib subfolder.
  • Step 3:Copy mysql fdw—1.0.sql and mysql fdw.control from the folder containing the mysql fdw csource files to the extension folder in your PostgreSQL installation’s share folder. C: Program FilesPostgreSQL9.4 share extension is an example of a location.

Query Parquet Data as a PostgreSQL Database

After you’ve installed the extension, you can begin running queries against Parquet data to import data from Parquet to PostgreSQL by following the steps below:

  •  Step 1:Go to your PostgreSQL database and log in.
parquet to postgres : open database
Image Source
  • Step 2:Install the database extension.
Parquet to postgres: install database extension
Image Source
  • Step 3:Create a Parquet data server object:
parquet to postgres :creating server object
Image Source
  • Step 4:Make a user mapping for a MySQL remoting service user’s username and password. In the sample service configuration, the user’s credentials are as follows:
Parquet to postgres:user mapping
Image Source
  • Step 5: Make a local schema.
parquet to postgres:creating local schema
Image Source
  • Step 6: Import all of the tables from your Parquet database.

You can now run SELECT commands to import data from Parquet to PostgreSQL.

Limitations of Using ODBC Driver for Parquet to PostgreSQL Import

There are certain limitations of using ODBC driver for Parquet to PostgreSQL Import, such as:

  • Limited Functionality of ODBC: Both aggregate and non-aggregate functions cannot be used together as an argument for a single SQL statement. Also, it supports scalar functions in canonical form only 
  • Performance: ODBC driver adds an extra layer of translation, leading to slower import speed as compared to Parquet or Postgres.

Parquet to PostgreSQL Integration: Using Spark Postgres Library

Spark-Postgres is intended for dependable and performant ETL in big-data workloads, and it includes read/write/scd capabilities to better connect Spark and Postgres. Spark SQL supports both reading and writing Parquet files, preserving the schema of the original data automatically.

It enables you to load multiple data files of parquet to PostgreSQL in a single spark command:

  •  spark 
  • .read.format(“parquet”)// specifies the read file format is parquet
  •  .load(parquetFilesPath)   // reads the parquet files
  •  .write.format(“postgres”) //specifies the write file format is parquet
  • .option(“host”,”yourHost”)// specifies the host
  • .option(“partitions”, 4) // 4 threads 
  • .option(“table”,”theTable”)//specifies the table
  •  .option(“user”,”theUser”) //specifies the user
  • .option(“database”,”thePgDatabase”)//specifies the database
  •  .option(“schema”,”thePgSchema”)// specifies the schema
  • .loada             // bulk load into postgres
  • Step 1: Parquet is fetched through a query into a Spark data frame, this fetches the query with 4 threads and produces a spark data frame. It produces 5 * 4 = 20 files to be read one by one (multiline).
  • Step 2: We can use an optional JDBC URL, a JDBC URL identifies a database so that the appropriate driver can recognize it and connect to it. It copies the spark data frame into Postgres with 4 threads. Also, it disables the indexes and re-indexes the table afterward.
parquet to postgres: code for spark
 Image Source
parquet to postgres: code for spark 2
 Image Source
parquet to postgres: code for spark 3
 Image Source

With this simple step-by-step guide you can easily load Parquet to Postgres.

Limitations of Using Spark Postgres Library

Here are some limitations of using Spark Postgres library for Parquet to Postgres import:

  • It can be resource-intensive as Spark consumes high memory, and you may be required to invest in expensive RAM storage. 
  • Spark has limited real-time processing capabilities, which can prove to be inefficient. 
  • It cannot deal with large volumes of files as its processing capabilities become slow.

Use Cases to Transfer Your Parquet Data to Postgres

Migrating data from Parquet to PostgreSQL offers various advantages. Here are a few uses:

  1. Advanced Analytics: Postgres’ robust data processing capabilities allow you to do complicated queries and data analysis on your Parquet data, yielding insights that would not be available with Parquet alone.
  2. Data Consolidation: If you’re using numerous sources in addition to Parquet, synchronizing to Postgres allows you to centralize your data for a more complete picture of your operations and build up a change data collection process to ensure that there are no data discrepancies again.
  3. Historical Data Analysis: Parquet has limitations with historical data. Syncing data to Postgres enables long-term data retention and the study of historical patterns over time.
  4. Data Security and Compliance: Postgres includes advanced data security capabilities. Syncing Parquet data to Postgres secures your data and enables enhanced data governance and compliance monitoring.
  5. Scalability: Postgres can manage massive amounts of data without compromising performance, making it a great alternative for growing enterprises with expanding Parquet data.
  6. Data Science and Machine Learning: By storing Parquet data in Postgres, you may use machine learning models to perform predictive analytics, consumer segmentation, and other tasks.

Conclusion

This blog talks about the steps to follow to import data from Parquet to PostgreSQL in an easy way. It also gives us a basic understanding of Parquet and PostgreSQL.

Extracting complex data from a diverse set of data sources to carry out an insightful analysis can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from 150+ Data Sources including Databases or SaaS applications into a destination of your choice or a Data Warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

Some more useful resources related to this source and destination:

  1. Best PostgreSQL ETL tools
  2. Loop in PostgreSQL
  3. Redshift Parquet Integration
VISIT OUR WEBSITE TO EXPLORE HEVO

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations, with a few clicks. Hevo Data with its strong integration with 150+ data sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Harshitha Balasankula
Former Marketing Content Analyst, Hevo Data

Harshita is a data analysis enthusiast with a keen interest for data, software architecture, and writing technical content. Her passion towards contributing to the field drives her in creating in-depth articles on diverse topics related to the data industry.

No-code Data Pipeline for PostgreSQL