PostgreSQL, although is primarily used as an OLTP Database, is one of the popular tools for analyzing data at scale. Its novel architecture, reliability at scale, robust feature set, and extensibility give an advantage over other databases.
However, as the data volume and velocity needs increases, it requires huge effort and expertise to optimize PostgreSQL to make it run analytics queries within the desired time. Due to this, there can be an impact on your transaction workloads directly or indirectly. This is why most of the organizations prefer a separate system for analytical workloads. If that solution is in the cloud additional benefits like elasticity in scaling, minimal upfront cost, etc. come to play.
There are a number of well-known cloud-based solutions specifically designed for analytical workloads. BigQuery, a leading contender in this domain is provided by Google Cloud Platform. BigQuery is really worth considering as critical part of your next analytical data pipeline as it differs from other cloud-based solutions in many aspects and considered as a truly serverless high performing data warehouse.
Benefits of BigQuery
Here are some points to be considered while comparing BigQuery with alternative solutions like Redshift. Read a complete comparison between Redshift and BigQuery.
- BigQuery automatically scales both up and down on the basis data volume and complexities of queries. Cluster sizing or scaling is easily handled by BigQuery which means it is truly serverless.
- There is no cost for infrastructure. We have to pay for storing data ($0.020 per GB per Month) and for querying data ($5 per 1 TB of data analyzed). BigQuery is storing data in columnar format. So most of the time volume of data analyzed for the query is much lesser than the total table size. Please refer to their website to know the recent pricing details.
- Google BigQuery supports streaming data into the destination table without impacting query performance.
- All the data in BigQuery is encrypted by default. Security is guaranteed and no user configuration is required.
- BigQuery supports semi-structured data like JSON format. One does not have to convert it to relational data.
- Google cloud team has added ML modules on top of BigQuery. These ML models can be created and trained using basic SQL.
PostgreSQL to BigQuery Replication
Broadly, there are two ways to perform data Postgres to BigQuery Replication:
Method 1: A ready to use cloud ETL tool like, Hevo Data Integration Platform (7 Days Free Trial) that can easily help you bring data from not just PostgreSQL, but many other custom sources.
Method 2: Write custom ETL code to move data from PostgreSQL to BigQuery.
For the scope of this blog, we will focus on Method 2 and detail out the steps and challenges. Towards the end, we will also briefly compare the two methods, so that you have the right details to make a choice.
Method to Perform PostgreSQL to BigQuery Migration Using Custom Code:
The COPY command is used to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally.
The picture given above represents the high-level steps to be followed to migrate Postgres to BigQuery:
- Extract data from PostgreSQL using the COPY TO command as shown in the image above
- Optionally Clean and Transform data
- Upload to Google Cloud Storage(GCS)
- Upload to the BigQuery table from GCS using bq command-line tool or console or any cloud SDK.
Let’s take a detailed look at each step.
Step 1: Extract data From PostgreSQL
COPY TO command is the well-accepted method to extract data from the PostgreSQL table. You can also consider using pg_dump utility for the first time full data extraction. This blog post will cover both methods.
a. Data Extraction – COPY command
Copy Command is the most efficient way to move data between PostgreSQL tables and Standard file-system. The COPY TO command copies the results of a SELECT query or contents of a table to that file. The syntax of Commands is:
COPY table_name or query TO file_name WITH options.
COPY persons TO 'C:\tmp\persons_db.csv' WITH DELIMITER ',' CSV HEADER; COPY (select * from persons where age >25) TO 'C:\tmp\persons_db.csv' WITH DELIMITER ',' CSV HEADER;
Some frequently used options:
- FORMAT – Data format to be read or written: text, csv or binary. By default, this is set to text.
- DELIMITER- This indicates as to what character separates columns within each row of the file.
- NULL – This is the string that represents a null value. By default, it is an unquoted empty string in CSV format and \N (backslash-N) in text format.
- HEADER – In the output file, the column names/headers from the table. These are mentioned in the first line.
- QUOTE – When a data value is quoted, this specifies the quoting character to be used. A double-quote is used by default.
- ESCAPE – This represents the character that should appear before the data character that matches the value for QUOTE.
- ENCODING – This specifies the encoding of the output file. The existing client encoding is used, if this option is not specified.
Now, let’s look into how COPY command can be used to extract data from multiple tables using a procedure. The list_of_tables contains details of tables to be exported.
CREATE OR REPLACE FUNCTION to_csv(path TEXT) RETURNS void AS $$ declare tables RECORD; statement TEXT; begin FOR tables IN SELECT (table_schema || '.' || table_name) AS schema_table FROM list_of_tables LOOP statement := 'COPY ' || tables.schema_table || ' TO ''' || path || '/' || tables.schema_table || '.csv' ||''' DELIMITER '';'' CSV HEADER'; EXECUTE statement; END LOOP; return; end; $$ LANGUAGE plpgsql; SELECT db_to_csv('/home/user/dir'); -- This will create one csv file per table, in /home/user/dir/
If you want to extract data incrementally, add more metadata like the timestamp of last data extraction to table list_of_tables and use that information while creating copy command to extract data changed after that timestamp.
For example, say you are using a column name last_success_pull_time corresponding to each table in the table list_of_tables which stores last successful data pull time. Each time data in the table which are modified after that timestamp has to be pulled. Body of the loop in procedure will change like this:
Here, a dynamic SQL is created with a predicate comparing last_modified_time_stamp from the table to be extracted and last_success_pull_time from table list_of_tables.
begin FOR tables IN SELECT (table_schema || '.' || table_name) AS schema_table, last_success_pull_time AS lt FROM list_of_tables LOOP statement := 'COPY (SELECT * FROM ' || tables.schema_table || ' WHERE last_modified_time_stamp > ' || last_success_pull_time ') TO ' '' || path || '/' || tables.schema_table || '.csv' ||''' DELIMITER '';'' CSV HEADER'; EXECUTE statement; END LOOP; return; End;
b. Extract data using the pg_dump
pg_dump is a utility for backing up a PostgreSQL database. It can be used to extract data from the table also.
pg_dump --column-inserts --data-only --table=<table> <database> > my_dump.sql
Here, output file table_name.sql will be in the form of insert statements like
INSERT INTO table_name (column1, column2, column3, ...) VALUES (value1, value2, value3, ...);
This output has to be converted into a CSV file. You have to write a small script to perform this.
Step 2: Transforming Data
Apart from transformations to accommodate business logic, there are some basic things to keep in mind while preparing data for BigQuery.
- BigQuery expects CSV data to be UTF-8 encoded. If data encoded in ISO-8859-1(Latin-1), explicitly specify it while loading to BigQuery.
- BigQuery doesn’t enforce Primary Key and unique key constraints. Your ETL process has to take care of that.
- Column types are slightly different between Postgres and BigQuery. Most of the types have either equivalent or convertible types. The following table contains a list of common data types.
|PostgreSQL Data Type||BigQuery Data Type|
|ENUM||No type for ENUM.Must use any type which can represent values in ENUM|
|TIME [WITHOUT TIME ZONE]||TIME|
|TIMESTAMP [WITHOUT TIME ZONE]||TIMESTAMP|
You can visit their official page to know more about BigQuery data types.
- DATE value must be a dash(-) separated and in the form YYYY-MM-DD (year-month-day).
Fortunately, default date format in Postgres is same, YYYY-MM-DD.So if you are simply selecting date columns it should be the incorrect format. In the case of date is stored as a string in the table for any reason, it can be converted while selecting data.
The TO_DATE function in PostgreSQL helps in converting string values into dates.
Syntax : TO_DATE(str,format) Example : SELECT TO_DATE('31,12,1999','%d,%m,%Y'); Result : 1999-12-31
- In TIMESTAMP type, the hh:mm:ss (hour-minute-second) portion must use a colon (:) separator.
Similar to Date type, The TO_TIMESTAMP function in PostgreSQL is used to converting strings into timestamps.
Syntax : TO_TIMESTAMP(str,format) Example : SELECT TO_TIMESTAMP('2017-03-31 9:30:20','YYYY-MM-DD HH:MI:SS'); Result: 2017-03-31 09:30:20-07
- Make sure text columns are quoted if it can potentially have delimiter characters.
Step 3: Upload to Google Cloud Storage(GCS)
There are multiple ways in which you can achieve this:
Gsutil is the standard tool from GCP to handle objects and buckets in GCS. It has options to upload a file from your local machine to GCS.
To copy a file to GCS :
gsutil cp table_data.csv gs://my-bucket/path/to/folder/
To copy an entire folder :
gsutil cp -r dir gs://my-bucket/path/to/parent/
If the files to be copied are already in S3, the same command can be used to transfer to GCS. Note that AWS credentials should be present in the default path to work this.
gsutil cp -R s3://bucketname/source/path gs://bucketname/destination/path
Storage Transfer Service:
Storage Transfer Service from Google Cloud Platform is another alternative to upload files to GCS from S3 or other online data sources like HTTP/HTTPS location given destination or sink is Cloud Storage bucket. A GCS bucket can also be the source.
When it comes to moving data to GCS, this service is extremely useful. Here are the top reasons to use Storage Transfer Service:
- Using Storage Transfer service, you can schedule both repeating and one-time transfers.
- If no corresponding source object is present, the service supports the deletion of existing objects in the destination. This is very helpful in syncing source and destination.
- After transferring data, you can delete the source object.
- The service supports source and sink synchronization with advanced filters like file creation dates, file-name, etc.
Upload Data from Web Console:
In case you are uploading data from the local machine, you can use web console UI upload files to GCS. The steps along with relevant screenshots are mentioned below:
1. First thing, you will have to login to your GCP account. On the left side bar, click on Storage and go to the Browser.
2. Select the GCS bucket that you want to upload the file. Here the bucket is named test-data-hevo. Click on the bucket.
3. In bucket details page, you have to click on the Upload files button and select relevant files from your system.
4. Once the upload is completed, the file will be listed in the bucket as follows :
Step 4: Upload to the BigQuery table from GCS
You can use web console UI or command line tool called bq to load data to BigQuery table. First, let’s look into how to achieve this using the web console:
- Go to BigQuery console from the left side panel.
- Create Dataset if not present already.
- Now click on the created datasets in the left side. Then create table option will appear below Query editor.
- Create Table button will open a wizard with options to specify input source and other specifications for the table.
As shown in the screenshot, there are options to specify schema or auto-detect. There are many more options available. Feel free to chose them as per your need.
To interact with BigQuery, you can use bq, the command line tool. Using this tool, uploading data to GCS is a cakewalk. All you need to do is use the bq load command and specify CSV as the source_format.
The syntax of bq load:
bq --location=[LOCATION] load --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA] [LOCATION] is an optional parameter and represents your Location. [FORMAT] is to be set to CSV. [DATASET] represents an existing dataset. [TABLE] is the table name into which you're loading data. [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI. [SCHEMA] is a valid schema. The schema must be a local JSON file or inline. Note that here an autodetect flag can also be used instead of supplying a schema definition.
There are a bunch of options specific to CSV data load:
|CSV option||bq command flag||Description|
|Field Delimiter||-F or –field_delimiter||Field delimiter: Comma, Tab, Pipe, Other|
|Header rows||–skip_leading_rows||Header rows to skip|
|Number of bad records allowed||–max_bad_records||Number of errors allowed|
|Newline characters||–allow_quoted_newlines||Allow quoted newlines|
|Custom null values||–null_marker||Specifies a string that represents a null value in a CSV file|
|Trailing optional columns||–allow_jagged_rows||Allow jagged rows|
|Unknown values||–ignore_unknown_values||Ignore unknown values|
|Quote||–quote||The value that is used to quote data sections in a CSV file|
|Encoding||-E or –encoding||The character encoding of the data|
Refer to BigQuery documentation to explore more about data cloud storage CSV.
For Schema Specification using a JSON file:
bq --location=US load --source_format=CSV your_dataset.your_table gs://your_bucket/your_data.csv ./your_schema.json
For schema auto detection using file:
bq --location=US load --autodetect --source_format=CSV your_dataset.your_table gs://mybucket/data.csv
BigQuery offers 3 options to write to an existing table:
- Write if empty
- Append to the table
- Overwrite table.
Additionally, while uploading data it is possible to add new fields to the table. Here is an example.
Overwriting existing table:
bq --location=US load --autodetect --replace --source_format=CSV your_dataset.your_table gs://mybucket/your_data.csv
Appending data to the existing table:
bq --location=US load --autodetect --noreplace --source_format=CSV your_dataset.your_table gs://mybucket/your_data.csv ./your_schema.json
Adding a new field to table. Here, the new schema file with an extra field is given :
bq --location=asia-northeast1 load --noreplace --schema_update_option=ALLOW_FIELD_ADDITION --source_format=CSV your_dataset.your_table gs://mybucket/your_data.csv ./your_schema.json
Step 5: Update target table in BigQuery
Upload the data into a staging table before upserting newly extracted data to the BigQuery table. It needs to be a full load. Refer to the full data load section above for more details. Let’s call it as intermediate_delta_table. You can refer any of the following two steps to load data to the target table :
a. Update the values existing records in the final table and insert new rows from the delta table which are not in the final table.
UPDATE data_set.final_target_table t SET t.value = s.value FROM data_set.intermediate_delta_table s WHERE t.id = s.id;
INSERT data_set.final_target_table (id, value) SELECT id, value FROM data_set.intermediate_delta_table WHERE NOT id IN (SELECT id FROM data_set.final_target_table);
b. Delete rows from the final table which are present in the delta table. Then insert all rows from the delta table to the final table.
DELETE data_set.final_target_table f WHERE f.id IN (SELECT id from data_set.intermediate_delta_table);
INSERT data_set.final_target_table (id, value) SELECT id, value FROM data_set.intermediate_delta_table;
Limitations of PostgreSQL to BigQuery Migration Using custom code / ETL Scripts:
In the modern era of data-driven business, getting value from data should be quick with minimum friction. Data Warehouses like BigQuery play an extremely important role in this. However, moving data to these cloud-based solutions takes multiple steps like extract data, clean/transform it and upload. It requires huge engineering resources and time to ensure data correctness and consistency across the process.
Easier Way to move data from PostgreSQL to BigQuery
There is a faster alternative to achieve the same output and scale quickly. A Cloud Data Integration Platform – Hevo, can help you consolidate data from any data source into BigQuery in a matter of a few minutes. All of this can be achieved on a simple point and click interface without writing any ETL scripts. Here are the steps to replicate PostgreSQL to BigQuery using Hevo:
- Connect to your PostgreSQL database.
- Select the replication mode: (i) Load selected MySQL tables (ii) Load data via Custom SQL Query (iii) Load data through Postgres Logical Replication.
- Configure the BigQuery destination where it should be copied.
Hevo offers a 7 Day Free Trial for you to explore a hassle-free data migration from PostgreSQL to BigQuery.
We would recommend you to explore both methods to move data from PostgreSQL to BigQuery and make the right choice as per your requirement.