SFTP/FTP to BigQuery: How to Transfer Data Seamlessly

on Data Integration • February 11th, 2020 • Write for Hevo

Many businesses generate data and store it in the form of a file. However, the data stored in these files can not be used as it is for analysis. Given data is now the new oil, businesses need a way to move data into a database or data warehouse so that they can leverage the power of a SQL-like language to answer their key questions in a matter of seconds. This article talks about loading the data stored in files on FTP to BigQuery Data Warehouse. 

Introduction to FTP

FTP stands for File Transfer Protocol, is the standard protocol used to transfer files from one machine to another machine over the internet. When downloading an mp3 from the browser or watching movies online, have you encountered a situation where you are provided with an option to download the file from a specific server? This is FTP in action. 

FTP is based on a client-server architecture and uses two communication channels to operate: 

  1. A command channel that contains the details of the request
  2. A data channel that transmits the actual file between the devices

Using FTP, a client can upload, download, delete, rename, move and copy files on a server. For example, businesses like Adobe offer their software downloads via FTP.

Introduction to Google BigQuery

Bigquery is a NoOps (No operations) data warehouse as a service provided by Google to their customers to process over petabytes of data in seconds using SQL as a programming language. BigQuery is a cost-effective, fully managed, serverless and highly available service.

Since Bigquery is fully managed, it takes the burden of implementation and management off the user, making it super easy for them to focus on deriving insights from their data. 

You can read more about the features of BigQuery here.

Moving Data from FTP Server To Google BigQuery

There are two ways of moving data from FTP Server to BigQuery:

Method 1: Building a Custom Code to Move Data

To be able to achieve this, you would need to understand how the interfaces of both FTP and BigQuery work, hand-code custom scripts to extract, transform and load data from FTP to BigQuery. This would need you to deploy tech resources.

Method 2: Use a ready Data Integration Platform that works out of the box

The same can be achieved using a no-code data integration product like Hevo Data. Hevo is fully managed and can load data in real-time from FTP to BigQuery. This will allow you to stop worrying about data and focus only on deriving insights from it.

This blog covers both approaches in detail. It also highlights the pros and cons of both approaches so that you can decide on the one that suits your use case best.

FTP Server to BigQuery: ETL Using Custom Code

The steps involved in loading data from FTP Server to BigQuery are as follows:

  1. Connect to BigQuery Compute Engine instance using WINSCP
  2. Copy files from your FTP server to Compute engine VM instance
  3. Load data into Bigquery using BQ load utility

Step 1: Establish a connection between your machine and BigQuery’s Compute Engine VM instance using FTP protocol.

  1. Download the WINSCP tool for your device.
  2. Open WINSCP application to connect to the Compute Engine instance.
  3. In the session, section select ‘FTP’ as a file protocol. 
  4. Paste external IP in Host Name. 
  5. Use key-comment as a user name. Lastly, click on login

Step 2: Copy files from your machine to the Compute Engine VM instance.

On successful login, copy the file to VM.

Step 3: Load the file into the BigQuery table using the bq load command.

(In this article we are loading a “.CSV” file)

1. SSH into your compute engine VM instance, go to the directory in which you have copied the file.

2. Execute below command

bq load --autodetect --source_format=CSV test.mytable testfile.csv

For more bq options please read the bq load CLI command google documentation.

3. Now verify the data load by selecting data from “test.mytable” table by opening the BigQuery UI.

Thus we have successfully loaded data in BigQuery table using FTP.

FTP to BigQuery: Limitations of Loading Data using Custom Code

  1. The entire process would have to be set up manually. Additionally, once the infrastructure is up, you would need to provision engineering resources to monitor FTP server failure, load failure and more so that accurate data is available in BigQuery. 
  2. This method works only for a one-time load. If your use case is to do a change data capture, this approach will fail.
  3. For loading data in UPSERT mode will need to write extra lines of code to achieve this functionality.
  4. If the file contains any special character or unexpected character data load will fail.
  5. Currently, bq load supports only single character delimiter, if we have a requirement of loading multiple characters delimited files, this process will not work.
  6. Since in this process, we are using multiple applications, so in case of any process, abortion backtracking will become difficult.

A much more efficient and elegant way would be to use a ready platform like Hevo (14-day free trial) to load data from FTP (and a bunch of other data sources) into BigQuery. 

Hevo can help you bring data from FTP to BigQuery in two simple steps:

Step 1: Connect to your FTP location by providing necessary connection setting details

Step 2: Authenticate and point to the BigQuery Table where the data needs to be loaded.

That is all. Hevo will ensure that your FTP data is loaded to BigQuery in real-time without any hassles. Here are some of the advantages of using Hevo

  1. Easy Setup and Implementation – Your data integration project can take off in just a few mins with Hevo. 
  2. Complete Monitoring and Management – In case the FTP server or BigQuery data warehouse is not reachable, Hevo will re-attempt data loads in a set instance ensuring that you always have accurate data in your data warehouse.
  3. Change Data Capture – Hevo can automatically detect new files on the FTP location and load them to BigQuery without any manual intervention
  4. 100’s of additional Data Sources – In addition to FTP, Hevo can bring data from 100’s of other data sources into BigQuery in real-time. This will ensure that Hevo is the perfect companion for your businesses’ growing data integration needs
  5. 24×7 Support – Hevo has a dedicated support team available at all points to swiftly resolve any queries and unblock your data integration project.

Before spending days of development effort in loading data from FTP to BigQuery manually, do give Hevo a shot by exploring a 14-day free trial here.

No-code Data Pipeline for BigQuery