SFTP/FTP to BigQuery: 2 Easy Methods

on Data Integration • February 11th, 2020 • Write for Hevo

Many businesses generate data and store it in the form of a file. However, the data stored in these files can not be used as it is for analysis. Given data is now the new oil, businesses need a way to move data into a database or data warehouse so that they can leverage the power of a SQL-like language to answer their key questions in a matter of seconds. This article talks about loading the data stored in files on FTP to BigQuery Data Warehouse. 

Table of Contents

Introduction to FTP

FTP stands for File Transfer Protocol, which is the standard protocol used to transfer files from one machine to another machine over the internet. When downloading an mp3 from the browser or watching movies online, have you encountered a situation where you are provided with an option to download the file from a specific server? This is FTP in action. 

FTP is based on a client-server architecture and uses two communication channels to operate: 

  1. A command channel that contains the details of the request
  2. A data channel that transmits the actual file between the devices

Using FTP, a client can upload, download, delete, rename, move and copy files on a server. For example, businesses like Adobe offer their software downloads via FTP.

Introduction to Google BigQuery

Bigquery is a NoOps (No operations) data warehouse as a service provided by Google to their customers to process over petabytes of data in seconds using SQL as a programming language. BigQuery is a cost-effective, fully managed, serverless, and highly available service.

Since Bigquery is fully managed, it takes the burden of implementation and management off the user, making it super easy for them to focus on deriving insights from their data. 

You can read more about the features of BigQuery here.

Moving Data from FTP Server To Google BigQuery

There are two ways of moving data from FTP Server to BigQuery:

Method 1: Using Custom ETL Scripts to Move Data from FTP to BigQuery

To be able to achieve this, you would need to understand how the interfaces of both FTP and BigQuery work, hand-code custom scripts to extract, transform and load data from FTP to BigQuery. This would need you to deploy tech resources.

Method 2: Using Hevo Data to Move Data from FTP to BigQuery

The same can be achieved using a no-code data integration product like Hevo Data. Hevo is fully managed and can load data in real-time from FTP to BigQuery. This will allow you to stop worrying about data and focus only on deriving insights from it.

Get Started with Hevo for Free

This blog covers both approaches in detail. It also highlights the pros and cons of both approaches so that you can decide on the one that suits your use case best.

Methods to Move Data from FTP to BigQuery

These are the methods you can use to move data from FTP to BigQuery in a seamless fashion:

Method 1: Using Custom ETL Scripts to Move Data from FTP to BigQuery

The steps involved in loading data from FTP Server to BigQuery using Custom ETL Scripts are as follows:

Step 1: Connect to BigQuery Compute Engine

  1. Download the WINSCP tool for your device.
  2. Open WinSCP application to connect to the Compute Engine instance.
  3. In the session, the section select ‘FTP’ as a file protocol. 
  4. Paste external IP in Host Name. 
  5. Use key-comment as a user name. Lastly, click on the login option.

Step 2: Copy Files from Your FTP Server

On successful login, copy the file to VM.

Step 3: Load Data into BigQuery using BQ Load Utility

(In this article we are loading a “.CSV” file)

1. SSH into your compute engine VM instance, go to the directory in which you have copied the file.

2. Execute the below command

bq load --autodetect --source_format=CSV test.mytable testfile.csv

For more bq options please read the bq load CLI command google documentation.

3. Now verify the data load by selecting data from the “test.mytable” table by opening the BigQuery UI.

Thus we have successfully loaded data in the BigQuery table using FTP.

Limitations of Using Custom ETL Scripts to Move Data from FTP to BigQuery

Here are the limitations of using Custom ETL Scripts to move data from FTP to BigQuery:

  • The entire process would have to be set up manually. Additionally, once the infrastructure is up, you would need to provide engineering resources to monitor FTP server failure, load failure, and more so that accurate data is available in BigQuery. 
  • This method works only for a one-time load. If your use case is to do a change data capture, this approach will fail.
  • For loading data in UPSERT mode will need to write extra lines of code to achieve this functionality.
  • If the file contains any special character or unexpected character data load will fail.
  • Currently, bq load supports only a single character delimiter, if we have a requirement of loading multiple characters delimited files, this process will not work.
  • Since in this process, we are using multiple applications, so in case of any process, abortion backtracking will become difficult.

Method 2: Using Hevo Data to Move Data from FTP to BigQuery

Hevo Logo
Image Source

A much more efficient and elegant way would be to use a ready platform like Hevo (14-day free trial) to load data from FTP (and a bunch of other data sources) into BigQuery. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Sign up here for a 14-Day Free Trial!

Hevo takes care of all your data preprocessing to set up migration from FTP Data to BigQuery and lets you focus on key business activities and draw a much powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. 

Hevo can help you bring data from FTP to BigQuery in two simple steps:

Configure Source: Connect Hevo Data with SFTP/FTP by providing a unique name for your Pipeline, Type, Host, Port, Username, File Format, Path Prefix, Password.

FTP to BigQuery: Source Configuration
Image Source

Configure Destination: Connect to your BigQuery account and start moving your data from FTP to BigQuery by providing the project ID, dataset ID, Data Warehouse name, GCS bucket.

FTP to BigQuery Destination Configuration
Image Source

Step 2: Authenticate and point to the BigQuery Table where the data needs to be loaded.

That is all. Hevo will ensure that your FTP data is loaded to BigQuery in real-time without any hassles. Here are some of the advantages of using Hevo
:

  1. Easy Setup and Implementation – Your data integration project can take off in just a few mins with Hevo. 
  2. Complete Monitoring and Management – In case the FTP server or BigQuery data warehouse is not reachable, Hevo will re-attempt data loads in a set instance ensuring that you always have accurate data in your data warehouse.
  3. Transformations – Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  4. Connectors – Hevo supports 100+ integrations to SaaS platforms, files, databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, MongoDB, TokuDB, DynamoDB, PostgreSQL databases to name a few. 
  5. Change Data Capture – Hevo can automatically detect new files on the FTP location and load them to BigQuery without any manual intervention
  6. 100’s of additional Data Sources – In addition to FTP, Hevo can bring data from 100’s other data sources into BigQuery in real-time. This will ensure that Hevo is the perfect companion for your businesses’ growing data integration needs
  7. 24×7 Support – Hevo has a dedicated support team available at all points to swiftly resolve any queries and unblock your data integration project.

Conclusion

This blog talks about the two methods you can implement to move data from FTP to BigQuery in a seamless fashion.

Extracting complex data from a diverse set of data sources can be a challenging task and this is where Hevo saves the day! 

Visit our Website to Explore Hevo

Hevo offers a faster way to move data from Databases or SaaS applications like FTP into your Data Warehouse like Google BigQuery to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code. 

Sign Up for a 14-day free trial to try Hevo for free. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

No-code Data Pipeline for BigQuery