AWS Redshift is a completely managed data warehouse service offered by Amazon based on a pay-as-you-go model. Using a completely managed service like Redshift relieves organizations of effort that is involved in building and maintaining a high availability data warehouse. Such flexibility has made Redshift a very popular option for organizations to build their extract transform load pipeline.

Even though an old protocol, FTP and SFTP are still widely popular mechanisms to let systems access large data files through the internet. Many use cases – legacy as well as modern, still involves using FTP servers to provide access to data. This post is about the steps involved in loading data from SFTP or FTP to Redshift. This blog describes both methods in detail. In addition, the post also talks about the pros and cons of each method so that you weigh both the options and make an informed call based on your use case.

Introduction to Amazon Redshift

FTP to RedShift: Redshift Logo
Image Source

Redshift excels over other data warehouse services with its unique ability to return super-fast results over queries spanning millions of rows. This is made possible through the use of a massively parallel processing architecture that depends on a cluster of nodes. One of these nodes is designated as a leader node which handles client connections, planning query executions, and work assignment, It offers two kinds of nodes – Dense storage based or Dense compute based. Customers can choose the Redshift cluster configuration based on their budget and use cases. A detailed note on Redshift’s underlying architecture can be found here.

What makes completely managed services special is their ability to scale seamlessly. Redshift is no exception and can scale by adding more nodes or upgrading the existing nodes. Redshift’s concurrency scaling allows scaling only during the high load times with predefined boundaries for cost and resource usage. It is to be noted that this feature is charged extra. The redshift spectrum is another unique feature that enables customers to use the processing power of Redshift over data stored in S3. 

Introduction to SFTP/FTP

FTP to RedShift: FTP/SFTP Differences
Image Source

SFTP stands for Secure File Transfer Protocol. Think of it as an encrypted and more secure version of a standard File Transfer Protocol. Businesses and organizations primarily use SFTP because it’s safer and because their data remains encrypted.

FTP stands for File Transfer Protocol – it is the most used protocol for downloading or uploading files from a server. FTP can be used for transferring files between different computers. Any remote host that is running an FTP can connect to a localhost machine (the user’s computer) through an Internet connection, allowing for a seamless fil transfer.

These are network protocols that facilitate data transfer between client and server systems. The basic difference between the two protocols is the added security layer in SFTP that establishes a secure connection between the client and the server based on an authentication method (such as username/password). 

Loading data from SFTP or FTP to Redshift

There are two ways of moving data from SFTP/FTP to Redshift:

Method 1: Using Custom Code to Connect FTP to Redshift

This is a simple 4-step process that starts with downloading the data files from an FTP Server to the local client machine. This method has considerable advantages but a few limitations as well which will be covered later in the article.

Method 2: Using Hevo Data to Connect FTP to Redshift

If you are looking to instantly move data from FTP to Redshift, then this is a better and faster approach for free. Hevo can help you move data from Twitter Ads to Redshift on a point and click interface, without having to write any code for free. Sign up here to try Hevo for free. 

Get started with hevo for free

Understanding the Methods to Connect FTP to Redshift

There are two methods that you can leverage to establish a connection from FTP to Redshift in a seamless fashion. These methods are as follows:

Method 1: Using Custom Code to Connect FTP to Redshift

By default, Redshift’s answer for everything related to loading data is its COPY command which can interface with a number of data sources. COPY command can interface with DynamoDB, AWS S3, or even remote hosts that support SSH. The challenge in our problem is that most FTP/SFTP servers will not have an SSH server running and loading data to Redshift requires a separate set of steps to first download the data to S3. Downloading data from FTP can be done by using the CURL utility. The below sections details the steps involved in loading data from FTP/SFTP:

Step 1: Downloading SFTP/FTP to a Local Server

Use the CURL utility to download FTP/SFTP data to a local system.

curl -u user:password 'ftp://<ftp_url>/data/products.csv' -o ~/Downloads/products.csv

If the server supports SFTP, just change the protocol name in the URL as below.

curl -u user:password 'sftp://<ftp_url>/data/products.csv' -o ~/Downloads/products.csv

Step 2: Using AWS CLI to Load the File to S3

Once the file is in the local file system, use the AWS CLI to load the file to S3. Assuming AWS CLI is already installed and the credentials are configured in CLI, use the below command.

aws s3 cp products.csv s3://my-data-bucket/products/

Step 3: Connecting to Redshift

Connect to the Redshift instance using the below command:

psql -h mycluster.12345672922.us-west-2.redshift.amazonaws.com -U user -d dbname -p 5439

At the password prompt enter the password for the specific user.

Step 4: Loading Data from S3 to Redshift

Use the Redshift COPY command to load the data from S3 to Redshift.

copy products from 's3://my-data-bucket/products/products.csv'  credentials 'aws_access_key_id=<AWS-Access-Key-ID>;aws_secret_access_key=<AWS-Secret-Access-Key>' csv;

COPY command needs the table to be created beforehand. The example given above assumes that the data file has fields in the same order as that of the Redshift table. In case this is not the case, we will need to use a column mapping file.

COPY command by default tries to map all the fields it sees through its implicit data type mapping.

A detailed account of how to effectively use the COPY command can be found here.

In the above example, we used an intermediate S3 bucket to store the data before loading it to Redshift. Another option is to use the SSH configuration parameter in Redshift’s COPY command to fetch the data from the local filesystem through a remote SSH connection. Instead of steps 3 and 4, create a manifest file to configure the connection to the remote host where the file was saved in step 2.

{    "entries": [       {"endpoint":"<ssh_endpoint_or_IP>",         "command": "cat /user/Downloads/products.csv",        "mandatory":true,         "publickey": "<host_public_key>",         "username": "<host_user_name>"}     ]  }

Please note this file needs to be in S3 for the COPY command to use.

copy products from 's3://mybucket/ssh_manifest' credentials 'aws_access_key_id=<AWS-Access-Key-ID>; aws_secret_access_key=<AWS-Secret-Access-Key>' csv ssh;

If the FTP/SFTP server already has an SSH server running in it, then using the ssh mode of COPY command as above can avoid the separate downloading step in step 1.

So that was a very simplified explanation of how to copy data from an FTP or SFTP server to Redshift.

Limitations of using Custom Code to Connect FTP to Redshift

In reality, there are a number of challenges involved in connecting FTP to Redshift. The typical ones are listed below:

  • The above method of connecting FTP to Redshift can only be used for a first-time bulk load. The COPY command by default appends data rows to existing rows and hence there is a need for additional logic when incremental data load is required.
  • Redshift’s implicit data type mapping will play havoc leading to corrupted databases if the developer is not careful enough. To be specific, Redshift will implicitly convert the column values according to the target table’s data type specification without throwing an error in many cases. For example, a decimal value will be converted to an integer by automatically truncating it if the data type specification of the target column is an integer. You can read more about these implicit conversions and Redshift data types here
  • We used two staging locations in the above approach – A local file system to keep the data after downloading from FTP and then an S3 bucket to load data to Redshift. In reality, having access to such a local system that can hold a large amount of data is not practical.
  • In most cases, this process needs to happen periodically rejecting the already processed files. A custom script with progress awareness and restart support will need to be implemented to use this in production.

A solution to all these challenges is to use an automated data pipeline platform like Hevo which can accomplish this in a matter of few clicks. 

Hevo provides a point and clicks interface with support for most of the commonly used source and target database combinations.

Method 2: Using Hevo Data to Connect FTP to Redshift

FTP to RedShift: Hevo Logo
Image Source

Hevo is an automated data pipeline that provides an easy-to-use,cost-free User Interface with the ability to copy data from FTP to Redshift without writing any code. Hevo enables the lowest time to production for such copy operations, allowing developers to focus on their core business logic rather than waste time on the configuration nightmares involved in setting these up.

Sign up here for a 14-day free trial!

Hevo can help you bring data from SFTP or FTP to Redshift in two simple steps:

  • Configure Source: Connect Hevo Data with SFTP/FTP by providing a unique name for your Pipeline, Type, Host, Port, Username, File Format, Path Prefix, Password.
FTP to Redshift: Configure Source
Image Source
  • Integrate Data: Load data from FTP to Redshift by providing your Redshift databases credentials like Database Port, Username, Password, Name, Schema, and Cluster Identifier along with the Destination Name.
FTP to Redshift: Configure Destination
Image Source

That is all. Hevo will ensure that your SFTP/FTP data is loaded reliably into Redshift in real-time. 

Here are some of the advantages of using Hevo:

  • Easy Setup and Implementation –  With Hevo, your data projects can take off in a jiffy as it will only take a few minutes to set up and configure your first data pipeline. 
  • Change Data Capture – Hevo can automatically detect new files on the SFTP or FTP location and load them to Redshift without any manual intervention.
  • Transformations – Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  • Connectors – Hevo supports 100+ integrations to SaaS platforms, files, databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, MongoDB, TokuDB, DynamoDB, PostgreSQL databases to name a few. 
  • Zero Maintainance Overhead – Hevo automatically takes care of handling all the errors that may occur, ridding you of any data pipeline maintenance tasks. 
  • Additional Data Sources – In addition to SFTP/FTP, Hevo can bring data from 100+ other data sources into Redshift in real-time. This will ensure that Hevo is the perfect companion for your businesses’ growing data integration needs.
  • Live Support – Hevo has a dedicated product support team available at all points to swiftly resolve any queries and ensure you always have up-to-date data in your warehouse. 

Conclusion

This blog talks about the two methods you can use to set up a connection from FTP to Redshift: using custom ETL scripts and with the help of a third-party tool, Hevo. It also gives a brief overview of FTP and Redshift highlighting their key features and benefits before diving into the setup process.

Extracting complex data from a diverse set of data sources can be a challenging task and this is where Hevo saves the day! Hevo offers a faster way to move data from Databases or SaaS applications such as FTP into your Data Warehouse like Snowflake to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

visit our website to explore hevo

Before spending days of development effort in loading data from SFTP/FTP to Redshift manually, do give Hevo a shot by sign up for a 14-day free trial here. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Talha
Software Developer, Hevo Data

Talha is a seasoned Software Developer, currently driving advancements in data integration at Hevo Data, where he have been instrumental in shaping a cutting-edge data integration platform for the past four years. With a significant tenure at Flipkart prior to their current role, he brought innovative solutions to the space of data connectivity and software development.

No-Code Data Pipeline for Redshift

Get Started with Hevo