Unlock the full potential of your SFTP/FTP data by integrating it seamlessly with Redshift. With Hevo’s automated pipeline, get data flowing effortlessly—watch our 1-minute demo below to see it in action!
AWS Redshift is a completely managed data warehouse service offered by Amazon based on a pay-as-you-go model. Using a completely managed service like Redshift relieves organizations of the effort that are involved in building and maintaining a high-availability data warehouse. Such flexibility has made Redshift a very popular option for organizations to build their extract transform load pipeline.
Even though old protocols, such as FTP and SFTP, are still widely popular mechanisms to let systems access large data files through the internet, many use cases – legacy and modern- still involve using FTP servers to provide access to data. This post is about the steps involved in loading data from SFTP or FTP to Redshift. This blog describes both methods in detail. In addition, the post also talks about the pros and cons of each method so that you can weigh both options and make an informed call based on your use case.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It enables fast query performance using columnar storage technology and parallel processing to handle large volumes of data efficiently.
Key Features of Amazon Redshift
- Data Integration: Easily integrates with AWS services like S3, DynamoDB, and data lakes, as well as other ETL tools, facilitating data ingestion and management.
- Scalable Performance: Redshift scales to handle petabytes of data with high performance, using massively parallel processing (MPP) and columnar storage.
- Managed Service: It is a fully managed service that automates administrative tasks like backups, patching, and monitoring, reducing operational overhead.
- SQL-Based Queries: Redshift supports standard SQL queries and integrates with popular BI tools and SQL clients, making it accessible for users familiar with SQL.
- Cost-Effective Pricing: Offers pay-as-you-go pricing with options for reserved instances to optimize costs based on your usage needs.
Take your data to the next level by seamlessly migrating from SFTP/FTP to Amazon Redshift. Hevo’s automated platform makes the process effortless with these standout features:
Join 2000+ happy customers like Whatfix and Thoughtspot, who’ve streamlined their data operations. See why Hevo is the #1 choice for building modern data stacks.
Get Started with Hevo for Free
What Are SFTP and FTP, and How Do They Differ?
SFTP (Secure File Transfer Protocol) and FTP (File Transfer Protocol) are both protocols used for transferring files between systems over a network, but they differ significantly in terms of security, use cases, and underlying technology.
Understanding FTP:
FTP stands for File Transfer Protocol, a basic standard protocol for transferring files over the Internet. It was created in the 1970s to push and pull large files between systems.
- Port: FTP Use port 21
- Data Channels: FTP uses two channels:
- Command Channel — is used to send commands
- Data Channel — used for transferring data.
- Authentication: A username and password are needed to authenticate, but even the credentials are communicated in plain text.
- Data Encryption: FTP alone does not encode the data being transferred, which implies both the data and login credentials can be captured.
Understanding SFTP
SFTP (Secure File Transfer Protocol) is a secure version of FTP that operates over the SSH (Secure Shell) protocol. It was designed to provide a secure method of transferring files.
- Single Channel: Unlike FTP, which uses separate command and data channels, SFTP uses a single channel for commands and data, simplifying firewall configurations and enhancing security.
- Port: SFTP typically operates on port 22, the same as SSH.
- Data Encryption: SFTP encrypts the data and the commands using SSH, making it secure against eavesdropping and man-in-the-middle attacks.
- Authentication: It uses the same mechanisms as SSH, including username/password or more secure methods like public key authentication.
What is the Simplest Way To Connect FTP to Redshift?
Want to connect SFTP/FTP with Redshift but don’t have the technical expertise to write long lines of code? Don’t worry, Hevo got you covered. Use Hevo’s Automated platform to connect SFTP/FTP with Redshift by creating a pipeline in just 2 simple steps.
Step 1: Configure SFTP/FTP as your source
Step 2: Configure Redshift as your destination.
Integrate FTP/SFTP to Redshift
Integrate FTP/SFTP to BigQuery
Integrate FTP/SFTP to Snowflake
Manual Method: Using Custom Code to Connect FTP to Redshift
By default, Redshift’s answer for everything related to loading data is its COPY command which can interface with a number of data sources. COPY command can interface with DynamoDB, AWS S3, or even remote hosts that support SSH. The challenge in our problem is that most FTP/SFTP servers will not have an SSH server running and loading data to Redshift requires a separate set of steps to first download the data to S3. Downloading data from FTP can be done by using the CURL utility. The below sections details the steps involved in loading data from FTP/SFTP:
Step 1: Downloading SFTP/FTP to a Local Server
Use the CURL utility to download FTP/SFTP data to a local system.
curl -u user:password 'ftp://<ftp_url>/data/products.csv' -o ~/Downloads/products.csv
If the server supports SFTP, just change the protocol name in the URL as below.
curl -u user:password 'sftp://<ftp_url>/data/products.csv' -o ~/Downloads/products.csv
Step 2: Using AWS CLI to Load the File to S3
Once the file is in the local file system, use the AWS CLI to load the file to S3. Assuming AWS CLI is already installed and the credentials are configured in CLI, use the below command.
aws s3 cp products.csv s3://my-data-bucket/products/
Step 3: Connecting to Redshift
Connect to the Redshift instance using the below command:
psql -h mycluster.12345672922.us-west-2.redshift.amazonaws.com -U user -d dbname -p 5439
At the password prompt enter the password for the specific user.
Step 4: Loading Data from S3 to Redshift
Use the Redshift COPY command to load the data from S3 to Redshift.
copy products from 's3://my-data-bucket/products/products.csv' credentials 'aws_access_key_id=<AWS-Access-Key-ID>;aws_secret_access_key=<AWS-Secret-Access-Key>' csv;
COPY command needs the table to be created beforehand. The example given above assumes that the data file has fields in the same order as that of the Redshift table. In case this is not the case, we will need to use a column mapping file.
COPY command by default tries to map all the fields it sees through its implicit data type mapping.
In the above example, we used an intermediate S3 bucket to store the data before loading it to Redshift. Another option is to use the SSH configuration parameter in Redshift’s COPY command to fetch the data from the local filesystem through a remote SSH connection. Instead of steps 3 and 4, create a manifest file to configure the connection to the remote host where the file was saved in step 2.
{
"entries": [{"endpoint":"<ssh_endpoint_or_IP>",
"command": "cat /user/Downloads/products.csv",
"mandatory":true,
"publickey": "<host_public_key>",
"username": "<host_user_name>"}]
}
Please note this file needs to be in S3 for the COPY command to use.
copy products from 's3://mybucket/ssh_manifest' credentials 'aws_access_key_id=<AWS-Access-Key-ID>;
aws_secret_access_key=<AWS-Secret-Access-Key>' csv ssh;
If the FTP/SFTP server already has an SSH server running in it, then using the ssh mode of COPY command as above can avoid the separate downloading step in step 1.
So that was a very simplified explanation of how to copy data from an FTP or SFTP server to Redshift.
Migrate Data seamlessly Within Minutes!
No credit card required
Limitations of using Custom Code to Connect FTP to Redshift
In reality, there are a number of challenges involved in connecting FTP to Redshift. The typical ones are listed below:
- The above method of connecting FTP to Redshift can only be used for a first-time bulk load. The COPY command by default appends data rows to existing rows and hence there is a need for additional logic when incremental data load is required.
- Redshift’s implicit data type mapping will play havoc leading to corrupted databases if the developer is not careful enough. To be specific, Redshift will implicitly convert the column values according to the target table’s data type specification without throwing an error in many cases. For example, a decimal value will be converted to an integer by automatically truncating it if the data type specification of the target column is an integer.
- We used two staging locations in the above approach – A local file system to keep the data after downloading from FTP and then an S3 bucket to load data to Redshift. In reality, having access to such a local system that can hold a large amount of data is not practical.
- In most cases, this process needs to happen periodically rejecting the already processed files. A custom script with progress awareness and restart support will need to be implemented to use this in production.
A solution to all these challenges is to use an automated data pipeline platform like Hevo which can accomplish this in a matter of few clicks.
Hevo provides a point and clicks interface with support for most of the commonly used source and target database combinations.
Conclusion
This blog talks about the two methods you can use to set up a connection from FTP to Redshift: using custom ETL scripts and with the help of a third-party tool, Hevo. It also gives a brief overview of FTP and Redshift highlighting their key features and benefits before diving into the setup process.
Discover how to connect Google Drive and SFTP to streamline your data exchange processes. Explore the benefits of integrating these platforms.
Extracting complex data from a diverse set of data sources can be a challenging task and this is where Hevo saves the day! Hevo offers a faster way to move data from Databases or SaaS applications such as FTP into your Data Warehouse like Snowflake to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.
Before spending days of development effort loading data from SFTP/FTP to Redshift manually, give Hevo a shot by signing up for a 14-day free trial here. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
FAQ on SFTP/FTP to Redshift
How to connect FTP in AWS?
Connect FTP in AWS by using services like AWS Transfer Family, which supports FTP, FTPS, and SFTP for transferring files to and from AWS storage services like S3.
Does AWS support FTP?
Yes, AWS supports FTP through AWS Transfer Family, allowing secure file transfers to and from Amazon S3 using FTP, FTPS, or SFTP protocols.
What is an FTP port?
The default port for FTP is 21. FTPS uses port 990 for control and dynamic ports for data transfers.
How do I connect to Redshift from local?
You can connect to Redshift from a local machine using a SQL client (like pgAdmin) or a command-line tool (like psql) with the Redshift JDBC or ODBC driver and cluster endpoint details.
Talha is a Software Developer with over eight years of experience in the field. He is currently driving advancements in data integration at Hevo Data, where he has been instrumental in shaping a cutting-edge data integration platform for the past four years. Prior to this, he spent 4 years at Flipkart, where he played a key role in projects related to their data integration capabilities. Talha loves to explain complex information related to data engineering to his peers through writing. He has written many blogs related to data integration, data management aspects, and key challenges data practitioners face.