Are you trying to connect HubSpot to Redshift? Do you want to consolidate your HubSpot and Redshift data for better analysis? If yes, then you are in the right place. Cloud-based software as a service offerings spare organizations from investing time and effort in building custom software solutions and enables them to focus on their core business.
A challenge with using such services is that they limit the organization’s ability to carry out analysis over the data generated. In most cases, organizations will have their own data warehouse which needs to be kept up to date with the customer data present in these SaaS offerings.
Having all the data in one place will help organizations to extract the true capabilities of the Data Warehouse and derive meaningful insights from their data. Hubspot is a Software-as-a-service(SaaS) offering that provides a full suite of software for marketing, sales, and customer service. This post is about the methods to load the data from HubSpot to Redshit, one of the most popular Data Warehouse services.
Introduction to HubSpot
HubSpot is an American company that develops software products for inbound marketing, customer service, and sales. It will help your company to attract visitors, leads, and customers. They also provide various tools for hosting web and landing pages, creating blogs and emails, managing interaction with leads and customers, and analyzing the campaigns. HubSpot features such as drag and drop, analysis, speed, etc make it a must-have tool.
Introduction to Amazon Redshift
Amazon Redshift is a cloud-based fully managed data warehouse. It can be used for large data set storage and migration as it is based on Massive Parallel Processing (MPP). Redshift’s column-oriented database can connect to Business Intelligence (BI) tools and SQL-based clients. All your data is available in real-time. Redshift’s parallel processing and compressions reduce the execution time and deliver faster results. All these features are available at a reasonable price.
Methods to Move Data from HubSpot to Redshift
You can connect HubSpot to Redshift using the following methods:
Method 1: HubSpot to Redshift – Building Custom Code Using HubSpot APIs
Hubspot provides a rich set of APIs to retrieve the data present in customer accounts that can help you move your data from HubSpot to Redshift. It also provides an easy-to-follow authentication model to enable developers to build custom applications using their APIs.
In this tutorial, we will use the Contacts API that gives us a list of all the contacts present in Hubspot. We will then try to decipher the data and load it into Redshift using Redshift’s COPY command.
1. The first step is to go to the settings page in your Hubspot portal and create API keys. Please note this needs super admin permissions. In the settings panel, navigate to integrations > API key and click the Generate API key button. The keys have a lifetime of 6 months.
Using the API key, make the following request using the CURL utility
curl -o contacts.json https://api.hubapi.com/contacts/v1/lists/all/contacts/all?hapikey=<api_key>&count=2
The above command will populate the contacts.json file. Before going into the steps, there are some specifics of the API that we need to be familiar with. The count property of the API allows developers to specify the number of entries to be retrieved at a time.
The API at a time returns a maximum of 100 entries. If there are more than 100 contacts, it returns the data in a paginated manner. The API response has a parameter called vid-offset that can be used to retrieve the next set of entries after consuming the first page. The developer has to use the value of the vid-offset parameter in the next request with the vidOffset property.
2. Convert the JSON to a CSV using the jq command-line utility. For this tutorial, we will extract the vid, first name, company, and last name attribute of contacts to form a CSV. Use the below command to accomplish this.
cat contacts.json | jq -r ".contacts[] | [.vid,.properties.firstname.value, .properties.company.value,.properties.lastname.value] | @csv" >> contacts.csv
The output of the above command will be copied to a CSV file named “contacts”.
3. Copy the file to AWS S3. Assuming AWS CLI is already installed and configured, use the below command.
aws s3 cp /contacts/contacts.csv s3://contacts_bucket/
4. We will now use Redshift’s COPY command to load data. COPY command can load data from an S3 bucket, a remote host, an EMR cluster, or DynamoDB. In this case, we will be copying data from AWS S3. Use the below command to execute this.
copy contacts from 's3://contacts_bucket/contacts.csv' iam_role 'arn:aws:iam::0123456789012:role/User_RedshiftRole' csv;
Where iam_role is the user’s IAM role. This command requires a contacts table to be already created with an id,first_name, company, and last_name fields.
That concludes the step in copying the data from contacts API to Redshift. In reality, when implementing for a production environment, there are a lot more challenges that need to be solved to create a repeatable process. Let’s have a look at the typical problems that should be addressed in such a custom application.
Method 2: HubSpot to Redshift using Hevo Data
Hevo focuses on two simple steps to move your data from HubSpot to Redshift:
- Configure Source: Connect Hevo Data with HubSpot by providing a unique name for your Pipeline, along with details about your authorized Redshift account. You can also choose the historical sync duration for your HubSpot data.
- Integrate Data: Complete HubSpot to Redshift migration by providing your destination name, account name, region of your account, database username and password, database and schema name, and the Data Warehouse name.
HubSpot to Redshift: Limitations of Using Custom Code
Some of the limitations of using custom code to connect HubSpot to Amazon Redshift are listed below:
- Hubspot has API rate limits to prevent its infrastructure from being overloaded. Every response has rate limit specification headers which specify the maximum number of requests for a day, the maximum number of requests per window, size of the window, etc. The application has to read these headers and make API calls appropriately.
- The API returns a paginated response. So the application needs to be intelligent enough to keep track of the vid-offset value through which Hubspot manages pagination to get the full data.
- The above approach is limited to be used for a bulk load of data only. Redshift does not enforce any kind of primary key constraints and hence repeated loading of data fetched from the API will result in duplicate rows. If continuous sync (or real-time sync) between Hubspot and Redshift is required, additional logic has to be built.
- Redshift’s COPY command has a built-in implicit data type conversion mechanism which can play havoc with your data if not carefully mapped. COPY command can take a JSON configuration file to map the data type correctly if the developer can specify them.
If all of the above proves to be too much work, an alternative elegant solution will be to use a cloud-based ETL tool like Hevo which can abstract away all these details and set up this data load operation in a matter of few clicks.
Conclusion
In this blog, you learned about HubSpot, Redshift, and two different approaches to connect HubSpot to Redshift. You can either use HubSpot API to build custom scripts or use a fully automated data pipeline, Hevo to connect HubSpot to Redshift.
What is your preferred approach to move data from HubSpot to Redshift? Do let us know in the comments section below.
Talha is a Software Developer with over eight years of experience in the field. He is currently driving advancements in data integration at Hevo Data, where he has been instrumental in shaping a cutting-edge data integration platform for the past four years. Prior to this, he spent 4 years at Flipkart, where he played a key role in projects related to their data integration capabilities. Talha loves to explain complex information related to data engineering to his peers through writing. He has written many blogs related to data integration, data management aspects, and key challenges data practitioners face.