Building a Data Science Tech Stack: A Comprehensive Guide

on BI Tool, Data Integration, Data Warehouse, ETL • December 12th, 2020 • Write for Hevo

Contents

Hevo, A Simpler Alternative to Integrate your Data for Analysis

Hevo offers a faster way to move data from databases or SaaS applications into your data warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Real-time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Security: Hevo is SOC II, GDPR, and HIPPA compliant. Hevo also enables top-grade security with end-to-end encryption, two-factor authentication, and more.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.

You can try Hevo for free by signing up for a 14-day free trial.

Introduction

The field of data science has evolved to a stage where no organization can ignore it while setting up their data science tech stack. Organizations use machine learning not only to serve their customers better but also to gather insights about their business to complete the senior management team.

Such close coupling of data science with business operations means that choosing the right stack for your data architecture is a make or break decision. Having the right set of tools has a positive impact on your time to market, development cost, infrastructure costs, and the overall stability of your platform.

Data science tech stack is not only about the framework used to create models or the runtime for inference jobs. It extends to your complete data engineering pipeline, business intelligence tools, and the way in which models are deployed. This post is about the critical factors that must be considered while building the data science tech stack.

Understanding Enterprise Data Science Tech Stack

In typical enterprise architecture, data flows in from various on-premise and cloud sources into a data lake. A data lake is a heterogeneous data storage area where all kinds of data including the data originated from transactional databases are stored irrespective of their structure or source. Data is then extracted, transformed, and loaded to a data warehouse where it can be analyzed.

Business analysts and data scientists work on the data warehouse and come up with reports and reusable analytics modules. Some of these modules are deployed with their data source as the data warehouse itself and produce actionable insights on a batch basis.

Another set of modules are closely integrated into the transactional systems and provide results on a real-time basis. Both kinds of models are typically served as web interfaces to aid in independent scaling and deployment. As evident from the above prose, a data science stack is a combination of all the technologies involved in operating this complex flow. 

Selecting the Components for your Data Science Tech Stack

Now that we understand the components, let us discuss the factors that must be considered while selecting the stack for the critical points in the flow

Data Warehouse

Choice of data warehouse primarily depends on whether you want an on-premise solution or a cloud-based solution. The obvious advantage of cloud-based software as a service solution is the maintenance-free nature and the ability to focus on the core analytics problem without getting distracted.

The most popular on-premise solution is an execution engine like Spark or Tez and a querying layer like Hive or Presto on top of it. The advantage is that you have complete control over your data. You can directly build analytics and machine learning modules using Spark using custom code. Querying engines like Presto now have basic ML algorithms built into them.

If your organization does not have the development expertise to maintain such solutions and does not intend to acquire them, you may be better off using cloud-based services like Redshift, Azure data warehouse, or BigQuery. They can take advantage of the ML modules that are already part of the suite.

Redshift ML is the latest entrant in this space while BigQuery ML has been there for a while. So if you are looking to create ML models directly from your cloud data warehouse itself, BigQuery and Azure ML may be the stable offering compared to the AWS one. 

ETL Tool

Any analytics module or machine learning model is as good as the features it takes as input. ETL tool is the one that is responsible for creating these input features. If you are going for an on-premise solution, spark based transformation functions using custom code or Spark SQL in Python or Scala is the popular choice. This means you will have to build your own frameworks and schedulers to ensure the feature building process is reliable. Alternately, you can pick an open-source tool like Pentaho data integration as well, but they won’t be as flexible as custom solutions. 

If you are open to cloud-based solutions, Google Cloud Data flow, Azure Data Bricks, and AWS Glue provide excellent SAAS offerings. All of these allow automatic code generation based on visual interfaces and supports data science modeling natively. A disadvantage is that they are more aligned to their own stack as in Glue is more suited if you are using AWS stack and Data bricks are better if you are already using Azure stack. Support for external cloud-based data sources is also limited.

An alternate option is to use an independent cloud-based ETL tool like Hevo. Hevo supports code-free ETL and can achieve complex transformations in a few clicks. It can be an excellent companion for your data science attempts. 

Try Hevo here for a free trial!

Business Intelligence and Visualization Tools

Business intelligence and visualization tools are an important part of the data science tech stack puzzle since they play an important role in exploratory data analysis. Popular on-premise solutions are Tableau and Microsoft Power BI. If your development team wants custom code-based solutions, Python libraries like Seaborn and Matplotlib are good options for visualizing data. 

AWS Quicksight, Google Data Studio, Azure Data Explorer are also excellent SAAS alternatives in this space. AWS’s quick sight also has basic machine learning capabilities to detect anomalies, forecast values, and even create automatic dashboards. As always, these services makes sense if you are already on their stack and do not do a good job of integrating data outside of their stack. 

ML and Analytics Implementation Frameworks

For custom code based implementations, the defacto standard for machine learning and analytics has been Python for a while. For statistical analysis and modeling, Scikit-learn and stats-model is the popular choice. For statistical models, R also offers a rich set of functions and can be deployed in production.

For deep learning, TensorFlow, MXNet, Pytorch, etc can be used. In case you have a preference for java, deeplearning4j is a good choice. Community support is a big factor to consider here since in most cases developers will need a lot of research before finalizing the model pipeline. 

 If your organization is not into hiring ML expertise or developing custom models, most of the cloud service providers offer machine learning models and automated model building as a service.

Azure Machine Learning, Google Cloud AI, AWS machine learning services, etc allows you to build models and intelligence without using much code at all. All you need to do is prepare data in the format specified.

Google Data labAWS Sagemaker, and Azure ML studio provide excellent platforms for data science development.  A point to note here is that your ETL tool is of critical importance here since your effort in implementing machine learning is then limited to providing input features.  

Deployment Stack

Once the models are built, the next step is to deploy them for real-time or batch inferences. If you are having an on-premise setup, the typical choice is to wrap the models in a web service framework like Flask or Django and create Docker containers for deployment. You can then scale them horizontally using a container orchestration framework or load balancer.

The obvious deciding factor here is the effort involved and the expertise needed. Inference modules come with a lot of complexity and need the careful application of complex concepts like batching, threading, etc to extract the best performance. Typical ML frameworks like TensorFlow, MX net, Pytorch, etc come with their own deployment functions and it is better to exploit them rather than reinvent the wheel here. 

A way out of the complicated deployment process is to use the ML serving options provided by cloud services. AWS, GCP, and Azure have deployment mechanisms built into their machine learning services and also allows the deployment of custom models created external to their systems. The biggest advantage is that scaling is completely automated while using such services. 

Conclusion

As evident above, choosing the components of your analytics and data science stack is not an easy job. There are umpteen factors at play and a large number of combinations that can be tried out. Broadly, this decision will be based on your answers to the following questions

  1. Do you prefer on-premise or cloud-based services?
  2. Do you have the development expertise to create your own models and analytics functions?
  3. Are you already invested in one of the cloud service providers?
  4. Do you have a case for real-time data ingestion and analytics?

Whether you choose a completely cloud-based system or a combination of custom implementations and cloud-based services, the ETL tool is the primary link between all the entities involved in the data science tech stack. Hevo provides an excellent cloud-based ETL tool that can make an easy job creating features for your business analysts and data scientists to work on. Hevo integrates smoothly with 100 + data sources.

Try Hevo for free by signing up for a 14-day free trial!

Share your thoughts on building a data science tech stack in the comments!

No-code Data Pipeline for your Data Warehouse