Understanding the Data Analytics Stack

on Analytics • September 8th, 2017 • Write for Hevo

Introduction to a Data Analytics Stack:

Today, data guides most decisions in the companies. These decisions could be launching a new product, diversification, research and development, promotional offers, advertising campaigns, etc.

Business teams always need data at their disposal. But the challenge they face is that the data is scattered across different sources say MixPanel, Amazon Redshift, Google Analytics, Salesforce or more.

Let me explain with a simple example –

You are a mobile network service provider. Your customer complaints’ data is stored in Zendesk and CRM data in Salesforce. For discovering real-time insights from both, it is important to merge the data for querying and processing. Bringing the scattered data to a platform in order to provide a unified view is termed as data integration.

Data integration begins with the extraction of data, followed by data transformation and ends with loading the data into a warehouse.

Performed by a data pipeline, this process is the core component of a data analytics stack. The three components of a data analytics stack are – data pipeline, data warehouse, and data visualization.

Let us understand more about the data analytics stack on this blog.   

1. Data Pipeline

To fetch data from scattered sources such as MySQL, log files, Google Analytics to a data warehouse, say Redshift; you require a data pipeline.

A data pipeline moves your data from one system to another while transforming it along the way. The process of blending data from different sources, existing in different formats takes place in a data pipeline.

There are three steps which take place in data pipelines- Extract, Transform, Load. Of which Transform and Load can occur in any order. They are abbreviated as ETL and ELT.

Let us try to understand ETL and ELT with a simple use case.

A leading fashion retail company has its presence in the United States and Europe. The operations take place in different time zones – PST, CST and UTC.

To know the sales figures per day, you can either transform the time zone into a standard format and then load in the data warehouse (ETL). Else you can load the data as it is and perform desired transformations at the end (ELT). ELT is a more feasible option here. It also gives the user flexibility to customize their pivoted data.

Extraction, Transformation, and Loading are key components of the data centralization toolbox.

A. Extraction:

The extraction process deals with retrieval of desired data from different sources. It also includes identification and selection of the data as the source usually contains unwanted data or data of little interest. The complexity of the extraction process varies and it depends on the type of data source.

The different sources could be of the following types:

Databases: MySQL, PostgreSQL, MongoDB, Cassandra

Cloud applications: Google Analytics, Salesforce, MixPanel, SendGrid

Data streams: Mobile application, Website, Sensors in machines, Smart home devices

Flat files: Log files, Excel files, Amazon S3

B. Transformation:

The extracted data is cleaned, sorted and formatted as per the requirements. This process is called as data transformation.

Joining and merging of data from different data sources takes place during the transformation phase. Data from some sources might require fewer transformations, whereas others may require multiple transformations –  such as validation, summarization, etc.

C. Loading:

Loading refers to physical transportation of data into a data warehouse. Data warehouses are updated periodically and records are often loaded to multiple tables in one go.

2. Data warehouse –  

A data warehouse is a large storage space used to consolidate data which is accessible to different departments in an organization.

If the data has a smaller volume, it is feasible to operate using SQL databases like MySQL or PostgreSQL. But as the data complexity grows, data warehousing becomes a need.

Data warehousing models have evolved with changing business requirements. Some of the widely used modern data warehouses are – Amazon Redshift (Amazon Web Services), BigQuery (Google) and Snowflake.  

A data warehouse operates upon historical data. For avoiding data complications, a data warehouse and organization’s operational databases are kept separate.

Once it is loaded in a warehouse, the data to be analyzed should be easy to interpret. At this juncture, data visualization tools play a pivotal part.

3.  Data Visualisation

It is easier to understand the growth per quarter or sales figures from a pictorial view instead of a heavy excel file.

Visualisation is important as it makes it easier to comprehend the data. Dashboards provide a combined view of multiple analyses on a single platform.

There are many commercial as well as open source data visualization tools available. Paid tools like Periscope data, Tableau, Looker, QlikView allow SQL queries. More features of these widely used tools are listed in this blog.

Free tools like – Metabase, Redash and Superset also support SQL backends used for data analytics. You can organize reports in dashboards in all of them.

Hevo as a Data Integration Platform:

Building a data pipeline is an expensive, monotonous and a time-consuming process. We can put time, energy, manpower, and resources to better use if a data pipeline/integration platform was readily available.

Solution to this very problem is automation of the data integration process.  It is a smart way to opt for a platform which will aggregate your data in no time to ensure faster accessibility.

We, at Hevo, assist organizations to overcome your data integration problems by seamless movement of data ensuring scalability, flexibility, and data security. We guarantee that movement of your data is effortless – from any source to any destination, and in any format.

If you are facing similar problems integrating your data, then drop us your queries at info@hevodata.com.

No-code Data Pipeline for your Data Warehouse