11 Best ETL Tools for Data Warehousing in 2020

on Data Integration • January 9th, 2019 • Write for Hevo

Finding the ETL tool that fits your use case like a glove can be hard. This detailed guide aims to help you give a complete set of inputs in terms of broad classification, use cases, and an evaluation framework on the ETL tools in the market. The post also has a detailed comparison of the best ETL tools for 2020. All of this combined should assist you to pick the best ETL tool as per your use case. 

This is a fairly comprehensive blog. In case you are eager to just get to the point and discover the best ETL Tools, here is the list.

11 Best ETL Tools List for 2020:

  1. Informatica PowerCenter
  2. IBM InfoSphere DataStage
  3. Hevo Data
  4. Talend
  5. Pentaho
  6. AWS Glue
  7. StreamSets
  8. Blendo
  9. Google Data flow
  10. Azure Data Factory
  11. Apache Nifi

Now that you have the names of some of the top ETL tools, read on to learn more about the use case where each of these tools will fit best.

Before you dive into understanding what the top ETL solutions in the market today are, it is important to briefly understand the ETL process itself. This will set you up better to appreciate the value provided by different ETL tools. 

What is ETL?

ETL is the process of moving raw data from one or more sources into a destination data warehouse in a more useful form. This is an essential process in making the data analysis-ready in order to have a seamless business intelligence system in place.

ETL simply stands for – Extract, Transform, and Load. Often, the process entails the following: Data is first Extracted from the source and maintained in a staging area. While in the staging area, depending on the use case of your business, the data is then Transformed into a format that’s more useful for analysis and more appropriate for the destination warehouse schema. It is then Loaded into the destination data warehouse.

For example, in order to derive deep insights on your marketing metrics, you would have to move raw events from Google Analytics, Google Ads, Salesforce, and other applications used by your marketing teams and load it in a SQL-like format into a data warehouse like Amazon Redshift.

Here is a real-life example of the need for ETL. Let’s take the example of an insurance company. Every day, data is created in the transactional database when people take new policies or file for claims. This data is extracted from the transactional database, aggregated and transformed and loaded into a data warehouse. The company’s data scientists and business analysts dig through this transformed and aggregated data to come up with better policies, claim processes, etc that will aid the company to reduce its cost or increase its revenue.

The company may be advertising its policies through a number of social networking as well as other internet mechanisms. This data also ends up in the data warehouse through the ETL flow, since advertising data is the first place a company looks to get more insights about its target customers. The algorithms and strategies developed by analysts then become part of the same ETL process helping the employees on the ground to deal with the customers. This is why ETL is such an important part of improving and managing a business.

What are ETL Tools?

ETL tools are applications/platforms that enable users to execute ETL processes.  In simple terms, these tools help businesses move data from one or many disparate data sources to a destination. These help in making the data both digestible and accessible (and in turn analysis-ready) in the desired location – often a data warehouse.

The example shared in the previous section makes it clear that a modern ETL process consists of a large number of scheduled or triggered processes to move data between a wide variety of sources and target data stores. Coordinating, orchestrating, and executing these activities with strict SLAs is a very critical task and this where ETL tools come into the picture. You can read more about an ETL tool here.

Selecting an ETL tool is a make or break decision for companies because if not careful,  this can become a large money pit. Good ETL tools automate most of these workflows without needing human intervention at all and provide a highly available service.

Now that we know what an ETL tool is, let us look at the list of top ETL tools and go through a quick comparison of these.

Top 11 ETL Tools Comparison:

1. Hevo Data

Key Features of Hevo Data

  • Easy Setup and Highly Intuitive User Interface – Hevo has a minimal learning curve and can be set up in minutes. Once the user has quickly configured and connected both the data source and the destination warehouse, Hevo moves data in real-time
  • Fully Managed– No coding nor pipeline maintenance is required by your team
  • Unlimited Integrations – Hevo can provide connectivity to numerous cloud-based and on-site assets. Check out the complete list here: hevodata.com/integrations
  • Automatic Schema Mapping – Hevo automatically detects the schema of the incoming data and maps it to the destination schema. This feature frees you from the tedious job of manually configuring schema.
  • Effortless Data Transformations: Hevo provides a simple Python interface to clean, transform, and enrich any data before moving it to the warehouse. Read more on Hevo’s Transformations here. Sign up for a 14-day free trial here and experience efficient and effective ETL.
  • A snapshot of Hevo Platform is as follows:

Hevo Data Pricing:

Hevo’s basic plan starts at $149/month. You can explore the detailed pricing here.

2. Informatica PowerCenter

Key Features of Informatica PowerCenter

  • Informatica power center provides an on-premise ETL tool that can integrate with a number of traditional database systems.
  • It is an enterprise-grade solution with comprehensive support for data governance, monitoring, master data management, and data masking.
  • Informatica power center also has a cloud counterpart which allows accessing repositories deployed inside the organization’s premise and can execute transformation tasks in Informatica’s cloud.
  • Informatica power center is primarily batch-based.
  • Informatica change connectors now support popular cloud data warehouses like AWS dynamo DB, AWS Redshift, etc. It also supports a variety of data storage solutions and software as a service offering. You can find a comprehensive list of connectors here.

Informatica PowerCenter Use case

Informatica power center is more suited for organizations that need enterprise-grade security and data governance within their on-premise data because of mandatory compliance requirements. Even the cloud version of Informatica Power Center is more suited for on-premise data and emphasis is on the data security part.

Informatica PowerCenter Pricing

Informatica Power center cloud starts from 2000$ per month for its most basic version. You can find the details of Informatica Pricing here. Pricing depends on parameters such as the data sources that you need to integrate with, security needed task flow orchestration requirement, and so on.

Informatica Power center pricing is not transparent and depends a lot on the contract negotiated by the customer and Informatica. AWS and Azure provide Informatica Power center on a pay as you go, pricing model.

3. IBM InfoSphere DataStage

Key Features of IBM InfoSphere DataStage

  • Like Powercenter, this is an enterprise product aimed at bigger organizations with legacy data systems.
  • Infosphere data stage has a cloud version that can be hosted in the IBM cloud, but here too the focus is on-premise databases and executing the transformation tasks in the cloud.
  • It is primarily batch-based.
  • IBM data stage has connectors to cloud-based data storage solutions like AWS S3 and Google cloud storage. Since it supports JDBC, software as a service data warehouses like Redshift which provides JDBC connectors can also be integrated. Support for connectors is not as comprehensive as Informatica Power center though.

IBM InfoSphere DataStage Use case

IBM Infosphere is suited for enterprise-grade applications that primarily run on on-premise databases.

IBM InfoSphere DataStage Pricing

Like in the case of Powercenter, on-premise pricing for the Infosphere is not transparent and is negotiated via contracts. Infosphere cloud pricing started from 6800$ per month for the smallest cloud deployment. Read more on the pricing here.

4. Talend

Key Features of Talend

  • Talend has a large suite of products ranging from data integration, big data management, to data preparation, and more.
  • Talend Data Fabric is a collection of all tools that come under the Talend Umbrella bundled with platinum customer support.
  • Talend Open Studio is open-source and can be used without paying if you do not use Talend Cloud.
  • Talend supports most cloud and on-premise databases and has connectors to software as a service offering as well.
  • Talend’s big bet is in the area of multi-cloud and hybrid cloud where customers with extremely high data protection requirements hedges themselves by using more than a cloud provider and on-premise systems.
  • Talend studio provides a UI to design the flow and transformation logic much like Infosphere and Power center.
  • Talend works on the basis of code generation approach and hence the code has to be built every time there is a change in logic.
  • Support for connectors is comprehensive like that of PowercenterCompared to PowerCenter and DataStage, Talend is a more recent tool in the same space.
  • Talend can operate both on-premise and on the cloud. It is mostly suited for batch processes.
  • Has a large suite of products ranging from data integration, big data management, to data preparation, and more. If you are looking for a tool that can get many of these tasks offloaded along with data integration, Talend might be suitable.

Talend Use case

Talend is an enterprise-grade solution with a strict emphasis on data governance and hybrid cloud architecture. If you are a financial institution or an enterprise with strict compliance requirements to spread risks across multiple clouds, then Talend can be a good option.

Talend Pricing

Talend also offers varied pricing, based on the set of products and features opted for. Talend data integration basic plan starts at $12,000/year. Read more about Talend pricing here.

5. Pentaho

Key Features of Pentaho

  • Also known as Kettle, Pentaho has an open-source community and an enterprise edition.
  • Like PowerCenter and DataStage, Pentaho is also built to cater to on-premise, batch ETL use cases.
  • It offers data integration and data processing features from a diverse set of data sources. 
  • Pentaho also bets heavily on the hybrid cloud and multi cloud-based architectures. 
  • Pentaho works on the basis of the interpretation of ETL procedures stored in XML format. Since there is no code generation involved, Pentaho is better than Talend in case of ad-hoc analysis.
  • Pentaho does not disclose pricing upfront.

Pentaho Use case

Pentaho is normally used when companies go for open source ETL tools in an on-premise ecosystem. Unlike the tools mentioned above, Pentaho does not focus on its own cloud. The full suite of Pentaho can be deployed in an on-premise or cloud provider. In that sense, it provides complete independence without being tied to any cloud provider.

Pentaho Pricing

Pentaho community edition is free to use. Pentaho enterprise edition price is not disclosed and is negotiated based on contracts. Talend also offers varied pricing, based on the set of products and features opted for. You can request for a quote here.

6. AWS Glue

Key Features of AWS Glue

  • Glue is a cloud-based real-time ETL tool provided by AWS on a pay as you model.
  • AWS glue is primarily batch-oriented, but can also support near real-time use cases based on lambda functions. 
  • If most of the data sources that you are looking to ingest data from are on AWS, Glue provides easy methods to ETL the data.
  • Support for sources and destinations outside the AWS ecosystem is not great
  • Glue has some noteworthy features – integrated data catalog, automatic schema discovery, and more. Read more about AWS Glue here.
  • AWS Glue combined with lambda functions allows it to implement a serverless full-fledged ETL pipeline.
  • AWS Glue has a pay-as-you-go pricing model. It charges an hourly rate, billed by the second. Read more about AWS Glue pricing here.

AWS Glue Use case

AWS Glue appeals to people who want to go completely serverless and are fine with staying within the AWS ecosystem using only AWS services. It appeals especially to the companies who do not want to spend money on infrastructure teams to closely monitor and manage their ETL system. The downside is that data completely resides in the cloud and may not be suitable for industries with high compliance requirements and hybrid cloud ambitions.

AWS Glue Pricing

AWS Glue has a pay-as-you-go pricing model. It charges an hourly rate, billed by the second. The pricing is in terms of data processing units which are charged at 0.44 per DPU hour. Read more about AWS Glue pricing here.

7. StreamSets

Key Features of StreamSets

  • StreamSets positions itself as a DataOps tool. It has data monitoring capabilities that stretch beyond the traditional ETL.
  • Cloud-optimized, real-time ETL tool.
  • Utilizes a spark-native execution engine to extract and transform data. Customers can build batch and real-time data pipelines with minimal coding.
  • Stream sets support a large number of origin and destination combinations. A list of all supported origin and destination can be found here.
  • Support for SAAS offerings is limited in the case of stream sets.
  • Stream sets come with a data protector offering that complies with major data security guidelines like HIPAA and GDPR.

SteamSets Use case

Stream sets are a good option in case the use case is completely real time-oriented and the organization does not want to be locked into a particular cloud provider. It allows companies to use their own preferred on-premise or cloud provider and use stream sets only for defining their real-time pipeline. If you are using a large number of SAAS offerings, StreamSets are not a preferred option since SAAS connector support is not comprehensive.

StreamSets Pricing

StreamSets pricing is not disclosed and is based on negotiated contracts.

8. Blendo

Key Features of Blendo

  • Real-time, cloud-native ETL tool.
  • Blendo focusses on extraction and syncing of data. of ELT. It extracts raw data from sources and loads it into destinations without performing transformations.
  • Blendo has over 50 data sources, majorly focussing on SaaS platforms and databases.
  • Blendo does not focus much on the compliance part and does not make any claims about data security compliances.

Blendo Use case

Blendo is a good option in case the company wants an ETL tool for great support for SAAS offerings and does not have strict compliance requirements to maintain data on-premise.

Blendo Pricing

Blendo base package starts at 150$ per month. You can read more about Blendo pricing here.

9. Google Cloud Dataflow

Key Features of Google Cloud Dataflow

  • Google cloud dataflow is a fully managed ETL service provided by Google based on Apache Beam.
  • It is tailor-made for google cloud ecosystem sources and destinations.
  • It works well for batch as well as real-time use cases.
  • Using Dataflow, it is possible to run a completely serverless ETL pipeline based on google ecosystem components.
  • Google cloud platform complies with all data security guidelines like HIPAA and GDPR.
  • Since it is designed for the google cloud ecosystem, it does not fare well on the multi-cloud and hybrid cloud-based architectures.

Google Cloud Dataflow Use case

Google cloud dataflow is a good alternative if the company does not mind being locked down to the Google ecosystem and does not have strict compliance requirements with respect to on-premise data. Dataflow makes sense in scenarios where the customer is not interested in managing their own infrastructure and wants a serverless ETL model.

Google Cloud Dataflow Pricing

Google cloud dataflow is billed on a per hour basis for CPU, memory, storage, and data processing units. You can find more details of its pricing here.

10. Azure Data Factory

Key Features of Azure Data Factory

  • Azure data factory is the Microsoft counterpart for AWS Glue and Google Cloud Dataflow.
  • It is a fully managed service focusing more on Azure-based destinations.
  • It supports both real-time and batch-based ETL flows.
  • Data factory can run a completely serverless ETL pipeline using Azure components.
  • Like its AWS and Google cloud counterparts, it complies with almost all data security guidelines.
  • Azure data factory is not suited for multi-cloud or hybrid cloud-based architectures.

Azure Data Factory Use case

Data factory is a good alternative for people well invested in the Azure ecosystem and does not mind being locked to it. Customers who are comfortable with data being on Azure cloud and do not have multi-cloud or hybrid cloud requirements can prefer this.

Azure Data Factory Pricing

Azure data factory is priced based on the number of activity runs per month. You can find more details about the pricing here.

11. Apache Nifi

Key Features of Apache Nifi

  • Apache Nifi is an open-source data flow automation software that can be used to execute ETL flows between various sources and destinations.
  • It is more suited for real-time processing with rudimentary support for batch-based processing.
  • Support of integration to SAAS offerings is not present in Apache Nifi. Since it is open-source, developers can build them using custom implementations.
  • Apache Nifi is not locked to any cloud provider and can practically run on on-premise, or in any cloud provider. 
  • All compliance and data security become the responsibility of the infrastructure team when Nifi is deployed on-premise.

Apache Nifi Use case

Data factory is a good alternative for people well invested in the Azure ecosystem and does not mind being locked to it. Customers who are comfortable with data being on Azure cloud and do not have multi-cloud or hybrid cloud requirements can prefer this.

Apache Nifi Pricing

Azure data factory is priced based on the number of activity runs per month. You can find more details about the pricing here.

Different Types of ETL Tools

ETL tools exist in many different flavors depending on the kind of use cases, the SLAs involved, the location of the data storage, and even based on their implementation philosophy and pricing. Let’s look into details about the three major divisions that exist in the ETL tools space. 

Let us look at the use case and the need for each one in detail. Towards the end, you would have the right lens to pick the best ETL tool for your business.  

Classifying ETL Tools based on Data Load Frequency:

1. Batch ETL Tools

In the early days, bandwidth and computing power was very expensive and/or required rationing. The ETL process required (and still does, to some extent) significant portions of said resources. This led to a trade-off between performing daily business analytics and performing ETL. 

The resulting compromise would be to run the ETL in batches during business off-hours, which became known as the “batch window”. Even now, after decades of development and evolution in computing and communication, the Batch ETL process is still used.

You’ll find this practiced in situations where:

  • Legacy systems are still used for vital business processes
  • The systems that are used to host the data source and/or data warehouse are on-site
  • The activity of data extraction would interrupt the functioning of business transaction systems
  • The volume of data is so large that it requires privileged use of available resources for efficient processing

This method is also used by many who may not be in such specific scenarios due to its benefits including simplicity, limited user involvement, and improved data quality.

What are some of the Best Batch ETL Tools?

The following tools are some of the best ETL tools for batch data replication.

2. Real-Time ETL Tools

Business intelligence is only as good as the data it’s based on, and, increasingly, decisions need to be based on what’s going on now. As such, the real-time availability of data is becoming a more fundamental issue. Fortunately, advancements in technology allow companies to get a full snapshot of their business activities in real-time (or near-real-time). This is done using a process called data streaming.

Event records are able to be moved immediately to the relevant destinations, almost regardless of volume, allowing analysts to glean useful information from the most up-to-date data sets. Some of the best ETL applications today are able to take full advantage of existing technologies and infrastructure to construct highly efficient data pipelines that move data from multiple data sources to the data warehouse in real-time. 

What are some of the Best Real-time ETL Tools?

The following tools are some of the best ETL tools with support for real-time data replication.

Classifying ETL Tools based on Deployment Type:

1. Cloud-Based ETL Tools

Increasingly, more applications are moving to the cloud. Naturally, users are shifting their resources to cloud-based services. Services like Google Analytics, HubSpot, Salesforce, Zendesk, Shopify, and more are being used by companies to operate.

Despite this change, companies still need to move their data from these cloud-based applications to their data warehouses for analysis and insights. 

Notably, companies like Google and Amazon are providing cloud-based solutions for data warehousing – Google BigQuery and Amazon Redshift. Another player, Snowflake, has emerged as one of the famous names in the cloud data warehousing sector.

These services result in significant cost savings for organizations as they are no longer required to spend large amounts of capital on building and maintaining architecture for their data warehousing needs. Companies also benefit from the colossal computing power provided by the vast IT infrastructure of behemoths like Google and Amazon. For a relatively small expense, customers can run queries on terabytes of data and receive results in mere seconds. 

A natural consequence of this trend is that companies have developed Data Integration tools that provide robust ETL from various cloud-based applications to the warehouse.

What are some of the Top Cloud-based ETL Tools?

The following tools are some of the best cloud-based ETL tools for data replication.

2. Public Cloud and On-premise ETL Tools

On-premise systems mean the entire system is deployed in customer’s own data centers with them maintaining the infrastructure as well as software. A public cloud-based deployment is when an organization uses a cloud service provider like AWS or Azure and deploys the tools of his choice on their Virtual Machines. In this case, only the infrastructure is provided by the cloud provider and the organization maintains the software. 

What are some of the Top On-premise ETL Tools?

The following tools are some of the best cloud-based ETL tools for data replication.

Classifying ETL Tools based on Pricing:

1. Free and Open Source ETL Tools

Open source tools are free to use and modify. The possibility of customization is the biggest advantage when it comes to such tools. They usually come with a very active community. However, they require focussed skills to deploy and maintain a highly reliable service. Typically there is no enterprise-level support for such tools. The disadvantage is that such tools rise and fall in popularity according to the preferences of developers. At times, companies end up getting stuck for years with tools for which the community support dies down after their fall in popularity. 

Of the tools mentioned here, Talend, Pentaho, Apache Airflow, Apache Nifi are open source and free to use. The enterprise versions of Pentaho and Talend are paid versions that come with dedicated support and assistance.

What are some of the Top Open Source ETL Tools?

The following tools are some of the best ETL tools with support for real-time data replication.Ai

2. Completely Managed, Paid ETL Tools

Paid ETL Tools are often completely managed and come with reliable support from the ETL companies. Some of the open-source tools also come with a paid enterprise edition that is supported by the product company. The paid tools come as self-deployment based or as completely managed services. Completely managed services spare the customer of maintaining the infrastructure and runs the load and transformation tasks in their own cloud.

What are some of the Top Completely Managed ETL Tools?

The following tools are some of the best ETL tools with support for real-time data replication.

How to choose the right ETL Solution for your organization?

Choosing an ETL tool depends on a lot of factors. Some of the most critical factors that affect this decision are explored below.

  1. Ability to Integrate with Various Data Sources
    Comprehensive connector support for everything from database systems, marketing, and management software, and various completely managed services should be the first factor to consider while choosing an ETL tool. Even if you may not be using many SAAS providers or databases, it makes sense to choose the one with the most comprehensive support for future-proofing the ETL pipeline.

  2. Real-time Data Streaming

    Gone are the days where organizations could manage with batch processes and service level agreements in terms of hours and days. Modern business requires quick decision making and even a second of delay to act could lead you to lose customers. Real-time data streaming and event-based processing is a must-have feature in an ETL tool.

  3. Robust Data Transformations

    Support for complex data transformations plays a critical role in the performance of any ETL tool. This not only affects the development time but also the execution time. As time goes by, data volume and complexity of data structure increases and the wrong choice of an ETL tool can severely dent the ability to extract value from your data.

  4. Reliable Data Load

    An ETL tool generally has a close coupling with the data warehouse system that is being used. If this relationship is not perfect, the entire pipeline’s efficiency and reliability go down leading to service level agreement violations.

  5. Ease of Use and Implementation

    An ETL implementation is a herculean task and most of the work involved is in defining data flows and in deploying the jobs. An easy to use GUI to specify the data flow and a reliable deployment process can result in large cost savings.

Naturally, in order to decide what ETL solution may be best for you, you would have to review your organization’s own use cases first.

If you are responsible for the data infrastructure for a long-established institution that has systems based on, or integrated with, legacy software, or maintain systems that require batch processing of your data, then chances are you may be wedded to an ETL solution that is compatible with the systems that your organization has been using so far. If, on the other hand, your organization is taking steps to migrate some or all of their OLTP and OLAP assets to more modern solutions then read on.

If the major concern in your data integration needs is:

  1. Real-time availability of data
  2. Ability to move data from cloud-based platforms

A hassle-free, modern data integration platform like Hevo might suit your needs. Hevo brings data from 100s of disparate data sources into the Warehouse in Real-time – without writing a single line of code. 

What are your thoughts on the ETL tools shared? Which is the ETL tool of your choice? Let us know your thoughts in the comments.

No-code Data Pipeline for your Data Warehouse