Azure Fault Tolerance: A Comprehensive Guide 101

• May 25th, 2022

azure fault tolerance: FI

Microsoft Azure is a public cloud computing platform that offers Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) solutions for services like Analytics, Virtual Computing, Storage, and Networking.

Azure Fault Tolerance is the ability of an operating system to respond to a hardware or software failure. The ability of a system to continue operating despite failures or malfunctions is referred to as fault tolerance.

This article talks about the different aspects of Azure Fault Tolerance and concepts related to it. It also discusses Microsoft Azure briefly.

Table Of Contents

What is Microsoft Azure?

azure fault tolerance: microsoft azure logo
Image Source

Microsoft Azure, also known as Azure, is a cloud computing service provided by Microsoft for the management of applications through Microsoft-managed data centers. It supports a wide range of programming languages, tools, and frameworks, including Microsoft-developed and third-party software.

Azure was first announced in October 2008 at Microsoft’s Professional Developers Conference (PDC) and was codenamed “Project Red Dog” internally. It was formally released in February 2010 as Windows Azure, before being renamed to Microsoft Azure on March 25, 2014.

Azure is billed on a pay-as-you-go basis, which means that subscribers receive a monthly bill that only includes the resources they have used.

Microsoft Azure has a wide range of applications due to its numerous service offerings. One of the most common uses for Microsoft Azure is to run virtual machines or containers in the cloud. These compute resources can be used to run infrastructure components like DNS servers, Windows Server Services like Internet Information Services (IIS), or third-party applications. Third-party operating systems, such as Linux, are also supported by Microsoft.

Azure is also frequently used as a cloud database hosting platform. Serverless relational databases, such as Azure SQL, and non-relational databases, such as NoSQL, are available from Microsoft. Furthermore, the platform is frequently used for disaster recovery and backup. To meet their long-term data retention requirements, many organizations use Azure storage as an archive.

However, Azure’s advantages go beyond cost savings. The combination of Azure and Office 365 can greatly simplify the task of administering certain technologies such as Windows Server, Active Directory, and SharePoint. This allows IT staff to focus on new projects rather than general system maintenance.

Microsoft is courting businesses to move their AI compute operations to Azure. At Microsoft’s Build 2019 developer conference, Project Brainwave–an FPGA-based Deep Learning system designed for real-time AI–was released as a preview to Azure. Microsoft also added features to Azure Machine Learning Service to help developers use “a no-code approach to model creation and deployment using a new visual machine-learning interface,” as well as Cognitive Services algorithms to generate insights from structured or unstructured content.

Replicate Data in Minutes Using Hevo’s No-Code Data Pipeline

Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 100+ Data Sources straight into your Data Warehouse or any Databases. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!


Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!

Key Features of Microsoft Azure

  • Infrastructure as a Service (IaaS): IaaS, which enables businesses to manage and deploy applications quickly and easily, was pioneered by Microsoft Azure. Azure allows businesses to tailor cloud software to their specific needs.
  • Strong Support in Analytics: Microsoft Azure comes with built-in Data Analysis and key insight support. Cortana Analytics, Stream Analytics, Machine Learning, and SQL Services are all included in the service. These features will aid businesses in identifying new business opportunities, improving customer service, and making educated decisions.
  • Enhance Existing IT Support: One of the most appealing aspects of Azure is how well it integrates with the existing IT department. Hybrid Databases, Storage Solutions, and Secure Private Connections are all used to accomplish this. Microsoft Azure can coexist peacefully with your data center in your business environment. As a result, Azure is one of the most Cost-effective and User-friendly cloud services available.
  • Unique Storage System: In comparison to other cloud services, Azure has more delivery points and data centers. Azure can now provide an optimal user experience and deliver content to your business environment more quickly. Users can store data in Azure in a fast and secure environment. Businesses can also share content among multiple virtual machines.
  • Enhanced Scalability: Microsoft Azure is a pay-as-you-go service that can be scaled up or down quickly to meet your business’s needs and environment. This makes it a practical solution for many businesses of varying sizes.
  • Enhanced Flexibility: Azure is adaptable; it allows your company to use any level of functionality it requires. It supports many of the same technologies that many developers and IT professionals already use. With almost no downtime, your company can quickly deploy and change web apps to Azure.

Understanding Azure Fault Tolerance

To understand Azure Fault Tolerance, look at the following:

Azure Fault Tolerance: What is Fault Tolerance?

  • In the event of a failure, Microsoft Azure has several mechanisms in place to ensure that services and applications continue to function. Hardware failures, such as hard-drive crashes, or temporary service availability issues, such as storage or networking services, are examples of such failures. Azure and its software-controlled infrastructure are designed to anticipate and manage outages like these.
  • Azure Fault Tolerance is the ability of a system to continue to function properly even if some of its components fail (or have one or more faults within them). When compared to a naively designed system, where even a minor failure can cause a total breakdown, the decrease in operating quality is proportional to the severity of the failure. In high-availability or life-critical systems, Azure fault tolerance is especially important.
  • For example, if your Azure application runs on two Virtual Machines, Microsoft ensures that those two VMs are allocated within the infrastructure so that system failures do not affect them.
  • Azure Fault tolerance is designed to deal with failure on a small scale, such as switching from an unhealthy virtual machine to a healthy one. However, much larger failures can occur on occasion. Natural disasters, for example, can have an impact on all resources in a region.
  • The Azure Infrastructure (the Fabric Controller) reacts quickly in the event of a failure to restore services and infrastructure. If a Virtual Machine (VM) fails on a physical host due to a hardware failure, the Fabric Controller moves the VM to another physical node using the same Azure storage hard disc. Azure is also capable of coordinating upgrades and updates so that service downtime is avoided.
  • Fault Domains and Upgrade Domains are the most fundamental concepts for enabling the high availability of computing resources (such as cloud services, traditional IaaS VMs, and VM scale sets). These have existed in Azure since the beginning.

What Makes Hevo’s ETL Process Best-In-Class

Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.

Check out what makes Hevo amazing:

  • Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
  • Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making. 
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day free trial!

Azure Fault Tolerance: The Azure Datacenter Architecture

  • Visualizing a high-level view of how Azure data centers are structured can help you fully understand fault domains and upgrade domains. Microsoft refers to Azure datacenters as Quantum 10 architecture. This supports higher throughput. 
  • As shown in the diagram below, its topology implements a full, non-blocking, meshed network that provides each Azure datacenter with an aggregate backplane with high bandwidth.
azure fault tolerance: datacenter architecture Quantum 10 architecture
Image Source
  • Racks are created from the nodes. A cluster is formed by a group of racks. There are numerous clusters of various types in each data center. Some clusters are in charge of storage, while others are in charge of computing, SQL, and other tasks. The Top-Of-Rack (TOR) switch is the rack’s single point of failure.
  • All of the machines or nodes in the cluster are managed by the Fabric Controller. The Fabric Controller coordinates deployments across cluster nodes. For Azure fault tolerance, each cluster has multiple Fabric Controllers. Every node in the cluster’s health must be known to the Fabric Controller. It also aids the Fabric Controller in detecting failures so that deployments can be automatically repaired by re-provisioning affected VMs on different physical nodes.
  • Every machine in a cluster has different agents that continuously monitor node health and communicate it back to the Fabric Controller to help the Fabric Controller determine the health of a node. It’s critical to comprehend how the various components interact to achieve this. The following are crucial elements:
    • Host OS: The operating system is installed on the physical machine.
    • Host Agent: A process that runs on individual nodes and allows that machine to communicate with the Fabric Controller.
    • Guest OS: The operating system running within the virtual machine.
    • Guest Agent: Resides in the virtual machine and communicates with the host agent to monitor and maintain the virtual machine’s health.
azure fault tolerance: physical node
Image Source

Azure Fault Tolerance: Fault Domains and Upgrade Domains

  • Fault Domains are a very important part of understanding Azure Fault Tolerance. Fabric Controller hosts would be distributed across different fault domains and update domains to ensure the high availability of any Platform-as-a-Service (PaaS) application.
  • A fault domain is a logical breakdown unit. It has a lot to do with the data centers’ physical infrastructure. A fault domain is associated with each rack of servers in Azure. While Azure guarantees that any PaaS application (with more than one instance) hosted on the platform will be available across multiple fault domains, the Fabric Controller determines the total number of fault domains over which the application’s instances are spread based on machine availability within the datacenter.
  • When you push updates to the system, an upgrade domain is a logical unit that helps keep the application running. This is a user-definable setting for PaaS applications. A maximum of five upgrade domains can be used by an Azure application.
azure fault tolerance: upgrade domains
Image Source

Fault Domains, Upgrade Domains, and IaaS VMs

  • Upgrading Domains, Fault Domains, and IaaS VMS are a crucial part of Azure Fault Tolerance. Azure introduced the concept of Availability Sets to help spread Infrastructure-as-a-Service (IaaS) VMs across fault domains and upgrade domains. All instances in an availability set are split between two or more fault domains and given distinct upgrade domain values.
  • You won’t be able to get Service-Level Agreements (SLAs) for VMs unless you assign them to an availability set. It’s crucial to understand this because it determines how you can maintain high availability for your services and applications even if Azure datacenters experience failures or upgrades. You can only avoid being affected by such failures by assigning your VMs to an availability set.
  • Consider the following scenario to illustrate the significance of this: Regularly, the Azure product team pushes Operating System updates to all data centers. The host OS (physical machines) and the guest OS (VMs hosting PaaS applications or your own IaaS VMs) must both be updated to the latest OS to push updates to the entire data center. To deploy the updates without disrupting application availability:
    • The host OS updates are carried out one fault domain at a time across the data center, for all available fault domains.
    • Every user application is updated one upgrade domain at a time, for all available upgrade domains.

As long as you run at least two instances per service or at least two VMs as part of an availability set, Azure can push upgrades to its infrastructure while maintaining service availability (such as a load-balanced Web service, SQL Server AlwaysOn Availability Group Nodes, and so on).

Azure Fault Tolerance: How Many Fault Domains?

  • Determining the number of Fault Domains is a critical aspect of Azure Fault Tolerance, when it comes to PaaS-like workloads, which are mostly stateless, fault and upgrade domains can help keep things running. These methods work with stateless Web applications. The overall Web application or service remains available even if a subset of the nodes becomes unavailable during upgrade cycles or temporary downtimes.
  • When it comes to stateful infrastructures, such as database servers, the situation becomes more complicated (be it RDBMS or NoSQL). Knowing that your servers are distributed across multiple fault domains may not be sufficient in these situations.
  • Azure Fault Tolerance guarantees that VMs in the same availability set will be deployed on at least two fault domains when using IaaS. (therefore two racks). There is a chance that VMs in an availability set will be deployed across more than two fault domains, but there is no guarantee. When deploying to North or West Europe, as well as several US regions, the project has always used exactly two fault domains.
azure fault tolerance: number of fault domains
Image Source
  • The Get-AzureVM command in Azure PowerShell returns the result in the GridView control, as shown in the figure above. The VMs in the cluster are distributed across two fault domains and three upgrade domains, and they are all part of the same availability set (not visible in the grid).
  • That sample deployment’s sql1 and sqlwitness nodes share a physical rack. That rack and the rest of the data center are connected by the same TOR. A different rack houses the sql2 node.
  • Being deployed across only two fault domains shouldn’t be a problem for stateless applications like Web APIs or Web applications, at least from a consistency standpoint. It’s a different story for stateful workloads like database servers, at least in terms of availability.
  • Depending on how a cluster operates, it may be necessary to know how many nodes can fail before the cluster’s health is jeopardized. If a cluster relies on quorum votes or majority votes for certain operations, such as electing new masters or confirming consistency for reading requests, the question of how many nodes can go down in a worst-case scenario becomes more critical.
  • Although the fault domain recovers your VMs automatically, the question of how many nodes can fail at the same time is still relevant. A recovery operation may take some time, depending on how long the Fabric Controller takes to recover the VM and the database system that is running on it.
  • When you understand Azure automation’s internal behavior during upgrades, the entire topic becomes more important. All upgrades to the host OS running on the hypervisor on each of the physical nodes that host your VMs happen in IaaS VMs based on fault domains, not upgrade domains, as the general developer community believes. Upgrade domains are only used to update PaaS VM-based applications. As a result, you’ll be affected by host OS upgrades every quarter, as is customary. You might experience intermittent downtime if your cluster is spread across two fault domains and relies on majority votes and the like.
  • Deploying a MongoDB replica set in Azure is a good example of this. One master is required for each MongoDB replica set. If that master node fails, the remaining nodes vote for a new master. For the master to be elected, a majority of votes are required. If there aren’t enough nodes up to vote for a new master, the entire replica set is declared unhealthy and considered “down.”
  • The documentation for MongoDB specifies a fault tolerance per replica set size. In a replica set of three nodes, only one node can fail. As shown in the diagram below, in a cluster of five nodes, the maximum number of nodes that can fail without the cluster going down is two.
Number of MembersMajority Required to Elect a New PrimaryFault Tolerance
  • MongoDB introduced the concept of an arbiter if the number of database nodes required in a replica set is even (for example, two database nodes). For elections, an arbiter serves as a voting server, but it does not manage the entire database stack (for saving costs and resources). So, if you have a MongoDB replica set with two database nodes, you’ll need a third node—the arbiter—to provide an extra vote for majority-based master elections in the event of failures.
  • A majority of nodes are required to elect a new primary node in the SQL Server AlwaysOn Availability Group, which is similar. A voting-only member operates on a similar principle. In SQL Server, it’s simply referred to as a witness (rather than arbiter, as it is in MongoDB).
  • sql1 and sqlwitness are on one fault domain, and sql2 is on another, according to SQL Server and the deployment. Only sql2 is left if fault domain “0” fails. However, a majority of sql2 is insufficient for the cluster to elect a new master. If fault domain “0” fails, the entire cluster will be unhealthy.
  • If sql1 and sql2 ended up on the same fault domain, the situation would be much worse. The Fabric Controller would then have to recover both database nodes from a potential failure or finish the host OS upgrade process.
  • A MongoDB replica set presents a similar scenario. Only one node in a replica set of three nodes can fail for the entire cluster to remain active and available, according to the table from the official MongoDB documentation. As a result, the distribution of your nodes across a set of fault domains is critical. Both Azure host OS upgrades and potential failures may be impacted.

Azure Fault Tolerance: High Availability in Stateful Services

  • When Azure mostly deploys VMs across two fault domains, a valid question is how you can achieve high availability. This question can be answered in two ways: long-term and short-term.
  • The Azure product team is working hard to drastically improve the situation in the medium term. Azure can deploy your workloads across a minimum of three fault domains when you use version 2 IaaS VMs (based on the new Azure Resource Manager API). This is a compelling reason to use an Azure Resource Manager and a version 2 VM.
  • It’s not that simple in the short term, or for as long as you rely on traditional Azure Service Management and version 1 IaaS VMs. You have two options to reduce the risk of downtimes, depending on your SLA, Recovery Time Objective (RTO), and Recovery Point Objective (RPO) targets. Both methods are based on a SQL Server AlwaysOn Availability Group, as shown in the diagram below.
azure fault tolerance: high availability in stateful services
Image Source
  • The goal is to minimize the consequences of both routine Host OS upgrades and eventual failure. Deploy one node outside the VM availability set and two nodes within the same VM availability set in a single data center for a three-node cluster.
  • With that approach, the effects of host OS upgrades are nearly completely mitigated. Upgrades to VMs within availability sets occur at a different time than single VM host OS upgrades. Host OSes with single VMs that do not use availability sets are typically upgraded a week before those with availability sets.
  • You can only reduce the likelihood of being affected in the case of fault domain failures. There’s always a chance that a node outside the availability set will land on one of the VMs within the availability set’s fault domains. The amount of resources consumed and available in an Azure data center determines a lot.
  • There is no need for such a solution in an Active Directory Domain. For high availability, there’s only a primary and backup domain controller. That’s two nodes evenly distributed across two fault domains, as promised by Azure for version 1 IaaS VMs.
  • That still leaves you with the task of improving your SQL node preparation. By not deploying sqlwitness in the same data center, you can reduce the risk. This not only lowers the probability but also eliminates the risk.
  • Depending on your SLA, RPO, and RTO requirements, as well as the price you’re willing to pay for high availability, you have two options: a fully functional backup deployment in a second region, or just the arbiter/witness in the secondary region.
  • You can replicate your entire deployment in a secondary region with a fully functional secondary deployment. This includes your front-end and middle-tier applications and services as well. Customers could be redirected to the secondary region in the event of a major failure in the primary region.
  • These deployments are typically designed to achieve high RPO/RTO goals. With short RPOs and RTOs, database systems like MongoDB or SQL AlwaysOn typically span the replica set or SQL AG cluster across two regions, with ongoing replication enabled across those regions. Due to latency and performance issues, replication across regions will most likely be asynchronous; however, replication will take anywhere from milliseconds to minutes, rather than double-digit minutes or hours.
  • Running a single VM in the secondary region as a witness or arbiter, on the other hand, is a much cheaper option. When all you need to do is keep your primary cluster alive in the event of a fault domain failure, it’ll suffice. It does not allow you to failover to a secondary region without taking some additional steps, such as spinning up new nodes and virtual machines in the secondary region.
  • In the example, in the secondary region, you could run a full SQL node as the only node. It would have different upgrade cycles because it is run as a single VM. Because it runs in a different data center, the chances of it being upgraded or failing at the same time as the primary availability set’s nodes are slim.
  • It’s not easy to achieve high availability and Azure fault tolerance for your applications and services. It necessitates an understanding of and adjustment to basic concepts. Fault domains, upgrade domains, and availability sets must all be understood. When migrating to Azure, it’s especially important to understand the fault-tolerance requirements of the stateful systems you’re using in your infrastructure. In Azure, map those fault-tolerance requirements to the fault domain and upgrade domain behaviors.
  • Use IaaS VMs v2 as part of the Azure Resource Manager and Resource Group efforts for completely new IaaS deployments. You’ll gain the Azure fault tolerance of being deployed across at least three fault domains this way. These tips in this section can help you mitigate the effects of fault domain outages and maintenance events like Host OS upgrades. You’ll be able to achieve high availability without any unpleasant surprises if you embrace and adapt the concepts outlined here.


This blog explains Azure Fault Tolerance and how it is handled in detail. In addition to that, it gives an overview of Microsoft Azure and its Key Features.

visit our website to explore hevo

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations, with a few clicks. Hevo Data with its strong integration with 100+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.

Want to take Hevo for a spin? 

Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

No-code Data Pipeline For Your Data Warehouse