Fault Tolerance refers to a system’s capacity to continue running normally even if one or more of its components fail. Whether it’s a computer system, a cloud cluster, a network, or something else, this is true. In other terms, fault tolerance refers to how a computer’s operating system (OS) reacts to and accommodates software and hardware problems.

Hardware, software, or a hybrid solution employing load balancers can handle an OS’s capacity to recover and accept faults without failing. To gracefully handle problems, some computer systems employ numerous duplicate fault tolerant networks.

In this article, you will learn about fault tolerance, how it works, and designing a fault tolerant network.

What is Fault Tolerance?

The goal of designing a fault-tolerant network is to prevent disruptions from a single point of failure while ensuring high availability and business continuity for critical applications or systems. Backup components step in automatically when something fails, so your services keep running smoothly. Here’s how it works:

  • Hardware Backups: Fault tolerance for hardware involves having identical or equivalent systems as backups. For example, you can run two identical servers in parallel, mirroring all actions to a backup server. If one fails, the other takes over without missing a beat.
  • Software Backups: For software, additional instances ensure reliability. A customer database, for example, can be regularly copied to a backup system. If the main database fails, operations automatically switch to the backup.
  • Backup Power Sources: Many businesses use alternative power sources, like generators, to stay operational if the main power line goes down.

Using redundancy, you can eliminate single points of failure and make systems more reliable. For instance, a fault-tolerant network might have an identical backup network running in parallel, mirroring all operations for extra safety.

Fault tolerance is also vital for disaster recovery. Cloud-based backup components can quickly restore critical systems if on-premise infrastructure is affected by a natural disaster or human error. With fault-tolerant designs, your systems stay secure, reliable, and ready to handle unexpected issues.

Experience Fault Tolerant Data Integration With Hevo

Looking to replicate your data without any inconsistency or error? Migrating your data can become seamless with Hevo’s no-code intuitive platform. With Hevo, you can:

  • Automate Data Extraction: Effortlessly pull data from various sources and destinations with 150+ pre-built connectors.
  • Transform Data effortlessly: Use Hevo’s drag-and-drop feature to transform data with just a few clicks.
  • Seamless Data Loading: Quickly load your transformed data into your desired destinations, such as BigQuery.
  • Transparent Pricing: Hevo offers transparent pricing with no hidden fees, allowing you to budget effectively while scaling your data integration needs.

Try Hevo and join a growing community of 2000+ data professionals who rely on us for seamless and efficient migrations.

Get Started with Hevo for Free

How does Fault Tolerance Work?

There are two basic fault-tolerance models:

Normal Functioning

This is when a fault tolerant network faces a problem yet continues to function normally. This means the system’s performance measurements, such as throughput and reaction time, remain unchanged.

Graceful Degradation

When specific problems occur in other types of fault tolerant networks, performance will degrade gracefully. That is, the impact of the problem on the system’s performance is proportional to the severity of the malfunction. To put it another way, a minor flaw will have a minor influence on the system’s performance rather than causing the entire system to fail or have huge performance concerns.

Designing a Fault Tolerant Network

What is ATCA Network?

ATCA is Advanced Telecommunications Computing Architecture. ATCA is built to work in high-availability (HA) situations. The ATCA specification incorporates many HA elements like redundancy and fault tolerance. ATCA systems must be connected to external networks in such a way that the shelf’s HA principles are applied to external networks as well. The availability of a system is determined by its connections to end-users. 

A simplified partial-network diagram of an ATCA network is shown below.

Fault Tolerant Network: atca network

Fault Tolerant Network Design Principles and Guidelines

The principles and guidelines for fault-tolerant networks are covered in this section.

Basic Guidelines of Fault Tolerant Network

  • Redundant Cabling: To ensure high availability, you should use redundant cabling at both the board and link levels, regardless of the software used.
  • Backplane Systems: A backplane system is a set of electrical connectors arranged in parallel to form a computer bus, where each pin connects to the corresponding pin on other connectors. ATCA systems use a dual-star topology for backplane connectivity, so every node (a network endpoint) in the ATCA shelf connects to both switchblades. (This applies specifically to PICMG 3.0 R2.0 ECN-002 “cross-connect” enabled shelves for shelf managers.)
  • Extending Redundancy: Redundancy should also extend outside the ATCA shelf. External elements like nodes, switches, and routers should connect to both switchblades in the shelf. This extends the dual-star network topology beyond the shelf for greater fault tolerance.
  • Fault Tolerance with Cables: Multiple cables to each hub improve fault tolerance, while connecting to both switchblades prevents failures from affecting external links. Since external cables are among the most vulnerable parts of a high-availability system, having multiple links to each switchblade avoids or delays complete fail-overs in case of cable failure.

Channel Bonding: Fault Tolerant Network

  • Simplifying HA Networks: Channel bonding drivers combine multiple network interfaces into one virtual interface, simplifying high-availability (HA) network management. Each ATCA node has two network interfaces, connected to two hub blades, which can be treated as one using these drivers.
  • Choosing Ports: Channel bonding drivers use different decision methods. The active standby algorithm is best for HA as it prioritizes availability, making it the most suitable choice for ATCA networks.
  • Checking Port Usability:
      1. Link State Check: This method simply checks if the network interface’s physical link is active, but it’s limited as it only verifies the immediate connection.
      2. IP Availability Check: A more advanced method, it monitors the entire path between the port and the monitored element. If any link fails, the driver switches to the other interface, ensuring reliability inside and outside the ATCA shelf.
  • Driver Flexibility: Channel bonding drivers are topology, layer, and protocol agnostic. This means they perform well in both complex and simple networks, including Layer 2 switched and Layer 3 routed setups, making them adaptable and reliable.

Fault Tolerant Network Layer 2 Methods

Fault Tolerant Network: layer 2
  • Layer 2 Switching is a popular choice for fault-tolerant networks due to its simplicity, speed, and cost-effectiveness.
  • However, Layer 2 networks must follow a strict tree structure with no loops, which limits redundancy for high-availability (HA) systems.
  • By using VLAN and MSTP protocols, we can overcome this limitation and enable network loops where needed.
  • Key Concepts:
  • VLANs (Virtual Local Area Networks):
    • Divide a Layer 2 network into smaller segments for better traffic management.
    • Tagging traffic allows us to control which traffic passes through specific ports, preventing loops within each VLAN.
  • Spanning Tree Protocol (STP):
    • Detects and disables loop-causing links, turning the network into a tree structure to avoid loops.
  • Multiple Spanning Tree Protocol (MSTP):
    • VLAN-aware: Improves upon STP by respecting VLAN configurations and keeping loops isolated within specific VLANs.
    • Provides faster reconfiguration when network changes occur, ensuring quick failover if a link fails.
  • Example of Layer 2 Fault-Tolerant Network:
  • RED Nodes: Three nodes that don’t require external network access.
  • BLUE Nodes: Three nodes that need external access and communicate with the RED nodes.
  • The RED and BLUE nodes are separated using VLANs, and MSTP ensures that loops in the BLUE VLAN won’t affect the RED VLAN.
  • Benefits:
  • Improved fault tolerance and redundancy for seamless network communication without disruptions.

Configuration of Layer 2 Fault Tolerant Network VLAN:

vlan database
vlan  101
vlan  202
exit
 
configure
 
!interswitch link needs to be in both VLANs
interface  0/2
!The port cost of MST 2 is set here are lower then normal
!to signify using this port is preferential over others
spanning-tree mst 2 cost 1800
vlan participation exclude 1
vlan participation include 101
vlan tagging 101
vlan participation include 202
vlan tagging 202
exit
 
!node blade A is in 202
interface  0/3
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
!node blade B is in 202
interface  0/4
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
!node blade C is in 202 and 101
interface  0/5
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
vlan participation include 101
vlan tagging 101
exit
 
!node blade D is in 101
interface  0/6
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade E is in 101
interface  0/7
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade F is in 101
interface  0/8
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!external network
interface 0/20
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
!MSTP is enabled such that 101 and 202 are in different
!MSTP instances
spanning-tree
spanning-tree configuration name "MSTPexample"
spanning-tree configuration revision 0
spanning-tree mst instance 1
spanning-tree mst vlan 1 1
spanning-tree mst vlan 1 101
spanning-tree mst instance 2
spanning-tree mst vlan 2 202
 
exit

Fault Tolerant Network Layer 3 Methods

  • Virtual Router Redundancy Protocol (VRRP) Overview
    VRRP allows multiple routers to appear as a single virtual router. This is useful in fault-tolerant networks, where redundancy is crucial. By using a single virtual IP, network devices on the Layer 3 side of a hub can connect to the network, while the VRRP nodes handle the routing duties.
  • Master and Backup Roles
    In VRRP, one router becomes the master and handles routing traffic. If this primary router fails, another router in the group automatically takes over, ensuring uninterrupted connectivity. This failover process keeps the network running smoothly.
  • Failover Monitoring with VRRP Tracking
    VRRP doesn’t just rely on the typical checkpoint packets to detect a router failure. It also uses “tracking” to monitor the status of individual links, routes, or remote IPs. If any of these monitored elements fail, VRRP initiates a failover without waiting for the usual packet loss, speeding up recovery.
  • Layer 3 Routing vs Layer 2 Switching
    While Layer 2 switching is simpler and faster, Layer 3 routing is more reliable and allows for better fault tolerance. In Layer 3 networks, loops are often expected. To manage this, hub blades are used as gateways to connect internal Layer 2 networks to external Layer 3 networks.
  • Redundant Gateways with VRRP
    In a fault-tolerant network, redundant gateways are crucial for ensuring continuous connectivity. For example, in an ATCA system, two switchblades serve as the redundant gateways between node blades and the external network. VRRP ensures that even if one gateway fails, traffic can still flow seamlessly through the other gateway.
  • External Network and Subnet Setup
    In this setup, the external network is in one subnet (e.g., 22.50.1.x), and the node blades are in another (e.g., 12.55.67.x). VRRP is configured so that the node blades can continue to access the external network even if one of the switchblades goes down.
  • Enhanced Failover with VRRP Tracking
    To make failovers more reliable, VRRP tracking is added to monitor the status of key network elements. This helps ensure a quick response to failures, improving the overall availability of the network.

An example VRRP network with VRRP tracking can be seen in the diagram below.

Fault Tolerant Network: layer 3

Configuration of Layer 3 Fault Tolerant Network VRRP:

vlan database
vlan  101
vlan routing 101
exit
 
configure
ip routing
ip vrrp
 
!interswitch link needs to be in both VLANs
interface  0/2
vlan participation exclude 1
vlan participation include 101
vlan tagging 101
exit
 
!node blade A is in 101
interface  0/3
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade B is in 101
interface  0/4
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade C is in 101
interface  0/5
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade D is in 101
interface  0/6
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade E is in 101
interface  0/7
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade F is in 101
interface  0/8
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!external network
interface 0/20
ip address  22.50.1.5  255.255.255.0
exit
 
interface  4/1
!the other switch is set to 12.55.67.19
ip address  12.55.67.18  255.255.255.0
ip vrrp 102
ip vrrp 102 ip 12.55.67.20
!the other switch is set to 253
ip vrrp 102 priority 250
ip vrrp 102 mode
exit
!
!track remote server
track 1  ip route 22.50.1.15/24 reachability
 
!track local link state If any link goes down, failover
track 2 interface 0/3 line-protocol
track 3 interface 0/4 line-protocol
track 4 interface 0/5 line-protocol
track 5 interface 0/6 line-protocol
track 6 interface 0/7 line-protocol
track 7 interface 0/8 line-protocol
track 8 interface 0/20 line-protocol
 
!track local route
track 9  interface 4/1 ip routing
 
!Assign values to each track and assign the tracks to the vrrp instance
vrrp 102 track 1 decrement 40
vrrp 102 track 2 decrement 40
vrrp 102 track 3 decrement 40
vrrp 102 track 5 decrement 40
vrrp 102 track 6 decrement 40
vrrp 102 track 7 decrement 40
vrrp 102 track 8 decrement 40
vrrp 102 track 9 decrement 40
 
exit

Single Fault Tolerant Network Example

To achieve maximum availability, each of the fault tolerant network design methods given (channel bonding drivers, Layer 2 methods, and Layer 3 methods) should be employed simultaneously.

An example of all approaches merged into a single fault tolerant network setup is shown below.

Fault Tolerant Network: example

Configuring a Fault Tolerant Network With All Design Methods Integrated:

vlan database
vlan  101
vlan routing 101
vlan 202
vlan routing 202
exit
 
configure
ip routing
ip vrrp
 
!interswitch link needs to be in both VLANs
interface  0/2
!The port cost of MST 2 is set here are lower then normal
!to signify using this port is preferential over others
spanning-tree mst 2 cost 1800
vlan participation exclude 1
vlan participation include 101
vlan tagging 101
exit
 
!node blade A is in 202
interface  0/3
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
!node blade B is in 202
interface  0/4
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
!node blade C is in 202 and 101
interface  0/5
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
vlan participation include 101
vlan tagging 101
exit
 
!node blade D is in 101
interface  0/6
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade E is in 101
interface  0/7
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade F is in 101
interface  0/8
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!external network
interface 0/20
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
interface  4/1
!the other switch is set to 12.55.67.19
ip address  12.55.67.18  255.255.255.0
ip vrrp 102
ip vrrp 102 ip 12.55.67.20
!the other switch is set to 253
ip vrrp 102 priority 250
ip vrrp 102 mode
exit
 
interface  4/2
!the other switch is set to 22.50.1.4
ip address  22.50.1.3  255.255.255.0
ip vrrp 202
ip vrrp 202 ip 22.50.1.5
!the other switch is set to 253
ip vrrp 202 priority 250
ip vrrp 202 mode
exit
 
!
!track remote server
track 1  ip route 22.50.1.15/24 reachability
 
!track local link state If any link goes down, failover
track 2 interface 0/3 line-protocol
track 3 interface 0/4 line-protocol
track 4 interface 0/5 line-protocol
track 5 interface 0/6 line-protocol
track 6 interface 0/7 line-protocol
track 7 interface 0/8 line-protocol
track 8 interface 0/20 line-protocol
 
!track local route
track 9  interface 4/1 ip routing
track 10  interface 4/2 ip routing
 
!Assign values to each track and assign the tracks to the vrrp instance
vrrp 102 track 1 decrement 40
vrrp 102 track 2 decrement 40
vrrp 102 track 3 decrement 40
vrrp 102 track 5 decrement 40
vrrp 102 track 6 decrement 40
vrrp 102 track 7 decrement 40
vrrp 102 track 8 decrement 40
vrrp 102 track 9 decrement 40
vrrp 102 track 10 decrement 40
vrrp 202 track 1 decrement 40
vrrp 202 track 2 decrement 40
vrrp 202 track 3 decrement 40
vrrp 202 track 5 decrement 40
vrrp 202 track 6 decrement 40
vrrp 202 track 7 decrement 40
vrrp 202 track 8 decrement 40
vrrp 202 track 9 decrement 40
vrrp 202 track 10 decrement 40
 
!MSTP is enabled such that 101 and 202 are in different
!MSTP instances
spanning-tree
spanning-tree configuration name "MSTPexample"
spanning-tree configuration revision 0
spanning-tree mst instance 1
spanning-tree mst vlan 1 1
spanning-tree mst vlan 1 101
spanning-tree mst instance 2
spanning-tree mst vlan 2 202
 
exit

Conclusion

In this article, you learned about fault tolerance, how it works, its components and designing a fault tolerant network using Layer 2 and Layer 3 methods. You also saw examples of each method with code.

However, as a Developer, extracting complex data from a diverse set of data sources like Databases, CRMs, Project management Tools, Streaming Services, and Marketing Platforms to your Database can seem to be quite challenging. If you are from non-technical background or are new in the game of data warehouse and analytics, Hevo Data can help!

Hevo Data will automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. This platform allows you to transfer data from 100+ multiple sources to Cloud-based Data Warehouses like Snowflake, Google BigQuery, Amazon Redshift, etc. It will provide you with a hassle-free experience and make your work life much easier.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

FAQs

1. Which network topologies are fault-tolerant?

Fault-tolerant network topologies include mesh and hybrid topologies. In a mesh topology, devices are interconnected, providing multiple paths for data to travel, ensuring that a single failure won’t disrupt the network. Hybrid topologies combine multiple designs, offering flexibility and redundancy to improve fault tolerance.

2. What does it mean that the internet is fault-tolerant?

The internet is fault-tolerant because it can continue functioning even when parts of the network fail. Its decentralized design and redundant paths ensure data can be rerouted through alternative routes, maintaining connectivity and reliability.

3. Is fault tolerance good or bad?

Fault tolerance is a good thing because it ensures systems remain operational even when components fail. It enhances reliability, minimizes downtime, and protects against data loss, which is crucial for critical applications and user satisfaction.

4. What is fault-tolerant communication?

Fault-tolerant communication ensures that messages or data can still be transmitted successfully, even if parts of the communication system fail. This is achieved through redundancy, error detection, and rerouting mechanisms to maintain reliability and minimize disruptions.

Sharon Rithika
Content Writer, Hevo Data

Sharon is a data science enthusiast with a hands-on approach to data integration and infrastructure. She leverages her technical background in computer science and her experience as a Marketing Content Analyst at Hevo Data to create informative content that bridges the gap between technical concepts and practical applications. Sharon's passion lies in using data to solve real-world problems and empower others with data literacy.