The Ultimate Guide On Designing a Fault Tolerant Network 101

|

Fault Tolerant Network FI

Fault Tolerance refers to a system’s capacity to continue running normally even if one or more of its components fail. Whether it’s a computer system, a cloud cluster, a network, or something else, this is true. In other terms, fault tolerance refers to how a computer’s operating system (OS) reacts to and accommodates software and hardware problems.

Hardware, software, or a hybrid solution employing load balancers can handle an OS’s capacity to recover and accept faults without failing. To gracefully handle problems, some computer systems employ numerous duplicate fault tolerant networks.

In this article, you will learn about fault tolerance, how it works, and designing a fault tolerant network.

Table Of Contents

What is Fault Tolerance?

The goal of designing a fault tolerant network is to avoid disruptions caused by a single point of failure while also assuring the high availability and business continuity of mission-critical applications or systems.

Backup components are used in fault tolerant networks to automatically take the place of failed components, ensuring that no service is lost. These are some of the backup components:

  • Hardware systems are backed up by systems that are the same or equivalent. A server, for example, can be made fault-tolerant by operating two identical servers in parallel and mirroring all actions to the backup server.
  • Software systems that are backed up by additional instances of software. A database containing customer information, for example, can be copied to another system on a regular basis. If the primary database fails, operations can be moved to the secondary database automatically.
  • Alternative power sources that have been made fault-tolerant. Many firms have backup power generators in case the main power line fails.

Redundancy can also be used to make any system or component that has a single point of failure fault tolerant. A fault tolerant network, for example, is one that has an identical fault tolerant network mirroring all operations in backup and running in parallel. Hardware fault tolerance in the form of redundancy may make any component or system significantly safer and more reliable by removing single points of failure.

In a disaster recovery approach, fault tolerance might be useful. Fault tolerant networks with cloud backup components, for example, can quickly restore mission-critical systems even if on-premise IT infrastructure is destroyed by natural or human-caused disasters.

How does Fault Tolerance Work?

There are two basic fault-tolerance models:

Normal Functioning

This is when a fault tolerant network faces a problem yet continues to function normally. This means the system’s performance measurements, such as throughput and reaction time, remain unchanged.

Graceful Degradation

When specific problems occur in other types of fault tolerant networks, performance will degrade gracefully. That is, the impact of the problem on the system’s performance is proportional to the severity of the malfunction. To put it another way, a minor flaw will have a minor influence on the system’s performance rather than causing the entire system to fail or have huge performance concerns.

Replicate Data in Minutes Using Hevo’s No-Code Data Pipeline

Hevo Data, a Fully-managed Data Aggregation solution, can help you automate, simplify & enrich your aggregation process in a few clicks. With Hevo’s out-of-the-box connectors and blazing-fast Data Pipelines, you can extract & aggregate data from 100+ Data Sources straight into your Data Warehouse, Database, or any destination. To further streamline and prepare your data for analysis, you can process and enrich Raw Granular Data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!”

GET STARTED WITH HEVO FOR FREE

Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!

Designing a Fault Tolerant Network

What is ATCA Network?

ATCA is Advanced Telecommunications Computing Architecture. ATCA is built to work in high-availability (HA) situations. The ATCA specification incorporates many HA elements like redundancy and fault tolerance. ATCA systems must be connected to external networks in such a way that the shelf’s HA principles are applied to external networks as well. The availability of a system is determined by its connections to end-users. 

A simplified partial-network diagram of an ATCA network is shown below.

Fault Tolerant Network: atca network
Image Source

Fault Tolerant Network Design Principles and Guidelines

The principles and guidelines for fault-tolerant networks are covered in this section.

Basic Guidelines of Fault Tolerant Network

A system should be redundantly cabled, preferably at both the board and link level, regardless of the software employed to boost availability.

A backplane (or “backplane system”) is a collection of electrical connectors connected in parallel to form a computer bus, with each pin of each connector linked to the same related pin of all the other connectors. For backplane connectivity, ATCA employs a dual-star topology. As a result, every node (anything that is a network endpoint) inside the ATCA shelf is connected to both switchblades. (This statement applies exclusively to PICMG 3.0 R2.0 ECN-002 “cross-connect” enabled shelves for shelf managers.)

  • Outside of the ATCA shelf, the redundancy should be repeated.
  • Every external element (such as a node, switch, or router) should be linked to both switchblades on the ATCA shelf.
  • The ATCA notion of a dual-star network is extended outside of the shelf with these redundant links.
  • While providing numerous cables to both hubs promotes fault tolerance, wiring to both switchblades provides fault tolerance for external link failures.
  • External cables are one of a HA system’s most vulnerable components. By having numerous linkages to each switchblade, a complete fail-over can be avoided, or at least postponed in some instances.

Channel Bonding: Fault Tolerant Network

Channel bonding drivers are the simplest and most powerful approach to constructing a HA network. Every ATCA node has at least two network interfaces because it is connected to two hub blades. While these interfaces can logically be handled as independent interfaces, it makes more sense to treat them as one. The abstraction is provided by channel bonding drivers. Higher-level software utilizes a single virtual network interface, while the channel bonding driver manages the difficult task of deciding which physical network interface to use.

Various decision methods are used by channel bonding drivers to determine which port to use at what moment. The majority of these algorithms prioritize bandwidth over availability. However, because the active-standby algorithm was created expressly for HA, it is the optimal algorithm to utilize in ATCA networks.

The essential strength of a channel bonding driver is in its decision algorithm for determining if a port is useable, not in its choice strategy for choosing ports. Only two algorithms are commonly used to determine whether a port is usable:

  • The first merely verifies the network interface’s link state. Because it simply checks the port’s immediate physical link, this technique has limited utility.
  • The second one monitors an IP’s availability. Because a whole path (including linkages and elements between the port and the element being watched) can be monitored, this can be a very powerful approach to monitoring a port. The channel bonding driver will fail over to the other network interface if any of the links between the elements fail. An ATCA node can monitor and respond to faults both inside and outside the ATCA shelf in this fashion.

One of the best things about channel bonding drivers is that they’re topology, layer, and protocol agnostic (decision algorithms are not topology independent, but the overall channel bonding driver is). They perform well in both complicated and simple networks, as well as Layer 2 switched and Layer 3 routed networks.

Simplify your Data Analysis with Hevo’s No-code Data Pipeline

Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.

Check out what makes Hevo amazing:

  • Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
  • Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making. 
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.

Simplify your Data Analysis with Hevo today! SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Fault Tolerant Network Layer 2 Methods

Layer 2 switching is widely utilized for Fault Tolerant Network due to its simplicity. The simplicity of switching makes it quick and economical, but it restricts network architecture.

Without any extra protocols, a Layer 2 network must adhere to a tight tree structure. There are no loops allowed in the network. This restriction runs counter to the core HA notion of redundancy. VLAN and MSTP protocols on top of a Layer 2 switched network, on the other hand, overcome this barrier and allow loops.

VLANs can be used to divide a switched network into smaller networks.

Configure switches to allow only VLAN-tagged traffic to pass through specific ports. As traffic flows through/to switches and nodes, it receives tags (switches and nodes can remove tags as needed).

Traffic can be regulated and controlled using these rules and tags, ensuring that the network loop does not occur on any single VLAN.

STP (Spanning Tree Protocol) is a technology that was created to deal with network loops particularly. STP traverses a network, locates all loops, and disables the loop-causing links. It basically converts the network graph into a tree graph (no node has more than one link to another node), hence the name.

Multiple Spanning Tree Protocol (MSTP) is a protocol that aims to improve STP. In two respects, MSTP is superior to STP. MSTP, for starters, is VLAN aware. Regular STP ignores VLAN settings, therefore even if a network loop is properly separated using VLAN, STP will disable it. While the loop exists, MSTP recognizes that it has been contained by VLAN settings and does not disable the loop connection. Second, when the network changes, MSTP can converge (reconfigure) faster than STP. In any network loop, one of the two redundant links will be active and the other will be idle. When the active link fails, the inactive link should be switched to as soon as feasible.

A Layer 2 fault tolerant network with VLAN and MSTP is shown in the diagram below.

Fault Tolerant Network: layer 2
Image Source

We have three node blades in this fault tolerant network that do not need to communicate with the external network (RED) and three nodes that need to communicate with the external network(BLUE). The RED nodes require communication from one of the BLUE nodes. The RED and BLUE nodes are logically separated via VLANs. MSTP is used to prevent VLAN RED from being affected by a blocking of the inter-switch link caused by a loop in VLAN BLUE.

Configuration of Layer 2 Fault Tolerant Network VLAN:

vlan database
vlan  101
vlan  202
exit
 
configure
 
!interswitch link needs to be in both VLANs
interface  0/2
!The port cost of MST 2 is set here are lower then normal
!to signify using this port is preferential over others
spanning-tree mst 2 cost 1800
vlan participation exclude 1
vlan participation include 101
vlan tagging 101
vlan participation include 202
vlan tagging 202
exit
 
!node blade A is in 202
interface  0/3
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
!node blade B is in 202
interface  0/4
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
!node blade C is in 202 and 101
interface  0/5
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
vlan participation include 101
vlan tagging 101
exit
 
!node blade D is in 101
interface  0/6
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade E is in 101
interface  0/7
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade F is in 101
interface  0/8
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!external network
interface 0/20
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
!MSTP is enabled such that 101 and 202 are in different
!MSTP instances
spanning-tree
spanning-tree configuration name "MSTPexample"
spanning-tree configuration revision 0
spanning-tree mst instance 1
spanning-tree mst vlan 1 1
spanning-tree mst vlan 1 101
spanning-tree mst instance 2
spanning-tree mst vlan 2 202
 
exit

Fault Tolerant Network Layer 3 Methods

Multiple routers might appear as a single virtual router using VRRP (Virtual Router Redundancy Protocol). Blades can be set with a single virtual IP and network components on the Layer 3 side of the hub with VRRP nodes on the Layer 2 side to observe two routes to the same subnet. One of the VRRP routers becomes the virtual IP’s master and performs the routing. In the event that the primary router fails, the backup router takes over in the fault tolerant network.

A router failure in traditional VRRP is defined as a series of missing checkpoint packets exchanged between VRRP routers. Other fail-over circumstances are included with VRRP tracking. Multiple “tracks” can be set up with VRRP tracking to monitor the status of a link, a local route, or a remote IP. The router forces a fail-over without waiting for the checkpoint packets if any track fails.

Layer 3 routing is more difficult than Layer 2 switching, but it is also more reliable. In Layer 3 fault tolerant networks, loops are expected. Hub blades are used as gateway routers in ATCA. A hub blade connects an external Layer 3 network to an internal Layer 2 network. The ATCA shelf has two gateways for the two hub blades. Two distinct gateways could be handled by the node blades, but the VRRP protocol provides a more elegant approach.

An example VRRP network with VRRP tracking can be seen in the diagram below.

Fault Tolerant Network: layer 3
Image Source

The two switchblades in this scenario serve as redundant gateways between the node blades and the external network. The external network is in the 22.50.1.x subnet, whereas the node blades are in the 12.55.67.x subnet. A VRRP instance is set up so that even if one of the switchblades fails, the node can still connect to the outside network. To provide more reliable failover scenarios, VRRP tracks have been added.

Configuration of Layer 3 Fault Tolerant Network VRRP:

vlan database
vlan  101
vlan routing 101
exit
 
configure
ip routing
ip vrrp
 
!interswitch link needs to be in both VLANs
interface  0/2
vlan participation exclude 1
vlan participation include 101
vlan tagging 101
exit
 
!node blade A is in 101
interface  0/3
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade B is in 101
interface  0/4
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade C is in 101
interface  0/5
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade D is in 101
interface  0/6
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade E is in 101
interface  0/7
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade F is in 101
interface  0/8
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!external network
interface 0/20
ip address  22.50.1.5  255.255.255.0
exit
 
interface  4/1
!the other switch is set to 12.55.67.19
ip address  12.55.67.18  255.255.255.0
ip vrrp 102
ip vrrp 102 ip 12.55.67.20
!the other switch is set to 253
ip vrrp 102 priority 250
ip vrrp 102 mode
exit
!
!track remote server
track 1  ip route 22.50.1.15/24 reachability
 
!track local link state If any link goes down, failover
track 2 interface 0/3 line-protocol
track 3 interface 0/4 line-protocol
track 4 interface 0/5 line-protocol
track 5 interface 0/6 line-protocol
track 6 interface 0/7 line-protocol
track 7 interface 0/8 line-protocol
track 8 interface 0/20 line-protocol
 
!track local route
track 9  interface 4/1 ip routing
 
!Assign values to each track and assign the tracks to the vrrp instance
vrrp 102 track 1 decrement 40
vrrp 102 track 2 decrement 40
vrrp 102 track 3 decrement 40
vrrp 102 track 5 decrement 40
vrrp 102 track 6 decrement 40
vrrp 102 track 7 decrement 40
vrrp 102 track 8 decrement 40
vrrp 102 track 9 decrement 40
 
exit

Single Fault Tolerant Network Example

To achieve maximum availability, each of the fault tolerant network design methods given (channel bonding drivers, Layer 2 methods, and Layer 3 methods) should be employed simultaneously.

An example of all approaches merged into a single fault tolerant network setup is shown below.

Fault Tolerant Network: example
Image Source

Configuring a Fault Tolerant Network With All Design Methods Integrated:

vlan database
vlan  101
vlan routing 101
vlan 202
vlan routing 202
exit
 
configure
ip routing
ip vrrp
 
!interswitch link needs to be in both VLANs
interface  0/2
!The port cost of MST 2 is set here are lower then normal
!to signify using this port is preferential over others
spanning-tree mst 2 cost 1800
vlan participation exclude 1
vlan participation include 101
vlan tagging 101
exit
 
!node blade A is in 202
interface  0/3
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
!node blade B is in 202
interface  0/4
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
!node blade C is in 202 and 101
interface  0/5
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
vlan participation include 101
vlan tagging 101
exit
 
!node blade D is in 101
interface  0/6
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade E is in 101
interface  0/7
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!node blade F is in 101
interface  0/8
vlan participation exclude 1
vlan participation include 101
vlan pvid 101
exit
 
!external network
interface 0/20
vlan participation exclude 1
vlan participation include 202
vlan pvid 202
exit
 
interface  4/1
!the other switch is set to 12.55.67.19
ip address  12.55.67.18  255.255.255.0
ip vrrp 102
ip vrrp 102 ip 12.55.67.20
!the other switch is set to 253
ip vrrp 102 priority 250
ip vrrp 102 mode
exit
 
interface  4/2
!the other switch is set to 22.50.1.4
ip address  22.50.1.3  255.255.255.0
ip vrrp 202
ip vrrp 202 ip 22.50.1.5
!the other switch is set to 253
ip vrrp 202 priority 250
ip vrrp 202 mode
exit
 
!
!track remote server
track 1  ip route 22.50.1.15/24 reachability
 
!track local link state If any link goes down, failover
track 2 interface 0/3 line-protocol
track 3 interface 0/4 line-protocol
track 4 interface 0/5 line-protocol
track 5 interface 0/6 line-protocol
track 6 interface 0/7 line-protocol
track 7 interface 0/8 line-protocol
track 8 interface 0/20 line-protocol
 
!track local route
track 9  interface 4/1 ip routing
track 10  interface 4/2 ip routing
 
!Assign values to each track and assign the tracks to the vrrp instance
vrrp 102 track 1 decrement 40
vrrp 102 track 2 decrement 40
vrrp 102 track 3 decrement 40
vrrp 102 track 5 decrement 40
vrrp 102 track 6 decrement 40
vrrp 102 track 7 decrement 40
vrrp 102 track 8 decrement 40
vrrp 102 track 9 decrement 40
vrrp 102 track 10 decrement 40
vrrp 202 track 1 decrement 40
vrrp 202 track 2 decrement 40
vrrp 202 track 3 decrement 40
vrrp 202 track 5 decrement 40
vrrp 202 track 6 decrement 40
vrrp 202 track 7 decrement 40
vrrp 202 track 8 decrement 40
vrrp 202 track 9 decrement 40
vrrp 202 track 10 decrement 40
 
!MSTP is enabled such that 101 and 202 are in different
!MSTP instances
spanning-tree
spanning-tree configuration name "MSTPexample"
spanning-tree configuration revision 0
spanning-tree mst instance 1
spanning-tree mst vlan 1 1
spanning-tree mst vlan 1 101
spanning-tree mst instance 2
spanning-tree mst vlan 2 202
 
exit

Conclusion

In this article, you learned about fault tolerance, how it works, its components and designing a fault tolerant network using Layer 2 and Layer 3 methods. You also saw examples of each method with code.

However, as a Developer, extracting complex data from a diverse set of data sources like Databases, CRMs, Project management Tools, Streaming Services, and Marketing Platforms to your Database can seem to be quite challenging. If you are from non-technical background or are new in the game of data warehouse and analytics, Hevo Data can help!

Visit our Website to Explore Hevo

Hevo Data will automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. This platform allows you to transfer data from 100+ multiple sources to Cloud-based Data Warehouses like Snowflake, Google BigQuery, Amazon Redshift, etc. It will provide you with a hassle-free experience and make your work life much easier.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

mm
Former Content Writer, Hevo Data

Sharon is a data science enthusiast with a passion for data, software architecture, and writing technical content. She has experience writing articles on diverse topics related to data integration and infrastructure.

No-Code Data Pipeline for Your Data Warehouse