written 8.7 years ago by | • modified 2.9 years ago |
written 8.7 years ago by |
Failure analysis involves analyzing the data center to identify systems that are susceptible to a single point of failure and implementing fault-tolerance mechanisms such as redundancy.
Single Point of Failure
A single point of failure refers to the failure of a component that can terminate the availability of the entire system or IT service. Following figure illustrates the possibility of a single point of failure in a system with various components: server, network, switch, and storage array. The figure depicts a system setup in which an application running on the server provides an interface to the client and performs I/O operations. The client is connected to the server through an IP network, the server is connected to the storage array through a FC connection, an HBA installed at the server sends or receives data to and from a storage array, and an FC switch connects the HBA to the storage port.
In this example, several single points of failure can be identified. The single HBA on the server, the server itself, the IP network, the FC switch, the storage array ports, or even the storage array could become potential single points of failure. To avoid single points of failure, it is essential to implement a fault-tolerant mechanism.
Fault Tolerance:
To mitigate a single point of failure, systems are designed with redundancy, such that the system will fail only if all the components in the redundancy group fail. This ensures that the failure of a single component does not affect data availability.
Data centers follow stringent guidelines to implement fault tolerance. Careful analysis is performed to eliminate every single point of failure. In the example shown in following figure, all enhancements in the infrastructure to mitigate single points of failures are emphasized:
1.Configuration of multiple HBAs to mitigate single HBA failure.
2.Configuration of multiple fabrics to account for a switch failure.
3.Configuration of multiple storage array ports to enhance the storage array’s availability.
4.RAID configuration to ensure continuous operation in the event of disk failure.
Implementing a storage array at a remote site to mitigate local site failure. Implementing server (host) clustering, a fault-tolerance mechanism whereby two or more servers in a cluster access the same set of volumes. Clustered servers exchange heartbeats to inform each other about their health. If one of the servers fails, the other server takes up the complete workload.