0
4.4kviews
Short Note on: Fault tolerance
1 Answer
4
78views
  • A system fails when it does not match its promises. An error in a system can lead to a failure. The cause of an error is called a fault.
  • Faults are generally transient, intermittent or permanent. Transient occurs once and disappears, intermittent fault keep reoccurring and permanent faults continue till system is repaired.
  • Being fault tolerant is related to dependable systems. The dependability includes availability, reliability, safety and maintainability.
  • The ability of a system to continue functioning in the event of partial failure is known as fault tolerance.
  • It is an important goal in distributed system design to construct a system that can automatically recover from partial failures without affecting overall performance.

To improve the fault tolerance ability in DOS:

1.Redundancy techniques:

  • The aim of redundancy technique is to avoid single point failure by replicating critical hardware and software components, such that if one of them fails, the other can be used to continue in spite of occasional partial failures.
  • Example:
  • A critical process can be simultaneously copied on two nodes so that if one of the two processes fails, the execution of the process can be completed on the other node.
  • Similarly a critical file may be replicated on two or more storage devices for better reliability. But this requires additional system overhead.
  • Thus DOS systems must be designed accordingly to maintain the balance between the degrees of reliability and incurred overhead.
  • A system is k-fault tolerant if it continues to function in the event of k failed components. If a system is to be designed to tolerate k-fail stop failures, then k+1 replicas are needed.
  • If k replicas are lost due to failures, the remaining one replica can be used to continue functioning of the system.
  • If a system is designed to tolerate b Byzantine failures, a minimum of 2k+1 replicas are needed because a voting mechanism can be used to believe the majority of k+1 replicas when k replicas behave abnormally.
  • Another important redundancy technique is to design a stable storage device which is a virtual storage device that can withstand transient I/O faults and decay of the storage media.

2.Distributed control:

  • For better reliability the algorithms or protocols used in DOS must employ distributed control mechanisms to avoid single point of failure.
  • Example:

    ▪A highly available distributed file should have multiple and independent file servers controlling multiple and independent storage devices.

    ▪A distributed controlled technique could also be used for name servers, scheduling algorithms and other execution control function.

Please log in to add an answer.