Fault tolerance of DFS.

294views

written 8.7 years ago by

yashbeer ★ 11k

Fault tolerance is an important issue in designing a distributed file system.

There are various types of faults that harm the integrity of the system. If a processor loses the contents of its main memory in the event of a crash it leads to logically complete but physically incomplete operations, making the data inconsistent. This may have a certain impact on the integrity of the data stored.
Decay of disk storage devices may result in the loss or corruption of data stored by a file system. Decay is when the portion of device is irretrievable.
The primary file properties that directly influence the ability of a distributed file system to tolerate faults are as follows:

i. Availability

This refers to the fraction of time for which the file is available for use. This property depends on location of the clients and the location of files.
Example: If a network is partitioned due to a communication failure, a link may be available to the clients on some nodes but may not be available to clients on other nodes. Replication is a primary mechanism for improving the availability.

ii. Robustness

This refers to the power to survive crashes of the storage files and storage decays of the storage medium on which it is stored. Storage devices implemented using redundancy techniques are often used to store robust files.
A robust file may not be available until the faulty component has been recovered. Robustness is independent of either the location of the file or the location of clients.

iii. Recoverability

This refers to the ability to roll back to the previous stable and consistent state when an operation on a file is aborted by the client. Atomic update techniques such as transaction mechanism are used.

The stable storage technique for fault tolerance of distributed file systems is given below:

Stable storage duplicates storage devices to implement a stable storage device and to ensure the period when only one of the two component device is operational is significantly less than the mean time between failures of stable device. In context of crash resistance capability, stable storage is broadly classified into 3 types:

i. Volatile storage- This is a RAM that cannot withstand power failure or machine crashes i.e. data stored is lost in these events.

ii. Non-volatile storage-This is a disk that can withstand CPU failures but cannot withstand transient IO faults and decay of storage media. Such storage media have complicated failure nodes and prove to be insufficiently reliable for storing critical data.

iii. Stable storage-This storage can even withstand transient IO faults and decay of storage media.

The operations that happen in this storage are read and write. These actions done on each of the disks use retries to tolerate the effects of transient hardware faults.
Stable-storage system use ordinary fallible disks and converts them into reliable virtual devices whose probability of failure is negligible.
This is applicable for applications that require a high degree of fault tolerance such as atomic transactions.

ADD COMMENT EDIT