Discuss file caching for Distributed Algorithm

36views

written 5.6 years ago by

yashbeer ★ 11k

A file-caching scheme for a distributed file system contributes to its scalability and reliability as it is possible to cache remotely located data on a client node. Every distributed file system use some form of file caching.

The following can be used:

1.Cache Location

Cache location is the place where the cached data is stored. There can be three possible cache locations

i.Servers main memory:

A cache located in the server’s main memory eliminates the disk access cost on a cache hit which increases performance compared to no caching.

The reason for keeping locating cache in server’s main memory-

Easy to implement
Totally transparent to clients
Easy to keep the original file and the cached data consistent.

ii.Clients disk:

If cache is located in clients disk it eliminates network access cost but requires disk access cost on a cache hit. This is slower than having the cache in servers main memory. Having the cache in server’s main memory is also simpler.

Advantages:

Provides reliability.
Large storage capacity.
Contributes to scalability and reliability.

Disadvantages:

Does not work if the system is to support diskless workstations.
Access time is considerably large.

iii.Clients main memory

A cache located in a client’s main memory eliminates both network access cost and disk access cost. This technique is not preferred to a client’s disk cache when large cache size and increased reliability of cached data are desired.

Advantages:

Maximum performance gain.
Permits workstations to be diskless.
Contributes to reliability and scalability.

2.Modification propagation

When the cache is located on client’s nodes, a files data may simultaneously be cached on multiple nodes. It is possible for caches to become inconsistent when the file data is changed by one of the clients and the corresponding data cached at other nodes are not changed or discarded.

The modification propagation scheme used has a critical effect on the systems performance and reliability.

Techniques used include –

i.Write-through scheme

When a cache entry is modified, the new value is immediately sent to the server for updating the master copy of the file.

Advantage:

High degree of reliability and suitability for UNIX-like semantics.
The risk of updated data getting lost in the event of a client crash is low.

Disadvantage:

Poor Write performance.

ii.Delayed-write scheme

To reduce network traffic for writes the delayed-write scheme is used. New data value is only written to the cache when a entry is modified and all updated cache entries are sent to the server at a later time.

There are three commonly used delayed-write approaches:

Write on ejection from cache:

Modified data in cache is sent to server only when the cache-replacement policy has decided to eject it from client’s cache. This can result in good performance but there can be a reliability problem since some server data may be outdated for a long time.

Periodic write:

The cache is scanned periodically and any cached data that has been modified since the last scan is sent to the server.

Write on close:

Modification to cached data is sent to the server when the client closes the file. This does not help much in reducing network traffic for those files that are open for very short periods or are rarely modified.

Advantages:

Write accesses complete more quickly that result in a performance gain.

Disadvantage:

Reliability can be a problem.

3.Cache validation schemes

The modification propagation policy only specifies when the master copy of a file on the server node is updated upon modification of a cache entry. It does not tell anything about when the file data residing in the cache of other nodes is updated. A file data may simultaneously reside in the cache of multiple nodes. A client’s cache entry becomes stale as soon as some other client modifies the data corresponding to the cache entry in the master copy of the file on the server. It becomes necessary to verify if the data cached at a client node is consistent with the master copy. If not, the cached data must be invalidated and the updated version of the data must be fetched again from the server.

There are two approaches to verify the validity of cached data:

i.Client-initiated approach

The client contacts the server and checks whether its locally cached data is consistent with the master copy.

Checking before every access- This defeats the purpose of caching because the server needs to be contacted on every access.
Periodic checking- A check is initiated every fixed interval of time.
Check on file open- Cache entry is validated on a file open operation.

ii.Server-initiated approach

A client informs the file server when opening a file, indicating whether a file is being opened for reading, writing, or both. The file server keeps a record of which client has which file open and in what mode. The server monitors file usage modes being used by different clients and reacts whenever it detects a potential for inconsistency. E.g. if a file is open for reading, other clients may be allowed to open it for reading, but opening it for writing cannot be allowed. So also, a new client cannot open a file in any mode if the file is open for writing.
When a client closes a file, it sends intimation to the server along with any modifications made to the file. Then the server updates its record of which client has which file open in which mode.
When a new client makes a request to open an already open file and if the server finds that the new open mode conflicts with the already open mode, the server can deny the request, queue the request, or disable caching by asking all clients having the file open to remove that file from their caches.

ADD COMMENT EDIT