Article Summary


Summary: Researcher in this article propose a system called selective and lightweight versioning (SLEEVE) to improve the Hadoop Distribute File System (HDFS) in respect to the errors and bugs caused to it by software and memory corruption. SLEEVES acts as a light weight subsystem approximate data structure that monitors the functioning of the other HDFS subsystems like namespace management. This approach has efficiently solved the issue of fail-silent in HDFS.

Strengths: The proposed system corresponds to 90% of fail-silent errors caused by random memory corruption. The proposed system is also efficient because it does not require a complete reboot, instead it uses micro-recovery which is faster that a complete reboot. The system has the capacity to isolate faulty behaviors within a single node. SLEEVE selectively targets system against errors which contributes to lower system overload. The main strength of the proposed system is that it has the capability to detect silent errors which are otherwise hidden.

Weaknesses: SLEEVE has not adopted to HDFS only and it is unclear what would be the use of this approach in distributed systems. The proposed system does not guarantee that it won’t trigger recovery action without being required to do so. This is because a lossy compression has be incorporated. SLEEVE is subject to anomalous bit flips and errors.

Questions: How would the SLEEVE approach respond to distributes systems other than HDFS?

How much of the engineering effort would be reduced if the proposed system is implemented in other computer languages than Java, in which it is currently implemented?