Reliable Data Structures

Critical to the operation of any fault-tolerant system are a set of fault tolerant data structures and well-defined methods on those structures. This project would design and implement some fault-tolerant data structures in the context of a cluster architecture. Thus, are free to use some assumptions about clusters (e.g., a non-byzantine failures). A short intro to some reliable data structure concepts for clusters can be found here. A more lengthy treatment can be found here.

In this project, designs should target tolerance to:

Scalability is important too, but absolute performance of any single operation can be slow. For example, a very fault tolerant hash table update that took 1 second to update might be OK if there was a scheme to make sure as load increased, the response time could be kept constant.

Cost is not an issue. Soon nodes will become very cheap (~$100 or less), so don't worry about minimizing any hardware costs. In fact, one purpose of this project is to investigate how we might trade $$ in hardware for more reliability. For example, today you can buy 4 times as much hardware and gett a 4X reliability improvement, but that would be an excellent property of a system.

Time for recovery is an important property. What happens when a software or hardware component fails? How fast can the system recover? What kinds of performance penalties are there during the recovery process? Ideally, the data structure should remain operational in the face of disasters and there should be little performance loss during recovers. I.e., there shouldn't be a bad "hiccup " (change in response time or throughput) when a component fails.

Class Projects

Implement a fault tolerant log. Any node should be able to append to the end of the log. How scalable is your log? How resistant is it to node failure?

Implement a fault tolerant hash table. How scalable is the hash table? How fast can you insert delete. How fast is the failure recovery?

Could you implement a fault tolerant array reduction; recall a reduction is some operation (sum or XOR) over the whole array. Assume your can have duplicates in the array. How would that change your algorithm?

Implement a fault tolerant array.

Define consistency semantics for your data structures. How do update occur? What are the implementation trade-offs?