The recent improvements in workstation and interconnection network performance have popularized the clusters of off-the-shelf workstations as a platform for both high-performance and interactive applications. Clusters of workstations currently have the potential to approach the performance of supercomputers and the ease of use of mainframes with much cheaper hardware. However, the main obstacles yet to be surpassed by these clusters as general-purpose computing platforms are two deficiencies of most current distributed operating systems: their inadequate management of the available resources and their inability to present an integrated view that can simplify utilization, programming, and management of the cluster as a whole.
In order to eliminate these problems and approach the computational power of large clusters of workstations, we propose Nomad, an efficient operating system for clusters of uni and/or multiprocessors. The main goal of Nomad is to efficiently support (high-performance or interactive) parallel, distributed, and sequential applications. Nomad includes several important characteristics for modern cluster-oriented operating systems: scalability, efficient resource management across the cluster, efficient scheduling of parallel and distributed applications, distributed I/O, fault detection and recovery, protection, and backward compatibility.
The mechanisms used to implement these characteristics include unique cluster-wide process identifications, process checkpointing and migration, co-scheduling of concurrent applications, and a distributed file system. Some of these mechanisms, such as process checkpointing and migration, can be found in previously proposed systems. However, Nomad is unique in the particular set of characteristics it includes, in its strategy for disseminating information across the cluster, and in that it manages all cluster resources, while using neither extra messages nor centralized servers to implement its functionality. Nomad avoids sending extra messages by relying on the communication intrinsic to its distributed file system (which is needed for high disk I/O throughput and fault tolerance anyway). For instance, in order to implement process migration, load information is piggybacked on file access messages.
Nomad is now under development as a layer of software on top of standard Unix-like operating systems. The current prototype of the system already includes most but not all of Nomad's characteristics. The prototype supports the x86/Pentium family of microprocessors using Linux as the base operating system and the Alpha family of microprocessors using Digital Unix as the base operating system.
Our preliminary evaluation of the mechanisms that have already been implemented (checkpointing, migration, scheduling, and distributed I/O) show that Nomad performs as well or better than similar previously proposed systems. In addition, our perliminary evaluation of the load balancing aspect of Nomad shows that the pattern of file accesses in our distributed file system allows for efficient and scalable load balancing. Based on these results, our main conclusion is that the complete implementation of Nomad will most likely be efficient and will provide an interesting platform for future research on operating systems for clusters of workstations.