Scalable I/O Systems

I/O is a critical component of large scale computer systems. These systems run databases,web-servers and commerce servers. Current I/O systems suffer from a lack of scalability, availability and manageability (SAM). Ideally, these systems should scale to 100's of terabytes of capacity and support 1000's of simultaneous users. The I/O system must also provide 24x7 availability. In addition, the system should allow the administrator to accept reduced availability in exchange for increased capacity or performance. Finally, large scale I/O systems must be tractable to operate. This project works towards the goal of providing an I/O sub-system from the SAM perspective. For example, one should be able to create a 24x7 multi-terabyte storage sub-system which is manageable by a part time system administrator. The system should survive multiple disk failures before losing any data. If the system were really good, it might even order the disks on-line to help reconstruct itself.




Figure 1: Scalable I/O system model.

The figure shows the model used for a scalable I/O system. An fast interconnect fabric, such as Myrinet or Giganet, is used to connect I/O devices to the main computer. I/O processors in the system provide the main OS with a logical devices; these logical devices have better performance and availability properties than real devices. For example, an I/O processor might abstract a logical block device using a RAID scheme to provide better availability. Likewise, a logical network device might use striping to provide better bandwidth an better tolerate link failures.



Figure 1 shows a model of a typical computer system At the "top" of the system are the processors and memory system. Powerful technical and non-technical forces will keep the CPU and memory systems locked together into high-speed, but proprietary, interconnects. I/O devices, on the other hand, are at the "bottom" of the machine; they must survive behind stable, and thus slower, hardware and software interfaces over long periods of time. The goal of this project is to extend the bottom of the machine, the I/O device space.

At the core of this project are remote devices. A remote device is a device accessed over a scalable network such a Myrinet or GigaNet. These networks present an interesting fabric for use in a scalable I/O system. They can scale to 100's of nodes and have performance properties in the range of most I/O devices. Thus, they will make a close to ideal fabric to build a SAM type of I/O subsystem. The first step in constructing a SAM I/O system is thus even accessing the device over a SAN.

An idea for a prototype scalable I/O system is to use a full-blown computer to emulate a device. A fast messing layer, such as Via or GM, could be used as the as a the transport . The basic method to create remote devices would be to create a special device driver in kernel. that would translate kernel calls into messages to the remote device, as well as operate in the reverse path.

Class Projects

Disk: Create an abstract block device driver to translate block driver calls to messages to a disk device. The remote disk device should be a process running on another computer. Compare the performance of your remote device to a real disk.

Network: Hack network drivers to translate kernel driver calls into the driver to your remote network device. Compare the performance of the remote network device to your network device.

Build a (sort of) scalable router using the Solaris IP stacks and many (10?) remote devices. How many packets per second can you scale the router to? The reason to use Solaris is that the Solaris kernel supports a higher degree of parallelism than either Linux or NT.

Graphics: This one is a bit more challenging. Could you create a remote graphics devices by altering an X-server and mapping the remote framebuffer memory into a VIA memory map? Would it be more appropriate to just send X requests to the remote device directly?



Follow on projects once the above basic infrastructure is working:

Autoconfiguration

When the device is first plugged into the system, have the system configure the device into the kernel automatically. How would the device announce itself? How will the system configure it, and how will it de-configure it on removal?

Error Recovery

Implement a simple chained declusting or mirroring scheme where the abstract driver uses at least 2 remote devices and can recover in case one fails. How might you add devices into the system to achieve better availability characteristics?

Data Splicing

Does your architecture allow short cuts? E.g. Could X-windows data be sent directly from the network to graphics device? Could you modify the read/write interface to allow shortcutting? What about disk to disk device shortcutting?. Certainly that might be useful for copying. How might you also specify a computation in the path of the shortcut?

Caching

Develop a cache control interface between the application and your remote device. How could a web or NFS server take advantage of of a network-device in the network interface? For example, a web server might command the network device to automatically return a popular page, thereby avoiding a full path of the request through the OS. What would the caching interface look like? What about a disk cache interface?