Goal: Store huge files (terabyte and larger) in a file system that is distributed across many (thousands) of machines.
Conventional file systems lay out an entire file system on one logical storage device (a disk or NAND flash memory). This means that the metadata and data for all files and directories is located in this one device. In some cases, the device might be a collection of disks (e.g., a RAID drive or several disks connected by a volume manager) but, to the operating system, it appears as one device – one collection of blocks. This places a restriction on each file system that it cannot span one machine. Conventional distributed file systems, used for network attached storage (NAS), basically use a file system access protocol but interact with one or more servers running these conventional file systems.
With parallel file systems, the metadata and data of files are separated onto different entities. The metadata of a file is the set of attributes about a file: its name, size, access permissions, time stamps, etc. The data of a file is, well, the file’s data. Metadata does not take up much space. It’s a small set of attributes and each attribute is generally only a few bytes in size. Data, however, can be huge: terabytes or more for large files.
The idea of a parallel file system is to have a system – or systems – manage the metadata. As part of the metadata for each file, the system will identify the systems and logical blocks holding the file’s data. By doing this, we gain several benefits:
A file’s data is no longer limited to the capacity of any one system and can be spread out across many systems.
Because data blocks are distributed among multiple systems, access is effectively load balanced across them. Moreover, multiple systems can process different file or different parts of a file with read operations taking place on different systems. This provides scalable bandwidth, comparable to striping in storage arrays.
Data blocks can be replicated on multiple systems, providing fault tolerance and the potential for greater read bandwidth since processes can read from any of the replicated data blocks.
This design approach is the basis for the Google File System (GFS), the Hadoop Distributed File System (HDFS, essentially a clone of GFS), and distributed file systems used in supercomputing clusters, such as Luster and GlusterFS. We will not look at these last two but the concepts are similar.
Google File System (GFS)
The Google File System is designed to support large data-intensive applications (the kind of algorithms Google uses for search and other services). The system is designed for an environment where there are many thousands of storage servers, some of which are expected to be down at any given point in time. The data that is managed comprises of billions of objects and many petabytes of data. GFS is not designed to be a general-purpose file system and is optimized for large files (rather than lots of small files), streaming reads (more common than writes), and atomic appends (which may be done concurrently by multiple clients).
Each GFS cluster contains one master file server. This is a faster and more reliable machine that manages file system metadata, such as names and attributes of files and directories. None of the actual file data resides on this system. Instead, it contains a mapping of the file contents to the chunks that hold the data. Chunks are fixed-size, 64 MB blocks and are stored in chunkservers. The chunkservers are less reliable than the master and are replicated so that each chunk is stored on typically three three separate chunkservers.
Clients contact the master to look up files. The master gives them a chunkserver name and chunk handles for the files requested.
To write to a file, the master grants a chunk lease to one of the replicated chunks. The chunkserver holding this is the primary replica chunkserver for that chunk. Writing is a two-stage process. First, the client writes to one replica. That replica then forwards the data to another replica chunk server, and so on until all replicas for that chunk receive the data. This relieves load from the client in that the client has to only write one copy. It also doesn’t put any one computer within GFS in the position of having to send out N copies of the data.
Once all replicas acknowledge receipt of the data, the second stage, writing, takes place. The client sends a write request to the primary chunkserver, identifying the data that was sent. The primary assigns a sequence to the write operations that take place on the data and sends write requests to all the replicas so that they will write data in the same sequence-number order. Note that the data flow (data transfer) from client is chained: it goes to the primary replica, then to secondary replica, then another secondary replica, etc. This pipelining optimizes the bandwidth on each link. Moreover, each system will try to pick the nearest system that has not yet been added to the chain to send data. The client has the choice of all the replicas and can pick the closest one. The control flow (write commands) is sent from the client to the primary and then goes from the primary directly to all of the secondaries (but is a much, much smaller message).
Hadoop Distributed File System
Apache’s Hadoop Distributed File System (HDFS) is incredibly similar to GFS and is designed with essentially the same goals. The key distinction is that it does not support modifying a file once it is created, but it does support appends. This avoids the need to manage leases or locks for the file.
Like GFS’s master, HDFS uses a separate server, called a NameNode, that is responsible for keeping track of the filesystem namespace and the location of replicated file blocks. File data is broken into 128 MB blocks (default, but configurable per file) and is stored on DataNodes (GFS’s chunkservers). Each block (GFS chunk) is replicated on multiple DataNodes (the default is three, as with GFS). As with GFS, all file system metadata, names and block maps, is stored in memory on the DataNode for performance. File writes from clients are pipelined through each of the replica DataNodes that will hold the data.
HDFS uses rack-aware logic for determining which computers should host replicas for a block. The first replica is targeted to a computer in the same rack as the requesting client (to minimize latency). Second and additional replicas come from other racks in the data center for fault tolerance in case the first rack fails. The process for writing a block is essentially the same as in GFS. Data writes are pipelined to get data onto the replicas and then an update command is sent by the primary replica to all the others to write the block (this is essentially a commit operation, as it makes the data permanent).
Dropbox: Cloud-based file synchronization
Goal: Provide an Internet service that synchronizes part of a user's file system to remote servers. Any changes are propagated back to any devices that a user uses for synchronization. In this manner, multiple user computers can keep their data synchronized.
Dropbox is one of the first of a set of popular services that provide “cloud-based file storage” – a service that allows you to store data on remote servers and access it from anywhere. Dropbox does this in a manner that is largely transparent to the user. The user designates a directory (aka folder) that will remain in sync with the server. Any local changes will be propagated to the server and any changes from the server will be propagated to the local computer. Multiple computers can thus keep data in sync, with changes on one computer being sent to dropbox servers and then synchronized on other computers.
The Dropbox service started in 2008 and grew rapidly in popularity and, hence, the amount of users and data. It serves as a good case study in scaling a system. Currently (2012), Dropbox handles over a 100 million users synchronizing a billion files each day. One way that Dropbox differs from other data access services such as Twitter, Facebook, Reddit, and others is that those services have an extremely high read to write ratio. Any content that is created is usually read many, many times. With Dropbox, in contrast, the read to write ratio is close to 1. Because the primary use case is file synchronization, remote data is rarely accessed except to synchronize it with other machines. As such, Dropbox has to deal with an inordinately high number of uploads.
Dropbox began life as one server, using a mySQL database to keep track of users and their files. As the server ran out of disk space, all file data was moved over to use Amazon’s S3 service (Simple Storage Service; a web services interface to store and retrieve objects where each object is identified with a unique key). Dropbox ran a database server that stored all the metadata (information about the file) and a server that provided a web portal and interacted with client software. At this time, each client ran a process that would look through a directory to identify changed files and send them to dropbox. In addition, the program polled the server periodically to see if there are any changes that need to be downloaded.
Tens of thousands of servers all polling a server to see if anything changed generates considerable load, so the next design change was to add a notification server. Instead of having a client ask periodically, the notification server sends a message telling the client that there are changes. Since clients may be behind firewalls and it may not be possible for a notification server to connect to them, the notification server relies on having the client establish a TCP connection to it. The single server that handled all requests was also split in two: a metadata server handled everything related to information about users and files while a blockserver, hosted at Amazon, handled everything having to do with reading and writing file data.
As the number of users grew even larger, a two-level hierarchy of notification servers was added since each server could manage only around one million connections. The metadata server, block server, and database were also replicated and load balanced.