Network Attached Storage

Distributed file systems

Paul Krzyzanowski

March 2021

Goal: Allow multiple clients to access files from file servers on a network.

We often need to access data that is stored on other computers. One way to do this is by logging in to that computer with a program like ssh or telnet and accessing it directly from that system. Another way is to download a copy of the files onto our local computer by using a program such as sftp or http. These techniques require us to go through an explicit process of accessing that data: we need to know exactly where to connect and need to run programs to connect to the computer and access that the data. Ideally, we would like a degree of transparency so we can access these remote files like we access local files. That is the goal of network attached storage, or NAS. They’re also referred to as distributed file systems or remote file systems.

Design Considerations

To provide the same system call interface for supporting various local file systems as well as remote files, operating systems generally rely on a layer of abstraction that allows different file system-specific interfaces to coexist underneath the common system calls. On most Unix-derived systems (e.g., Linux, BSD, macOS), this is known as the VFS layer (Virtual File System). This gives us transparency: we can access remote files using the same system call interface as local files, which makes them visible in the local directory tree just as any other mounted file system.

In addition to transparency, we need to consider several other design issues for a remote file access protocol:

Consistency: How can the system ensure that everyone sees the latest version of the file? This is particularly important if files are shared among multiple users and it can be challenging to achieve when these shared files are modified. If files are replicated or cached for performance, the design of the protocol needs to be able to invalidate or update those copies.
Security: The operating system is the gatekeeper of what a process (and user) is or is not allowed to do. It authenticats the user upon logging in and then validates file access permissions for that user. With remote file systems, the local operating system is no longer in full control. Requests must be sent to a remote system over the network. This introduces a whole host of security issues. Should the remote system authenticate and trust the local system? Do users share the same user IDs or names across systems? Who administers file permissions? Can a malicious user on a local system impersonate other users and access their files? Is communications encrypted so message contents cannot be sniffed, modified or replayed?
Reliability: If my computer crashes, all the programs running on it die. If a remote file server crashes, my programs continue to run. The remote file access protocol needs to account for network and system failures.
State: Should the remote system keep track of what local processes are doing with those files? If so, how much information should they track? Any information stored about client activity is called state. Keeping no state is the goal of the design of many web services. All necessary information is sent with each request from the client. By not storing state, failover to another system is easy to implement. You don’t need to worry about losing any information if the server crashes and then recovers. However, not tracking state makes it impossible to coordinate activity between different clients. For instance, a server cannot tell a client that another client has a lock on a file – that’s state. If we do keep state on the server, there’s the question of how little or how much state to store. For example, we might only track of which clients have which files open. Or, we might track what chunks of file data each client has cached locally so we can send invalidations if any data gets modified. We might keep track of which clients have which files open in specific modes. For instance, if all clients have files open only for reading then we know there will not be modifications and there will be no need to invalidate cached data at the client.

Service models

There are two basic models for implementing distributed file systems: the download/upload model or the remote access model. In a download/upload model, the entire file is downloaded from the server when a process opens the file. After that, the process accesses the data just like a local file – because it is. When the process closes the file, this local copy is uploaded to the server if there were changes. It’s an attractive model because the file access protocol only needs to implement the open and close operations; everything else is local. There are challenges with this model. It is not efficient if a client only needs to access a small part of the file; the entire file will be download regardless of how much of it is needed. The model doesn’t adapt well to small clients with limited storage space: how do you open a 200 GB database file when you only have 128 GB of space on your local file system? Most problematic is the concurrent access problem. If two or more users are modifying the same file, they are doing so to their private copies of it. Neither sees the other’s changes and the last one to close the file will upload the changes, overwriting the other user’s version.

Sequential semantics are what we commonly expect to see in file systems, where reads return the results of previous writes. Session semantics occur when an application owns the file for the entire access session, writing the contents of the file only upon close, thereby making the updates visible to others after the file is closed, and overwriting any modifications made by others prior to that. We are likely to see session semantics with this model.

With a remote access model, the remote file service provides a functional interface to operations that can be performed on files. Such a service would provide commands to read a range of bytes from a file, write a range of bytes to a file, perhaps lock a region of a file, delete a file, and so on. The advantage of this model is that the client can ask for exactly what it needs and not more. It also gives the server the ability to get the latest updates from clients so it can present a consistent up-to-date view of all the latest changes. The downside of the remote access model is that clients must interact with the server throughout the duration of accessing the files. They also will incur the latency of making requests that must go through two operating systems and a network.

Caching

Caching is used throughout computing as a technique for bringing frequently-used data in a place where it could be accessed more quickly. It’s also a place where data may be stored temporarily to enable more efficient transfers to its ultimate location. For example, the operating system maintains a buffer cache to hold disk blocks. This serves several purposes. First, it provides a place to read blocks of data from the disk (you cannot read individual bytes) even if the process only needs a few bytes. Secondly, it stores the data in memory so it could be accessed more quickly if it is needed repeatedly. Thirdly, it allows output data to be accumulated into block-size chunks and for the operating system to delay writes to the disk.

For network attached storage, file access protocols may manage caching in similar ways. Cache management options include:

write-though: To enable other clients to read data that a process modified, the system with the modified data needs to update the server. Write-through caching means that we can keep data cached but any modifications get sent to the server immediately. This on its own does not solve the consistency problem. Other clients will either need to check for modifications at the server before they can access their own cache (possibly negating the value of caching) or the server will need to keep state so that it can invalidate client caches if there’s an update.
delayed writes (write-behind): Sending modifications to the server immediately is great for keeping the version of the file at the server up to date. Unfortunately, it can be horribly inefficient. Think of the situation where a program appends to a file a byte at a time. We will be sending a stream of single byte updates to the server. It would be more efficient to wait a while and see if there are more updates in the hope of sending a single larger packet instead of many small ones. The tradeoff is that file access semantics can become ambigious since clients that read data from the server will not always get the latest updates.
write on close: Propagating all changes when a process closes the file can be highly efficient. At this time, we have all the modifications to the file and can send them efficiently without the concern of sending repeated updates or small chunks. Doing this, however, is an admission that we are using session semantics – where the last process to close the file gets to overwrite any other changes to the file.
read-ahead (prefetch): Just like sending small chunks of data from the client to the server is inefficient, so is receiving small chunks from the server. The basic form of read-ahead is to request a larger block of data even if the process needs to read only a few bytes. The assumption is that there is a good chance the process will soon need to read more data (most processes read files from the first byte of the file to the last) and it is more efficient to send an entire block in one packet rather than sending multiple packets with smaller chunks of data. The more ambitious form of read-ahead is to request successive blocks from the file before they are needed, again under the assumption that most files are read sequentially.
centralized control: The most sophisticated form of caching is having the server keep track of how processes can access files (read-only, read-write, write-only), whether they have exclusive access to byte ranges of a file, and what data they have cached. This can be a significant amount of state to track, particularly in large environments, but it allows the server to send targeted invalidation messages whenever it gets updates of specific byte ranges of a file.

NFS

NFS was designed as a stateless, RPC-based model implementing commands such as read bytes, write bytes, link files, create a directory, and remove a file. Since the server does not maintain any state, there is no need for remote open or close procedures: these only establish state on the client. NFS works well in faulty environments: there’s no state to restore if a client or server crashes. To improve performance, a client reads data a block (8 KB by default) at a time and performs read-ahead (fetching future blocks before they are needed). NFS suffers from ambiguous semantics because the server (or other clients) has no idea what blocks the client has cached and the client does not know whether its cached blocks are still valid. NFS uses validation, where the client compares modification times from server requests to the times of data that it cached. However, it does this only if there are file operations to the server. Otherwise, the client simply invalidates the blocks after a few seconds. In NFS’s original stateless design, file locking could not be supported since that would require the server to keep state. It was later added through a separate lock manager that maintained the state of locks.

To facilitate larger deployments, NFS introduced the automounter. It was common to have an environment with many clients, each mounting many remote file systems. In such an environment, if all clients start up at approximately the same time, they can flood the server with mount requests. The automounter mounts remote directories only when they are first accessed. To make keeping track of mount points easier across many machines, automounter maps are configuration files that define what remote directories get mounted. These can be distributed across a set of clients.

AFS

AFS was designed as an improvement over NFS to support file sharing on a massive scale. NFS suffered because clients would never cache data for a long time (not knowing if it would become obsolete) and had to frequently contact the server. AFS introduced the use of a partition on a client’s disk to cache large amounts of data for a long time: whole file caching and long-term whole-file caching. It supports a file download-upload model. The entire file is downloaded on first access (whole file download) and uploaded back to the server after a close only if it was modified. Because of this behavior, AFS provides session semantics: the last one to close a modified file wins and other changes (earlier closes) are lost.

During file access, the client need never bother the server: it already has the file. When a client first downloads a file, the server makes a callback promise: it maintains a list of each client that has downloaded a copy of a certain file. Whenever it gets an update for that file, the server goes through the list and sends a callback to each client that may have a cached copy so that it can be invalidated on the client. The next time the client opens that file, it will download it from the server. Files under AFS are shared in units called volumes. A volume is just a directory (with its subdirectories and files) on a file server that is assigned a unique ID among the cell of machines (remember cells from DCE RPC?). If an administrator decides to move the volume to another server, the old server can issue a referral to the new server. This allows the client to remain unaware of resource movement.

Coda

Coda was built on top of AFS and focused on two things: supporting replicated servers and disconnected operation. To support replicated storage, AFS’s concept of a volume was expanded into that of a Volume Storage Group (VSG). Given that some volumes may be inaccessible at a particular point in time, the Accessible Volume Storage Group (AVSG) refers to the subset of the VSG that the client can currently access. Before a client accesses a file, it first looks up the replicated volume ID of the file to get the list of servers containing replicated volumes and the respective local volume IDs. While it can read files from any available server, it first checks the versions from all of them to ensure that one or more servers don’t have out-of-date files. If the client discovers that a server has an old version of a file, it initiates a resolution process by sending a message to that server, telling it to update its old versions. When a client closes a file, if there were any modifications, the changes are written out to all available replicated volumes.

If no servers are available for access, the client goes into disconnected operation mode. In this mode, no attempt is made to contact the server and any file updates are logged instead in a client modification log (CML). Upon connection, the client plays back the log to send updated files to the servers and receive invalidations. If conflicts arise (e.g., the file may have been modified on the server while the client was disconnected) user intervention may be required.

Because there’s a chance that users may need to access files that are not yet in the local cache, Coda supports hoarding, which is a term for user-directed caching. It provides a user interface that allows a user to look over what is already in the cache and bring additional files into the cache if needed. The hoard database is the list of which files are cached locally.

DFS

AFS evolved over the years. AFS version 3 was modified to become the recommended distributed file system in the Distributed Computing Environment (DCE). This was named the Distributed File System (DFS).

The primary design goal of this system was to avoid the unpredictable lost data problems of session semantics if multiple clients are modifying the same file. The concept of tokens was introduced. A token is permission given by the server to the client to perform certain operations on a file and cache a file’s data. The system has four classes of tokens: open, data, status, and lock tokens. An open token must be obtained to have permission to open a file. A read data token must be obtained for a byte range of a file to have permission to access that part of the file. Similarly, a write data token is needed to write the file. Status tokens tell the client that it may be able to cache file attributes. These tokens give the server control over who is doing what to a file. Tokens are granted and revoked by the server. For example, if one client needs to write to a file then any outstanding read and write data tokens that were issued to any clients for that byte range get revoked: those clients are now denied access until they get new tokens.

SMB

Microsoft’s Server Message Block protocol was designed as a connection-oriented, stateful file system with a priority on file consistency and support of locking rather than client caching and performance. While it does not use remote procedure calls, its access principle is the same: requests (message blocks) are functional messages, providing file access commands such as open, create, rename, read, write, and close.

With the advent of Windows NT 4.0 and an increasing need to provide improved performance via caching, Microsoft introduced the concept of opportunistic locks (oplocks) into the operating system. This is a modified form of DFS’s tokens. An oplock tells the client how it may cache data. It allows clients to cache information without worrying about changes to the file at the server. At any time, a client’s oplock may be revoked or changed by the server. The mechanism has been extended since Windows 7 and is generally referred to as leases. There are currently eight oplocks (including leases; do not memorize these but have an understanding of what they do):

A level 1 oplock (exclusive oplock) provides the client with exclusive access to the file (nobody else is reading or writing it), so it can cache lock information, file attributes, and perform read-aheads and write-behinds.
A level 2 oplock (shared oplock) is granted in cases where multiple processes may read a file and no processes are writing to it.
A batch oplock is also exclusive and allows a client to keep a file open on the server even if the local process using it closed the file. This optimizes cases where a process repeatedly opens and closes tehe same files (e.g., batch script execution).
A filter oplock is exclusive and allows applications that hold the oplock to be preempted whenever other processes or clients need to access the file.
A read oplock (R) is a shared oplock that is essentially the same as a level 2 oplock. It supports read caching.
A read-handle oplock (RH) allows multiple readers and keeps the file open on the server even if the client process closes it. It is similar to the batch oplock but is shared and does not support file modifications. It supports read caching and handle caching.
A read-write oplock (RW) gives a client exclusive access to the file and supports read and write caching. It is essentially the same the the level 1, or exclusive, oplock.
A read-write-handle oplock (RWH) enables a client to keep a file open on the server even if the client process closes it. It is exclusive and similar to the batch oplock.

The last four oplocks have been added since Windows 7 and are somewhat redundant with earlier mechanisms.

Oplocks may be revoked (broken) by the server. For example, if Process A has a read-write oplock, it can cache all read and write operations, knowing that no other client is modifying or reading that file. If another client, Process B, opens the file for reading, a conflict is detected. Process B is suspended and Process A is informed of the oplock break. This gives Process A a chance to send any cached changes to the server before Process B resumes execution.

DFS Namespaces

Microsoft added a separate component to SMB file sharing called DFS Namespaces. DFS stands for Distributed File System, but is not related to DFS, although there are concepts in common between the two.

DFS Namespaces run on a Namespace Server that tracks which shared volumes lives on which SMB server. The collection of shared volumes are arranged by an administrator to present a single hierarchical file system to all clients. Clients contact the DFS Namespace server and don’t need to be aware of the individual servers that serve the files. DFS Namespaces are not a separate remote file system protocol; files are accessed through SMB.

DFS namespaces provide location transparency. Users do not have to be aware of which servers they need to contact for resources. DFS also allows a single node in the directory tree to point to multiple replicated read-only volumes, similar to the mechanism in AFS. This is designed for fault tolerance and load balancing but does not support file modifications.

SMB 2 and beyond

The SMB protocol was known to be chatty. Common tasks often required several round-trip messages. It was originally designed for LANs and did not perform optimally either on wide area networks (WANs) or on today’s high-speed LANs (1–100 Gbps). The SMB protocol was dramatically cleaned up with the introduction of the Microsoft Vista operating system (SMB 2), with minor changes added in Windows 7 (SMB 2.1) and even more in Windows 8 (SMB 3). Apple has dropped its proprietary AFP protocol in favor of SMB 2 in macOS 10.10 (Mavericks). We will focus our discussion on the significant changes introduced in SMB 2.

SMB 2 added six key changes to its design:

Reduced complexity: The set of SMB commands went from over 100 commands down to 19.
Pipelining: Pipelining is the ability to send additional commands before the response to a prior command is received. Traditionally, SMB (and any RPC-based system) would have to wait for one command to complete before sending an additional command. The goal of pipelining is to keep more data in flight and use more of the available network bandwidth.
Credit-based flow control: To give the server control over getting an overly high rate of client requests, credit-based flow control is used. With credit-based flow control, the server creates a small number of credits and later increases this number as needed. The server sends these credits to each client. The client needs credits to send a message to the server. Each time a message is sent, it decrements its credit balance. This allows server to control the rate of messages from any client and avoid buffer overflow. Note that TCP implements congestion control, but this results in data loss and wild oscillations in traffic intensity (TCP keeps increasing its transmission window size until packet loss occurs; then it cuts the value of the buffer in half and starts increasing again).
Compounding: Compounding is similar to pipelining but now allows multiple commands to be sent in one message. It avoids the need to optimize the system by creating commands that combine common sets of operations. Instead, one can send an arbitrary set of commands in one request. For instance, instead of the old SMB RENAME command, the following set of commands are sent: CREATE (create new file or open existing); SET_INFO; CLOSE. Compounding reduces network time because multiple requests can be placed within one message.
Larger read/write sizes: Fast networks can handle larger packet sizes and hence transfer larger read and write buffers more efficiently.
Improved caching: SMB 2 improved its ability to cache folder and file properties. This avoids messages to the server to retrieve these properties.
Durable handles: If there was a temporary network disconnection, An SMB client would lose its connection to the server and have to reestablish all connections and remount all file systems and reopen all files. With SMB 2, the connection has to be reestablished but all handles to open files will remain valid.

NFS version 4

While NFS version 3 is still widely used, NFS version 4 introduced significant changes and is a departure from the classic stateless design of NFS. The server is now stateful and is able to control client-side cache coherence better, allowing clients to cache data for a longer time. Servers can grant clients the ability to do specific actions on a file to enable more aggressive client caching. This is similar to SMB’s oplocks. Because of the stateful server, NFS now acquired open and close operations.

The addition of callbacks will notify a client when file or directory contents have changed.

Like SMB 2, NFS now supports compound RPC, where multiple operations can be grouped together into a single request. Sending one message instead of a series of messages reduces overall round-trip time significantly.

NFSv4 also added strong authentication and encryption and support file system replication and migration. This includes a mechanism of sending referrals similar to that used by AFS.

References

https://barreto.home.blog/2008/12/09/smb2-a-complete-redesign-of-the-main-remote-file-protocol-for-windows/