LCSR File System Risk Assesment

Updated October 14, 2003

NetApp File Server

The NetApp (koko) provides fast, reliable storage via RAID 4 technology. Errors on disks can be corrected "on the fly." Any single disk failure is replaced automatically from a spare disk pool. The data on the lost disk is reconstructed from the parity information on the RAID volume. The volumes also employ a snapshotting facility which serves as an online backup facility for corrupted or deleted files.

Risk Assessment

Risk User corrupts or deletes one or more files
Probability Moderate
Impact Low
Discussion As described in our file recovery description, snapshots are regularly taken of user filesystems. A file lost within the past two days can be restored to its state within typically 4 hours of the deletion. Within two weeks, the file can be recovered as it was within a day of its loss. For two months, the file can be restored to its status on weekly checkpoints.
No backups for general user files are kept for longer than about 2 months, so no restore is possible after that point.

Risk A single disk fails on a data volume
Probability Low
Impact Low
Discussion On detecting a disk failure, the NetApp automatically replaces the dead disk with a working spare. It immediately begins reconstructing the new disk image from the parity information contained in the RAID volume. During the course of this reconstruction, access to the volume will still be available but will be slower.
At the same point as this is occurring, a notification is sent to support at Network Appliance. They will replace the failed disk within the next business day.

Risk Multiple disks fail simultaneously on a data volume
Probability Extremely low
Impact High
Discussion For this to occur, another disk on the same volume has to fail while a recovery from a single disk failure on that volume is taking place.
Should this occur, we would have to restore the entire volume from tape. This operation would take on the order of a day to complete. In this case, the volume would be restored to its state as of the last monthly dump to tape. Since tape dumps do not dump snapshots (the dumps themselves are actually done of a temporary snapshot), all snapshots of the recovered volume would be lost. Additionally, if the volume lost was the volume on which the system resides, the NetApp would be down until the volume could be restored.

Risk Hardware failure of NetApp itself
Probability Extremely low
Impact Extremely high
Discussion Network Appliance has built its reputation on reliability. The hardware is built with redundancy to avoid any single points of failure. (For example, there are two hardware paths to each disk.) Every week, diagnostics are automatically run on the hardware and disks. The results are reported to support at Network Appliance. Should any problems be indicated, Network Appliance will correct the problem before it progresses.
Should a hardware failure of the NetApp occur, all data on the NetApp would not be available until the hardware is replaced. Our contract specifies that they will have replacement hardware on site within 2 hours of a call. Additional downtime would be incurred in locating staff to diagnose the problem and report it to Network Appliance and in installing the replacement hardware.
If the hardware failure corrupted one or more volumes, the multiple disk failure recovery would have to be followed for each lost volume.