| Risk | User corrupts or deletes one or more files |
| Probability | Moderate |
| Impact | Low |
| Discussion |
As described in our
file recovery description,
snapshots are regularly taken of user filesystems.
A file lost within the past two days can be restored to its state
within typically 4 hours of the deletion.
Within two weeks, the file can be recovered as it was within a day of
its loss.
For two months, the file can be restored to its status on weekly
checkpoints.
No backups for general user files are kept for longer than about 2 months, so no restore is possible after that point. |
| Risk | A single disk fails on a data volume |
| Probability | Low |
| Impact | Low |
| Discussion |
On detecting a disk failure, the NetApp automatically replaces the
dead disk with a working spare.
It immediately begins reconstructing the new disk image from the
parity information contained in the RAID volume.
During the course of this reconstruction, access to the volume will
still be available but will be slower.
At the same point as this is occurring, a notification is sent to support at Network Appliance. They will replace the failed disk within the next business day. |
| Risk | Multiple disks fail simultaneously on a data volume |
| Probability | Extremely low |
| Impact | High |
| Discussion |
For this to occur, another disk on the same volume has to fail while a
recovery from a single disk failure on that volume is taking place.
Should this occur, we would have to restore the entire volume from tape. This operation would take on the order of a day to complete. In this case, the volume would be restored to its state as of the last monthly dump to tape. Since tape dumps do not dump snapshots (the dumps themselves are actually done of a temporary snapshot), all snapshots of the recovered volume would be lost. Additionally, if the volume lost was the volume on which the system resides, the NetApp would be down until the volume could be restored. |
| Risk | Hardware failure of NetApp itself |
| Probability | Extremely low |
| Impact | Extremely high |
| Discussion |
Network Appliance
has built its reputation on reliability.
The hardware is built with redundancy to avoid any single points of
failure.
(For example, there are two hardware paths to each disk.)
Every week, diagnostics are automatically run on the hardware and
disks.
The results are reported to support at Network Appliance.
Should any problems be indicated, Network Appliance will correct the
problem before it progresses.
Should a hardware failure of the NetApp occur, all data on the NetApp would not be available until the hardware is replaced. Our contract specifies that they will have replacement hardware on site within 2 hours of a call. Additional downtime would be incurred in locating staff to diagnose the problem and report it to Network Appliance and in installing the replacement hardware. If the hardware failure corrupted one or more volumes, the multiple disk failure recovery would have to be followed for each lost volume. |