LCSR Backup Policies and Procedures

Updated October 14, 2003

LCSR manages over xxx TB of disk space within DCS and for some associated departments (eg, Math and DIMACS). Our goal is to provide a reasonable method of recovery for files, file systems, and machines should problems occur.

In general, we back up system filesystems on critical servers and designated user filesystems (eg, user home directories and specially allocated filesystems for specific groups). For non-critical servers and desktop machines, we feel that reinstalling the machine is the easiest recovery method in case of disk failure. File space accessible to users on those machines, while good for high-speed local access, is regarded as temporary and is not backed up.

Backups to tape

Until 2003, virtually all our backups have been done to tape. While we still back up some files to tape, we are in the process of moving many user files to our NetApp which uses snapshots. (See below.)

Suns

Historically, backups on Suns have been done to tape. Once a week (the specific day varying between machines to balance load on the tape drives) a "full" backup is done of the backed up filesystems. Every other day of the week, "incremental" dumps (of files changed since the last "full") dumps are run. In case of disk failure, the filesystem could be restored to the state of the most recent "incremental" backup by restoring a "full" backup to a empty partition, then restoring the most recent "incremental" on top of it. User file restore requests could be performed from the most recent "full" backup if the file had not been modified recently or from the most recent "incremental" if the file had been being actively modified before it was corrupted or deleted.

"Incremental" dumps are recycled each week. The "full" dumps are kept for longer. Originally, all "fulls" were kept for 1 month. The first "full" of a month was designated a "monthly." "Monthlies" were kept for 6 months. "Monthlies" done in July and January were designated "semi-annuals" and "annuals" and kept for the life of the tape (typically, several years).

With the advent of the new law regarding public access to government records, we have changed the period for which we keep backups. This law makes us responsible for searching for all copies of files relevant to requests made by the public. In consultation with University attorneys, other departments, and our users group, we have severely limited the number of backups we keep. Our policy is now to keep backups for a maximum of two months.

Linux machines

[Do we back up any user accessible Linux machines?]

PCs

DCS Staff PCs running Windows 2000 are backed up daily to a special volume on the NetApp. Current statistics on these backups are available online.

We do not back up user PCs or Macs. Users are expected to backup their own data if it's critical.

Backups by snapshots

Backups to tape have always been a time and resource consuming endeavor. As disk sizes grew while costs dropped, tape technology has always played "catch up." We have observed that the cost of time and tape hardware to support disk space has never scaled as easily as the disk sizes increased. Some sort of of disk to disk backup scheme has therefore always seemed an attractive alternative.

Recently, several companies have implemented snapshotting filesystems (allowing for pictures of the filesystem to be frozen in time) on RAIDed disks (providing among other things, highly reliable filesystems). We are now using a NetApp fileserver to provide backed up filespace at a cost of about $55/GB. Three years ago, providing disk space on a Sun with backup to tape ran about $225/GB. So we're now providing faster file service on more reliable hardware with quicker file restore turnaround time for about 1/4 the cost.

NetApp filesystems

As of September, 2003, we have about 1.1 TB of useable disk storage on the NetApp. Using snapshots for backups instead of backing up to tape reduces the time spent doing tape backups for DCS while simultaneously reducing the time needed to restore files when requested by a user.

A regular schedule of snapshots increases the likelihood we will be able to restore a user's file to a useable state. Snapshots are done approximately every 4 hours. These are kept for about 2 days. Snapshots done at midnight are kept for 2 weeks, except for those done at Sunday-Monday midnight, which are kept for 2 months. Snapshots allow us to keep a better variety of potential backup copies of a file for the two months we keep them.

Every month volumes on the NetApp are backed up to tape. This is not for user file restore. It is only for emergency restore in case of a RAID failure on a volume of the NetApp. These backups do not contain snapshots, so while we can restore a volume on the NetApp to its state sometime in the previous month should a catastrophic failure occur, snapshots on the restored volume would be lost.

Backups to CD

Policies and Procedures for CD Backups

Policy: LCSR maintains CD backup facilities for users of our faculty computing systems. These backups act as a 'checkpoint of last resort' for users of our faculty systems.

Procedure: Users who wish it have their home directories are written onto CD-Rom or Data-DVD (using the appropriate media based on total home directory size) every two months. Whether a user has this backup performed is an 'opt-in' mechanism; each user is asked if they wish to opt-in before each backup (users may permenently opt-in or -out if they wish.) Backups are either delivered to the individual user or picked up by them at their direction. Once picked up or delivered, the storage of these backup media are the responsibility of the user.

Risks: Risks fall into three categories

  1. Backups are not done - This might occur due to procedural or software errors; a user's home directory might not be added to the list of directories to be backed up, or the software used to do backups might outright fail.

    Risk: Moderate

    Remedies: A checklist of directories is kept by the operator performing the backups; it is in turn checked by a supervisor; the operator watches the backups occur, and the backup software checks itself to see that the backup has occurred.

  2. Backups done incorrectly - this is the most insidious possibility: backups which appear to be done, but actually are not, due to software error, hardware error, or operator error.

    Risk: Moderate

    Remedies: The backup software performs basic checks of backup integrity; backups are tested against 'live' directories after the backups are done.

  3. Backups are done, but lost - the backup disks might be accidently destroyed or stolen.

    Risk: Low

    Remedies: The disks are kept by a supervisor until they are picked up or delivered (at the user's option); thereafter the security of the disks is the responsibility of the owner of the files, who has the most personal interest in the safety of the disks anyway. Disks which are not given to the owner are destroyed by a 'disk shredder'. Further, users are polled before the backups are done. Disks are not made for those who are not interested in having a backup of their files, thus precluding 'orphan' disks.