Emergency powerdown system

Emergency powerdown system

Note: If a power failure is currently taking place, please proceed directly to the usage section below.

Overview

The emergency powerdown system is a set of scripts to ease the management of shutting down all the machines in the machine room in the event of an extended power failure. It currently runs only on Solaris machines. Machines configured to run this system check in periodically to see if an emergency powerdown should be performed and log the fact that they are still running. The person managing the powerdown declares shutdown a level to be in effect and machines at that level will issue an orderly shutdown (with warning to logged in users if they exist) and power themselves off if possible. A status script repetitively reports what systems are down or up.

After the power has returned, the shutdown levels can be used as a guide for the order to reboot machines.

Configuration

To configure a machine to take part in the emergency powerdown system: You can then start the remote script with the command
    /etc/init.d/powerfail start

It's also a good idea to ensure that you can rsh everywhere from farside so that it you need to temporarily start the shutdown system on some non-participating machines, it will be easy. You can test this out (on farside as yourself, not root) with the command

    /lcsr/master/power/emergency-run-powerdown-script -test-rsh

Usage

Bringing machines down

Hopefully, you are working from a hardcopy of this page. It and two copies of the shutdown order should be located in the front of the brown "disaster book" in the operator's office. The shutdown order is also online at
http://www.cs.rutgers.edu/~watrous/emergency-powerdown-order.html.

As soon as a power failure occurs which looks like it might last long enough to need the emergency powerdown system you should, as root on ns-lcsr:

This will cause the remote scripts to check the shutdown level status every 120 seconds (2 minutes) instead of the default 900 seconds (15 minutes).

When you decide to start shutting down machines, begin with

    touch shutdown.level.1
The presence of this file indicates to all level 1 machines that they should shut themselves down. (Rick Crispin advises going to level 1 about 15 minutes into a power failure with no end in sight.)

Then we start up a remote script on machines not normally configured to run the emergency powerdown system. (Before this step, be sure that the file remote.suicide does not exist on /lcsr/master/power.   remote.suicide is the way you signal the remote scripts on normally non-participating machines to kill themselves off.) If there are Solaris machines not running the automatic shutdown system, you can temporarily start it on those machines by running

    /lcsr/master/power/emergency-run-powerdown-script
on farside as yourself, not root. (Hopefully,you are set up to be able to rsh everywhere from farside.)

Next,

    ./status -m2 &
This will print out the status of most of the machines and equipment in the machine room which must be powered off. It also prints the status of machines which do not take part in the automatic shutdown and must be done manually. (Currently,the automatic shutdown system only runs on Solaris machines.)

When status indicates machines are no longer responding to ping, wait about 30 seconds, then go to the machine and power it and any peripherals it has (eg, disks or tape drives) off. As machines are confirmed down and powered off, you can check them off on one copy of the shutdown order.

As each level is completed, touch the next shutdown.level (eg, shutdown.level.2) file to escalate to the next shutdown level. (Note: shutdown.level.3 does not imply shutdown.level.2. Each level has it's own file.)

Bringing machines back up

If you did not power off all machines which were temporarily running the remote script, you can have them kill themselves off by touching the file remote.suicide in /lcsr/master/power.

You should also remove all shutdown.level files, although as a safety precaution the remote script will not honor a shutdown.level file created before the machine was rebooted.

The command

    ./status -m2 -l -r &
will repetitively print out the status of machines broken down by shutdown level (-l) and in reverse order (-r). This is useful to see that machines are coming back up in the proper order. You can keep track of what's in progress and/or done on the second copy of the shutdown order (which is not in reverse order).

When all is done, let Don know how things went and what improvements you might like.

Testing

It is possible to test the system on a particular machine by touching the file shutdown.level.n.hostname in /lcsr/master/power (where n is the machine's shutdown level as determined by /lcsr/master/power/configure and hostname is it's hostname (minus the .rutgers.edu).

Miscellaneous

The script, remote, checks itself against the copy on /lcsr/master/power, restarting itself with a new copy should the central copy be newer. So if you edit remote on /lcsr/master/power, all machines will start running the new copy the next time they check in. This is a two-edged sword. While you can fix bugs on the fly (as I was able to do during the system's first acid test), you can also kill the entire system by making a mistake in editing. Be careful with this!

remote also restarts itself if the configure script has been updated (so that machines which would be reclassified by any modifications to configure are).

Rob Toth points out that Suns which power themselves down will power themselves back up if power goes off and back on. It is therefore advisable to power off (in the back) Suns which have powered themselves off if there is the possibility of the UPS failing completely.

[URL: file://localhost/lcsr/master/power/emergency-powerdown-system.html]

This page last updated February 19, 2004.