Note: If a power failure is currently taking place, please proceed directly to the usage section below.
After the power has returned, the shutdown levels can be used as a guide for the order to reboot machines.
/etc/init.d/powerfail start
It's also a good idea to ensure that you can rsh everywhere from farside so that it you need to temporarily start the shutdown system on some non-participating machines, it will be easy. You can test this out (on farside as yourself, not root) with the command
/lcsr/master/power/emergency-run-powerdown-script -test-rsh
As soon as a power failure occurs which looks like it might last long enough to need the emergency powerdown system you should, as root on ns-lcsr:
When you decide to start shutting down machines, begin with
touch shutdown.level.1
The presence of this file indicates to all level 1 machines that they
should shut themselves down.
(Rick Crispin advises going to level 1 about 15 minutes into a power
failure with no end in sight.)
Then we start up a remote script on machines not normally configured to run the emergency powerdown system. (Before this step, be sure that the file remote.suicide does not exist on /lcsr/master/power. remote.suicide is the way you signal the remote scripts on normally non-participating machines to kill themselves off.) If there are Solaris machines not running the automatic shutdown system, you can temporarily start it on those machines by running
/lcsr/master/power/emergency-run-powerdown-script
on
farside
as yourself, not
root.
(Hopefully,you are set up to be able to
rsh
everywhere from
farside.)
Next,
./status -m2 &
This will print out the status of most of the machines and equipment
in the machine room which must be powered off.
It also prints the status of machines which do not take part in the
automatic shutdown and must be done manually.
(Currently,the automatic shutdown system only runs on Solaris
machines.)
When status indicates machines are no longer responding to ping, wait about 30 seconds, then go to the machine and power it and any peripherals it has (eg, disks or tape drives) off. As machines are confirmed down and powered off, you can check them off on one copy of the shutdown order.
As each level is completed, touch the next shutdown.level (eg, shutdown.level.2) file to escalate to the next shutdown level. (Note: shutdown.level.3 does not imply shutdown.level.2. Each level has it's own file.)
You should also remove all shutdown.level files, although as a safety precaution the remote script will not honor a shutdown.level file created before the machine was rebooted.
The command
./status -m2 -l -r &
will repetitively print out the status of machines broken down by
shutdown level
(-l)
and in reverse order
(-r).
This is useful to see that machines are coming back up in the proper
order.
You can keep track of what's in progress and/or done on the second
copy of the shutdown order (which is
not
in reverse order).
When all is done, let Don know how things went and what improvements you might like.
remote also restarts itself if the configure script has been updated (so that machines which would be reclassified by any modifications to configure are).
Rob Toth points out that Suns which power themselves down will power themselves back up if power goes off and back on. It is therefore advisable to power off (in the back) Suns which have powered themselves off if there is the possibility of the UPS failing completely.