LCSR Operator manual.
General FAQ's:
NEW: Hill-3 printer 'hill3' (aka 'dcs') is now the ricoh copier
in Hill 378. Here's service that
printer. Note - there are currently different roles for
'tray 1' and 'tray 3' for printing or copying.
NEW: Webserver locks up
The web server can become catatonic occasionally, mostly by the
machine entirely running out of memory. In such a case, you can check
that it is so by pinging www3.srv.lcsr.rutgers.edu, and trying to open
a web page on www.cs.rutgers.edu. If it pings, but doesn't answer the
web request, that's the case we're talking about.
Now do a further check -- go to the terminal server and open the window
for www3 (or www6) and see if "out of memory errors" are appearing
on the screen. If they are (or if there is another, perhaps more
disturbing error message), you should contact Doug Motto (cell phone)
or Kendall Blake (same) at once. If Doug doesn't answer, leave a
message and call Blake. If you can't get ahold of either of them,
call Charles.
If all of that fails, you'll have to hard-reboot the machine.
www3 is labeled, and over by the machines next to the giant a/c box
and the ups (battery backup) box. Take the cover off the one labeled
www3, and power cycle it. Put the cover back on, and check that it is
coming up at the terminal server. Try again to reach Doug, or
Kendall, or Charles, failing that, make sure you email all of them
telling them what is happening.
Temps Watching
- who to contact - at some point, you will have questions
about some situation, and be unsure what to do. You
should ask for help. There's a bunch of sysprogs
on the hall, your easiest thing to do as ask any one
of them. If none are available, you should call one.
Which one you call depends on what sort of question.
Use the list below to guide you.
If none of the recommended people are reachable, and you
think the situation is crucial enough, KEEP TRYING
UNTIL YOU REACH SOMEONE. If you aren't sure if a situation
is critical or not, call. In an emergency
involving crime or fire, anything involving
your personal safety, we want you out of harm's way
first. You can always call the RU cops (732/932-7211
-- 911 from a voip phone) from a position of safety.
(phone numbers are provided on a laminated card from Doris --
putting them online is a bit dangerous. There is also a
list in the operator office.
crime, fire alarms, etc. - RU Police, then also call Rick
Crispin, failing that Rob Toth
hardware, power, networking, airconditioning - Rick Crispin,
then Rob Toth
faculty, research (aramis, athos, presidents, many more) - Don W.
netapp - Don W., then Rob Tuck
grad machines, cereal, web - Doug Motto
undergrad (remus/romulus/alfalfa/spanky)- Lars
handin - Lars
wireless (plus login and lawn-gw machines) - Hanz
email (plus dragon, spamfilter machines) - Hanz, then Don W.
math - Risa Hynes
dimacs - Walter Morris
room access - Rob Toth, then Rick Crispin, then any staffer
if none of the above, then Rick Thomas
If none of these 'primary' contacts works, then try Charles,
and then try other staff members until you get somebody.
- Machine room layout
- whats in what rack (to be completed)
- lcsr machines - dcs faculty, dcs staff, dcs grad, dcs
undergrad, 'cereal', 'oslab', backup machines, and the like
- telcom - routers and switches
- hanz - PCs (some running linux, some windows) and macs
- math - math departmentment machines
- dimacs - dimacs machines
(We do backups for math and dimacs. Both departments have their own
regular system administration staff.)
- A/C units
There are 4 AC units (three small, and one large - and loud) in
the main machine room, along with 1 in math/dimacs machine
room. These units should be checked every so often, as to the
status of alarm lights. The lights to be concerned about are
high temp., high humidity, low humidity. If any of these red
"alarm" lights are lit, look for the green "status" lights for
cooling, humidification, or dehumidification. If alarm lights
are lit, but the corresponding green lights are also lit,
everything is probably ok, and only needs to be monitored. In
any case however, the best way to determine if a problem exists
is to feel the air temp by putting a hand up to the blower area
- the vent on each of the sides of the a/c unit, at the top,
and feeling the air flow. If the air flow is warm, or warmer
than usual, monitor the situation for awhile. Monitoring can
be done via the temperature watcher (see 'operator tools') and
the nagios host watcher (see below.)
If the temp in the machine room gets really warm, or downright
hot, call Physical Plant at 5-3741, and report that 2nd floor
CORE bldg. computer room A/Cs are not functioning properly. If
this occurs after hours (later than 4:30pm and earlier than 8am
weekdays, and all day Saturday and Sunday), please call the
University Police (upper left corner programmed button on
phone), and ask them to call Physical Plant.
If air is cool, periodic monitoring should continue.
- power goes out, what do you do?
The UPS (uninterrupted power supply) system is a battery backup
system, in place to prevent loss of power to computers, and
anything else connected to it, during a city power
outage. Under maximum conditions, it should supply
approximately 2 hours of battery backup power. It will
automatically turn itself on when it's sensors indicate a loss
of city power, and off when the outage is over.
We have an automatic
system of system powerdowns that will react to a power
failure to shut off machines, allowing the UPS to last longer.
If a power failure occurs when one of the system staff is
not around, call one of them. The regular
phone system is no longer directly connected to the telephone
company network. The phone should come back (it is hooked up
to our emergency generator) immediately after a power failure.
If it does not, you can plug into the telephone hunt
group using the phone in the phone rack -- in the com-rack row,
farthest from the operator office. Plug the phone into 2443.
You can now dial out, and people can dial in by calling
732-445-2443.
- circuit breaker boxes
Unless very specifically directed to do so (by Rick or Rob or
other supervisory personnel), you should not mess with these, or
even open them.
- red button
There are BIG RED BUTTONs (BRB) near each exit in each of the
machine rooms. They are emergency power cutoff switches. In
the event of a disaster of some sort - fire, flood etc... -
hitting these buttons will cut power to all equipment except
the ACs and lights. I believe these, when used, must be reset
by physical plant personnel. Unless very specifically
directed to do so (by Rick or Rob or other supervisory
personnel), you should not ever mess with these.
- what to do in case of fire or smoke
There are fire alarms located in both the operator office and
the machine room, at all exits. These are self-explanatory as
to their use. If you ever see smoke or flame, activate a fire
alarm, and get out of the building ASAP, via the nearest
exit. Please familiarize yourself with all building exits. DO
NOT USE ELEVATORS. There are stairwells in both the front and
rear of the building, on the side of the building where the
walkway to Hill is. From the operator office, the rear
stairwell is closest - out the op office door go right, down
hall and around short hall to corner of building.
There should be fire extinguishers at each exit of the machine
room - but there aren't. Why not?
- what to do if there is an unusual smell -- call the hardware
guys.
- console servers
b
- tape drives, tape procedures, tape library
The 'tape library' is located off the main computer room. The door
to the tape library, from the machine room, is not marked as the
tape library. The tape library is located near the entrance to
the math and DIMACS machine room. Nowadays, tapes are mostly
stored in the operator office, but the room is still referred
to as the 'tape library'.
Cleaning tapes should be used when a tape drive's
'cleaning' light goes on. This means that the drive has noticed
that there is junk on the read-write heads, and you should use a
cleaning tape. Insert the cleaning tape into the drive like
normal, and wait for the light to stop blinking, then remove the
tape as normal. If more than one run of a cleaning tape is
needed for a drive (or if the 'cleaning' light keeps lighting
up), it probably means there's something else wrong with the
drive. In any case, you should report it in your eos.
- Other Tape errors - the backup scripts might fail for
a variety of reasons. These can include an outright bad tape,
permissions problems, or problems on the tape server machines
themselves. Figuring out which is which is a matter of
experience, really, but here are some guidelines.
- permissions problems - these usually show up as a
message saying "Permission denied" or "Permission
administratively denied". These appear when the machine
trying to write to the tape server's tape drive doesn't
have permission to do so. (Most often this happens for a
new machine, but sometimes permissions files get misedited
or something.) Contact one of the systems staff.
- bad tape - this is sort of the answer-of-last-resort. if
everything else fails to fix the problem, this can be the
real problem. This most often happens if the backup
begins normally, but then has a write error in the middle
somewhere. If you believe the tape is bad (they can go
bad just from being used over and over again), try a
replacement tape. If it works, then the original tape is
bad. DO NOT THROW AWAY BAD TAPES (or, for that
matter, good tapes.) They must be handled
specially to erase them - contact a member of the system
staff to dispose of the tape.
- server problem. This most often appears as a tape write
error message, or a tape-busy message at the beginning of the
backup. What is happening is that the tape handling process
on the tape writing machine didn't die properly, and is still
trying to talk to the tape drive (and won't let any other
process do so.) Alternatively, it may be that there is a
valid backup happening already, and you (or the op before you)
forgot it was happening. (In that case, just wait for the
first backup to finish :-). The process is called 'rmt'
(usually two are run for each backup, so a machine with two
tape drives might have as many as four valid rmt processes.) If
you are not sure the process should be killed, ask a sysprog.
If you are sure (and you better be very sure -- for machines
with multiple tape drives, if a valid backup is running, you
can't tell the difference easily between a 'good' rmt and a
'bad' rmt), then kill the processes with the kill command.
Again, be very careful about this. If you are not really,
really sure, ask.
- bimonthly cd-dvd backups of faculty home directories
- logbooks
There are several logbooks that need to be kept up to date.
How to generate new logsheets is detailed here.
- Backup logs - this is the most important logbook. In
it you should record the completion of tape backups scheduled
for your shift. Each day's backups have an entry, describing
whether the backup is a full or daily, what sort of tape, and
spaces for you to mark that the backup was done and how long
it took (how long matters -- if a backup takes a very long time,
it might need to be reorganized). Also note if there were
errors in the backup, which you should also mention in
your end of shift report.
- Print logs - this used to be on paper, but now is an
exellish spreadsheed (thanks Fahim!). Instead of writing it
on oh-so-last-millenium paper, there is a spreadsheet to
be found in ~operator/PrintLogs. Here's a how-to on using
it.
cd ~operator/bin
./printlogs MM/YYY
then print the "printlogs-MMM.YYY" file it creates
... the comment you should give should include your initials, and the
date, and anything else you want to add (things like 'lp4 is on fire',
like that.) End the comment with a single period ('.') on a line
(followed by a carriage return, of course.) Printer_Log.ods will now
disappear, to be replaced by an updated 'Printer_Log.ods,v'. If you
find the .ods file in the directory when you first cd to the
directory, it means that the last op didn't check the file back in.
Go ahead an 'ci' the file, with a comment like, 'last op was too lazy
to check this back in' (also mention it in your eos, but leave out the
'lazy' part), the 'co' the file and continue as above.
- operational duties
- answer phone
answer phone "this is lcsr operations, may I help you?"
Always log phone calls in the phone log. Check the phone
answering system (https://pbx.cs.rutgers.edu/recordings/)
when you've been away from the phone (for instance on print
runs), or at the beginning of your shift. Messages found on
the answering system should be logged in the help system as if
you had received the call yourself; after that the phone
message should be moved into the 'work' folder in the answering
system to show that someone has listened to it. To use the
phone itself to listen to messages, dial *98. Use the same
username (2443) and password (ask me) as in the web-based one.
- Check operator mail. While we hope the users will use
the ticket system (see below) -- by sending to 'help@cs', the
old way was for people to send to 'operator@cs', and many of
our users still will do that. Also, some automatic processes
send top operator@cs. Note that this is different from
'ops@cs'. 'ops@cs' is for us to talk amongst ourselves, and
arrives as mail in your personal inbox. 'operator@cs' mail
arrives at the shared mailbox that appears in your cs mail
server list of folders as 'operator/INBOX', or something
similar. You should check this mailbox every so often,
just to make sure no mail has arrived there.
- Check the ticket system for new help requests at
tickets.cs.rutgers.edu/admin for new tickets every hour.
Every hour. EVERY HOUR.
Here's placement of
ticket system ticket guidelines. Here's how to figure out what cluster a
machine is in (for ticket placement, for instance.)
- tape backups (tape log) (here's the difference between fulls
and dailies)
We do three kinds of backups here -- one is to DLTIV tapes,
which you do with the "from tape" stuff described below. One is
of the netapp, which Don does himself, and one is to the
disk-to-disk server, known as 'hold', which happens
automatically every night. Each has a different
restore mechanism, which are described below. Generally,
user files (e.g. /ug/users) are on the netapp, system files are
on the disk-to-disk server
for machines that use the disk-to-disk system, and machines that still do tape
backups, the files are on the tapes. If you're not sure
where a restore request's files live, ask.
- tape restores
User Files (most home directories)
| structure | on machine | restore from |
| fac | faculty | netapp |
| grad | grad | netapp |
| ug | undergrad | netapp |
| ilab | cereal | netapp |
| staff | farside | farside or netapp |
Non-user files. Most of these (e.g. /fac/s1, various machines'
/etc, /ug/software and the like) are handled by the disk-to-disk
system.
- what to do if backups get missed - if, for some reason
(usually because an op couldn't make a shift, and no one could cover),
backups were not done, we need to make sure that the backups are
brought up to date as quickly as possible. Check back in the logbook
and see what hasn't been done. Do the backups in this order:
- fulls: if a full has not been done, do it first.
When it is finished, mark any dailies before or after that
in the book as not done (big "NOT DONE" is recommended),
and note in the logbook for that full, that it was done,
and on what day. For instance:
let's say there was
a full to be done on Saturday, and dailies on Sunday and
Monday. You come in Monday to see that none of the Saturday
and Sunday backups were done. You should do the full at
once, marking in the logbook for the full, "done on Monday",
and for the Saturday and Monday dailies, "NOT DONE - full
done on Monday". (The Monday daily doesn't have to be done,
you've just done the full.)
- dailies: if only dailies have been missed, do the
most recent daily, and log the other dailes as not done. In
the above example,
lets say there were dailies to be done on
Saturday, Sunday, and Monday. You come in on Monday, you do
the daily on the Monday tape, logging it as normal. The
Saturday and Sunday dailies, you log as "NOT DONE - daily
done on Monday."
- check out of projectors
We maintain a collection of lcd projectors that are loaned
to various faculty members (or their representatives,
faculty will send tas or students sometimes). When they
appear, you need to do two things
- check that they have reserved their checking out in the
calendar system (
Here's a direct link --
there's also a link
to it on the main dcs web page -- under resources.)
If they haven't,
and one of the projectors is as-yet unreserved, make
a note of when they picked it up in you EOS, so we can keep track of what
is reserved and what isn't.IF there is a
conflict, that is, if we are confronted with
situation where someone has made a reservation for
a projector, and someone else just shows up looking
pick one up, the one with the reservation gets
the projector..
- get their id, and use the checker system to record who
took the item.
Here's how to see
reservations (and set yourself up to see them), and
here's how to use the
check-out system.
- keep the whiteboard accurate. For users arriving in
person, its just good manners to tell them if you're arround
to help them. Have a look at this for some guidance.
- watch for problems. These problems can include
connectivity problems, that is, if the outside world is
suddenly unconnectable-to, or if you get several reports
of the same problem over a short amount of time, or people
have problems logging in to multiple machines -- people who
should be able to by all appearances, or printers jamming
the same way repeatedly, then you should create a printer
queue ticket so the appropriate people can have a look.
You're our eyes and ears.
- LDAP, kerberos, enigma - the university as a whole,
and DCS as well, are moving to an authentication system
known as LDAP ('lightweight directory access protocol').
DCS also uses 'kerberos', and you and the rest of the staff
have an 'enigma' (or 'safeword', or whatever) one-time
cards. These all depend on a server to handle password
authentication. For instance, 'dragon', our mail server,
depends on the OIT LDAP server, ldap.rutgers.edu. If that
server fails, for whatever reason, suddenly users will start
being unable to login to the mail server. (The LDAP server
is supposed to have 'failover', that is, automatic backup
servers, but we've seen at least one case where that just
flat didn't work.) Anyway, when you see that happening, you
need to alert the staff. For the kerberos servers, which
we run, symptoms would be people not being able to log into
regular servers (remus, paul, etc.) or desktop machines.
For enigma servers (which OIT run), symptoms would be that
you can't log in.
- reboots - You may be called upon to reboot a machine by a
member of the system staff. You should never reboot a
machine without permission of a member of the system staff.
If you think a reboot is called for, contact somebody on
the system staff and ask permission. Machines are often
running long-term jobs, or system updates that are not
obvious, and a reboot at an unexpected time might cause
serious difficulty -- for instance, a reboot in the middle
of a kernel update may cause the machine to not boot at all.
- end of shift reports
- 'door problems' - if people report that they cannot get into
a swipe-card (either type) door, this should be reported
to the hardware guys.
- How to fix
passwords, and other non-loginable conditions, which
might be password, or academic integrity-signing, or
other-things related.
- generally be the first line of support for our users
- privs - what they are for and what they are not
- no reading of files or mail without explicit permission
(you tell them 'I need your permission', they have to
give it, via email, or in person, provided they can present
ID proving they are who they say they are.)
- when you are not sure what is right to do, ask. This may
all sound silly and overly strict, but we have in the past
had an operator who read files, and even changed a users
password, based on a phone call from someone who well, we
never did find out who they were. Don't do that.
- some suggestions for
users who are having trouble with the wireless network (lawn).
- print runs (printrun log)
- check paper
- supply cabinets
- printer key stick
- toner replacement
- additional supplies
- paper from doris
- toner from Rick and Rob
- see Mark's howto video
<
- Ilab door and trash check
Since the ilab rooms have become more popular, they have also
become more misused. So, the first op of the day, on the first
print-run of the day, should stop by HIll 248-252 and
- make sure the doors are all actually closed. (People have begun
propping them open forever, which kind of makes them less secure.)
- look around and see if there is any obvious garbage lying around
and get rid of it. (This does not include mopping the floors or
anything, just empty bottles, wrappers, and the like.)
- Go around to each machine and check for a login
prompt...report any machines which do not have either a graphical
login prompt or do not have someone logged in.
Exam ('local') mode for cereal (ilab) machines.
The cereal (or ilab) machines can be put into a special 'local' mode
for exams (this way students can be tested on-line, without worrying
about them using google, or IM, or email to 'enhance' their scores.
The difficulty is that while users cannot get out of the machines,
neither can users, who might be using the machines for some other
class programming assignment. They will complain to help@cs that the
cereal machines are all down -- which from the 'outside', is true.
Faculty and TAs who are going to put the machines in this mode
(using a command you don't need to remember, 'mklocal') reserve
the time on a special reservation system. The url for these
reservations is
http://www.cs.rutgers.edu/resources/rooms_and_equipment/status/rooms.php
here you will find a calendar for each room (selected near the top --
hill 248, 250, 252). These are times the room is reserved, usually
for a recitation but sometimes for exams. The machines may or may
not, during this time, be off the net for exams. I pick a couple of
representative machines (macaroni, frootloops), and if they are
unreachable (with ping, for instance), and the room is reserved, then
I can presume that the machines are being used for an exam, and, since
I know how long the reservation is for, I can advise users when the
machines are likely to be available again.
... of course, faculty or TAs may not reserve the time but at least this way
you have -- some way of telling what is going on.
- Here's a map to the ilab
machines, in case you have to find one specific one.
- Unix commands
Most things you have to do are on unix (mostly solaris or
linux). Here is some documentation on what's what and how's how.
- operator-specific tools
- a beginning of
shift checklist, and an end of shift
checklist.
- lpqer
printer control and monitoring is now done via the printserver web page
In addition, we have a service called 'nagios' which will alert
you to printer problems. Here's a how-to to add it to your
firefox session (thanks to Anthony for the how-to).
1. Download it from "https://addons.mozilla.org/en-US/firefox/addon/3607".
2. Get to the configuration window. This is usually done by
right-clicking on the small Nagios icon on the bottom right of the
Firefox window and clicking 'Settings'.
3. Under the 'General' tab you should see a button labeled 'Add New'.
4. Type 'http://report.rutgers.edu/' under 'Nagios web interface url'.
5. Type 'http://report.rutgers.edu/nagios/cgi-bin/status.cgi' under
'Status Script Url/Located url'.
6. Click the 'standalone' button
7. You should be good to go!
- temperature
watcher. This is included in the uber-watching page,
on
farside. Note that if any of the frames doesn't display,
that's a failure condition, and the appropriate sysadmin should
be informed (printing - Hanz, others - Rick and Rob.)
- printer information -- this has been superceded by use of
printserver.cs.rutgers.edu to handle all printing. See
here for print handling.
- Additional printer
information
- slide
- reported network congestion
If you ever get calls about the network being slow, have a
look at http://speed.rutgers.edu/
This has - in addition to a speed-tester (which you can run
yourself anywhere, by the way) a graph of current network
traffic going into and out of the university. Our limit is
600M -- which you can see we reach daily from about noon to
about 6pm.
So anyway, if you get complaints about the network being slow,
check to see first if the user is trying to get out of the
university, and then check this graph. If it's at 600M, there
isn't much we can do, but at least we can tell the users what
is going on.
If the traffic is within Rutgers, or the traffic monitors look
ok, then you need to contact the hardware folk and report it
to them.
- see Mark's howto video
- SPECIAL CASE stuff. In here go stuff that is 'one-off', or
stuff which are not part of the regular routine, but are
occasionally done.
- The first op in of the day should go to the console
server in the main machine room, select the aramis
console window, and attempt to log in with your regular
password and enigma/safeword card. If this succeeds,
then log out -- don't close the window itself --
and mention the success in your EOS,
in a line labeled "Aramis Console."
If it fails, please send mail to Doug (dmotto), telling
him it failed. Failure is typically that you can't type
anything in the window -- not even your username.
Do not attempt to fix this, just let Doug know. Don't
close the window, or anything.
- planetlab machines - As part of a research project
spanning the planet (hence the name), Rutgers has a
couple of machines in the 'research machine room' (C231)
that occasionally need starting or restarting. The
machines are in the first rack as you come in. Open the
door to the rack (right hand side), and you will see two
machines labels 'planetlab1', and 'planetlab2'. They
each have a usb key either in the front or the back
(never remove them unless specifically directed to -- the
machines boot off of them). You start them by pressing
the power button. You restart them by pressing the power
button to turn it off, then pressing it again to turn it
on. That's it - no console, no login, no nothing.
Currently (November 2008) besides the staff, only
Dr. Nguyen (tdnguyen) and Chris Peery (peery) will be
telling you to do that. Most often this is done if
there's a power failure, or something like that. The
machines are actually administered remotely by the
planetlab staff (wherever they are.)
- Projects website http://project.cs.rutgers.edu/students
is the site for keeping track of your project tasks (as
opposed to operational tasks). This website (software
from the 'dotproject' project, this
is a good place for documentation) should be used in
lieu of email reports to Doug or others as to how your projects
are going. Please remember to update this after every
shift (in addition to your regular EOS report.) Do
not leave your project session connected after your shift
ends, otherwise the next op might accidently log
project info as you. (Also, remember that your eden/rci
password is used for the projects website, rather than
others.)
- A way to recover osterman (and other printers) if nothing else works
- restarting the
monitors in the lobby.
- speed tester, and
monitoring the rutgers-to-the-outside-world bandwidth usage.
- What to do when firefox
locks your out. Chris Eskow wrote a very good reply to
this problem.
- Hill and Core
lost and found(s).
- things to be done (by Charles)
for new ops
References
Don's FAQ
has one specially for
ops but you have to run a web browser from one of the staff
machines to see it.
The old ops
page (Note - browser must be running on farside or other staff
machine to see these pages.)
I can't log in