LCSR Operator manual. General FAQ's: NEW: Hill-3 printer 'hill3' (aka 'dcs') is now the ricoh copier in Hill 378. Here's service that printer. Note - there are currently different roles for 'tray 1' and 'tray 3' for printing or copying.

NEW: Webserver locks up The web server can become catatonic occasionally, mostly by the machine entirely running out of memory. In such a case, you can check that it is so by pinging www3.srv.lcsr.rutgers.edu, and trying to open a web page on www.cs.rutgers.edu. If it pings, but doesn't answer the web request, that's the case we're talking about.

Now do a further check -- go to the terminal server and open the window for www3 (or www6) and see if "out of memory errors" are appearing on the screen. If they are (or if there is another, perhaps more disturbing error message), you should contact Doug Motto (cell phone) or Kendall Blake (same) at once. If Doug doesn't answer, leave a message and call Blake. If you can't get ahold of either of them, call Charles.

If all of that fails, you'll have to hard-reboot the machine. www3 is labeled, and over by the machines next to the giant a/c box and the ups (battery backup) box. Take the cover off the one labeled www3, and power cycle it. Put the cover back on, and check that it is coming up at the terminal server. Try again to reach Doug, or Kendall, or Charles, failing that, make sure you email all of them telling them what is happening.

Temps Watching

  1. who to contact - at some point, you will have questions about some situation, and be unsure what to do. You should ask for help. There's a bunch of sysprogs on the hall, your easiest thing to do as ask any one of them. If none are available, you should call one. Which one you call depends on what sort of question. Use the list below to guide you.
    If none of the recommended people are reachable, and you think the situation is crucial enough, KEEP TRYING UNTIL YOU REACH SOMEONE. If you aren't sure if a situation is critical or not, call. In an emergency involving crime or fire, anything involving your personal safety, we want you out of harm's way first. You can always call the RU cops (732/932-7211 -- 911 from a voip phone) from a position of safety.

    (phone numbers are provided on a laminated card from Doris -- putting them online is a bit dangerous. There is also a list in the operator office.

    crime, fire alarms, etc. - RU Police, then also call Rick Crispin, failing that Rob Toth
    hardware, power, networking, airconditioning - Rick Crispin, then Rob Toth
    faculty, research (aramis, athos, presidents, many more) - Don W.
    netapp - Don W., then Rob Tuck
    grad machines, cereal, web - Doug Motto
    undergrad (remus/romulus/alfalfa/spanky)- Lars
    handin - Lars
    wireless (plus login and lawn-gw machines) - Hanz
    email (plus dragon, spamfilter machines) - Hanz, then Don W.
    math - Risa Hynes
    dimacs - Walter Morris
    room access - Rob Toth, then Rick Crispin, then any staffer if none of the above, then Rick Thomas
    If none of these 'primary' contacts works, then try Charles, and then try other staff members until you get somebody.

  2. Machine room layout

  3. tape drives, tape procedures, tape library
    The 'tape library' is located off the main computer room. The door to the tape library, from the machine room, is not marked as the tape library. The tape library is located near the entrance to the math and DIMACS machine room. Nowadays, tapes are mostly stored in the operator office, but the room is still referred to as the 'tape library'.

    Cleaning tapes should be used when a tape drive's 'cleaning' light goes on. This means that the drive has noticed that there is junk on the read-write heads, and you should use a cleaning tape. Insert the cleaning tape into the drive like normal, and wait for the light to stop blinking, then remove the tape as normal. If more than one run of a cleaning tape is needed for a drive (or if the 'cleaning' light keeps lighting up), it probably means there's something else wrong with the drive. In any case, you should report it in your eos.

  4. Other Tape errors - the backup scripts might fail for a variety of reasons. These can include an outright bad tape, permissions problems, or problems on the tape server machines themselves. Figuring out which is which is a matter of experience, really, but here are some guidelines.

  5. bimonthly cd-dvd backups of faculty home directories
  6. logbooks
    There are several logbooks that need to be kept up to date. How to generate new logsheets is detailed here.

  7. operational duties
    • answer phone
      answer phone "this is lcsr operations, may I help you?" Always log phone calls in the phone log. Check the phone answering system (https://pbx.cs.rutgers.edu/recordings/) when you've been away from the phone (for instance on print runs), or at the beginning of your shift. Messages found on the answering system should be logged in the help system as if you had received the call yourself; after that the phone message should be moved into the 'work' folder in the answering system to show that someone has listened to it. To use the phone itself to listen to messages, dial *98. Use the same username (2443) and password (ask me) as in the web-based one.

    • Check operator mail. While we hope the users will use the ticket system (see below) -- by sending to 'help@cs', the old way was for people to send to 'operator@cs', and many of our users still will do that. Also, some automatic processes send top operator@cs. Note that this is different from 'ops@cs'. 'ops@cs' is for us to talk amongst ourselves, and arrives as mail in your personal inbox. 'operator@cs' mail arrives at the shared mailbox that appears in your cs mail server list of folders as 'operator/INBOX', or something similar. You should check this mailbox every so often, just to make sure no mail has arrived there.

    • Check the ticket system for new help requests at tickets.cs.rutgers.edu/admin for new tickets every hour. Every hour. EVERY HOUR. Here's placement of ticket system ticket guidelines. Here's how to figure out what cluster a machine is in (for ticket placement, for instance.)

    • tape backups (tape log) (here's the difference between fulls and dailies) We do three kinds of backups here -- one is to DLTIV tapes, which you do with the "from tape" stuff described below. One is of the netapp, which Don does himself, and one is to the disk-to-disk server, known as 'hold', which happens automatically every night. Each has a different restore mechanism, which are described below. Generally, user files (e.g. /ug/users) are on the netapp, system files are on the disk-to-disk server for machines that use the disk-to-disk system, and machines that still do tape backups, the files are on the tapes. If you're not sure where a restore request's files live, ask.

    • tape restores
      User Files (most home directories)
      structureon machinerestore from
      facfacultynetapp
      gradgradnetapp
      ugundergradnetapp
      ilabcerealnetapp
      stafffarsidefarside or netapp

      Non-user files. Most of these (e.g. /fac/s1, various machines' /etc, /ug/software and the like) are handled by the disk-to-disk system.

    • what to do if backups get missed - if, for some reason (usually because an op couldn't make a shift, and no one could cover), backups were not done, we need to make sure that the backups are brought up to date as quickly as possible. Check back in the logbook and see what hasn't been done. Do the backups in this order:
      • fulls: if a full has not been done, do it first. When it is finished, mark any dailies before or after that in the book as not done (big "NOT DONE" is recommended), and note in the logbook for that full, that it was done, and on what day. For instance:

        let's say there was a full to be done on Saturday, and dailies on Sunday and Monday. You come in Monday to see that none of the Saturday and Sunday backups were done. You should do the full at once, marking in the logbook for the full, "done on Monday", and for the Saturday and Monday dailies, "NOT DONE - full done on Monday". (The Monday daily doesn't have to be done, you've just done the full.)

      • dailies: if only dailies have been missed, do the most recent daily, and log the other dailes as not done. In the above example,

        lets say there were dailies to be done on Saturday, Sunday, and Monday. You come in on Monday, you do the daily on the Monday tape, logging it as normal. The Saturday and Sunday dailies, you log as "NOT DONE - daily done on Monday."

    • check out of projectors We maintain a collection of lcd projectors that are loaned to various faculty members (or their representatives, faculty will send tas or students sometimes). When they appear, you need to do two things
      1. check that they have reserved their checking out in the calendar system ( Here's a direct link -- there's also a link to it on the main dcs web page -- under resources.) If they haven't, and one of the projectors is as-yet unreserved, make a note of when they picked it up in you EOS, so we can keep track of what is reserved and what isn't.IF there is a conflict, that is, if we are confronted with situation where someone has made a reservation for a projector, and someone else just shows up looking pick one up, the one with the reservation gets the projector..
      2. get their id, and use the checker system to record who took the item.

      Here's how to see reservations (and set yourself up to see them), and here's how to use the check-out system.

    • keep the whiteboard accurate. For users arriving in person, its just good manners to tell them if you're arround to help them. Have a look at this for some guidance.

    • watch for problems. These problems can include connectivity problems, that is, if the outside world is suddenly unconnectable-to, or if you get several reports of the same problem over a short amount of time, or people have problems logging in to multiple machines -- people who should be able to by all appearances, or printers jamming the same way repeatedly, then you should create a printer queue ticket so the appropriate people can have a look. You're our eyes and ears.

    • LDAP, kerberos, enigma - the university as a whole, and DCS as well, are moving to an authentication system known as LDAP ('lightweight directory access protocol'). DCS also uses 'kerberos', and you and the rest of the staff have an 'enigma' (or 'safeword', or whatever) one-time cards. These all depend on a server to handle password authentication. For instance, 'dragon', our mail server, depends on the OIT LDAP server, ldap.rutgers.edu. If that server fails, for whatever reason, suddenly users will start being unable to login to the mail server. (The LDAP server is supposed to have 'failover', that is, automatic backup servers, but we've seen at least one case where that just flat didn't work.) Anyway, when you see that happening, you need to alert the staff. For the kerberos servers, which we run, symptoms would be people not being able to log into regular servers (remus, paul, etc.) or desktop machines. For enigma servers (which OIT run), symptoms would be that you can't log in.

    • reboots - You may be called upon to reboot a machine by a member of the system staff. You should never reboot a machine without permission of a member of the system staff. If you think a reboot is called for, contact somebody on the system staff and ask permission. Machines are often running long-term jobs, or system updates that are not obvious, and a reboot at an unexpected time might cause serious difficulty -- for instance, a reboot in the middle of a kernel update may cause the machine to not boot at all.
    • end of shift reports

    • 'door problems' - if people report that they cannot get into a swipe-card (either type) door, this should be reported to the hardware guys.

    • How to fix passwords, and other non-loginable conditions, which might be password, or academic integrity-signing, or other-things related.

    • generally be the first line of support for our users

    • privs - what they are for and what they are not
      • no reading of files or mail without explicit permission (you tell them 'I need your permission', they have to give it, via email, or in person, provided they can present ID proving they are who they say they are.)
      • when you are not sure what is right to do, ask. This may all sound silly and overly strict, but we have in the past had an operator who read files, and even changed a users password, based on a phone call from someone who well, we never did find out who they were. Don't do that.
    • some suggestions for users who are having trouble with the wireless network (lawn).
    • print runs (printrun log)
      • check paper
      • supply cabinets
      • printer key stick
      • toner replacement
      • additional supplies
        • paper from doris
        • toner from Rick and Rob
      • see Mark's howto video
      <

    • Ilab door and trash check Since the ilab rooms have become more popular, they have also become more misused. So, the first op of the day, on the first print-run of the day, should stop by HIll 248-252 and
      • make sure the doors are all actually closed. (People have begun propping them open forever, which kind of makes them less secure.)
      • look around and see if there is any obvious garbage lying around and get rid of it. (This does not include mopping the floors or anything, just empty bottles, wrappers, and the like.)
      • Go around to each machine and check for a login prompt...report any machines which do not have either a graphical login prompt or do not have someone logged in.
    Exam ('local') mode for cereal (ilab) machines.
    The cereal (or ilab) machines can be put into a special 'local' mode for exams (this way students can be tested on-line, without worrying about them using google, or IM, or email to 'enhance' their scores. The difficulty is that while users cannot get out of the machines, neither can users, who might be using the machines for some other class programming assignment. They will complain to help@cs that the cereal machines are all down -- which from the 'outside', is true. Faculty and TAs who are going to put the machines in this mode (using a command you don't need to remember, 'mklocal') reserve the time on a special reservation system. The url for these reservations is
          
          http://www.cs.rutgers.edu/resources/rooms_and_equipment/status/rooms.php
    
    
    here you will find a calendar for each room (selected near the top -- hill 248, 250, 252). These are times the room is reserved, usually for a recitation but sometimes for exams. The machines may or may not, during this time, be off the net for exams. I pick a couple of representative machines (macaroni, frootloops), and if they are unreachable (with ping, for instance), and the room is reserved, then I can presume that the machines are being used for an exam, and, since I know how long the reservation is for, I can advise users when the machines are likely to be available again. ... of course, faculty or TAs may not reserve the time but at least this way you have -- some way of telling what is going on.
  8. Here's a map to the ilab machines, in case you have to find one specific one.

  9. Unix commands Most things you have to do are on unix (mostly solaris or linux). Here is some documentation on what's what and how's how.
  10. operator-specific tools
  11. a beginning of shift checklist, and an end of shift checklist.
  12. lpqer
    printer control and monitoring is now done via the printserver web page

    In addition, we have a service called 'nagios' which will alert you to printer problems. Here's a how-to to add it to your firefox session (thanks to Anthony for the how-to).

         1. Download it from "https://addons.mozilla.org/en-US/firefox/addon/3607".
         2. Get to the configuration window. This is usually done by
            right-clicking on the small Nagios icon on the bottom right of the
    	Firefox window and clicking 'Settings'.
         3. Under the 'General' tab you should see a button labeled 'Add New'.
         4. Type 'http://report.rutgers.edu/' under 'Nagios web interface url'.
         5. Type 'http://report.rutgers.edu/nagios/cgi-bin/status.cgi' under
            'Status Script Url/Located url'.
         6. Click the 'standalone' button
         7. You should be good to go!
         


  13. temperature watcher. This is included in the uber-watching page, on farside. Note that if any of the frames doesn't display, that's a failure condition, and the appropriate sysadmin should be informed (printing - Hanz, others - Rick and Rob.)
  14. printer information -- this has been superceded by use of printserver.cs.rutgers.edu to handle all printing. See here for print handling.
  15. Additional printer information
  16. slide
  17. reported network congestion
    If you ever get calls about the network being slow, have a look at http://speed.rutgers.edu/ This has - in addition to a speed-tester (which you can run yourself anywhere, by the way) a graph of current network traffic going into and out of the university. Our limit is 600M -- which you can see we reach daily from about noon to about 6pm. So anyway, if you get complaints about the network being slow, check to see first if the user is trying to get out of the university, and then check this graph. If it's at 600M, there isn't much we can do, but at least we can tell the users what is going on. If the traffic is within Rutgers, or the traffic monitors look ok, then you need to contact the hardware folk and report it to them.

  18. see Mark's howto video

  19. SPECIAL CASE stuff. In here go stuff that is 'one-off', or stuff which are not part of the regular routine, but are occasionally done.

    • The first op in of the day should go to the console server in the main machine room, select the aramis console window, and attempt to log in with your regular password and enigma/safeword card. If this succeeds, then log out -- don't close the window itself -- and mention the success in your EOS, in a line labeled "Aramis Console."
      If it fails, please send mail to Doug (dmotto), telling him it failed. Failure is typically that you can't type anything in the window -- not even your username.

      Do not attempt to fix this, just let Doug know. Don't close the window, or anything.

    • planetlab machines - As part of a research project spanning the planet (hence the name), Rutgers has a couple of machines in the 'research machine room' (C231) that occasionally need starting or restarting. The machines are in the first rack as you come in. Open the door to the rack (right hand side), and you will see two machines labels 'planetlab1', and 'planetlab2'. They each have a usb key either in the front or the back (never remove them unless specifically directed to -- the machines boot off of them). You start them by pressing the power button. You restart them by pressing the power button to turn it off, then pressing it again to turn it on. That's it - no console, no login, no nothing.

      Currently (November 2008) besides the staff, only Dr. Nguyen (tdnguyen) and Chris Peery (peery) will be telling you to do that. Most often this is done if there's a power failure, or something like that. The machines are actually administered remotely by the planetlab staff (wherever they are.)

    • Projects website http://project.cs.rutgers.edu/students is the site for keeping track of your project tasks (as opposed to operational tasks). This website (software from the 'dotproject' project, this is a good place for documentation) should be used in lieu of email reports to Doug or others as to how your projects are going. Please remember to update this after every shift (in addition to your regular EOS report.) Do not leave your project session connected after your shift ends, otherwise the next op might accidently log project info as you. (Also, remember that your eden/rci password is used for the projects website, rather than others.)

    • A way to recover osterman (and other printers) if nothing else works
    • restarting the monitors in the lobby.
    • speed tester, and monitoring the rutgers-to-the-outside-world bandwidth usage.
    • What to do when firefox locks your out. Chris Eskow wrote a very good reply to this problem.
    • Hill and Core lost and found(s).

  20. things to be done (by Charles) for new ops References

    Don's FAQ has one specially for ops but you have to run a web browser from one of the staff machines to see it.

    The old ops page (Note - browser must be running on farside or other staff machine to see these pages.)

    I can't log in