The SLD Tape Management System Servers

The following document describes the basics of the SLD Tape Management System Servers and their associated watchdog, TEST_SLDTMS.

A fair amount of detail is given here, but the solution in most cases is killing the server from unix.

General information about the SLD Tape Management Servers

General information about the TEST_SLDTMS

Diagnosis and Orderly Shutdown

Most problems are found by the TEST_SLDTMS system, unless this sytem is itself broken. Regardless of whether problems are reported by TEST_SLDTMS or by users, the diagnosis and repair procedure is the same.

From any Slac Central Vax SLD account, test both ports of the tape management system by issuing the following two commands:

  scc -p 7905 qtape qq1548
  scc -p 7906 qtape qq1548
Both commands should give the response:
  >       Tape:   QQ1548
  >      Owner:   PERL
  >    OwnerId:   PERL
  >  Type Tape:   PRODUCTION
  >   Location:   SILO
  > When Owned:   05-DEC-95
  >    Recycle:   UNKNOWN
  >  Last Used:   01-JAN-51
  > Times Used:   0
  > Write Lock:   NO
  >   Comments:   VMS PROD1 Minor Archive V_12_1 19951205.14:47:35
There may be some delay before the servers respond since they are NOT multi-threaded.

If both servers respond correctly, and the problem was reported by TEST_SLDTMS, the problem may have been that the server was just too slow to respond. TEST_SLDTMS waits ten minutes before giving up, so this is unlikely, but since the server is not multi-threaded, it is possible. Check whether the $USR disk is getting full. A full $USR disk can cause problems for TEST_SLDTMS.

If both servers respond correctly, and the problem was reported by a user, replace the user. Spares can be found at most good quality graduate schools.

If either of the servers fail to respond, it will be necessary to log on to unix and kill the server. The unix cron system will then start a fresh copy on the next five minute mark. See killing the server below.

If a server reponds, but complains about Oracle problems, it may be possible to shut down the server from a VAX account. The unix cron system will then start a fresh copy on the next five minute mark. Even if this doesn't fix things, it will at least release a console log file which may then be useful.

The server recognizes a special command, SHUTDOWN, when it receives it from an authorized SLAC Central Vax account. The authorized account names are:

  PERL
  GRB
  RICHARD
  KAREN
  TONYJ
The command must be issued from an account with this exact name, rather than just any account owned by these users (allowing the command from any account owned by one of these users would require the server being able to check account ownership, which in turn requires Oracle, which is probably not accessible or you wouldn't be trying to issue a shutdown in the first place).

If SHUTDOWN works, the unix cron system will then start a fresh copy on the next five minute mark. Even if this doesn't fix things, it will at least release a console log file which may then be useful.

TEST_SLDTMS uses this same SHUTDOWN command to shut down the servers once a day a little after 9 pm. This daily shutdown allows the unix cron system to close the server console logs. The logs are then mailed to the unix account sldtms.

The Big Fix: Killing the Servers from Unix

The fix for most problems is to kill the servers from unix. Even if this doesn't fix things, it will at least release a console log file which may then be useful.

Log on to AIXCRON under the account sldtms.

Issue the command:

  ps -ef | grep sldtms
If both servers are running, the output should be something like:
  sldtms@vesta01 $ ps -ef | grep sldtms
  sldtms 15028     1   0 16:43:19      -  0:00 perl5 /afs/slac.stanford.edu/package/scsutils/bin.common/TRSrun -c /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7906 
  sldtms 18618 17846   0   Jun 18  pts/0  0:04 -tcsh 
 richard 18754 16604   2 16:52:06  pts/1  0:00 grep sldtms 
  sldtms 19888     1   0 16:43:19      -  0:00 perl5 /afs/slac.stanford.edu/package/scsutils/bin.common/TRSrun -c /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7905 
  sldtms 20918 15028   0 16:43:19      -  0:01 /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7906 
  sldtms 21170 19888   0 16:43:19      -  0:00 /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7905 
If you do this exactly at the five minute mark, you may see extra processes, the new cron jobs coming on. These extra processes should terminate within a few tens of seconds, leaving a clear view of the processes that are really doing the work.

If you do not see any sldtms rxsql processes, check that the unix cron system is properly configured.

Kill the hung rxsql processes using the command kill -9. For example, to kill the server shown above running on port 7905:

  kill -9 33498
At the next five minute mark, cron should start a fresh server.

If the new cron job doesn't fix things, check the console log file released by the kill command.

The Unix Cron System

The unix cron system is very robust and has never yet had a problem running sldtms.

The cron system is manipulated using the command crontab.

The cron file for trscrontab is ~sldtms/sldtms/sldtms_trs.cron. It is set for a 25 hour token lifetime. Note that the VMS TEST_SLDTMS job must be running to shut the server down once per day to get a fresh token. As things stand, the server cannot refresh its Oracle connection without a live token.

To see the current settings of the sldtms account's cron system, from the sldtms account, issue the command:

  trscrontab -l
The response should be:
  sldtms@vesta01 $ trscrontab -l
farmhand;1500 3,8,13,18,23,28,33,38,43,48,53,58 * * * * /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7905
farmhand;1500 3,8,13,18,23,28,33,38,43,48,53,58 * * * * /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7906

If the correct response is not given, try editing the crontab file via the command:
  trscrontab -e

Studying the Tape Management Server Console Log Files

Each time the servers are shutdown, crash or are killed, the unix cron system mails their console logs to the unix account sldtms.

You can read the log files using any standard mail reader. Be sure to set the mail reader to "leave messages on server after retrieval" unless you really intend to archive the log files in your current area. Then set the mail server user name to sldtms and the incoming mail server name to popserve@slac.stanford.edu.

Every transaction begins in the log file with the account name from which it was received, the day and time and the received command. Every transaction ends with the line "FINISHED."

TEST_SLDTMS issues a shutdown command once a day a little after 9 pm, so if TEST_SLDTMS is working correctly, no log file should cover more than about 24 hours.

The most common cause of crashes is the "broken pipe" condition that occurs when the user kills their client process while the server is still trying to talk to them. These conditions can be seen as log files that end with the words "broken pipe." The system automatically recovers when the cron job comes around at the next five minute mark.

The most common cause of hangs is trouble in Unix or Oracle. While the system recovers on its own after some Oracle problems, other problems have been seen that leave the server hung, attached to a dead Oracle connection, after the rest of Oracle has recovered. The solution is the familiar one, killing the server.

There have been no other common causes of problems in the SLD Tape Mangement System Servers. For any other problems, check the log files and proceed from there.

Moving the Servers to a Different Host or Port

The current choice of host Farmhand and ports 7905 and 7906 is somewhat arbitrary. SLAC Computer Services has requested that cron jobs such as this be run on the machine Farmhand, but they can actually be run on any SLAC Central Unix host.

To run the servers on a different HOST or Port, log on to the desired host under account SLDTMS and run the crontab setup described above, substituting in the desired port numbers.

Then modify the VMS client side code in SLD's DUCS distribution system as follows:

  modify PRODCART:CMDCLIENT.H to set SERVER_HOST and the port IDs appropriately
  toducs/section=prod/vms prodcart:cmdclient_main.c
  toducs/section=prod/vms prodcart:cmdclient_send.c
  toducs/section=prod/vms prodcart:ida_client.c
  toducs/section=prod/vms prodcart:cartshr.vec
  toducs/section=prod/vms prodcart:client.buildcom

Testing New Versions of the Server Code

The structure of the Tape Management System makes it easy to test new versions of the server code.

Using the account SLDTMS, log on to the Unix machine where the usual servers are running.

Run the new version of the server code using the port number 7907. For example, if the new version is named cmdserve_new, run:

  cmdserve_new 7907

If you then run VMS client commands directly from the SCC command (bypassing the normal VMS front-ends), you can specify the new port, as in:

  scc -p 7907 qtape rk2121
  or
  scc -p 7907 sldwho perl

Be sure to shut down your new version of the server the polite way, using the shutdown command, as in:

  scc -p 7907 shutdown
If you kill your new server any other way, you may leave dead processes in Oracle. This doesn't really hurt anything, and sometimes it can't be helped, but it irks the Oracle czars.
Joseph Perl
Last Modified 23 June 1999 RXD