The SLD Tape Management System Servers

The following document describes the basics of the SLD Tape Management System Servers and their associated watchdog, TEST_SLDTMS.

A fair amount of detail is given here, but the solution in most cases is killing the server from unix.

General information about the SLD Tape Management Servers

The SLD Tape Management Servers handle various SLD tape issues (write permissions, data catalogs, etc) plus the SLDWHO function. Specifically, the servers handle the VMS commands:
- CANWRITE
- QTAPE
- READCAT (called by OPENTAPE)
- MYTAPES
- NEWCOM
- GIVE
- MAKEFREE
- TAPEUSE
- SLDWHO
A set of flowcharts is provided to diagram the tape management system.
The SLD Tape Management Servers run on SLAC Central Unix on the machine AIXCRON in the account sldtms. At the time of this writing (though not guaranteed forever), AIXCRON is synonamous with VESTA01.
An earlier document describes the VM implementation of those servers that was in place from 1994 to 1996.
The servers communicate with VMS commands via TCPIP Sockets.
All of the server code is a contained in a single uni-REXX program, ~sldtms/sldtms/cmdserve
Two identical copies of the server are run, one listening to port 7905, the other listening to port 7906.
The two copies of the server are absolutely identical. The only difference in their use is that the client code for READCAT and CANWRITE aims at port 7905 while the client code for all other commands aims at port 7906.
The server is NOT multi-threaded. Each copy of the server can process only one request at a time. The client side code contains wait-and-retry loops where needed. If speed becomes an issue, multi-threading may be the solution.
These two copies of cmdserve are started automatically by the unix cron system every five minutes on the five minute mark. If the server is already running on the specified port number, the new copy of cmdserve just terminates.
cmdserve communicates with unix Oracle through the RXS interface developed by Len Moss.

General information about the TEST_SLDTMS

The TEST_SLDTMS process tests the SLD Tape Management Servers, paging maintainers if a problem occurs. It also issues the once-a-day shutdown command that causes the SLD Tape Management Servers to close their log files and mail them to the sldtms account.
TEST_SLDTMS runs on the SLAC Central Vax cluster as a batch job in the SLD_FAST queue.
The test code is a DCL command file, DISK$SLD_USR0:[PERL.TEST_SLDTMS]TEST_SLDTMS.COM
The first thing that TEST_SLDTMS.COM does is to resubmit itself to the batch queues to run again in twenty minutes.
The batch job tests the port 7905 tape management server by issuing a CANWRITE command.
The batch job tests the port 7906 tape management server by issuing a QTAPE command.
Before each test, TEST_SLDTMS spawns a child process that will get TEST_SLDTMS running again if it hangs for more than 10 minutes.
If either port fails to respond, or fails to give the correct response, the maintainer is paged using the telalert system maintained by Mike Wendling.
Once the system has paged a maintainer to report a problem, it continues to test the Tape Management Servers every 20 minutes, but it does not page the maintainers with further problems.
The system pages the maintainers again the next time it finds that everything is OK.

Diagnosis and Orderly Shutdown

Most problems are found by the TEST_SLDTMS system, unless this sytem is itself broken. Regardless of whether problems are reported by TEST_SLDTMS or by users, the diagnosis and repair procedure is the same.

From any Slac Central Vax SLD account, test both ports of the tape management system by issuing the following two commands:

  scc -p 7905 qtape qq1548
  scc -p 7906 qtape qq1548

Both commands should give the response:

  >       Tape:   QQ1548
  >      Owner:   PERL
  >    OwnerId:   PERL
  >  Type Tape:   PRODUCTION
  >   Location:   SILO
  > When Owned:   05-DEC-95
  >    Recycle:   UNKNOWN
  >  Last Used:   01-JAN-51
  > Times Used:   0
  > Write Lock:   NO
  >   Comments:   VMS PROD1 Minor Archive V_12_1 19951205.14:47:35

There may be some delay before the servers respond since they are NOT multi-threaded.

If both servers respond correctly, and the problem was reported by TEST_SLDTMS, the problem may have been that the server was just too slow to respond. TEST_SLDTMS waits ten minutes before giving up, so this is unlikely, but since the server is not multi-threaded, it is possible. Check whether the $USR disk is getting full. A full $USR disk can cause problems for TEST_SLDTMS.

If both servers respond correctly, and the problem was reported by a user, replace the user. Spares can be found at most good quality graduate schools.

If either of the servers fail to respond, it will be necessary to log on to unix and kill the server. The unix cron system will then start a fresh copy on the next five minute mark. See killing the server below.

If a server reponds, but complains about Oracle problems, it may be possible to shut down the server from a VAX account. The unix cron system will then start a fresh copy on the next five minute mark. Even if this doesn't fix things, it will at least release a console log file which may then be useful.

The server recognizes a special command, SHUTDOWN, when it receives it from an authorized SLAC Central Vax account. The authorized account names are:

  PERL
  GRB
  RICHARD
  KAREN
  TONYJ

The command must be issued from an account with this exact name, rather than just any account owned by these users (allowing the command from any account owned by one of these users would require the server being able to check account ownership, which in turn requires Oracle, which is probably not accessible or you wouldn't be trying to issue a shutdown in the first place).

If SHUTDOWN works, the unix cron system will then start a fresh copy on the next five minute mark. Even if this doesn't fix things, it will at least release a console log file which may then be useful.

TEST_SLDTMS uses this same SHUTDOWN command to shut down the servers once a day a little after 9 pm. This daily shutdown allows the unix cron system to close the server console logs. The logs are then mailed to the unix account sldtms.

The Big Fix: Killing the Servers from Unix

The fix for most problems is to kill the servers from unix. Even if this doesn't fix things, it will at least release a console log file which may then be useful.

Log on to AIXCRON under the account sldtms.

Issue the command:

  ps -ef | grep sldtms

If both servers are running, the output should be something like:

  sldtms@vesta01 $ ps -ef | grep sldtms
  sldtms 15028     1   0 16:43:19      -  0:00 perl5 /afs/slac.stanford.edu/package/scsutils/bin.common/TRSrun -c /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7906 
  sldtms 18618 17846   0   Jun 18  pts/0  0:04 -tcsh 
 richard 18754 16604   2 16:52:06  pts/1  0:00 grep sldtms 
  sldtms 19888     1   0 16:43:19      -  0:00 perl5 /afs/slac.stanford.edu/package/scsutils/bin.common/TRSrun -c /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7905 
  sldtms 20918 15028   0 16:43:19      -  0:01 /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7906 
  sldtms 21170 19888   0 16:43:19      -  0:00 /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7905

If you do this exactly at the five minute mark, you may see extra processes, the new cron jobs coming on. These extra processes should terminate within a few tens of seconds, leaving a clear view of the processes that are really doing the work.

If you do not see any sldtms rxsql processes, check that the unix cron system is properly configured.

Kill the hung rxsql processes using the command kill -9. For example, to kill the server shown above running on port 7905:

  kill -9 33498

At the next five minute mark, cron should start a fresh server.

If the new cron job doesn't fix things, check the console log file released by the kill command.

The Unix Cron System

The unix cron system is very robust and has never yet had a problem running sldtms.

The cron system is manipulated using the command crontab.

The cron file for trscrontab is ~sldtms/sldtms/sldtms_trs.cron. It is set for a 25 hour token lifetime. Note that the VMS TEST_SLDTMS job must be running to shut the server down once per day to get a fresh token. As things stand, the server cannot refresh its Oracle connection without a live token.

To see the current settings of the sldtms account's cron system, from the sldtms account, issue the command:

  trscrontab -l

The response should be:

  sldtms@vesta01 $ trscrontab -l
farmhand;1500 3,8,13,18,23,28,33,38,43,48,53,58 * * * * /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7905
farmhand;1500 3,8,13,18,23,28,33,38,43,48,53,58 * * * * /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7906

If the correct response is not given, try editing the crontab file via the command:

  trscrontab -e

Studying the Tape Management Server Console Log Files

Each time the servers are shutdown, crash or are killed, the unix cron system mails their console logs to the unix account sldtms.

You can read the log files using any standard mail reader. Be sure to set the mail reader to "leave messages on server after retrieval" unless you really intend to archive the log files in your current area. Then set the mail server user name to sldtms and the incoming mail server name to popserve@slac.stanford.edu.

Every transaction begins in the log file with the account name from which it was received, the day and time and the received command. Every transaction ends with the line "FINISHED."

TEST_SLDTMS issues a shutdown command once a day a little after 9 pm, so if TEST_SLDTMS is working correctly, no log file should cover more than about 24 hours.

The most common cause of crashes is the "broken pipe" condition that occurs when the user kills their client process while the server is still trying to talk to them. These conditions can be seen as log files that end with the words "broken pipe." The system automatically recovers when the cron job comes around at the next five minute mark.

The most common cause of hangs is trouble in Unix or Oracle. While the system recovers on its own after some Oracle problems, other problems have been seen that leave the server hung, attached to a dead Oracle connection, after the rest of Oracle has recovered. The solution is the familiar one, killing the server.

There have been no other common causes of problems in the SLD Tape Mangement System Servers. For any other problems, check the log files and proceed from there.

Logging directly onto UNIX SQLPLUS can be useful for checking the current state of Oracle.
Karen Heidenreich usually knows the current state of Oracle.
Ian MacGregor is the Unix Oracle expert.
Len Moss is the RXS REXX SQL interface expert.

Moving the Servers to a Different Host or Port

The current choice of host Farmhand and ports 7905 and 7906 is somewhat arbitrary. SLAC Computer Services has requested that cron jobs such as this be run on the machine Farmhand, but they can actually be run on any SLAC Central Unix host.

To run the servers on a different HOST or Port, log on to the desired host under account SLDTMS and run the crontab setup described above, substituting in the desired port numbers.

Then modify the VMS client side code in SLD's DUCS distribution system as follows:

  modify PRODCART:CMDCLIENT.H to set SERVER_HOST and the port IDs appropriately
  toducs/section=prod/vms prodcart:cmdclient_main.c
  toducs/section=prod/vms prodcart:cmdclient_send.c
  toducs/section=prod/vms prodcart:ida_client.c
  toducs/section=prod/vms prodcart:cartshr.vec
  toducs/section=prod/vms prodcart:client.buildcom

Testing New Versions of the Server Code

The structure of the Tape Management System makes it easy to test new versions of the server code.

Using the account SLDTMS, log on to the Unix machine where the usual servers are running.

Run the new version of the server code using the port number 7907. For example, if the new version is named cmdserve_new, run:

  cmdserve_new 7907

If you then run VMS client commands directly from the SCC command (bypassing the normal VMS front-ends), you can specify the new port, as in:

  scc -p 7907 qtape rk2121
  or
  scc -p 7907 sldwho perl

Be sure to shut down your new version of the server the polite way, using the shutdown command, as in:

  scc -p 7907 shutdown

If you kill your new server any other way, you may leave dead processes in Oracle. This doesn't really hurt anything, and sometimes it can't be helped, but it irks the Oracle czars.

Joseph Perl
Last Modified 23 June 1999 RXD