A fair amount of detail is given here, but the solution in most cases is killing the server from unix.
From any Slac Central Vax SLD account, test both ports of the tape management system by issuing the following two commands:
scc -p 7905 qtape qq1548 scc -p 7906 qtape qq1548Both commands should give the response:
> Tape: QQ1548 > Owner: PERL > OwnerId: PERL > Type Tape: PRODUCTION > Location: SILO > When Owned: 05-DEC-95 > Recycle: UNKNOWN > Last Used: 01-JAN-51 > Times Used: 0 > Write Lock: NO > Comments: VMS PROD1 Minor Archive V_12_1 19951205.14:47:35There may be some delay before the servers respond since they are NOT multi-threaded.
If both servers respond correctly, and the problem was reported by TEST_SLDTMS, the problem may have been that the server was just too slow to respond. TEST_SLDTMS waits ten minutes before giving up, so this is unlikely, but since the server is not multi-threaded, it is possible. Check whether the $USR disk is getting full. A full $USR disk can cause problems for TEST_SLDTMS.
If both servers respond correctly, and the problem was reported by a user, replace the user. Spares can be found at most good quality graduate schools.
If either of the servers fail to respond, it will be necessary to log on to unix and kill the server. The unix cron system will then start a fresh copy on the next five minute mark. See killing the server below.
If a server reponds, but complains about Oracle problems, it may be possible to shut down the server from a VAX account. The unix cron system will then start a fresh copy on the next five minute mark. Even if this doesn't fix things, it will at least release a console log file which may then be useful.
The server recognizes a special command, SHUTDOWN, when it receives it from an authorized SLAC Central Vax account. The authorized account names are:
PERL GRB RICHARD KAREN TONYJThe command must be issued from an account with this exact name, rather than just any account owned by these users (allowing the command from any account owned by one of these users would require the server being able to check account ownership, which in turn requires Oracle, which is probably not accessible or you wouldn't be trying to issue a shutdown in the first place).
If SHUTDOWN works, the unix cron system will then start a fresh copy on the next five minute mark. Even if this doesn't fix things, it will at least release a console log file which may then be useful.
TEST_SLDTMS uses this same SHUTDOWN command to shut down the servers once a day a little after 9 pm. This daily shutdown allows the unix cron system to close the server console logs. The logs are then mailed to the unix account sldtms.
Log on to AIXCRON under the account sldtms.
Issue the command:
ps -ef | grep sldtmsIf both servers are running, the output should be something like:
sldtms@vesta01 $ ps -ef | grep sldtms sldtms 15028 1 0 16:43:19 - 0:00 perl5 /afs/slac.stanford.edu/package/scsutils/bin.common/TRSrun -c /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7906 sldtms 18618 17846 0 Jun 18 pts/0 0:04 -tcsh richard 18754 16604 2 16:52:06 pts/1 0:00 grep sldtms sldtms 19888 1 0 16:43:19 - 0:00 perl5 /afs/slac.stanford.edu/package/scsutils/bin.common/TRSrun -c /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7905 sldtms 20918 15028 0 16:43:19 - 0:01 /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7906 sldtms 21170 19888 0 16:43:19 - 0:00 /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7905If you do this exactly at the five minute mark, you may see extra processes, the new cron jobs coming on. These extra processes should terminate within a few tens of seconds, leaving a clear view of the processes that are really doing the work.
If you do not see any sldtms rxsql processes, check that the unix cron system is properly configured.
Kill the hung rxsql processes using the command kill -9. For example, to kill the server shown above running on port 7905:
kill -9 33498At the next five minute mark, cron should start a fresh server.
If the new cron job doesn't fix things, check the console log file released by the kill command.
The cron system is manipulated using the command crontab.
The cron file for trscrontab is ~sldtms/sldtms/sldtms_trs.cron. It is set for a 25 hour token lifetime. Note that the VMS TEST_SLDTMS job must be running to shut the server down once per day to get a fresh token. As things stand, the server cannot refresh its Oracle connection without a live token.
To see the current settings of the sldtms account's cron system, from the sldtms account, issue the command:
trscrontab -lThe response should be:
sldtms@vesta01 $ trscrontab -l farmhand;1500 3,8,13,18,23,28,33,38,43,48,53,58 * * * * /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7905 farmhand;1500 3,8,13,18,23,28,33,38,43,48,53,58 * * * * /usr/local/bin/rxsql /u/ey/sldtms/sldtms/cmdserve 7906If the correct response is not given, try editing the crontab file via the command:
You can read the log files using any standard mail reader. Be sure to set the mail reader to "leave messages on server after retrieval" unless you really intend to archive the log files in your current area. Then set the mail server user name to sldtms and the incoming mail server name to firstname.lastname@example.org.
Every transaction begins in the log file with the account name from which it was received, the day and time and the received command. Every transaction ends with the line "FINISHED."
TEST_SLDTMS issues a shutdown command once a day a little after 9 pm, so if TEST_SLDTMS is working correctly, no log file should cover more than about 24 hours.
The most common cause of crashes is the "broken pipe" condition that occurs when the user kills their client process while the server is still trying to talk to them. These conditions can be seen as log files that end with the words "broken pipe." The system automatically recovers when the cron job comes around at the next five minute mark.
The most common cause of hangs is trouble in Unix or Oracle. While the system recovers on its own after some Oracle problems, other problems have been seen that leave the server hung, attached to a dead Oracle connection, after the rest of Oracle has recovered. The solution is the familiar one, killing the server.
There have been no other common causes of problems in the SLD Tape Mangement System Servers. For any other problems, check the log files and proceed from there.
The current choice of host Farmhand and ports 7905 and 7906 is somewhat arbitrary. SLAC Computer Services has requested that cron jobs such as this be run on the machine Farmhand, but they can actually be run on any SLAC Central Unix host.
To run the servers on a different HOST or Port, log on to the desired host under account SLDTMS and run the crontab setup described above, substituting in the desired port numbers.
Then modify the VMS client side code in SLD's DUCS distribution system as follows:
modify PRODCART:CMDCLIENT.H to set SERVER_HOST and the port IDs appropriately toducs/section=prod/vms prodcart:cmdclient_main.c toducs/section=prod/vms prodcart:cmdclient_send.c toducs/section=prod/vms prodcart:ida_client.c toducs/section=prod/vms prodcart:cartshr.vec toducs/section=prod/vms prodcart:client.buildcom
The structure of the Tape Management System makes it easy to test new versions of the server code.
Using the account SLDTMS, log on to the Unix machine where the usual servers are running.
Run the new version of the server code using the port number 7907. For example, if the new version is named cmdserve_new, run:
If you then run VMS client commands directly from the SCC command (bypassing the normal VMS front-ends), you can specify the new port, as in:
scc -p 7907 qtape rk2121 or scc -p 7907 sldwho perl
Be sure to shut down your new version of the server the polite way, using the shutdown command, as in:
scc -p 7907 shutdownIf you kill your new server any other way, you may leave dead processes in Oracle. This doesn't really hurt anything, and sometimes it can't be helped, but it irks the Oracle czars.