Operations Reference Manual for EPEM/RTH

I. Introduction

The intended audience of this document is not the EPEM/RTH user, rather it is the poor sod who might have to debug a problem in the absence of the author. This will not be a guide to all possible problems. The great majority of the problems that occur with EPEM or RTH are due to changes in other systems which they depend on, eg, the Offline Recon code, IDA, etc. Also problems often occur due to changes in it's environment, eg SLDU44, the vax that is its current home. Hopefully this guide will enable one to better ascertain what the problem might be and it will usually take an experienced programmer that can read the code. At best this document will only make it a littel easier to navigate around the code.

Of course, the worst problem with documentation like this is that it immediately begins the process of becoming more and more incorrect. So be prepared for anything in here to be incorrect with higher probability as the document ages.

II. Architecture

EPEM:

On the SLDU44 VAX there is an event multiplexor which contains a sample of the events contained in the primary event multiplexor on SLDACQ VAX. EPEM as well as several other systems running on SLDU44 sample events from this local event multiplexor. EPEM runs the SLD reconstruction program (in IDA) on events from the multiplexor which pass the standard offline KZ0F Z filter. The reconstructed event is saved in $SCR:[SGI.EVENTS]LATEST_Z.JAZZDATA and replaces the previous event saved with the same name. This event file is the one that the SGI 3D display looks at when it is asked to display the latest reconstructed Z event. In addition, every N events the event is also saved as Rxxxxx_Eyyyyy.JAZZDATA in the same directory. N is an adjustable parameter and xxxxx and yyyyy are run number and event number of the event. Finally, there is a file called EVENTS.SUMMARY in the same directory which EPEM updates every event with its run number and event number and the number of tracks and clusters found.

EPEM also produces histograms for display by RTH. EPEM collects on order of 100 histograms of various quantities from both reconstructed event data and from data taken from all events including the majority non-Z events that it looks at from the multiplexor. These histograms are saved in $SCR:[RTH.EVENTS]EPEMD_HISTOS.HCOM after every reconstructed Z and after every 50th non-Z event. EPEM also writes out with each .HCOM file a file called NSAVE.DATA which contains information used by EPEM on a restart. When an EPEM process is started it reads in the latest .HCOM file and the latest NSAVE file so pick up where it left off when its last incarnation ended. The NSAVE file contains information used to create a normaliztion factor for each histograms when it is written out.

RTH:

RTH is a standalone process of which multiple copies may run on any account (with the right environment) on any VAX on the SLD cluster. RTH displays the histograms EPEM saves and the details of its capabilities are found in its online help system. RTH can compare the current data from EPEM with other standard data sets such as Monte Carlo and a large sample of data from earlier in the 1993 run. These standard files are found in $USR1:[RTH.RTH]. RTH is built on the MIDAS system developed by Tony Johnson and this document not go into the details of MIDAS.

III Operation

Starting and Stopping the systems:

The basic level of controlling the starting and stopping of the multiplexor, EPEM and RTH are handled by buttons in a menu item on SLDU17 called Run_Time_OFFline. Although the relevant processes actually run on SLDU44 they are controlled from SLDU17 which resides in the control room. One button called STOP_ALL44 shuts down EPEM, RTH and three display processes that get events from the event multiplexor on SLDU44. This button also stops the multiplexor thus completely eliminating all elements of the system. The button START_POOL44 startups the necessary porocesses on SLDU44 and SLDACQ to establish the multiplexor on SLDU44. The button START_EPEM44 starts EPEM and START_RTH44 starts RTH. The starting of the three display processes that draw events from the multiplexor is controlled by buttons on the SGI next to SLDU17. The three display processes are the 3D display on the SGI, the 2D Z display above the BOSS-BOSSSCP and the 2D Z display at MCC.

The details by which these buttons accomplish their tasks can be traced by bringing up the Run_Time_OFFline menu and following the succession of commands and command files used. One must be familiar with the workings of various multinet commands as well as the ins and outs of workstation windows in addition to things that are probably more familiar to the average SLD experienced user of the offline system on a VAX and as mentioned earlier in the case of RTH one must possess considerable arcane knowledge of MIDAS which at present writing is only known to the present author and Tony Johnson.

Normally on the SLDU17 where RTH and EPEM are displayed (remember their processes actually run on SLDU44) there is one other display, called Z'Beep. This display is created simply by taking a window on SLDU17 with a DCL prompt, $U17> , setting host to SLDACQ and logging on to the SGI account and give the window the name Z'Beep. This window will display a message from EPEM every time a Z is reconstructed.

Relinking EPEM:

Occasionally it is necessary to relink EPEM due to changes in the SLD Offline software. This can be done by logging on the RTH account on any SLD clustered VAX and typing

 
      @EPEM44 LINK

Before doing this look in the $USR1:[RTH.EXE] directory and see if there is more than on copy of EPEM44.EXE and EPEM44.MAP. If so delete the older versions to make room for a new version. It is a wise practice to always keep the most recent version that is already known to work. Thus if one decides the link one just did is no good, get rid of that version of .EXE and .MAP before trying again rather than getting rid of the version known to work last. This advice is particularly true of the .MAP file as having a .MAP file that was known to work is sometimes quite useful in figuring out what may be the cause of an EPEM problem that has been introduced by changes to offline code. Due to the unique environment in which EPEM runs it is OFTEN the first place problems with new code become manifest.

Crash dumps:

EPEM, RTH and the display processes run on SLDU44 but display on other workstations. As a result of this architecture if a crash occurs the remote displays are terminated and there is no corpse left around to figure out what went wrong. Part of the setup of these processes includes provisions for a crash log to be created. This files can be found in the $SCR:[RTH] and $SCR:[SGI] directories with names like process_CRASH.LOG where "process" is something like EPEM44. The crash log is created by the debugger and a close inspection of the .COM file that creates a process will disclose a set of commands that control various aspects of the debugger environment which enable the creation of these dumps.

EPEM code issues:

The configuration of EPEM is controlled by a variety of .IDA files found in DUCSRTOFF. This is code that EPEM runs in IDA from the mail IDA file which is found in $USR1:[RTH]EPEM44.COM. Any experienced SLD offline user should be able to hack through this code if necessary. There are a couple of special methods EPEM uses to send messages to the STATUS DISPLAY system using SOFSGNL/OFFL. The message to Z'Beep is done with a .COM file.

A point of some subtlety worth a few words is the histogram normalization code. RTH can overlay histograms from multiple sources, for example, current data and monte carlo. If the current data histogram has data from, say, 1000 calls and the monte carlo histogram has data from 100,000 calls the histograms need to be normalized so the overlays make for easy comparison. This is accomplished by having all histograms in RTH normalized so that the area under the histogram is 1 x # bins in the histogram.

The NSAVE file was mentioned above under EPEM Architecture as containing some of this normalization information. What this file contains is the number of calls made to each histogram. The histogram file that EPEM writes out is normalized for RTH. The information in NSAVE is not used by RTH. It is only used by EPEM on a restart or when an RTHCLR is done to begin a new round histogram collecting. In principle, the information in NSAVE is redundant and could be extracted from the .HCOM file but the NSAVE implementation was more straightforward and allows easier debugging of problems.

Configuration of RTH:

The main startup file for RTH is DUCSRTH:RTH.DAT. As mentioned above messing with MIDAS is arcane. The file which controls the definitions of the various windows of RTH is found in DUCSSLDMIDAS:HANDYPAK.UIL. Messing with the .UIL file is arcane**2.

Debugging EPEM problems:

If EPEM appears to be running but failing somewhere in its 1000s of lines of IDA code there is a debugging switch that can be turned to do state of the arts IDA debugging. Namely, setting the IDA variable DIAGNOSTIC to 1 or 2 in EPEM will cause statments to be printed indicating where it is in the IDA code that runs in EPEM. Setting 2 is a superset of what you get with setting value 1.

Gary Bower July 2, 1993