Operations Reference Manual for EPEM/RTH
I. Introduction
The intended audience of this document is not the EPEM/RTH user, rather
it is the poor sod who might have to debug a problem in the absence of
the author. This will not be a guide to all possible problems. The great
majority of the problems that occur with EPEM or RTH are due to changes
in other systems which they depend on, eg, the Offline Recon code, IDA,
etc. Also problems often occur due to changes in it's environment, eg
SLDU44, the vax that is its current home. Hopefully this guide will
enable one to better ascertain what the problem might be and it will
usually take an experienced programmer that can read the code. At best
this document will only make it a littel easier to navigate around
the code.
Of course, the worst problem with documentation like this is that
it immediately begins the process of becoming more and more incorrect.
So be prepared for anything in here to be incorrect with higher
probability as the document ages.
II. Architecture
EPEM:
On the SLDU44 VAX there is an event multiplexor which contains a
sample of the events contained in the primary event multiplexor on
SLDACQ VAX. EPEM as well as several other systems running on SLDU44
sample events from this local event multiplexor. EPEM runs the SLD
reconstruction program (in IDA) on events from the multiplexor which
pass the standard offline KZ0F Z filter. The reconstructed event is
saved in $SCR:[SGI.EVENTS]LATEST_Z.JAZZDATA and replaces the previous
event saved with the same name. This event file is the one that the
SGI 3D display looks at when it is asked to display the latest reconstructed
Z event. In addition, every N events the event is also saved as
Rxxxxx_Eyyyyy.JAZZDATA in the same directory. N is an adjustable parameter
and xxxxx and yyyyy are run number and event number of the event.
Finally, there is a file called EVENTS.SUMMARY in the same directory
which EPEM updates every event with its run number and event number and
the number of tracks and clusters found.
EPEM also produces histograms for display by RTH. EPEM collects
on order of 100 histograms of various quantities from both reconstructed
event data and from data taken from all events including the majority
non-Z events that it looks at from the multiplexor. These histograms
are saved in $SCR:[RTH.EVENTS]EPEMD_HISTOS.HCOM after every reconstructed
Z and after every 50th non-Z event. EPEM also writes out with each .HCOM
file a file called NSAVE.DATA which contains information used by EPEM
on a restart. When an EPEM process is started it reads in the latest
.HCOM file and the latest NSAVE file so pick up where it left off
when its last incarnation ended. The NSAVE file contains information
used to create a normaliztion factor for each histograms when it is
written out.
RTH:
RTH is a standalone process of which multiple copies may run
on any account (with the right environment) on any VAX on the SLD
cluster. RTH displays the histograms EPEM saves and the details of
its capabilities are found in its online help system. RTH can
compare the current data from EPEM with other standard data sets such as
Monte Carlo and a large sample of data from earlier in the 1993 run.
These standard files are found in $USR1:[RTH.RTH]. RTH is built on
the MIDAS system developed by Tony Johnson and this document not go
into the details of MIDAS.
III Operation
Starting and Stopping the systems:
The basic level of controlling the starting and stopping of the
multiplexor, EPEM and RTH are handled by buttons in a menu item
on SLDU17 called Run_Time_OFFline. Although the relevant processes actually
run on SLDU44 they are controlled from SLDU17 which resides in the
control room. One button called STOP_ALL44 shuts down EPEM, RTH and
three display processes that get events from the event multiplexor on
SLDU44. This button also stops the multiplexor thus completely
eliminating all elements of the system. The button START_POOL44
startups the necessary porocesses on SLDU44 and SLDACQ to establish
the multiplexor on SLDU44. The button START_EPEM44 starts EPEM and
START_RTH44 starts RTH. The starting of the three display processes
that draw events
from the multiplexor is controlled by buttons on the SGI next to
SLDU17. The three display processes are the 3D display on the SGI,
the 2D Z display above the BOSS-BOSSSCP and the 2D Z display at MCC.
The details by which these buttons accomplish
their tasks can be traced by bringing up the Run_Time_OFFline menu
and following the succession of commands and command files used.
One must be familiar with the workings of various multinet commands
as well as the ins and outs of workstation windows in addition to
things that are probably more familiar to the average SLD experienced
user of the offline system on a VAX and as mentioned earlier in the
case of RTH one must possess considerable arcane knowledge of MIDAS
which at present writing is only known to the present author and Tony
Johnson.
Normally on the SLDU17 where RTH and EPEM are displayed (remember
their processes actually run on SLDU44) there is one other display, called
Z'Beep. This display is created simply by taking a window on SLDU17 with
a DCL prompt, $U17> , setting host to SLDACQ and logging on to the
SGI account and give the window the name Z'Beep. This window will display
a message from EPEM every time a Z is reconstructed.
Relinking EPEM:
Occasionally it is necessary to relink EPEM due to changes in the
SLD Offline software. This can be done by logging on the RTH account
on any SLD clustered VAX and typing
@EPEM44 LINK
Before doing this look in the $USR1:[RTH.EXE] directory and see if there
is more than on copy of EPEM44.EXE and EPEM44.MAP. If so delete the
older versions to make room for a new version. It is a wise practice
to always keep the most recent version that is already known to work.
Thus if one decides the link one just did is no good, get rid of that
version of .EXE and .MAP before trying again rather than getting rid
of the version known to work last. This advice is particularly true of
the .MAP file as having a .MAP file that was known to work is sometimes
quite useful in figuring out what may be the cause of an EPEM problem
that has been introduced by changes to offline code. Due to the unique
environment in which EPEM runs it is OFTEN the first place problems
with new code become manifest.
Crash dumps:
EPEM, RTH and the display processes run on SLDU44 but display on
other workstations. As a result of this architecture if a crash occurs
the remote displays are terminated and there is no corpse left around
to figure out what went wrong. Part of the setup of these processes
includes provisions for a crash log to be created. This files can
be found in the $SCR:[RTH] and $SCR:[SGI] directories with names
like process_CRASH.LOG where "process" is something like EPEM44.
The crash log is created by the debugger and a close inspection of the
.COM file that creates a process will disclose a set of commands
that control various aspects of the debugger environment which enable
the creation of these dumps.
EPEM code issues:
The configuration of EPEM is controlled by a variety of .IDA
files found in DUCSRTOFF. This is code that EPEM runs in IDA
from the mail IDA file which is found in $USR1:[RTH]EPEM44.COM.
Any experienced SLD offline user should be able to hack through
this code if necessary. There are a couple of special methods
EPEM uses to send messages to the STATUS DISPLAY system using
SOFSGNL/OFFL. The message to Z'Beep is done with a .COM file.
A point of some subtlety worth a few words is the histogram
normalization code. RTH can overlay histograms from multiple
sources, for example, current data and monte carlo. If the current
data histogram has data from, say, 1000 calls and the monte carlo
histogram has data from 100,000 calls the histograms need to
be normalized so the overlays make for easy comparison. This is
accomplished by having all histograms in RTH normalized so that
the area under the histogram is 1 x # bins in the histogram.
The NSAVE file was mentioned above under EPEM Architecture
as containing some of this normalization information. What this file
contains is the number of calls made to each histogram. The histogram
file that EPEM writes out is normalized for RTH. The information in
NSAVE is not used by RTH. It is only used by EPEM on a restart or
when an RTHCLR is done to begin a new round histogram collecting.
In principle, the information in NSAVE is redundant and could be
extracted from the .HCOM file but the NSAVE implementation was more
straightforward and allows easier debugging of problems.
Configuration of RTH:
The main startup file for RTH is DUCSRTH:RTH.DAT. As mentioned
above messing with MIDAS is arcane. The file which controls the
definitions of the various windows of RTH is found in
DUCSSLDMIDAS:HANDYPAK.UIL. Messing with the .UIL file is arcane**2.
Debugging EPEM problems:
If EPEM appears to be running but failing somewhere in its 1000s of
lines of IDA code there is a debugging switch that can be turned to
do state of the arts IDA debugging. Namely, setting the IDA variable
DIAGNOSTIC to 1 or 2 in EPEM will cause statments to be printed indicating
where it is in the IDA code that runs in EPEM. Setting 2 is a superset of
what you get with setting value 1.
Gary Bower July 2, 1993