Offline ShiftsThis HELP file gives details on running offline shifts to monitor SLD production processing and data quality.
This includes:
The goal of the offline shifts is to monitor the data quality promptly as the production processing wends it way from the online tape to full reconstruction. Data problems are to be reported ASAP to the processing overseer, currently Richard Dubois!
What follows may seem to be a checklist of numbers to compare to standards and make a simple binary decision on data quality. Rather you should use the detailed questions to get yourself oriented and warmed up to use your physics skills to think about the run output and look for problems. If data quality was just a matter of some binary decisions based on some numbers we would just have the computer do it and we wouldn't need you!
All shift work is done using the web interface.
See the OFFSITE README file for the latest temporary kludges to the monitoring. It will be updated regularly as problems come and go.
Broadly stated, it is the monitor's responsibility to get feedback to either the processing overseer (Erez Etzion) or one of the production processing crew (Richard Dubois, Joe Perl, Gary Bower, and Karen Heidenreich) as appropriate.
You are responsible for any stages of run processing that finish during your shift period. The easiest way to determine this is to go to the special offline shifts page for Links to ACQCOPY, Filter and Recon Output. (A pointer to this page is also found on the SLD Offline Processing page.) In the date range on the "Links" page enter your shift days and it will list all processing that has finished during that range. Thus, if your shift runs from the 3rd thru the 5th you should have a final look on the 6th since the range is midnight to midnight California time. If some of the runs are 'catch-up' processing on older missed runs, concentrate on the runs from the day/previous day that are most time critical to report on.
You will receive information from the three processing stages.
Here is a fictitious sample report
SLD Offline Processing Report
Prepared by: MASUDA
10 Feb 1993 10:54:03
--------Monitor Evaluation---------
Run # Polar ZSLD #LUM -RTH-- -Ev DSP- -CDC HV- -Overall-
-------------------------------------------------------------------
12345 0.2100 8 17 OK OK OK OK
13635 0.2283 2 2 OK OK BAD BAD
-------------------------------------------------------------------
10 19
-------------------------------------------------------------------
Comments:
12345 All OK
13635 30% hadrons with HAD trigger
Quantities that need to be verified are:
The last three items, plus an explicit overall evaluation, are asked for in the report. Other problems should be reported in the comments section.
Offline shifts are done in 3-day segments. The schedule can be viewed here.
Changes to the schedule, after it is drawn up, are the responsibility of the shiftee. ie it's up to you to arrange shift swaps, and then notify Erez Etzion EREZ@SLAC of the changes. The SLDPM server uses the file to know who to notify about run information, so it is important that it be kept up to date.
It can be viewed with WWW as the GUI.
There are three stages of processing by the SLDPM server.
Once the end-run button is pressed in the online, the acquisition (ACQ) data tape is released for further processing in the SLACVX cluster.
The SLDPM server on VM is notified that the tape is ready for processing. It copies the tape to disk, followed by running a PASS1 filter, followed by a full reconstruction and PASS2 filter.
Once a stage is completed, e-mail is sent to the monitor notifying him/her that this information is ready for viewing. It is expected that SLD will take a run about every four hours, so that is the kind of time interval to expect for bursts of SLDPM activity.
This stage performs two functions. It copies the input tape to disk for effective access in subsequent steps. The second function is to monitor the trigger records for inconsistencies. The run monitor is notified if any are found.
The sorts of trigger errors that are checked for are listed below:
CONDITION: bits set
***note: if MA_BEL exists, we use MA_BEL, else use BELHDR****
0..........no BELHDR
1..........BELHDR.CONTRACT /= CONTRIBU
2..........BELHDR.ERR /= 0
4..........BELHDR.CONTRACT=VXDRAW & no VXDRAW
5..........BELHDR.CONTRACT=DCWSMHIT & NO DCWSMHIT
6..........BELHDR.CONTRACT=CRIDWSM & NO CRIDWSM
7..........BELHDR.CONTRACT=KTAG & NO KTAG
8..........BELHDR.CONTRACT=WCHS & no WCHS
9..........BELHDR.SLOTS = 0
10..........BELHDR.EVALUATR = 0
11..........SLOTS inconsistent with EVALUATOR
12..........BELHDR.EVALUATR /= VXDRAW.EVALUATR
13..........BELHDR.SLOTS /= VXDRAW.SLOTS
14..........BELHDR.EVALUATR /= DCWSMHIT.EVALUATR
15..........BELHDR.SLOTS /= DCWSMHIT.SLOTS
16..........BELHDR.EVALUATR /= CRIDWSM.EVALUATR
17..........BELHDR.SLOTS /= CRIDWSM.SLOTS
18..........BELHDR.EVALUATR /= KTAG.EVALUATR
19..........BELHDR.SLOTS /= KTAG.SLOTS
20..........VXDRAW.LENGREQ /= VXDRAW.LENGREAD
21..........DCWSMHIT.LENGTH /= DCWSMHIT.GATHERED
22..........CRIDWSM.LENGTH /= CRIDWSM.GATHERED
23..........KTAG.LENGTH /= KTAG.GATHERED
24..........BELHDR.TAG /= VXDRAW.CROSSING
25..........BELHDR.TAG /= DCWSMHIT.TAG
26..........BELHDR.TAG /= CRDWSM.FBHEAD.TAG
27..........BELHDR.TAG /= KTAG.HEADER.TAG
28..........event repeat-same event twice in a row
29..........event repeat-separated by >=1 events
30..........bcnums not monotonically increasing
LENGTH: true event size (excluding WIC and BRT)
GATHERED: event size written out (excluding WIC and BRT)
Definition of terms:
o BELHDR............central bank containing trigger info
o VXDRAW, DCWSMHIT, CRIDWSM, KTAG, WCHS.................
..................raw data banks from each subsystem
o EVALUATR..........trigger evaluator: whether a given trigger
condition was set
o SLOTS.............whether that trigger actually fired
o TAG...............beam crossing number
o CONTRACT..........which subsystems the trigger thinks are
contributing to the event
The subsystem and trigger AEBs each have their own ideas as to what the trigger information and beam crossing number are, so matching them all up ensures the integrity of the assembled event.
The following list shows the various conditions that generate BEL data integrity errors.
----------------error accounting--------------------------------------
ertype name test(condition) description
------ ------------- ------ -------------------
1. truncated event bit 2 nothing but BELHDR
2. runt event bit 1 contract /=contrib
3. inval event seq bit 30 bcnums not mono inc
4. inconsistent ev bit 24-27 bank tags/=hdr tag
5. truncated cntrib bits 20,21,22,23 gathered/=size
6. duplicate event bit 28 or 29 repeated tag, last 10
7. split event bit 28or29 AND 1 original veg-o-matic
8. invalid trigger bit 9 or 10 slot or eval = 0
9. inconsisten trg not in yet eval consistnt w/slot
10. lying bastard bit 4-8 not 1 missing bank, in cont
11. no belhdr bit 0 no belhdr
12.-18. open for expansion
19. other cond/=0 all other cond/=0
20. total tests 1-19. tot evt with cond/=0
----------------------------------------------------------------------
Errors 3, 4, 6, 7 8 and 10 should be reported as serious to the Run Coordinator. 1, 2 and 5 are common and are said to be benign.
The only duty here is to determine whether trigger errors have been flagged, and notify the run coordinator if there have.
For each run there is a file, called ASrun# SDATCHK. Look at this file to see if any trigger errors have been flagged. These are the ones to notify the run coordinator of.
Also examine ASrun#.STATS to see if any subsystem data is missing
Here is an example with lots of errors flagged.
------------------------------------------------------
Summary for run = 20130 number of events = 6022
------------------------------------------------------
time of first event to tape = 2-MAR-1993 14:40:15.32
------------------------------------------------------
run took 12366.1 seconds
------------------------------------------------------
size: total lenth = 468207440 bytes
total gathered = 467691292 bytes
------------------------------------------------------
average event size = 77749.4 bytes
event rate = 0.49 Hz
data rate = 37862.3 bytes/sec
------------------------------------------------------
error error type number
----- ----------------- ---------
1 truncated event.... 0
2 runt event........ 0
3 invalid evt seq.... 94
4 inconsistent evt... 0
5 truncated contrib.. 9
6 duplicate event.... 3
7 split event........ 0
8 invalid trigger.... 1206
9 inconsistent trg... N/A
10 lying bastard...... 0
11 no belhdr.......... 0
12 unused............. 0
13 unused............. 0
14 unused............. 0
15 unused............. 0
16 unused............. 0
17 unused............. 0
18 unused............. 0
19 other.............. 0
20 total.............. 1312
------------------------------------------------------
------------------------------------------------------
Summary for run = 20130 number of events = 6022
------------------------------------------------------
time of first event to tape = 2-MAR-1993 14:40:15.32
------------------------------------------------------
run took 12366.1 seconds
------------------------------------------------------
size: total lenth = 468207440 bytes
total gathered = 467691292 bytes
------------------------------------------------------
average event size = 77749.4 bytes
event rate = 0.49 Hz
data rate = 37862.3 bytes/sec
------------------------------------------------------
------------------------------------------------------
list of events (first 15)-------------------------
------------------------------------------------------
event = 8 error = 8 condition = 00000600
event = 13 error = 8 condition = 00000600
event = 14 error = 3 condition = 40000000
event = 18 error = 8 condition = 00000600
event = 19 error = 3 condition = 40000000
event = 23 error = 8 condition = 00000600
event = 26 error = 8 condition = 00000600
event = 27 error = 3 condition = 40000000
event = 38 error = 8 condition = 00000600
event = 39 error = 3 condition = 40000000
event = 47 error = 8 condition = 00000600
event = 76 error = 8 condition = 00000600
event = 98 error = 8 condition = 00000600
event = 104 error = 8 condition = 00000600
event = 118 error = 8 condition = 00000600
*** AIS33544.STATS ***
Number of events with Subsystem data
Pol VXD DC CRID KAL WIC
998 2613 2613 3261 3501 2614
RUN_NUMBER 33544
EVTS_SEEN 241
HADRONS 0
Filter Pass Rate
Energy **** WAB Muon Tau
197 0 0 0 0
Hadrons_&_HAD_trigger= 187
EIT_&_HAD_TRIGGER= 185
Date 817 JY_Z_day2 0 JY_CDC_day2 0
All_Bhabha= 617 Precise_Bhabha 352 Num_evts= 4203
<\pre>
A sample file is annotated here:
*** FLT33544.STATS ***
RUN_NUMBER 33544
EVTS_SEEN 4203
Evts with error bits in MA_BEL 52
SLCZ 0
SLC.SLDZ 0
POLAR 0.7649
Frac bad pol measures 0.2178
CDCon fraction for track/had candidates 0.9615
trk -0.9138
4.198
Physics/track candidates 3472
BHABHA candidates 617
Precise BHABHA candidates 352
Trigger Correlation matrix (norm to I/P events)
Rndm Energy LUM tr_cdc tau Had WAB Muon
0.162
0.000 0.080
0.000 0.000 0.237
0.000 0.032 0.000 0.232
0.000 0.028 0.026 0.005 0.100
0.000 0.050 0.001 0.034 0.016 0.085
0.000 0.030 0.000 0.014 0.010 0.022 0.030
0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.002
Filter Correlation matrix (norm to physics/track cands)
laser Rndm LUM trk tau Muon E_JY E_KZ0F
0.000
0.000 0.165
0.000 0.000 0.178
0.000 0.000 0.000 0.019
0.000 0.000 0.000 0.010 0.079
0.000 0.000 0.000 0.001 0.001 0.001
0.000 0.000 0.000 0.000 0.058 0.000 0.070
0.000 0.000 0.000 0.000 0.022 0.000 0.031 0.045
Frac of phy/track cands where KVM/MACH Pol bits disagree 0.00000
Most of these histograms are for diagnostics in the filter job itself. Of interest for monitoring are the polarization and CDC HV-on plots. We would like a nice gaussian for POL and a spike at 10 layers for the CDC.
A sample file is annotated here:
*** REC33544.STATS ***
RUN_NUMBER 33544
EVTS_SEEN 364
HADRONS 187
L/R asy for phys/trk, HAD candidates
-0.132 +/- 0.052 -0.228 +/- 0.071
Pol error rate for phys/trk, HAD candidates
0.0000 0.0000
Ave Z at IP -6.7008E-02
Filter Pass Rate
Energy TaKal WAB Muon Tau
197 188 22 2 210
Filter Correlation matrix (norm to I/P events)
Energy TaKal WAB Muon Tau
0.541
0.516 0.516
0.000 0.000 0.060
0.000 0.000 0.000 0.005
0.511 0.508 0.025 0.005 0.577
Hadrons_&_HAD_trigger= 187
EIT_&_HAD_trigger= 185
Date 817 JY_Z_day2 0 JY_CDC_day2 0
Most of the RECON histograms are the same ones used for the RTH online monitoring system, and undergo the same statistical comparison to the 'expected' distributions. Histograms that fail this comparison are flagged for the monitor's further investigation. (Not implemented yet).
Both raw and reconstructed data from the KAL and DC systems are displayed for a sample of hadron candidates. The sample is selected from events that pass either hadron filter; it shoots to randomly select about 5 events; if there are less, all are displayed.
Each event gets three views: the first is a 3-view with KAL cluster hits and CDC tracks shown; the 2nd is a blowup of the CDC with tracks and vectored hits; the 3rd shows a blowup of the VXD and CCD hits on tracks.
These are just there for visual verification that the detector is OK; no event classification is called for!
Look for lots of extra vectored hits in the tracking or missing layers. Make sure CCD hits are being associated to CDC tracks.
This stage runs ZXFIND on all the PASS2 output events for a given day. It looks at the primary vertex position, gamma conversions and K-shorts, and track muliplicities and VXD linking efficiency. Default parameters are used in ZXFIND.
Output diagnostics from this stage include statistics & histograms.
A file of statistics is produced for each DAY. The format is ZYyymmdd.STATS (eg for June 9, 1993 it would be ZY930609.STATS). A sample file is annotated here:
*** ZY960423.STATS ***
Summary for ZX Run for 401 Zs
--------------------------------------
... IP position ...
= 0.0000 +/- 0.0000 sig(x) = 0.0300 +/- 0.0000
= 0.0000 +/- 0.0000 sig(y) = 0.0300 +/- 0.0000
= 0.0000 +/- 0.0000 sig(z) = 0.1000 +/- 0.0000
... Gamma Conversions ...
Seen/Z = 0.1322
G mass = 0.0100 +/- 0.0006 G width = 0.0044 +/- 0.0005
R(pipe/CDC) = 0.6061
G
= 1.792 ... Vees ... Candidates Seen/Z = 0.1172 V fraction from fit = 8.5248E-02 V mass = 0.5013 +/- 0.0010 V width = 0.0053 +/- 0.0015 V
= 2.246 ... Trk multiplicity ... Total
There are two pages of the STATUS display devoted to the offline processing (pages 12, 14 & 15). These give some diagnostics and statistics from each stage of processing. Also given are the VM batch queues for each stage, the short term run monitor schedule and short term Z/day statistics. These pages are available from the web.