Offline Shifts

Table of Contents

Overview

This HELP file gives details on running offline shifts to monitor SLD production processing and data quality.

This includes:

Description of the Offline Shift Process

The goal of the offline shifts is to monitor the data quality promptly as the production processing wends it way from the online tape to full reconstruction. Data problems are to be reported ASAP to the processing overseer, currently Richard Dubois!

What follows may seem to be a checklist of numbers to compare to standards and make a simple binary decision on data quality. Rather you should use the detailed questions to get yourself oriented and warmed up to use your physics skills to think about the run output and look for problems. If data quality was just a matter of some binary decisions based on some numbers we would just have the computer do it and we wouldn't need you!

All shift work is done using the web interface.

Readme file

See the OFFSITE README file for the latest temporary kludges to the monitoring. It will be updated regularly as problems come and go.

Shift Responsibilities

Broadly stated, it is the monitor's responsibility to get feedback to either the processing overseer (Erez Etzion) or one of the production processing crew (Richard Dubois, Joe Perl, Gary Bower, and Karen Heidenreich) as appropriate.

You are responsible for any stages of run processing that finish during your shift period. The easiest way to determine this is to go to the special offline shifts page for Links to ACQCOPY, Filter and Recon Output. (A pointer to this page is also found on the SLD Offline Processing page.) In the date range on the "Links" page enter your shift days and it will list all processing that has finished during that range. Thus, if your shift runs from the 3rd thru the 5th you should have a final look on the 6th since the range is midnight to midnight California time. If some of the runs are 'catch-up' processing on older missed runs, concentrate on the runs from the day/previous day that are most time critical to report on.

Recipe

You will receive information from the three processing stages.

Report

Sample Report

Here is a fictitious sample report

 
                     SLD Offline Processing Report
                         Prepared by:  MASUDA
                          10 Feb 1993 10:54:03
 
                                    --------Monitor Evaluation---------
 
    Run #   Polar  ZSLD  #LUM    -RTH-- -Ev DSP- -CDC HV- -Overall-
-------------------------------------------------------------------
                  
    12345 0.2100      8    17      OK     OK       OK        OK    
    13635 0.2283      2     2      OK     OK      BAD        BAD   
-------------------------------------------------------------------
                     10    19
-------------------------------------------------------------------
Comments:
 
 12345  All OK
 
 13635  30% hadrons with HAD trigger
 

Details

Quantities that need to be verified are:

The last three items, plus an explicit overall evaluation, are asked for in the report. Other problems should be reported in the comments section.

Shift Schedule

Offline shifts are done in 3-day segments. The schedule can be viewed here.

Changes to the schedule, after it is drawn up, are the responsibility of the shiftee. ie it's up to you to arrange shift swaps, and then notify Erez Etzion EREZ@SLAC of the changes. The SLDPM server uses the file to know who to notify about run information, so it is important that it be kept up to date.

Accessing Information

It can be viewed with WWW as the GUI.

Processing Stages

There are three stages of processing by the SLDPM server.

Once the end-run button is pressed in the online, the acquisition (ACQ) data tape is released for further processing in the SLACVX cluster.

The SLDPM server on VM is notified that the tape is ready for processing. It copies the tape to disk, followed by running a PASS1 filter, followed by a full reconstruction and PASS2 filter.

Once a stage is completed, e-mail is sent to the monitor notifying him/her that this information is ready for viewing. It is expected that SLD will take a run about every four hours, so that is the kind of time interval to expect for bursts of SLDPM activity.

ACQCOPY

This stage performs two functions. It copies the input tape to disk for effective access in subsequent steps. The second function is to monitor the trigger records for inconsistencies. The run monitor is notified if any are found.

Trigger Errors

The sorts of trigger errors that are checked for are listed below:

 
  CONDITION:  bits set
       ***note: if MA_BEL exists, we use MA_BEL, else use BELHDR****
                      0..........no BELHDR
                      1..........BELHDR.CONTRACT /= CONTRIBU
                      2..........BELHDR.ERR /= 0
                      4..........BELHDR.CONTRACT=VXDRAW & no VXDRAW
                      5..........BELHDR.CONTRACT=DCWSMHIT & NO DCWSMHIT
                      6..........BELHDR.CONTRACT=CRIDWSM & NO CRIDWSM
                      7..........BELHDR.CONTRACT=KTAG & NO KTAG
                      8..........BELHDR.CONTRACT=WCHS & no WCHS
                      9..........BELHDR.SLOTS = 0
                     10..........BELHDR.EVALUATR = 0
                     11..........SLOTS inconsistent with EVALUATOR
                     12..........BELHDR.EVALUATR /= VXDRAW.EVALUATR
                     13..........BELHDR.SLOTS    /= VXDRAW.SLOTS
                     14..........BELHDR.EVALUATR /= DCWSMHIT.EVALUATR
                     15..........BELHDR.SLOTS    /= DCWSMHIT.SLOTS
                     16..........BELHDR.EVALUATR /= CRIDWSM.EVALUATR
                     17..........BELHDR.SLOTS    /= CRIDWSM.SLOTS
                     18..........BELHDR.EVALUATR /= KTAG.EVALUATR
                     19..........BELHDR.SLOTS    /= KTAG.SLOTS
                     20..........VXDRAW.LENGREQ /= VXDRAW.LENGREAD
                     21..........DCWSMHIT.LENGTH /= DCWSMHIT.GATHERED
                     22..........CRIDWSM.LENGTH /= CRIDWSM.GATHERED
                     23..........KTAG.LENGTH /= KTAG.GATHERED
                     24..........BELHDR.TAG      /= VXDRAW.CROSSING
                     25..........BELHDR.TAG      /= DCWSMHIT.TAG
                     26..........BELHDR.TAG      /= CRDWSM.FBHEAD.TAG
                     27..........BELHDR.TAG      /= KTAG.HEADER.TAG
                     28..........event repeat-same event twice in a row
                     29..........event repeat-separated by >=1 events
                     30..........bcnums not monotonically increasing
          LENGTH:  true event size (excluding WIC and BRT)
          GATHERED: event size written out (excluding WIC and BRT)
 
 
  Definition of terms:
 
    o BELHDR............central bank containing trigger info
    o VXDRAW, DCWSMHIT, CRIDWSM, KTAG, WCHS.................
      ..................raw data banks from each subsystem
    o EVALUATR..........trigger evaluator: whether a given trigger
                             condition was set
    o SLOTS.............whether that trigger actually fired
    o TAG...............beam crossing number
    o CONTRACT..........which subsystems the trigger thinks are
                             contributing to the event

The subsystem and trigger AEBs each have their own ideas as to what the trigger information and beam crossing number are, so matching them all up ensures the integrity of the assembled event.

The following list shows the various conditions that generate BEL data integrity errors.

 
 ----------------error accounting--------------------------------------
 
    ertype    name            test(condition)    description
    ------  -------------     ------             -------------------
       1.  truncated event     bit 2              nothing but BELHDR
       2.  runt event          bit 1              contract /=contrib
       3.  inval event seq     bit 30             bcnums not mono inc
       4.  inconsistent ev     bit 24-27          bank tags/=hdr tag
       5.  truncated cntrib    bits 20,21,22,23   gathered/=size
       6.  duplicate event     bit 28 or 29       repeated tag, last 10
       7.  split event         bit 28or29 AND 1   original veg-o-matic
       8.  invalid trigger     bit 9 or 10        slot or eval = 0
       9.  inconsisten trg     not in yet         eval consistnt w/slot
      10.  lying bastard       bit 4-8 not 1      missing bank, in cont
      11.  no belhdr           bit 0              no belhdr
      12.-18.  open for expansion
      19.  other               cond/=0            all other cond/=0
      20.  total               tests  1-19.       tot evt with cond/=0
 
 ----------------------------------------------------------------------

Errors 3, 4, 6, 7 8 and 10 should be reported as serious to the Run Coordinator. 1, 2 and 5 are common and are said to be benign.

Duties

The only duty here is to determine whether trigger errors have been flagged, and notify the run coordinator if there have.

For each run there is a file, called ASrun# SDATCHK. Look at this file to see if any trigger errors have been flagged. These are the ones to notify the run coordinator of.

Also examine ASrun#.STATS to see if any subsystem data is missing

Example SDATCHK file

Here is an example with lots of errors flagged.

 
  ------------------------------------------------------
   Summary for run =  20130  number of events =    6022
  ------------------------------------------------------
   time of first event to tape =  2-MAR-1993 14:40:15.32
  ------------------------------------------------------
   run took         12366.1 seconds
  ------------------------------------------------------
     size:    total lenth =  468207440 bytes
           total gathered =  467691292 bytes
  ------------------------------------------------------
     average event size  =      77749.4 bytes
               event rate =         0.49 Hz
               data rate =      37862.3 bytes/sec
  ------------------------------------------------------
   error               error type         number
     -----           -----------------    ---------
     1            truncated event....      0
     2            runt  event........      0
     3            invalid evt seq....     94
     4            inconsistent evt...      0
     5            truncated contrib..      9
     6            duplicate event....      3
     7            split event........      0
     8            invalid trigger....   1206
     9            inconsistent trg...    N/A
    10            lying bastard......      0
    11            no belhdr..........      0
    12            unused.............      0
    13            unused.............      0
    14            unused.............      0
    15            unused.............      0
    16            unused.............      0
    17            unused.............      0
    18            unused.............      0
    19            other..............      0
    20            total..............   1312
  ------------------------------------------------------
 
 
  ------------------------------------------------------
   Summary for run =  20130  number of events =    6022
  ------------------------------------------------------
   time of first event to tape =  2-MAR-1993 14:40:15.32
  ------------------------------------------------------
   run took         12366.1 seconds
  ------------------------------------------------------
     size:    total lenth =  468207440 bytes
           total gathered =  467691292 bytes
  ------------------------------------------------------
     average event size  =      77749.4 bytes
               event rate =         0.49 Hz
               data rate =      37862.3 bytes/sec
  ------------------------------------------------------
  ------------------------------------------------------
   list of events (first 15)-------------------------
  ------------------------------------------------------
   event =      8       error =   8        condition = 00000600
   event =     13       error =   8        condition = 00000600
   event =     14       error =   3        condition = 40000000
   event =     18       error =   8        condition = 00000600
   event =     19       error =   3        condition = 40000000
   event =     23       error =   8        condition = 00000600
   event =     26       error =   8        condition = 00000600
   event =     27       error =   3        condition = 40000000
   event =     38       error =   8        condition = 00000600
   event =     39       error =   3        condition = 40000000
   event =     47       error =   8        condition = 00000600
   event =     76       error =   8        condition = 00000600
   event =     98       error =   8        condition = 00000600
   event =    104       error =   8        condition = 00000600
   event =    118       error =   8        condition = 00000600
Example ACQCOPY STATS file
                   *** AIS33544.STATS ***

  Number of events with Subsystem data
    Pol     VXD     DC      CRID      KAL       WIC
     998  2613  2613  3261  3501  2614
  RUN_NUMBER       33544
  EVTS_SEEN         241
  HADRONS           0
  Filter Pass Rate
   Energy ****   WAB  Muon  Tau
     197     0     0     0     0
  Hadrons_&_HAD_trigger=          187
  EIT_&_HAD_TRIGGER=          185
  Date          817  JY_Z_day2            0 JY_CDC_day2            0
  All_Bhabha=         617 Precise_Bhabha         352 Num_evts=        4203
<\pre>

Filter

Click for Filter details.

Statistics

A sample file is annotated here:

 

                   *** FLT33544.STATS ***


  RUN_NUMBER       33544
  EVTS_SEEN        4203
  Evts with error bits in MA_BEL          52
  SLCZ           0
  SLC.SLDZ           0
  POLAR  0.7649
  Frac bad pol measures  0.2178
  CDCon fraction for track/had candidates  0.9615
   trk -0.9138
     4.198
  Physics/track candidates        3472
  BHABHA candidates         617
  Precise BHABHA candidates         352
  Trigger Correlation matrix (norm to I/P events)
   Rndm  Energy LUM  tr_cdc tau  Had  WAB  Muon
   0.162
   0.000 0.080
   0.000 0.000 0.237
   0.000 0.032 0.000 0.232
   0.000 0.028 0.026 0.005 0.100
   0.000 0.050 0.001 0.034 0.016 0.085
   0.000 0.030 0.000 0.014 0.010 0.022 0.030
   0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.002
  Filter Correlation matrix (norm to physics/track cands)
   laser  Rndm   LUM   trk   tau  Muon  E_JY  E_KZ0F
   0.000
   0.000 0.165
   0.000 0.000 0.178
   0.000 0.000 0.000 0.019
   0.000 0.000 0.000 0.010 0.079
   0.000 0.000 0.000 0.001 0.001 0.001
   0.000 0.000 0.000 0.000 0.058 0.000 0.070
   0.000 0.000 0.000 0.000 0.022 0.000 0.031 0.045
  Frac of phy/track cands where KVM/MACH Pol bits disagree 0.00000

What to look at

Histograms

Most of these histograms are for diagnostics in the filter job itself. Of interest for monitoring are the polarization and CDC HV-on plots. We would like a nice gaussian for POL and a spike at 10 layers for the CDC.

What to look at

Recon

Click for Recon/Pass2 details.

Statistics

A sample file is annotated here:

 
                   *** REC33544.STATS ***
 
  RUN_NUMBER       33544
  EVTS_SEEN         364
  HADRONS         187
  L/R asy for phys/trk, HAD candidates
    -0.132 +/-    0.052  -0.228 +/-    0.071
  Pol error rate for phys/trk, HAD candidates
     0.0000   0.0000
  Ave Z at IP -6.7008E-02
  Filter Pass Rate
   Energy TaKal  WAB  Muon  Tau
     197   188    22     2   210
  Filter Correlation matrix (norm to I/P events)
   Energy TaKal   WAB   Muon  Tau
   0.541
   0.516 0.516
   0.000 0.000 0.060
   0.000 0.000 0.000 0.005
   0.511 0.508 0.025 0.005 0.577
  Hadrons_&_HAD_trigger=          187
  EIT_&_HAD_trigger=          185
  Date          817  JY_Z_day2            0 JY_CDC_day2            0
 
What to look at

Histograms

Most of the RECON histograms are the same ones used for the RTH online monitoring system, and undergo the same statistical comparison to the 'expected' distributions. Histograms that fail this comparison are flagged for the monitor's further investigation. (Not implemented yet).

What to look at

Event Displays

Both raw and reconstructed data from the KAL and DC systems are displayed for a sample of hadron candidates. The sample is selected from events that pass either hadron filter; it shoots to randomly select about 5 events; if there are less, all are displayed.

Each event gets three views: the first is a 3-view with KAL cluster hits and CDC tracks shown; the 2nd is a blowup of the CDC with tracks and vectored hits; the 3rd shows a blowup of the VXD and CCD hits on tracks.

These are just there for visual verification that the detector is OK; no event classification is called for!

Look for lots of extra vectored hits in the tracking or missing layers. Make sure CCD hits are being associated to CDC tracks.

ZCHKDAY

This stage runs ZXFIND on all the PASS2 output events for a given day. It looks at the primary vertex position, gamma conversions and K-shorts, and track muliplicities and VXD linking efficiency. Default parameters are used in ZXFIND.

Output diagnostics from this stage include statistics & histograms.

Statistics

A file of statistics is produced for each DAY. The format is ZYyymmdd.STATS (eg for June 9, 1993 it would be ZY930609.STATS). A sample file is annotated here:

 
                   *** ZY960423.STATS ***
 
               Summary for ZX Run for          401 Zs
               --------------------------------------
    ... IP position ...
     =   0.0000  +/-   0.0000   sig(x) =   0.0300  +/-   0.0000
     =   0.0000  +/-   0.0000   sig(y) =   0.0300  +/-   0.0000
     =   0.0000  +/-   0.0000   sig(z) =   0.1000  +/-   0.0000
    ... Gamma Conversions ...
    Seen/Z =   0.1322
    G mass =   0.0100  +/-   0.0006  G width =   0.0044  +/-   0.0005
    R(pipe/CDC) =   0.6061
    G 

= 1.792 ... Vees ... Candidates Seen/Z = 0.1172 V fraction from fit = 8.5248E-02 V mass = 0.5013 +/- 0.0010 V width = 0.0053 +/- 0.0015 V

= 2.246 ... Trk multiplicity ... Total = 20.08 Inner = 0.0000E+00 Outer = 2.322 Beyond = 7.273 Inner trk frac WITHOUT VXD links = 1.000

What to look at

Histograms

What to look at

Using STATUS

There are two pages of the STATUS display devoted to the offline processing (pages 12, 14 & 15). These give some diagnostics and statistics from each stage of processing. Also given are the VM batch queues for each stage, the short term run monitor schedule and short term Z/day statistics. These pages are available from the web.


Last Updated: