Data Re-recon on VMS

Introduction

A facility has been constructed to re-reconstruct the PASS1 output on VMS. It was cloned from the MC Farm machinery for the VMS .COM and .TASK files, but the Rexx code was moved to UNIX using uni-Rexx.

The operator selects the run range he wants re-reconstructed with a Rexx exec in UNIX and then submits jobs for those runs to VMS from UNIX. The HAD datasets are copied to disk, the re-recon job uses those files as input, writes recon and miniDST output to the staging-out areas and then erases the input disk file.

Reconstruction Code

The same code is used for the re-recon as for the original PASS2. Hence all the event tagging is the same. Only the initialization and PASS2 code are needed from the PASS2 suite: R_VARINI.IDA and R_PASS2.IDA from DUCSSLD.

A difference from the original PASS2 is that the CRID recon is performed and that the output dataset is augmented with CRDAEB. The job also writes out the miniDST as a separate dataset.

Bookkeeping

The processing jobs are run out of the SLDPM SLACVX cluster account; files are kept in the [SLDPM.RECON] directory.

As for the MC Farm, the re-recon is described by a .TASK file which identifies the re-recon by name; specifies a run range; IDA and SETUP files for defining the job environment; code version desired; and a comment. A typical .TASK file is shown:


Task:		REC93V11
Rec_Ida:	disk$sld_usr0:[sldpm]RECV11
Setup_file:	disk$sld_usr0:[sldpm]REC93V11_SETUP
Run_Period:	15774 23000
Code_Version:	12.0
Comments:	Version 12 re-recon of '93 data

The output staging process maintains datacats in the [SLDPM] home directory. These datacats' names are based on the task name.

A task log file for the jobs run for a given task are stored in the [SLDPM] directory. Individual run log files are kept in $scr:[SLDPM].

Submitting Jobs

A difference between the MC Farm and the data re-recon operation is that the re-recon input files are densely packed on tape; often 150 runs at a time will fit on the 1 GB silo cartridges. Consequently some care must be taken to avoid having the recon jobs fight over the input tapes.

The current solution is to copy the input tape files to disk one tape at a time prior to a batch of runs being processed. When the recon jobs successfully complete, they erase their input file from disk. Currently the input disk pool is a 4 GB disk on SLDA6: disk$sld_rec_stg. When there is sufficient space on the disk, another batch of runs can be submitted.

Since a good deal of Oracle querying must be done, the job submission is done on a platform with easy Oracle access. UNIX has been chosen for its Rexx Oracle interface and ease of NFS access to VMS disks and vice versa. All the relevent files are kept in /u/ey/sldpm/recon.

A Digression on Using Rexx in UNIX

The verison of Rexx on UNIX is called uni-Rexx. The version that contains the Oracle access is in /afs/slac.stanford.edu/u/sf/renata/rxsql/rxsql/rxx. To connect to the SLACVM Oracle DB one must set an environment variable:

setenv TWO_TASK  t:slacvm:,,idnc,iiuty

Back to Submitting Jobs

Jobs can be submitted in two modes: by tape (ie do a tape's worth of files at a time) or by run range (intended for patchups of failed runs).

On UNIX

Use is made of the cron facility to regularly check the available space in the disk pool and submit another tape's worth of runs if there is room. This is intended to process the entire task without human intervention.

A file, 'task'.cron, tells cron what to run and at what interval. tcron.rex sets up the proper environment and calls tapesub.rex which looks at the 'task'.report file to see what the last submitted tape was and submits the next (if there is space for it). tapesub creates a file, 'task'.tapes, to list all the tapes containing the runs in the task. As each tapecopy job is submitted, it will update a file, 'task'.report, to keep track of what has been done. When all tapes match up with the .tapes file, the task is complete.

The method for submitting the crontab file is to prepare a file 'task'.cron which looks like

0,30 * * * * /nfs/juno/u18/ey/sldpm/recon/tcron.rex "rec93v11" >> /nfs/juno/u18/ey/sldpm/recon/rec93v11.cronlog

where the task name appears twice. To submit the task to crontab,

   crontab 'task'.cron

Note that the cron job is specific to the node you submitted the crontab command on. Other nodes will not know about it.

Each time the timed job runs, it echoes into the file 'task'.cronlog. After the task is complete disable the cron job by

crontab -r

Instructions for resubmitting failed jobs follows.

Recsub.rex prepares a .COM file which will be submitted to SLACVX to perform the tape copy onto the buffer disk and then submit the individual run-by-run recon jobs. The .COM file is left around as hadcopy'time'.com (to make it unique)

rxx recsub.rex 'task' run_begin run_end [( NOCOPY VAX DEBUG]
to submit the individual runs to SLACVX. The NOCOPY option tells recsub that the input dataset is still on disk. The jobs are submitted by the tape copy job as each file gets copied to disk. The tape copy .COM file is placed in SLDPM's scratch area on SLACVX along with another .COM file, release'time'.com, which has all the job submission commands. Note that if you use any of the options you'll need to enclose the entire argument string to recsub in single quotes (UNIX gets annoyed at the single open parenthesis.

recsub.rex submits the jobs to the VAX cluster via rsh. It will echo each run that it submits.

A utility, howmanytapes.rex, queries Oracle about the number of tapes occupied by a given run range:

 rxx howmanytapes.rex run_begin run_end
 
will tell you how many and which tapes occupy that run range.

On VMS

recsub.rex submits one RECSUPER.COM to batch on VMS for each run. This .COM file submits a recon job, RECJOB.COM, to do the actual reconstruction and then cleans up after successful completion (it uses [SLDPM]CHECKLOG.COM to verify that the recon job finished properly).

Monitoring the Jobs

Monitoring the re-recon jobs is much like the MC Farm monitoring: one can use WWW or look at the task log files directly.

A log is kept of the job submission requests from UNIX in ~sldpm/recon/'task_name'.report. This file chronicles the time and run range for each submission.

Shareables and Frozen Production Areas

Again, just as for the MC Farm, the executable code is run from shareables and a frozen code area.


Richard Dubois Updated Dec 8 1994