In this section of the workbook, you will learn when to use batch mode instead of interactive mode, how to submit batch jobs, how to monitor their progress and how to retrieve their output.
Most of what you will learn in this section applies both to running
on the SLACVX cluster and on any other VMS system.
However some aspects of how the queues work and how you monitor jobs
is different on the SLACVX cluster since the SLACVX cluster contains
SLD extensions to the basic VMS batch system.
When to Use Batch versus Interactive
SLD users typically run in interactive mode when they want to study a
small number of events in great detail. You should run this way when you are
developing your algorithms, trying out event selection criteria
or testing a new piece of reconstruction code.
But once you have written your code and you want to run that same code
on a large number of events, it is time for you to run in batch mode.
Because batch jobs make more efficient use of computer resources such as processor power and tape drives, batch jobs are given higher priority for these resources.
For reasons discussed in the workbook section The Tape System, batch processing is the only way you are allowed to write tapes.
Batch jobs is preferable for complicated jobs because it leaves a more organized record of your work. When it comes time to redo an analysis with minor changes, you will find the rework much easier if the original work was a batch job.
Finally, it is simply impolite to tie up resources in interactive mode that could better be shared with your collaborators by running in batch mode.
Don't be afraid to use interactive mode, but keep to the basic rule:
When the batch job runs:
If you do not already have a tape of your own, issue a GETFREE command to get one.
Create new subdirectory of your login directory to keep all of
the batch files in. This is not absolutely necessary, but it provides a good
way of organizing your work.
SET DEFAULT SYS$LOGIN
CREATE/DIR [.BATCHTEST]
SET DEFAULT [.BATCHTEST]
Create a new file to store the commands that your batch job will execute. You can use any file with file extension COM. For this example, make it TESTONE.COM.
Type or paste the following code into TESTONE.COM:
$SET DEFAULT [.BATCHTEST]
$DUCS PROD ALL
$IDA
OPENTAPE READ REC94_MDST STAGE WAIT
OPENTAPE WRITE your_tape_id.1/FIVETRK
DEF EVANAL
NTRACKS=0
BANKLOOP PHCHRG
IF PHCHRG%(NHIT) > 40
NTRACKS=NTRACKS+1
ENDIF
ENDLOOP
IF NTRACKS > 5
WRITE USING INLIST
TYPE "Wrote Event " _IEVENTH%(EVENT)
ELSE
TYPE "Rejected Event" _IEVENTH%(EVENT)
ENDIF
ENDDEF
GO 200
QQUIT
With the exception of the dollar signs in the first three lines, this file contains exactly what you would type to run this IDA job interactively from a fresh login. But, as we have discussed, the OPENTAPE WRITE command would be rejected from an interactive job.
In a COM file, every line that is a DCL command is preceeded by a dollar sign. Once the job has started IDA, the commands are IDA commands, not DCL commands, so they do not have dollar signs.
When your job is run, the result will be a new data tape that contains
only the events with more than five charged tracks with at least 40 hits each.
How the Different Queues Work
You are almost ready to submit your job, but before you submit it you need
to know something about "Batch Queues."
Batch systems typically need to accomodate both very long jobs and very short jobs. Coming up with a system to share limited computer resources between these two different kinds of jobs is difficult. Among the considerations are factors such as that some of the most important jobs tend to be long ones (such as preparing the Monte Carlo data set needed for analysis to be presented at an upcoming conference), but shorter jobs take so little time to run that it doesn't make sense to have them wait until all the long important jobs are finished. There can also be problems with different jobs competing for the same tapes or tape drives.
Batch systems typically handle these problems by designating different "batch queues" for different types of jobs. There are some queues for shorter jobs and some queues for longer jobs. The system then assigns some resources to each queue. If a given queue is empty, the system may try to allocate more resources to other queues (though for complicated but valid reasons, it sometimes makes sense to reserve resources for a queue that happens to be empty).
It is important to submit each job to the appropriate batch queue.
To see what batch queues are available,
type SHOW QUEUE host_name*
Batch queue names will have suffixes like "EXPRESS", "MEDIUM" or "CRUNCH."
Express queues typically handle short jobs. Crunch queues typically handle long jobs.
Only the Job Juggler has direct access to the actual SLACVX batch queues. Users submit jobs to the Job Juggler's virtual queues.
The Job Juggler queue names are as follows:
Queue Time Limit ============ ========== SLD_EXPRESS 2 minutes SLD_FAST 8 minutes SLD_SHORT 20 minutes SLD_MEDIUM 1 hour SLD_LONG 3 hours SLD_CRUNCH Infinite SLD_STAGE Reserved for production use SLD_MC Reserved for production use
To run the job, use the SUBMIT command.
type SUBMIT/QUEUE=SLD_EXPRESS/CHAR=CART TESTONE.COM
The extra option, CHAR=CART, tells the Job Juggler that it needs to reserve tape cartidge handling resources for your job. Without this option, you job will not be allowed to use tapes.
Two other common options are, CHAR=VAX and CHAR=ALPHA. These options cause your job to be run only on the specified type of host. The SLACVX cluster contains both VAX and ALPHA machines. The code that is in DUCS will run on either type of machine, but when you get to writing your own code, you will want to run it on the same kind of machine that it was compiled on.
To use multiple CHAR options in the same SUBMIT, put them together within parentheses, separated by commas, as in:
SUBMIT/QUEUE=SLD_EXPRESS/CHAR=(CART,VAX) TESTONE.COM
For other SUBMIT options, see The SLACVX Batch System.
The system should respond with a message that your job is pending.
Monitoring the Job via BATQ
you can monitor the progress of your job by using the BATQ command.
type BATQ/USER=your_userid
If you get the response "No jobs found," it means your job has completed.
BATQ with no options shows you all jobs in the batch system.
For other BATQ options, see The SLACVX Batch System.
Type SHOW QUEUE/ALL queue_name
To get full details on your job's progress, once you have the entry number,
type SHOW ENTRY/FULL entry_number
Getting the Output
When your job has completed, a record of the job will be left behind in your
login directory under the same name as your .COM file but with the
file extension .LOG. Thus for your example job, your log file will be
called TESTONE.LOG.
It contains everything that would have been typed to your terminal if this
had been an interactive session.
At the end of the log file, there is a block of information about the
amount of system resources that were consumed by your job.
Of particular interest is the "Charged CPU time." This is the time that is continually checked against the limits allowed by the particular batch queue that you are using. The units are days, followed after a space by hours, minutes, seconds and hundredths of a second.
As soon as your job begins running, the log file will appear on disk.
You can type out the log file or view it in an editor at any time while
the job is running to check on the progress of your job.
By checking in this way, you can often detect a problem before the job
has finished running and can stop the job if it is not working correctly.
Deleting a Job
To stop a job that is in queue or is running, you first need to get the
job's entry number by using BATQ (SLACVX cluster) or SHOW QUEUE
(non-SLACVX systems) as described above.
You can then stop the job by typing
DELETE/ENTRY=entry_number
The DELETE command gives no output unless it FAILS to find the job. To confirm that the DELETE has worked, do another BATQ or SHOW QUEUE.
Use DELETE only on jobs you directly submitted.
Do not use it to delete staging jobs (jobs automatically submitted by the
STAGE command). Doing this will cause the staging system to lose track
of what it has staged.
Check Your Results
If your job succeeded, you should now have a data set on your tape that
contains only the events with more than five charged tracks with at
least 40 hits each.
Check your tape now by opening it and writing out the event numbers.
You can do this from an interactive job by just using the IDA commands:
OPENTAPE READ your_tape_id.1/FIVETRK STAGE WAIT
DEF EVANAL
TYPE "Tape contains event" _IEVENTH%(EVENT)
ENDDEF
GO 0
CLOSE INFILE
You could try the same thing from a batch job if you prefer.
To complete this exercise, enter the tape in your private Datacat. If you do not have one already follow the instructions on how to make your own private datacat. Then create a nickname for your new data set.
Now go back to IDA and reopen your tape this time using the nickname.
OPENTAPE READ your_nickname STAGE WAIT
DEF EVANAL
TYPE "Tape contains event" _IEVENTH%(EVENT)
ENDDEF
GO 0
CLOSE INFILE
For further details on running batch on the SLACVX cluster, see The SLACVX Batch System by Tony Johnson.