Batch Jobs
Introduction to Sun Grid Engine
Job Submission and Scheduling
Job scheduling and control on the Tornado cluster is the responsibility of the Sun Grid Engine (SGE), and it schedules and runs jobs according to a fair-share policy, whereby the long-term allotment of CPU-time to jobs is proportional to the share of the Tornado system owned by each job-owner's institute of affiliation.
Currently on Tornado, batch jobs can be submitted from either the login-node, or any one of the compute nodes.
Jobs are submitted to the SGE using qsub, either typed interactively or embedded in job scripts or regular scripts. Job scripts must be stored in files, and a job with a job-script stored in a file named mycalc.job is submitted to the SGE by the command:
qsub mycalc.job
Job scripts for the SGE consist of two sections, the first of which contains information for the SGE regarding the nature of the job (eg., expected resource usage, etc.), and the second containing statements to be executed by a shell. For example, a job-script to run the program xplor (with input from dgsa.inp and interactive output to dgsa.log followed by the program runstats could look as follows:
#----------------------------------------------------------------------------- # SGE directives # #$ -N my_job #$ -S /bin/sh #$ -cwd # #----------------------------------------------------------------------------- # user command section # xplor < dgsa.inp > dgsa.log runstats dgsa.log
Assuming that the text shown above is stored in the file dgsa.job, the command-line:
qsub dgsa.job
will submit the corresponding job to the SGE.
A collection of useful SGE directives and their meanings will be provided further below. However, it is also valuable to understand the main principles of job-submission. In essence, each job script is read by two different programs in the course of running each job. The first program to read each job script is qsub, and it only cares about lines starting with the #$ combination of characters and ignores all other parts of job scripts. The second program to read each job script is the shell that the SGE starts on the compute node(s) to carry out the statements in the job script.
Take care when comment lines in the user part of the script that no lines with #$ at the beginning are introduced
Job-script examples
Below you will find examples (templates) of job-scripts for serial jobs as well as for parallel jobs. As already mentioned above, essential properties of jobs (eg., resource requirements) can be communicated to the SGE through job-script directives. However, job-script directives (excluding the #$ symbols) may also be used as options to the qsub command, in order to provide information that is difficult or inconvient to embed in job-scripts.
A single-node sequential job
The convention of SGE is that in order to run parallel jobs, a so-called parallel environment must be requested by the job script (by using a #$ -pe directive). A job script that does not request a parallel environment, such as the example below, is therefore implicitly a non-parallel (a.k.a. serial) job:
#----------------------------------------------------------------------------- #$ -S /bin/bash #$ -l h_cpu=00:01:00 #$ -l h_vmem=1024M #$ -o /scratch/wrkshr/myname #$ -j y #$ -N SNSJ #$ -M myname@mymail.de #$ -cwd #----------------------------------------------------------------------------- echo -n " Job started at: " date echo -n " Execution host: " hostname #----------------------------------------------------------------------------- # actual user command(s) to run # command # #----------------------------------------------------------------------------- echo -n " Job completed at: " date #-----------------------------------------------------------------------------
Feel free to add a #$ -q serial for clarification.
A task parallel job
The SGE allows to submit multiple tasks in a single job. This is based on option -t <first task>-<last task> as given in the following example.
#------------------------------------------------------------------------------ # Sun Grid Engine #$ -S /usr/bin/ksh #$ -N hurricans #$ -j y #$ -t 1-12 #$ -q serial #------------------------------------------------------------------------------ # SGE starts counting by 1 and ksh/bash arrays by 0, so incement SGE_TASK_ID=$(($SGE_TASK_ID - 1))
The environment variable SGE_TASK_ID is giving during runtime the task number. For example, the input files for a certain job can be saved in an array (here ksh/bash syntax)
set -A files=$(ls ${DATA_DIR}/99*1979*.szip`)
and than used as single file
file=${files[$SGE_TASK_ID]}
so that each task is running with a different input file.
Note that in constrast to what is stated in the qsub man-page, both the first and last task number must be specified (and separated by a dash-symbol).
A MPI parallel job
To submit and run an MPI parallel job, the job-script for sequential jobs (above) must be augmented to request for a parallel environment and a number of processors (processor cores) to run on. The job-script directive #$ -pe is used to request both a parallel environment and a suitably sized collection of processors. Note that the number of processors requeseted for parallel MPI jobs must be an integer multiple of eight. If the number of processors requested is not a multiple of 8, the job will wait indefinitely. The example below is a job-script for an MPI job (using the orte8, OpenMPI parallel environment), that will run on 16 processors/cores:
#----------------------------------------------------------------------------- #$ -S /bin/bash #$ -l s_cpu=2:00:00 #$ -l s_vmem=1024M #$ -j y #$ -N SNSJ #$ -M myname@mymail.de #$ -pe orte8 16 #----------------------------------------------------------------------------- echo -n " Job started at: " date echo -n " Execution host: " hostname #----------------------------------------------------------------------------- # actual user command(s) to run # command # #----------------------------------------------------------------------------- echo -n " Job completed at: " date #-----------------------------------------------------------------------------
In the #$ -pe directive, it is possble to specify a range of acceptable processor collection sizes, rather than a specific number of processors. For example, to request between 16 and 48 processors in the case above, the #$ -pe directive-line would be changed so as to read as follows:
#$ -pe orte8 16-48
Note that in the case of MPI parallel jobs, the job-script itself is only executed on one processor. It is the mpiexec-command that starts instances of your program (command in the example above) on all processors of the processor collection allocated for the job.
Parallel Environments
Currently only one parallel environment is available:
|
PE |
implementation |
MPI-Standard |
|
orte8 |
openmpi-1.3.2 |
MPI-2 |
Job-script Directives
The set of job-script directives understood by the SGE will be presented and described. To keep the information manageable, it has been summarized in the table below. For further information the reader is referred to the SGE manuals which can be found at:
http://gridengine.sunsource.net/
|
Directive |
Purpose |
|
-cwd |
Run job with the same working directory as the qsub command with which it was submitted. |
|
-hold_jid jids |
The job submitted by qsub depends on the jobs denoted by jids (a list of job-id's). |
|
-j y-or-n |
merge stdout and stderr if y-or-n is y |
|
-l h_cpu=time |
Set (hard) CPU-time limit to time. |
|
-l s_cpu=time |
Set (soft) CPU-time limit to time. |
|
-l h_vmem=mem |
Set (hard) memory limit to mem. |
|
-l s_vmem=mem |
Set (soft) memory limit to mem. |
|
-M addr |
Send event notifications by e-mail to addr. |
|
-m when |
Send notifications for the events in when |
|
-p prio |
Set job priority to prio: -1024, ..., 1023. |
|
-S pathname |
Use pathname as shell command for this job. |
Note:Due to the fact that on compute nodes the home directories are not available, start all your jobs from the /scratch filesystems and use always the -cwd directive.
Queue Limits
The following queues are available:
|
Queue-Name |
Description |
Wallclock |
CPU-Time |
Memory |
Comment |
|
cluster (default) |
multi-node jobs on all compute nodes |
8 hours |
unlimited |
16 GB per node |
number of PEs has to be a multiple of 8 |
|
serial |
single CPU jobs on a shared SMP node |
4 weeks |
4 hours |
32 GB per job |
|
|
smp |
multi-CPU jobs on one dedicated SMP node |
8 hours |
unlimited |
64 GB per job |
|
SGE job monitoring
The list of currently running and submitted jobs is shown by the qstat command:
tlogin1% qstat job-ID prior name user state submit/start at queue slots ja-task-ID -------------------------------------------------------------------------------------------------- 415584 0.50500 SK_MPI016 k202010 r 05/07/2009 09:15:41 cluster@tc008 16 415582 0.56214 OmpA-M3 k202062 r 05/07/2009 08:39:21 cluster@tc178 32 415583 0.56214 OmpX-200ps k202062 r 05/07/2009 08:53:11 cluster@tc181 32 415579 0.60500 BbpEF(mdc) k202062 r 05/07/2009 08:12:10 cluster@tc194 44 415577 0.56214 OmpX-M3 k202062 r 05/07/2009 08:00:15 cluster@tc223 32
Examining the files produced by jobs as they run is another (but admittedly not very sophisticated) means of monitoring their progress and behavior.
Each job also writes its standard output and standard error to files whose names are given by the job-name with extensions formed as the concatenation of o and e, respectively, and the job-id of the job. The standard output of the OmpA-M3 job shown in the output of qstat above, for example, would write its standard output to a file named OmpA-M3.o415582.