Submitting strings of jobs on supercomputers | Learning | Beckstein Lab

. . . . .
Submitting strings of jobs on supercomputers

Submitting strings of jobs on supercomputers

Simulations require a lot of computing power so they are often run on supercomputers such as ASU’s saguaro at A2C2 or the machines available through XSEDE. Typical simulations run for many days but most supercomputers are set up to restrict job run times to a day or so. One way to run long simulations is to set up the job in such a way that it restarts from a previous run and then submit a string of jobs that each dependent on the completion of the preceding one.

The script below, qsub_dependents.py automates the process of submitting strings of dependents jobs. One either tells it how many jobs to run (using the -N option) or lets the script calculate the number of jobs based on the benchmarked performance in ns/d and the desired simulation run time in ns.

Note that your queuing system submission script must contain all the logic to cleanly restart. This depends on your simulation code and how it handles restarts (and is not covered here).

Example

Example use on a machine with PBS/Torque:

qsub_dependents.py -p  18.166  kraken.pbs 
-- Will run 6 jobs performing at 18.166 ns/d for desired run time 100 ns.
-- Expected real time (excluding waiting): 5.50479 days
>> qsub kraken.pbs
>> qsub -W depend=afterok:2653842.nid00016 kraken.pbs
>> qsub -W depend=afterok:2653843.nid00016 kraken.pbs
>> qsub -W depend=afterok:2653844.nid00016 kraken.pbs
>> qsub -W depend=afterok:2653845.nid00016 kraken.pbs
>> qsub -W depend=afterok:2653846.nid00016 kraken.pbs

Usage

Usage: qsub_dependents.py [options] -- [qsub-options] FILE

Submit a string of dependent jobs through the PBS/TORQUE, Gridengine (GE) or SLURM queuing system. Either set --number or provide both the benchmarked performance (--performance), the projected run time and the wall time limit and the script will compute the number of jobs it needs to launch.

The walltime limit should agree (or at least not exceed the limit set in the script).

The syntax for PBS, GE, or SLURM is automatically chosen.

Examples:

qsub_dependents.py -N 5  run.ge
qsub_dependents.py -w 12 -p 15.3 -r 100 -- -l walltime=12:00:00 run.pbs

Adding three more jobs after a running one with jobid 12345.nid000016:

qsub_dependents.py -N 3 -a 12345.nid000016 run.pbs

Options:

-h, —help
show this help message and exit
-N N, —number=N
run exactly N jobs in total [3]
-p PERF, —performance=PERF
job was benchmarked to run at PERF ns/d
-r TIME, —runtime=TIME
total run time for the simulation in ns [100.0]
-w TIME, —walltime=TIME
walltime (in hours) allowed on the queue; must not be longer than the walltime set in the queuing script and really should be the same. NOTE: must be provided as a decimal number of hours. [24]
-a JOBID, —append=JOBID
make the first job dependent on an already running job with job id JOBID. (Typically used in conjunction with --number.)

qsub_dependents.py script

The script is published under the BSD 3-clause licence so you can copy and modify it as you like as long as you keep the copyright notice intact.

Download

Download the most recent version of the qsub_dependents.py script by right-clicking this link. This version recognizes the following queuing systems

  • PBS
  • Gridengine
  • SLURM

Source code

The source code for qsub_dependents.py is publicly available in the git repository Becksteinlab/queuetools .

Discuss: “Submitting strings of jobs on supercomputers”

No comments yet.

Leave a Reply

Textile help