Submitting strings of jobs on supercomputers
Simulations require a lot of computing power so they are often run on supercomputers such as ASU’s saguaro at A2C2 or the machines available through XSEDE. Typical simulations run for many days but most supercomputers are set up to restrict job run times to a day or so. One way to run long simulations is to set up the job in such a way that it restarts from a previous run and then submit a string of jobs that each dependent on the completion of the preceding one.
The script below, qsub_dependents.py automates the process of submitting strings of dependents jobs. One either tells it how many jobs to run (using the -N
option) or lets the script calculate the number of jobs based on the benchmarked performance in ns/d and the desired simulation run time in ns.
Note that your queuing system submission script must contain all the logic to cleanly restart. This depends on your simulation code and how it handles restarts (and is not covered here).
Example
Example use on a machine with PBS/Torque:
qsub_dependents.py -p 18.166 kraken.pbs
-- Will run 6 jobs performing at 18.166 ns/d for desired run time 100 ns.
-- Expected real time (excluding waiting): 5.50479 days
>> qsub kraken.pbs
>> qsub -W depend=afterok:2653842.nid00016 kraken.pbs
>> qsub -W depend=afterok:2653843.nid00016 kraken.pbs
>> qsub -W depend=afterok:2653844.nid00016 kraken.pbs
>> qsub -W depend=afterok:2653845.nid00016 kraken.pbs
>> qsub -W depend=afterok:2653846.nid00016 kraken.pbs
Usage
Usage: qsub_dependents.py [options] -- [qsub-options] FILE
Submit a string of dependent jobs through the PBS/TORQUE, Gridengine (GE) or SLURM queuing system. Either set --number
or provide both the benchmarked performance (--performance
), the projected run time and the wall time limit and the script will compute the number of jobs it needs to launch.
The walltime limit should agree (or at least not exceed the limit set in the script).
The syntax for PBS, GE, or SLURM is automatically chosen.
Examples:
qsub_dependents.py -N 5 run.ge
qsub_dependents.py -w 12 -p 15.3 -r 100 -- -l walltime=12:00:00 run.pbs
Adding three more jobs after a running one with jobid 12345.nid000016:
qsub_dependents.py -N 3 -a 12345.nid000016 run.pbs
Options:
- -h, —help
- show this help message and exit
- -N N, —number=N
- run exactly N jobs in total [3]
- -p PERF, —performance=PERF
- job was benchmarked to run at PERF ns/d
- -r TIME, —runtime=TIME
- total run time for the simulation in ns [100.0]
- -w TIME, —walltime=TIME
- walltime (in hours) allowed on the queue; must not be longer than the walltime set in the queuing script and really should be the same. NOTE: must be provided as a decimal number of hours. [24]
- -a JOBID, —append=JOBID
- make the first job dependent on an already running job with job id JOBID. (Typically used in conjunction with
--number
.)
qsub_dependents.py script
The script is published under the BSD 3-clause licence so you can copy and modify it as you like as long as you keep the copyright notice intact.
Download
Download the most recent version of the qsub_dependents.py script by right-clicking this link. This version recognizes the following queuing systems
- PBS
- Gridengine
- SLURM
Source code
The source code for qsub_dependents.py is publicly available in the git repository Becksteinlab/queuetools .
Discuss: “Submitting strings of jobs on supercomputers”