Submit a Job

When you submit a job to the job scheduler you must specify the resources that it needs to run. You may also need to specify application licences and the quality of service.

Use the qsub command to submit a simulation to the job scheduler. You should not run any intensive or long-lived processes directly on the login nodes.

You can control the behaviour of qsub using a directive. You can specify this either on the command line or in a job submission file. For example:

> qsub <job submission file>

The qsub command returns a unique identifier (jobID) for the job. This is how the system refers to the job during its lifetime.

You may need to specify the following information when submitting a job:

Resources

The scheduler will use the resource information when finding a suitable spare slot on the cluster. There are default values for each resource and maximum values that can be requested. It's always a good idea to request the minimum amount of a particular resource. The more resources requested, the longer the job is likely to be queued waiting for the resources to become free.

A number of different queues have been set up on the system allowing fine grained control of resource limits for jobs of different sizes. You generally do not need to choose a queue since production jobs are routed automatically to the appropriate queue. Jobs are prioritised according to the rules for Process limits, job limits and priorities.

When you submit a job, you may need to specify:

Processor cores

The cluster is built as a number of compute nodes (nodes), each with a number of processor cores (ppn).

The number of processor cores requested will be the product of these two values, and this would be expressed as nodes=N:ppn=P (where N and P are appropriate values). Alternatively jobs can simply request a number of procs.

  • Serial jobs can only meaningfully request procs=1 or nodes=1:ppn=1
  • OpenMP/threaded jobs can request nodes=1:ppn=P or procs=P (where P is between 1 and 28 for standard compute nodes)
  • MPI jobs can make use of any combination of nodes and ppn values, within the limits dictated by the system size.

Best practice for MPI jobs is to minimise nodes and maximise ppn values in order to get the greatest performance. However this is likely to increase the queued time of the job.

If you don’t need optimal performance you can request procs number of cores.  Then the scheduler can split the job across the cluster wherever they are available. This means that the job is likely to start sooner but will take longer to finish.

Memory

Each standard compute node has 128GB of physical memory and negligible swap available. Use pvmem to request a portion of the virtual memory per process or vmem to make a request for the job as a whole.

  • For MPI parallel jobs, request pvmem
  • For threaded/OpenMP parallel jobs, request vmem
  • For serial jobs you can request either pvmem or vmem since they mean the same in this context

The current default value has been deliberately set very low to force everyone to set this resource.

You can express values for pvmem and vmem in a number of ways. The following are all valid, requesting 2MB, 200MB and 4GB respectively:

pvmem=2048kb
vmem=200mb
pvmem=4gb

pvmem is a per-process value. If a job is submitted with pvmem=20GB and ppn=8, it will only be able to execute on large mem nodes as it requires 160GB on a node (more than the standard compute nodes have available).

You need to use different options for different types of jobs.  The per-process limitation of pvmem can be a problem for some threaded applications such as Gaussian which wish to claim a large amount of shared memory on a single node.  vmem is meaningless for parallel jobs which use multiple nodes.

Large memory jobs

We define a large memory job as one that requests more than 100GB of vmem. A large memory job is likely to end up on one of the large memory nodes, and definitely will if >128GB is requested. There are eight large memory nodes each with 1TB of memory and 28 CPU cores. Any type of job (serial, MPI, threaded or array) can also be a large memory job.

Execution time

The walltime is the amount of time that the job should run for. Setting a value for walltime is effectively mandatory, as the default value has been deliberately set very low. It can be expressed as a number of seconds, or more usefully as hours, minutes and seconds:

walltime=hh:mm:ss

walltime is the maximum 'real' time that a job will execute for. Once the time is up, the job will be killed.

The maximum amount of time that a job can request depends on the type of queue used. For most jobs the maximum walltime allowed is 21 days (504 hours). Jobs requiring GPU or the special 'parallel' queue can have a maximum of 7 days (168 hours) walltime.

If you need a large memory job (that is where vmem is larger than 100GB) then the maximum walltime you can request is 168 hours.

Beware of upcoming Service Days when submitting jobs. If the walltime requested cannot fit into the time left until the start of the next scheduled Service Day then the job will remain queued until after the Service Day. You can use the time2service command to return the number of seconds remaining:

> time2service
1886393

Use the -h or -d option with time2service to return the remaining time in hours or days respectively (rounded down to the nearest integer) until the next Service Day.

Bear in mind that the value returned by time2service shouldn't be treated as the maximum value that could be requested in walltime, as there is likely to be a delay while the job is queued before execution.

Share this page: