Difference in Job Submission between Jubail and Dalma

This section highlights some of the important chnages you would need in your job submission scripts when moving from Dalma HPC to Jubail HPC.

We shall discuss each of the sections below:

  1. Access

  2. Partitions

  3. Number of Tasks

  4. Memory

  5. Bigmem nodes

  6. GPU nodes

  7. Timelimit

  8. Preempt jobs

  9. Default Quota

Summary

Dalma

Jubail

Access

Dalma was accessed as follows:

Jubail can be accessed as follows:

ssh <NetID>@dalma.abudhabi.nyu.edu
ssh <NetID>@jubail.abudhabi.nyu.edu

Partitions

Dalma had serial and parallel partitions

No serial and parallel partitions on Jubail

Only one generalized partition called compute

More info can be found here

Dalma based job scipts with -p parallel and -p serial specified will work on Jubail and these lines will be filtered out automatically by the job scheduling wrapper.

SBATCH -p compute is optional(as compute is the default partition)

#SBATCH -p parallel
#SBATCH -p compute

Maximum number of Tasks/CPUs

Maximum number of CPUs per node was 28 in Dalma.

Maximum number of CPUs per node is 128 on Jubail.

Small jobs (requiring less than 28 CPUs) will be sent to old Dalma nodes by the scheduler.

Medium and Large jobs will be prioritized and sent to Jubail.

Small jobs (less than 28 CPUs),mostly belonging to python and R, don’t need any change and will be suported on jubail as well.

Medium and Large jobs (MPI Jobs) would need an adjustment as the -n or --ntasks parameter would need to be a multiple of 128 if it was a multiple of 28 on Dalma.

Requesting 280 CPUs(28x10) requires 10 nodes (parallel job).

#SBATCH -p parallel
#SBATCH -n 280

Requesting the nearest multiple of 128 cores as compared to what was requested on Dalma, 256 CPUs (128x2) requires only 2 nodes.

#SBATCH -p compute
#SBATCH -n 256

Memory

Total Memory per node was 112GB on Dalma

Total Memory per node is 480GB on Jubail

Default memory assigned for a job was 4GB per CPU.

Default memory assigned for a job is 3.75GB per CPU.

Max allowed memory per node was 112GB.

Max allowed memory per node is 480GB.

#SBATCH -p parallel
#SBATCH --mem=80G
#SBATCH -p compute
#SBATCH --mem=200G

BigMem nodes

Large memory nodes were requested using the bigmem partition for memory greater than 112GB.

Large memory nodes are requested using the bigmem partition for memory greater than 480GB.

Dalma has three large memory nodes.

Jubail has four large memory nodes.

#SBATCH -p bigmem was optional

#SBATCH -p bigmem is mandatory

Large mem nodes were requested when required memory was greater than 112GB

#SBATCH -p bigmem
#SBATCH --mem=200G

Large mem nodes are requested ONLY when required memory is greater than 480GB

#SBATCH -p bigmem
#SBATCH --mem=700G

GPU nodes

Dalma had 14 GPU nodes with 2 Nvidia V100 GPU cards on 12 nodes and 8 V100 GPU vards on 2 nodes.

On addition to the Dalma GPU nodes, Jubail has 24 GPU nodes with one A100 card on each of them.

Dalma had exclusive GPU nodes. Hence, only GPU jobs were running on GPU nodes.

Jubail has both exclusive (V100 GPU nodes (Dalma GPU nodes) ) and non-exclusive (versatile, A100 GPU nodes) which can run normal CPU jobs when idle (no GPU cards are needed) and have a higher priority for GPU jobs.

Only Nvidia V100 cards were available on Dalma.

On Jubail, Users have an option to choose between Nvidia V100 and and the new A100 cards.

By Default, the GPU jobs will be sent to V100 GPU nodes.

The users can test the performance differences between the A100 and V100 nodes and decide accordingly.

Since, A100 GPU nodes are non exclusive, Users might have to wait in queue for non GPU jobs (normal CPU jobs) on those nodes to be available on a priority basis.

You can also mention in your job script if you would like to explicitly send your job to a100 nodes.

When requesting a single GPU

#SBATCH -p nvidia
#SBATCH --gres=gpu:1

The syntax on Jubail for requesting a single GPU is same as Dalma

#SBATCH -p nvidia
#SBATCH --gres=gpu:1

When requesting a single GPU on new A100 nodes

#SBATCH -p nvidia
#SBATCH --gres=gpu:a100:1

WallTime

Max WallTime on Dalma was linked to the account they belong to (physics,students,engineering etc)

Max Wall time on Jubail is linked to the type (size) of job submitted by the user irrespective of the account they belong to.

The details of the types of Jobs and their respective limits can be found in the link here

Preempt partition

Partition used for quick testing with high job priority available for everyone.

Max Walltime for preempt jobs was 30 minutes on Dalma

Max Walltime for preempt jobs on Jubail is 7 days

More info on this can be found here

Default Quota

Dalma had 4 storage systems $HOME, $SCRATCH, $WORK and $ARCHIVE with Default Quota for each one.

Jubail has same storage systems as Dalma and same Default Quota for all storage systems except $HOME.

Default Quota for Dalma per $HOME was 20GB in 400K files.

Default Quota for Jubail per $HOME is 20GB in 150K files.