Using - Getting Started

This guide is for you if you have used HPC systems before and want to start using the Rocket HPC service.  If you need any more information, please consult the HPC Service pages or contact the HPC support team

If you are new to HPC, please sign up to an introductory course at https://workshops.ncl.ac.uk/public/sage/ or contact the HPC support team for help in getting started.

User environment & application software

The operating system on Rocket's login and compute nodes is CentOS 7.

Text editors include emacs, nano and vi. Use the module command to access other software packages, including compilers.

CommandPurposeExamples
module avail List available modules (case-insensitive search)

Add --redirect if piping to another command
module avail
module avail python
module --redirect avail |grep -vi python
module spider List available modules including hidden items 
Display information about modules 
module spider
module spider zlib 
module load Load module(s) - default version/build 

Load module(s) - named version/build
module load Python
module load Python MATLAB
module load Python/2.7.14-foss-2017b 
module list List currrently loaded modules  module list 
module unload  Unload module(s)  module unload Python 
module purge Unload all loaded modules  module purge 

 

 

 

 

 

 

 

The module man page has further information.

Most software on Rocket is installed centrally by NUIT, but some applications are the responsibility of other staff and access may be controlled.  You may also install software in your own directories or in shared project space.  It is your responsibility to abide by any licence terms and conditions of software that you install this way.  Please contact the HPC support team if you have software queries or requests.

Programming tools

Modules for programming tools include:

Compiler/tool suiteModule name
Intel Parallel Studio XE Cluster Edition intel
Intel VTune performance profiler VTune
PGI (Portland) Professional Fortran/C/C++       PGI
OpenMPI  OpenMPI    
GNU compilers GCC

Notes for MPI users:

1. Intel MPI on Rocket is configured to work with the SLURM 'srun' command rather than mpirun.  We recommend that you use srun for all Intel MPI batch jobs.  If you do need to use the Intel mpirun command, you will need to:

unset I_MPI_PMI_LIBRARY

2. PGI and GNU OpenMPI jobs will run with either mpirun or srun.  You may need to include the option --mpi=pmi2 in your srun command line, e.g:

srun -n 2 --mpi=pmi2 a.out

 

Running jobs

Use the login nodes for lightweight tasks such as editing code, submitting jobs and managing files.  Intensive computations on these nodes can have a serious impact on other people's work and may be killed; they should be submitted instead to compute nodes via the resource management system SLURM.

SLURM has been set up with the following partitions (=queues).  Jobs will be queued by default in the defq partition.

Partition (queue)NodesMax concurrent 
Time limit (wallclock)Default time limit (wallclock)Default memory per core
defq standard 528 cores 2 days 2 days 2.5 GB
bigmem medium,large,XL 2 nodes 2 days(*) 2 days 11 GB
short all 2 nodes 10 minutes 1 minute 2.5 GB
long standard 2 nodes 30 days 5 days 2.5 GB
power(**) power 1 node 2 days 2 days 2.5 GB
interactive all 1 node 1 day or 2 hours idle time 2 hours 2.5 GB

 

 

 

 

 

 

(*) contact the Rocket team if you need to run longer jobs on the bigmem partition

(**) the single node in this partition is a GPU resource and is based on POWER9 architecture.  Jobs in this partition should specify their GPU requirements using the SLURM directive --gres=gpu:<number> where <number>=0-4.

When you submit jobs through SLURM, you may:

  • run jobs on up to 528 cores concurrently 
  • have up to 10000 jobs in SLURM, either queued or running, at any one time
  • submit a job array with up to 10000 elements

A brief summary of SLURM commands is given below.  SLURM maintain a longer command summary page, and their Rosetta Stone page gives a set of translations between SLURM and PBS/Torque, SGE, LSF and LoadLeveller.  

The sample job scripts page has examples of different types of job and common SLURM options.  Most SLURM commands have an extensive set of options, detailed on the man pages. 

CommandPurposeExample
sbatch Submit batch job sbatch myscript.sh
srun

Run interactive job
Run parallel job within existing SLURM allocation

srun--pty /bin/bash
srun -c 22 my_parallel_program
srun my_parallel_program
salloc Allocate resources on which to run commands interactively salloc -n 4 -N 1-1
squeue Listed queued and running jobs.  See also sacct --allusers. squeue
squeue -u my_username
sinfo Cluster status summary sinfo
scontrol Display configuration or job specification
Modify specifications for queued job
scontrol show partition defq
scontrol update job job_ID part=bigmem
sstat Display job status sstat -j job_ID --allsteps
sacct Display job accounting information and resource usage sacct
sacct -S month/day
sacct -j job_ID -o cputime,usercpu
sacct -A my_project --allusers
scancel Cancel jobs scancel job_ID
scancel -u my_username

 

 

Storage space on Rocket

There are 3 areas, detailed below, where you may store files:

  • The Lustre filestore, /nobackup 
  • Your Rocket home directory
  • Temporary storage on each compute node, $TMPDIR

 No user files on Rocket are backed up. It is your responsibility to back up important files.

The University filetore (RDW) provides secure, longer-term storage and is mounted on the Rocket login nodes (not compute nodes) as /rdw. 

While Rocket is configured with data privacy in mind, the security of your data is your responsibility.  Contact the Rocket team if you have particular concerns.

The University's Research Data Service has further information about Research Data Management  and the handling of personal or sensitive data.

 

Fast storage on /nobackup

Rocket has a 500TB Lustre parallel filestore, mounted as /nobackup.  Each HPC project has a directory /nobackup/proj/project_code in which files can be shared between project members.

Each user also has a personal directory /nobackup/user_name

Your use of /nobackup is not limited by a quota. However, to keep overall use under control, we have some simple policies:

  • Any file that has not been accessed for 3 months will be deleted automatically
  • You will be warned 3 weeks before deletion and again 1 week before your files are deleted
  • If /nobackup becomes too full, the HPC support team may remove some files belonging to users or projects whose use is excessive. This may be at short notice or immediate. 

Home space

Your Rocket home space is accessed via NFS and has a quota of 40 GB. Old files are not removed from your home directory, however they are also not backed up.

Compute-node scratch storage

A job-specific directory, $TMPDIR, is created on allocated compute nodes at the start of a job and is deleted when the job ends.  Use this space e.g. for files that are needed only during a job’s execution.  Scratch space on a node is shared between jobs and cannot be reserved; consider allocating whole nodes for jobs with large scratch-space requirements.

Nodes     scratch space
Standard    469 GB
Medium 1.1 TB
Large 7.2 TB
XL  8.7 TB