Using - Getting Started
This guide is for you if you have used HPC systems before and want to start using the Rocket HPC service. If you need any more information, please consult the HPC Service pages or contact the HPC support team.
If you are new to HPC, please sign up to an introductory course at https://workshops.ncl.ac.uk/public/sage/ or contact the HPC support team for help in getting started.
User environment & application software
The operating system on Rocket's login and compute nodes is CentOS 7.
Text editors include emacs, nano and vi. Use the module command to access other software packages, including compilers.
Command | Purpose | Examples |
---|---|---|
module avail | List available modules (case-insensitive search) Add --redirect if piping to another command |
module avail module avail python module --redirect avail |grep -vi python |
module spider | List available modules including hidden items Display information about modules |
module spider module spider zlib |
module load | Load module(s) - default version/build Load module(s) - named version/build |
module load Python module load Python MATLAB module load Python/2.7.14-foss-2017b |
module list | List currrently loaded modules | module list |
module unload | Unload module(s) | module unload Python |
module purge | Unload all loaded modules | module purge |
The module man page has further information.
Most software on Rocket is installed centrally by NUIT, but some applications are the responsibility of other staff and access may be controlled. You may also install software in your own directories or in shared project space. It is your responsibility to abide by any licence terms and conditions of software that you install this way. Please contact the HPC support team if you have software queries or requests.
Programming tools
Modules for programming tools include:
Compiler/tool suite | Module name |
---|---|
Intel Parallel Studio XE Cluster Edition | intel |
Intel VTune performance profiler | VTune |
PGI (Portland) Professional Fortran/C/C++ | PGI |
OpenMPI | OpenMPI |
GNU compilers | GCC |
Notes for MPI users:
1. Intel MPI on Rocket is configured to work with the SLURM 'srun' command rather than mpirun. We recommend that you use srun for all Intel MPI batch jobs. If you do need to use the Intel mpirun command, you will need to:
unset I_MPI_PMI_LIBRARY |
2. PGI and GNU OpenMPI jobs will run with either mpirun or srun. You may need to include the option --mpi=pmi2 in your srun command line, e.g:
srun -n 2 --mpi=pmi2 a.out |
Running jobs
Use the login nodes for lightweight tasks such as editing code, submitting jobs and managing files. Intensive computations on these nodes can have a serious impact on other people's work and may be killed; they should be submitted instead to compute nodes via the resource management system SLURM.
SLURM has been set up with the following partitions (=queues). Jobs will be queued by default in the defq partition.
Partition (queue) | Nodes | Max concurrent | Time limit (wallclock) | Default time limit (wallclock) | Default memory per core |
---|---|---|---|---|---|
defq | standard | 528 cores | 2 days | 2 days | 2.5 GB |
bigmem | medium,large,XL | 2 nodes | 2 days(*) | 2 days | 11 GB |
short | all | 2 nodes | 10 minutes | 1 minute | 2.5 GB |
long | standard | 2 nodes | 30 days | 5 days | 2.5 GB |
power(**) | power | 1 node | 2 days | 2 days | 2.5 GB |
interactive | all | 1 node | 1 day or 2 hours idle time | 2 hours | 2.5 GB |
(*) contact the Rocket team if you need to run longer jobs on the bigmem partition
(**) the single node in this partition is a GPU resource and is based on POWER9 architecture. Jobs in this partition should specify their GPU requirements using the SLURM directive --gres=gpu:<number> where <number>=0-4.
When you submit jobs through SLURM, you may:
- run jobs on up to 528 cores concurrently
- have up to 10000 jobs in SLURM, either queued or running, at any one time
- submit a job array with up to 10000 elements
A brief summary of SLURM commands is given below. SLURM maintain a longer command summary page, and their Rosetta Stone page gives a set of translations between SLURM and PBS/Torque, SGE, LSF and LoadLeveller.
The sample job scripts page has examples of different types of job and common SLURM options. Most SLURM commands have an extensive set of options, detailed on the man pages.
Command | Purpose | Example |
---|---|---|
sbatch | Submit batch job | sbatch myscript.sh |
srun |
Run interactive job |
srun--pty /bin/bash srun -c 22 my_parallel_program srun my_parallel_program |
salloc | Allocate resources on which to run commands interactively | salloc -n 4 -N 1-1 |
squeue | Listed queued and running jobs. See also sacct --allusers. | squeue squeue -u my_username |
sinfo | Cluster status summary | sinfo |
scontrol | Display configuration or job specification Modify specifications for queued job |
scontrol show partition defq scontrol update job job_ID part=bigmem |
sstat | Display job status | sstat -j job_ID --allsteps |
sacct | Display job accounting information and resource usage | sacct sacct -S month/day sacct -j job_ID -o cputime,usercpu sacct -A my_project --allusers |
scancel | Cancel jobs | scancel job_ID scancel -u my_username |
Storage space on Rocket
There are 3 areas, detailed below, where you may store files:
- The Lustre filestore, /nobackup
- Your Rocket home directory
- Temporary storage on each compute node, $TMPDIR
No user files on Rocket are backed up. It is your responsibility to back up important files. The University filetore (RDW) provides secure, longer-term storage and is mounted on the Rocket login nodes (not compute nodes) as /rdw. While Rocket is configured with data privacy in mind, the security of your data is your responsibility. Contact the Rocket team if you have particular concerns. The University's Research Data Service has further information about Research Data Management and the handling of personal or sensitive data. |
Fast storage on /nobackup
Rocket has a 500TB Lustre parallel filestore, mounted as /nobackup. Each HPC project has a directory /nobackup/proj/project_code in which files can be shared between project members.
Each user also has a personal directory /nobackup/user_name.
Your use of /nobackup is not limited by a quota. However, to keep overall use under control, we have some simple policies:
- Any file that has not been accessed for 3 months will be deleted automatically
- You will be warned 3 weeks before deletion and again 1 week before your files are deleted
- If /nobackup becomes too full, the HPC support team may remove some files belonging to users or projects whose use is excessive. This may be at short notice or immediate.
Home space
Your Rocket home space is accessed via NFS and has a quota of 40 GB. Old files are not removed from your home directory, however they are also not backed up.
Compute-node scratch storage
A job-specific directory, $TMPDIR, is created on allocated compute nodes at the start of a job and is deleted when the job ends. Use this space e.g. for files that are needed only during a job’s execution. Scratch space on a node is shared between jobs and cannot be reserved; consider allocating whole nodes for jobs with large scratch-space requirements.
Nodes | scratch space |
---|---|
Standard | 469 GB |
Medium | 1.1 TB |
Large | 7.2 TB |
XL | 8.7 TB |