HPC quick-start guide
This guide is for you if you have used HPC systems before and want to start using the Rocket HPC service. If you need any more information, please consult the HPC Service pages or contact the HPC support team.
Getting an account
In order to access Rocket, you need to be a member of at least one registered HPC research project.
Rocket projects do not need to be funded to qualify for registration. If you would like to use HPC in your research and are a permanent member of staff, simply register your project online. You will be notified when your registration has been processed, normally within a few days.
Rocket accounts are generated automatically for the project PI and any secondary contact named on the registration form. These two people can also add and remove additional project members, whose Rocket accounts will then be created within 15 minutes. There are additional help pages on managing HPC projects.
Please read the Rocket Code of Conduct before you first login.
Login using your NUIT username and password.
Mac and linux users do not need to install anything.
Only computers on the Newcastle University campus network can connect directly to Rocket. From anywhere else you will need to set up a 'tunnel'.
- Windows: from a classroom PC, use PuTTY. From office PCs, you may need to download a ssh client first, as above. In both cases, create a ssh connection to rocket.hpc.ncl.ac.uk.
- Linux/Mac desktops or the timesharing service 'aidan' (unix.ncl.ac.uk): in a terminal window, type 'ssh rocket.hpc'
User environment & application software
The operating system on Rocket's login and compute nodes is CentOS 7.
Text editors include emacs, nano and vi. Use the module command to access other software packages, including compilers.
|module avail||List available modules (case-insensitive search)
Add --redirect if piping to another command
module avail python
module --redirect avail |grep -vi python
|module spider||List available modules including hidden items
Display information about modules
module spider zlib
|module load||Load module(s) - default version/build
Load module(s) - named version/build
|module load Python
module load Python MATLAB
module load Python/2.7.14-foss-2017b
|module list||List currrently loaded modules||module list|
|module unload||Unload module(s)||module unload Python|
|module purge||Unload all loaded modules||module purge|
The module man page has further information.
Most software on Rocket is installed centrally by NUIT, but some applications are the responsibility of other staff and access may be controlled. You may also install software in your own directories or in shared project space. It is your responsibility to abide by any licence terms and conditions of software that you install this way. Please contact the HPC support team if you have software queries or requests.
Modules for programming tools include:
|Compiler/tool suite||Module name|
|Intel Parallel Studio XE Cluster Edition||intel|
|Intel VTune performance profiler||VTune|
|PGI (Portland) Professional Fortran/C/C++||PGI|
Notes for MPI users:
1. Intel MPI on Rocket is configured to work with the SLURM 'srun' command rather than mpirun. We recommend that you use srun for all Intel MPI batch jobs. If you do need to use the Intel mpirun command, you will need to:
2. PGI and GNU OpenMPI jobs will run with either mpirun or srun. You may need to include the option --mpi=pmi2 in your srun command line, e.g:
|srun -n 2 --mpi=pmi2 a.out|
Use the login nodes for lightweight tasks such as editing code, submitting jobs and managing files. Intensive computations on these nodes can have a serious impact on other people's work and may be killed; they should be submitted instead to compute nodes via the resource management system SLURM.
SLURM has been set up with the following partitions (=queues). Jobs will be queued by default in the defq partition.
|Partition (queue)||Nodes||Max concurrent ||Time limit (wallclock)||Default time limit (wallclock)||Default memory per core|
|defq||standard||528 cores||2 days||2 days||2.5 GB|
|bigmem||medium,large,XL||2 nodes||2 days(*)||2 days||11 GB|
|short||all||2 nodes||10 minutes||1 minute||2.5 GB|
|long||standard||2 nodes||30 days||5 days||2.5 GB|
|power(**)||power||1 node||2 days||2 days||2.5 GB|
|interactive||all||1 node||1 day or 2 hours idle time||2 hours||2.5 GB|
(*) contact the Rocket team if you need to run longer jobs on the bigmem partition
(**) the single node in this partition is a GPU resource and is based on POWER9 architecture. Jobs in this partition should specify their GPU requirements using the SLURM directive --gres=gpu:<number> where <number>=0-4.
When you submit jobs through SLURM, you may:
- run jobs on up to 528 cores concurrently
- have up to 10000 jobs in SLURM, either queued or running, at any one time
- submit a job array with up to 10000 elements
A brief summary of SLURM commands is given below. SLURM maintain a longer command summary page, and their Rosetta Stone page gives a set of translations between SLURM and PBS/Torque, SGE, LSF and LoadLeveller.
The sample job scripts page has examples of different types of job and common SLURM options. Most SLURM commands have an extensive set of options, detailed on the man pages.
|sbatch||Submit batch job||sbatch myscript.sh|
Run interactive job
srun -c 22 my_parallel_program
|salloc||Allocate resources on which to run commands interactively||salloc -n 4 -N 1-1|
|squeue||Listed queued and running jobs. See also sacct --allusers.||squeue
squeue -u my_username
|sinfo||Cluster status summary||sinfo|
|scontrol||Display configuration or job specification
Modify specifications for queued job
|scontrol show partition defq
scontrol update job job_ID part=bigmem
|sstat||Display job status||sstat -j job_ID --allsteps|
|sacct||Display job accounting information and resource usage||sacct
sacct -S month/day
sacct -j job_ID -o cputime,usercpu
sacct -A my_project --allusers
|scancel||Cancel jobs||scancel job_ID
scancel -u my_username
There are 3 areas, detailed below, where you may store files:
- The Lustre filestore, /nobackup
- Your Rocket home directory
- Temporary storage on each compute node, $TMPDIR
No user files on Rocket are backed up. It is your responsibility to back up important files.
The University filetore (RDW) provides secure, longer-term storage and is mounted on the Rocket login nodes (not compute nodes) as /rdw.
While Rocket is configured with data privacy in mind, the security of your data is your responsibility. Contact the Rocket team if you have particular concerns.
Fast storage on /nobackup
Rocket has a 500TB Lustre parallel filestore, mounted as /nobackup. Each HPC project has a directory /nobackup/proj/project_code in which files can be shared between project members.
Each user also has a personal directory /nobackup/user_name.
Your use of /nobackup is not limited by a quota. However, to keep overall use under control, we have some simple policies:
- Any file that has not been accessed for 3 months will be deleted automatically
- You will be warned 3 weeks before deletion and again 1 week before your files are deleted
- If /nobackup becomes too full, the HPC support team may remove some files belonging to users or projects whose use is excessive. This may be at short notice or immediate.
Your Rocket home space is accessed via NFS and has a quota of 40 GB. Old files are not removed from your home directory, however they are also not backed up.
Compute-node scratch storage
A job-specific directory, $TMPDIR, is created on allocated compute nodes at the start of a job and is deleted when the job ends. Use this space e.g. for files that are needed only during a job’s execution. Scratch space on a node is shared between jobs and cannot be reserved; consider allocating whole nodes for jobs with large scratch-space requirements.