JADE2 GPU-based High-Performance Cluster
JADE2 is an EPSRC-funded Tier 2 regional High-Performance Computing Cluster based on GPUs. It is intended to be used to support Artificial Intelligence Research only. The computing nodes are based on NVIDIA DGX MAX-Q Deep Learning System platform. The cluster has 63 servers, each containing 8 NVIDIA Tesla V100 GPUs linked by NVIDIA’s NV link interconnects technology. Newcastle University is a member of the consortium of institutions sharing this resource.
## Requesting an account as a Newcastle University researcher
Users need to first get a web account at hartree:
https://um.hartree.stfc.ac.uk/hartree/signup.jsp
Also, please sign up for a Hartree ServiceNow account. This will allow you to access support and information for JADE 2:
https://stfc.service-now.com/hartreecentre
Further information on how to use the machine can be found via service-now:
https://stfc.service-now.com/kb?id=kb_search&kb_knowledge_base=ad6a44fc1b609050dbb77449cd4bcb96
Once you have a SAFE account, please log in and you will see a "Request to join project" button in your user dashboard. Click on it and look for the project "J2AD006: JADE2 Access for Newcastle:".
It will ask for a password, please use MbcSNitB2!
After completing this process, a request will be sent to Newcastle's representative in JADE2. Eventually, you will receive a notification of the username you are assigned to access the system.
## Accessing JADE2 and setting up your computing environment
The gateway node's hostname is jade2.hartree.stfc.ac.uk, which you access through SSH like in BEDE.
The management of libraries in JADE2 uses the module system, the same as in Rocket or BEDE. To run tensorflow/keras code, you can just load its module: "module load tensoflow".
For libraries not installed in there, the easier option is to set up your own anaconda environment. Anaconda is available in the system: "module load python/anaconda3". Afterwards, create your own environment and install packages through conda.
## Running jobs in JADE2
JADE2 uses the Slurm cluster scheduler, the same one used in Rocket and BEDE. This means that you don't run your programs directly, you wrap them in a shell script (as below) with some specific lines that specify the resources you are requesting. Then you hand them to the cluster scheduler software that will send the jobs to run to some node when resources become available.
You need to wrap up the jobs for the scheduler within a bash shell script.
Here is an example using the typical KERAS MNIST tutorial example from https://raw.githubusercontent.com/keras-team/keras-io/master/examples/vision/mnist_convnet.py
>#!/bin/bash
>
>#SBATCH --time=1:0:0 # Run for a max of 1 hour
>#SBATCH --nodes=1
>#SBATCH --gres=gpu:1
>#SBATCH --partition=small
>
>#Run commands:
>echo Job running at `hostname` and starting at `date`
>nvidia-smi # Display available gpu resources
>
>module load tensorflow
>python3 mnist_convnet.py
>
>echo "Job ending at" `date`
To submit the job, you use the sbatch command with the script name as an argument. To monitor how the job is progressing, you use squeue. To see an accounting of your previous jobs, you use sacct. See the documentation mentioned above for the full syntax of all of these commands.