Troubleshooting

Troubleshooting

My job is stuck in the queue

  • Type e.g. 'squeue -j 12345' for information on why the job number 12345 has not started
  • Check that your resource requests can be fulfilled.  A 5-day job will never start if it is in the default queue.
  • Some problems can be resolved without cancelling a job.  E.g. 'scontrol update job myjobid timelimit=01:00:00'.

My job has failed

  • Check your job error and output files (normally called slurm-<job_ID>.out) for clues
  • Check that any #SBATCH directives are at the top of your job script, before any other commands, or they will be ignored
  • Check that any #SBATCH directives are correct
  • Check that your filepaths are appropriate to Rocket
  • Check that your job submission script includes any modules needed
  • If you transferred any text files from Windows, you may need to run the command 'dos2unix myfilename' to convert them to linux format
  • Use the command 'sacct' to check your job's resource requests and usage.  E.g. 'sacct -j <jobid> -o reqmem,maxrss,alloc,timelimit,elapsed,exit'

My job runs slowly

  • For multicore jobs, check the number of cores allocated in SLURM (sacct -j jobid).  Check that the application is distributed across the cores as expected.   
  • Use the minimum number of nodes that can accommodate your jobs.  For example, if a job needs 88 cores, constrain it to run on 2 nodes using '#SBATCH -N 2-2' or '#SBATCH --exclusive'
  • Applications such as GROMACS run significantly faster if scheduled carefully.  Contact your local IT officer or the IT service desk for specific advice.
  • Avoid running large numbers of very short jobs (lasting e.g. a few minutes).  Bundling tasks into larger jobs will avoid overheads and is kinder to SLURM
  • Consider copying files to local /scratch before working on them
  • Avoid large numbers of small I/O requests

Why didn’t I get an email when my job finished/started/failed?

  • By default, SLURM doesn’t send email notifications.  To change that, add this line to your job script:

#SBATCH –mail-type=ALL

Where are the standard error and standard output files for my job?

  • By default, SLURM writes a single file combining stderr and stdout.  The file is stored in the directory from which you submitted the job and is called: slurm-myjobID.out, where myjobID is the job number.  To change this behaviour using the job script, add lines such as:

#SBATCH –o my_filename.out

#SBATCH –e my_filename.err

How can I see my jobs' resource usage?

The sacct and sstat commands have numerous options that report on job resources.  They are described fully on the sacct man page.  For example:

sacct -j <jobid> -o timelimit,elapsed,usercpu,systemcpu,reqmem,maxrss

Contact the IT service desk if you need to know more about the behaviour of a job.

Why can't I access /rdw?

Access to /rdw requires a Kerberos authentication ticket.  If you login using ssh keys, you will not have a ticket; generate one with the command 'kinit'.  The 'klist' command will list any active keys.

Note that /rdw is accessible only from the login nodes, not compute nodes.

How can I copy several terabytes of data to /rdw without it timing out?

You need to take two steps to avoid timeouts.   Firstly, you will periodically need to renew your Kerberos authentication ticket, which controls your access to /rdw and expires after 10 hours.  The 'krenew' command will do the renewal automatically for up to a week.   Secondly, to stop your process being killed if you are logged out, run it within a tmux session, then detach from your login.  You can reattach later if necessary.  An example session might look like this:

Start a new tmux session:

tmux

Run the command to copy your data to /rdw: 

krenew -v -- bash -c 'rsync -trv /nobackup/myuser/ /rdw/myshare/ >> mylogfile'

Detach the tmux session

<Ctrl-b d>          

If necessary, start tmux again and attach to your previous session:

tmux attach

Why does 'du' give different sizes for my dataset on /nobackup and /rdw?

The du command can give misleading results for /rdw.  Add the option '--apparent-size' to resolve this.  

Why won't mpirun work?  

There is a conflict between SLURM's srun command, which is the command we recommend for running MPI code, and Intel mpirun.  This affects only Intel mpirun.  If you need to use Intel mpirun, first type:

unset I_MPI_PMI_LIBRARY

Why won't srun work?

OpenMPI programs compiled with PGI or GNU need the option '--mpi=pmi2' in the srun command line, e.g:

srun -n 2 --mpi=pmi2 a.out