Condor - guide to suitability
What is Condor?
Condor is a High Throughput Computing (HTC) environment which makes use of spare resources on open-access computing clusters across the uiversity campus. Condor is a non-interactive, batch system which means that compute 'jobs' are submitted to a work-scheduling system which splits the task into 'chunks' for completion
What Condor IS good for
- Large numbers of independent jobs that need to process different sets of data
- Large numbers of independent jobs that need to sweep through a range of parameter(s)
- Static linked programs - the job defines all the executable code
What might be problematic
- Software that processes/ generates very large amounts of data (more than10 GB - but see partitioning below)
- Software that requires a large number of concurrent threads (PCs are 4-core)
- Software with a small number of licences
- Jobs that need to run for longer than 2-6 hours
- Dynamically linked libraries - can't ensure compatibility in our highly heterogenous environment
What is NOT suitable
It is possible to rule out certain large classes of job straight away:
- Jobs that require security in any shape or form (software or data) : Condor jobs run on other people's computers!
- Jobs with a guaranteed turnaround (the available resource is only what is not otherwise in use)
- Software that doesn't run on Windows 7.
- Interactive software (e.g. graphical user interface support screen/ keyboard/ mouse - but see command line below)
- Software that requires very large amounts of RAM (more than 8GB - but see partitioning below)
- Jobs that need to run for longer than 24 hours (but see partitioning below)
Notes
Command line
Condor jobs must essentially be jobs that can be run at the command line, with no graphical display or user interaction. However, some GUI based applications also have command line interfaces, or they may support compilation using command line based run-time libraries, in which case it may be possible to run them as Condor jobs.
Partitioning
PC Cluster machines are rebooted at 5am every day - if your job is still running at that time it will be evicted and all processing up to that point will be lost unless you have arranged otherwise. Long running jobs are also likely to be pre-empted by user activity.
For jobs that process large amounts of data it may be possible to restructure them into more jobs each processing less data. Long running jobs could be restructured into series of shorter jobs.
For certain types of job that require large amounts of RAM it may just be possible to restructure the calculation over more jobs each using a smaller amount of RAM. However, note this is not in general possible as many algorithms do require large amounts of RAM to work effectively.
Other platforms
Our primary resource for Condor is the large number of Windows 7 PCs comprising the PC Clusters. However, other platforms (e.g. Linux) may become available on the Condor grid.