In the Lemur Project-Cluster, the front-end node acts as a management node, directing user network and process traffic to where it needs to go. This means that the front-end can sometimes get heavily laden with processes. In this regard, we have installed the Condor job submission system to allow users to run jobs on the back-end nodes.
If you are running experiments or other processes that you know will take a long time (i.e. over an hour), please consider using the Condor job submission program to run your program. If, for some reason, you cannot use Condor, please "nice" your process to at least a priority of "4" so that it will allow other user processes to timeshare with it as well.
Because of the amount of traffic that will need to flow in and out of the front-end node, we encourage everyone to launch their processes via Condor. If we start to see the front-end node being bogged down with processes that are not run via condor, we will indiscriminately kill any process at any time to free up memory and/or CPU time on the front-end node.
The basic Condor commands are as follows :
To create a job to be submitted to Condor, a few rules must be followed:
At a minimum, a submit file must have the following:
Optional arguments can include:
Below is a sample submit file called "sample.submit" that I have used to run "IndriBuildIndex":
##### start ##### # # Any lines that start with a # are comment lines # # Sample submit file - runs IndriBuildIndex and saves the output # # choose the "universe" to run in Universe = vanilla # set up the executable and any arguments Executable = /bos/usr2/mhoy/indri/IndriBuildIndex Arguments = /bos/usr2/mhoy/build.parameters # set up our log files Log = sample.log Output = sample.out Error = sample.err # finally queue the job - this must be last! Queue # ##### done #####
To submit this to Condor, I would type "condor_submit sample.submit". Condor will then launch the job and move it to one or more of the nodes. Note that whatever directory the job is launched from is where the current working directory will be. For example, in the submit file above, the log, output and error files would all be created in the directory that I launched the job from.
When the job has finished, an e-mail will be automatically sent to you.
For the Condor "universe" to use. For small jobs, you should use the "vanilla" universe. This is a standard virtual machine and should be able to run any job you have. The "standard" universe has some extra advantages such as process checkpointing and allowing the use of remote system calls. To use the standard universe, you must relink your program with the Condor libraries. See the section on standard universes on the Condor website for more information on making your programs work with the standard universe.
For more information on Condor, the Condor home page is: http://www.cs.wisc.edu/condor/
The online manual is available from: http://www.cs.wisc.edu/condor/manual/v6.8/
and some general tutorials for Condor: http://www.cs.wisc.edu/condor/tutorials/
Finally, there is a decent tutorial at: http://www.acf.bnl.gov/UserInfo/Software/Condor/effective_condor_v1.doc
The Condor daemons do not run authenticated to AFS; they do not possess AFS tokens. Therefore, no child process of Condor will be AFS authenticated. The implication of this is that you must set file permissions so that your job can access any necessary files residing on an AFS volume without relying on having your AFS permissions.
If a job you submit to Condor needs to access files residing in AFS, you have the following choices: