Lemur Project Cluster


Running User Processes

Contents


General

In the Lemur Project-Cluster, the front-end node acts as a management node, directing user network and process traffic to where it needs to go. This means that the front-end can sometimes get heavily laden with processes. In this regard, we have installed the Condor job submission system to allow users to run jobs on the back-end nodes.

Running User Processes on the Front-End Node

If you are running experiments or other processes that you know will take a long time (i.e. over an hour), please consider using the Condor job submission program to run your program. If, for some reason, you cannot use Condor, please "nice" your process to at least a priority of "4" so that it will allow other user processes to timeshare with it as well.

Because of the amount of traffic that will need to flow in and out of the front-end node, we encourage everyone to launch their processes via Condor. If we start to see the front-end node being bogged down with processes that are not run via condor, we will indiscriminately kill any process at any time to free up memory and/or CPU time on the front-end node.

Condor Commands

The basic Condor commands are as follows :

To create a job to be submitted to Condor, a few rules must be followed:

  1. The job cannot run interactively. That is, it cannot wait for user input.
  2. You will need to create a submission file for each job to queue.

At a minimum, a submit file must have the following:

Optional arguments can include:

Below is a sample submit file called "sample.submit" that I have used to run "IndriBuildIndex":

  ##### start #####
  #
  # Any lines that start with a # are comment lines
  #
  # Sample submit file - runs IndriBuildIndex and saves the output
  #

  # choose the "universe" to run in
  Universe = vanilla

  # set up the executable and any arguments
  Executable = /bos/usr2/mhoy/indri/IndriBuildIndex
  Arguments = /bos/usr2/mhoy/build.parameters

  # set up our log files
  Log = sample.log
  Output = sample.out
  Error = sample.err

  # finally queue the job - this must be last!
  Queue

  #
  ##### done #####
  

To submit this to Condor, I would type "condor_submit sample.submit". Condor will then launch the job and move it to one or more of the nodes. Note that whatever directory the job is launched from is where the current working directory will be. For example, in the submit file above, the log, output and error files would all be created in the directory that I launched the job from.

When the job has finished, an e-mail will be automatically sent to you.

For the Condor "universe" to use. For small jobs, you should use the "vanilla" universe. This is a standard virtual machine and should be able to run any job you have. The "standard" universe has some extra advantages such as process checkpointing and allowing the use of remote system calls. To use the standard universe, you must relink your program with the Condor libraries. See the section on standard universes on the Condor website for more information on making your programs work with the standard universe.

Condor Resources

For more information on Condor, the Condor home page is: http://www.cs.wisc.edu/condor/

The online manual is available from: http://www.cs.wisc.edu/condor/manual/v6.8/

and some general tutorials for Condor: http://www.cs.wisc.edu/condor/tutorials/

Finally, there is a decent tutorial at: http://www.acf.bnl.gov/UserInfo/Software/Condor/effective_condor_v1.doc

Condor and AFS

The Condor daemons do not run authenticated to AFS; they do not possess AFS tokens. Therefore, no child process of Condor will be AFS authenticated. The implication of this is that you must set file permissions so that your job can access any necessary files residing on an AFS volume without relying on having your AFS permissions.

If a job you submit to Condor needs to access files residing in AFS, you have the following choices:

  1. Copy the needed files from AFS to either a local hard disk where Condor can access them (i.e. to an readable NFS volume).
  2. If you must keep the files on AFS, then set a host ACL (using the AFS "fs setacl" command) on the subdirectory to serve as the current working directory for the job. If a standard universe job, then the host ACL needs to give read/write permission to any process on the submit machine. If vanilla universe job, then you need to set the ACL such that any host in the pool can access the files without being authenticated.


[« Previous (User Guidelines)] | [Up] | [Next (Running Web Applications) »]