This documents describes a group of programs I wrote to make life easier when using the LSF scheduler we have in orchestra. I call the set of programs PyPlatform. * Motivation Oftentimes, I submit a job that in an interactive session takes 5 minutes to run. In a 15m queue, it dies with a timeout. Sometimes, the network filesystem in orchestra is working slowly, or the node my job was running in was too busy. In any case, I wanted a way to detect this situation, and somehow have my job rerun. Another annoying, and recurring problem, is that sometimes my jobs would die because of transient errors in the network, or because a certain job did not have the home, or group, directories mounted. Yet other times, my job run for 16 minutes. My estimation of 15m was reasonable, but not useful enough. In those cases, I wanted my job to be rerun on a 2h queue. One could submit everything to the unlimited queues, but those have a lower priority. * Description of the system The set of programs I wrote has a two main parts: a dispatcher and command-line tools to interface with the dispatcher. The dispatcher is a job that needs to run periodically in one of the login nodes. We can achieve this using cron (see Installation). This program checks the status of the submitted jobs, relaunches dead or timedout jobs, and performs general bookkeeping. If a job dies, the dispatcher will see this, and it will decide what to do. If it was killed because of some error, it will be rerun as it was submitted, up to a number of times you can configure. If it was killed because of a timeout, it will decide whether the timeout was legit (it if used very little CPU time, it's not considered legit), and then decide either to rerun it in the same queue, or to move it to a queue with more time available (e.g. bump it from a 2h queue to a 12h queue) The command line tools are mysub, myjobs and mykill. mysub supports a subset of the functionality of bsub. mykill is essentially like bkill, and myjobs prints the LSF job id of the jobs that PyPlatform is taking care of. * Installation Follow these instructions, and hopefully you will have everything running. 1) Log in to orchestra (you will actually login to either mezzanine or balcony). 2) cd to your home directory and execute svn checkout svn+ssh://orchestra.med.harvard.edu/home/et62/svnroot/PyPlatform This will create a directory PyPlatform wherever you were standing. 3) cd to PyPlatform/trunk and install everything by executing make 4) Tweak the file ~/.PyPlatform/config to suit your taste. If I were you, I would leave everything as it is, but feel free to play. 5) Install PyPlatform in your crontab by executing crontab -e and adding the line */1 * * * * bash -login /home/et62/.PyPlatform/forcron anywhere in the file. REPLACE et62 by your orchestra username. crontab -e will launch a some text editor. You will probably have no trouble using it. */1 means that the dispatcher is going to be run every minute. If you want it to run every 3 minutes, you replace that by */3. I feel that 1 minute is great if you are launching many short jobs, and 5 works more than fine if you are running longer jobs. The dispatcher checks to see whether there is another instance running, so don't worry about that. 6) Add ~/bin to your PATH, if it's not already there. You can check this by editing the file ~/.bash_profile I think the default version in orchestra includes the your home bin directory, if it exists. If nothing of the sort is present in the file, you can accomplish this effect by adding the line PATH=~/bin:"${PATH}" 7) Log out of orchestra, and log back in. 8) Learn how to use PyPlatform, by reading the usage section * Usage Execute mysub --help and you will get Usage: mysub [options] Options: -h, --help show this help message and exit -n NAME, --name=NAME Assign a name to the job -e ERRORSFILE, --errorsfile=ERRORSFILE Redirect stderr to a file -o OUTPUTFILE, --outputfile=OUTPUTFILE Redirect stdout to a file -N Send an email even if the output/errors are redirected -q QUEUE, --queue=QUEUE Specify a queue to run the job -a AFTERACTION, --afterwards=AFTERACTION Specify an action to be performed upon successful completion For now, ignore the -n and -a options. -e, -o, -q and -N work exactly like they do for bsub. If you want to know the LSF job ids of the jobs that PyPlatform is controlling, you can type myjobs. For now, this only lists job ids, but you can get the rest of the information from bjobs. If you want to kill a job that is being controlled by PyPlatform, you can use mykill. This program takes a list of LSF job ids, kills them and removes them from PyPlatform. If you use bkill instead, they may be rerun (because to LSF, they will have died with an error). There is one difference, though: the output and the errors are not stored, unless you specify a file with -e -o (bsub sends both of these things in the email report). Try it by executing mysub ls * Troubleshooting If you start getting emails from cron, you can remove the PyPlatform line from your crontab (by executing crontab -e and editing the file), and then you can forward me your emails. For now, we'll leave it at that. I have been using the scripts for a few weeks without any incidents. They can certainly be polished, and I'm counting on your help with that. Don't worry, I just want you to tell me what I could change or add. I'll take care of the rest. Cheers, Enrique