Pilot jobs using centralized storage
From EGEE-see WIki
Introduction
Another concept often used to circumvent the problem of job startup overhead are pilot jobs. Pilot jobs are dummy jobs sent to the cluster not to do real work but to acquire a processor for the user. After starting up, pilot jobs connect to some centralized service to request work. If there is work available, both the executable and input data are downloaded and started. After finishing, output is sent to the server and some more work is requested. If there is no work available on the server for some time, pilot job exits and releases the processor. Several implementations of this concept exists (listed in the references) and the next section describes an implementation of a simple pilot job client that uses centralized storage to record job state.
Implementation
Perl client described in this section was used for processing a set of input files with a sequence of parameters. Input files are stored on a GridFTP server in a predefined directory (each input file in its own subdirectory). Pilot jobs start this pilot script with the address of the GridFTP server and the directory containing input data:
$ ./client.pl storage.irb.hr /srv/storage/crogrid/vvidic/parf/work + uberftp storage.irb.hr "get /srv/storage/crogrid/vvidic/parf/work/../parf_ifort_static parf_ifort_static" + uberftp storage.irb.hr "ls /srv/storage/crogrid/vvidic/parf/work" + uberftp storage.irb.hr "get /srv/storage/crogrid/vvidic/parf/work/shufStrandPeriod1148.arff/data.gz data.gz" + uberftp storage.irb.hr "ls /srv/storage/crogrid/vvidic/parf/work/shufStrandPeriod1148.arff" + uberftp storage.irb.hr "put flag /srv/storage/crogrid/vvidic/parf/work/shufStrandPeriod1148.arff/active407" + ./parf_ifort_static -t data -r 1 -uu 1,63-1063 -c 407 -n 10000 -tv votes407.txt + uberftp storage.irb.hr "put votes407.txt /srv/storage/crogrid/vvidic/parf/work/shufStrandPeriod1148.arff/votes407.txt" + uberftp storage.irb.hr "cd /srv/storage/crogrid/vvidic/parf/work/shufStrandPeriod1148.arff; rm active407" + uberftp storage.irb.hr "ls /srv/storage/crogrid/vvidic/parf/work/shufStrandPeriod1148.arff" + uberftp storage.irb.hr "put flag /srv/storage/crogrid/vvidic/parf/work/shufStrandPeriod1148.arff/active244" + ./parf_ifort_static -t data -r 1 -uu 1,63-1063 -c 244 -n 10000 -tv votes244.txt ...
The script uses the uberftp command to access the storage, downloading the real executable whose name and location are defined in the script. If the download succeeds, directories with input files are browsed for unprocessed files. For each input file, executable needs to be started with parameters from 63 to 1063. Whether the parameter has been processed or not can be determined from the directory contents. If there is a file with the name votes56.txt than the parameter 56 has been processed and the result is available in this file. If this file doesn't exist, but there is a file named active56 then the parameter 56 is being processed by one of the clients. This file contains the hostname of the client processing the parameter and can be used for debugging in case of problems. If none of the two aforementioned files exist for a given parameter, then the parameter is available for processing.
Since multiple clients access the storage in parallel, each client selects at random a free parameter for processing. Although there is a possibility of two clients selecting the same parameter, the duplication of work does not cause problems. When a client finishes processing the input file using the selected parameter, results are uploaded as votes file and active file is removed. If there are no unprocessed parameters (all parameters are either processed or being processed) the client select one of the active parameters. This duplicate processing of active files ensures that failed jobs are handled properly - it is possible for the clients to fail while processing a parameter and never upload the results. When all the parameters have been processed, the directory with the output files is moved aside so that new clients don't need to access it just to discover that all the parameters are already done. The pilot script finishes running when all the directories with input files have been processed.
This implementation presents a sample pilot job client that can be reused for different applications and is easier to install and use than complex general purpose frameworks like Gladein and BOINC.
