Executing MPI applications using P-GRADE Portal
From EGEE-see WIki
Introduction
In the EGEE Grid users can connect to a resource broker (RB) that accepts a file written in the Job Description Language (JDL). The purpose of this file is to describe the job itself and various requirements set up by the user represented as attribute-value pairs. The possible attributes include for example the executable name, the input files used by the executable, or the number of nodes required by the job. By using the values defined in this file and the information present in the information system the resource broker can decide about the destination resource of the job.
Besides the brokered method, Grid users have the possibility to access the resources directly. In this case no broker is used to find a matching resource: it is the user's responsibility to send the job to a correct resource. Direct usage can be achieved through standard GT-2 calls: GRAM and GridFTP.
P-GRADE Portal makes it possible for the user to use EGEE resources in any of the above described ways, through a common graphical user interface.
During the development of EGEE Grid support in P-GRADE Portal, we have faced the following differences in EGEE infrastructure and previously used Grid infrastructures: first, many jobmanagers create a new working directory for each GRAM call. As a consequence, the job should be started using only one GRAM call.
Second, there are resources, that do not share the working directory among worker nodes. This means, that input files need to be copied to all of the worker nodes, because in case of MPI applications, all of the processes may access the input files. However, it is possible to detect the presence of shared working directories, in this case optimizations should be considered.
Third, the jobmanager used may not be able to handle MPI jobs. In order to make MPI jobs runnable on such resources, P-GRADE Portal has to start the MPI application.
Finally, there are some new services in EGEE: files may reside on Storage Elements, which can not be used by legacy applications, the user has to add support for them in his/her application. This is a serious problem, if the user does not have the possibility to modify the application. Another important service is the EGEE broker.
Direct MPI job submission
In order to make the execution layer work for both traditional and EGEE-like resources, the following points must be considered.
As jobmanagers may not support MPI jobs, the job type should be specified as 'single', and not 'mpi', but the required number of nodes must be specified for MPI jobs. This enables to start the application even if the jobmanager does not support MPI applications, but the resource is configured to do so.
When the working directories are not shared, it is impossible to copy input files to the worker nodes using GridFTP. Thus, files must be copied to a temporary working directory instead on the frontend machine. A GRAM call can be used to create this temporary working directory. The requested jobmanager might be unusable for this, as the directory is created on one of the worker nodes, and cannot be accessed using GridFTP. So the 'fork' jobmanager must be used to create this working directory, which runs the jobs on the frontend machine. In this case the job to be run is a simple 'create directory' command. After the directory is created, it is possible to copy input files to this directory using GridFTP.
P-GRADE Portal must not specify the working directory, where the job is to be run, as it may be different for each GRAM call.
The jobmanager knows nothing about the input files used by the job (there is no way to specify them), so moving the input files from the frontend to the worker nodes has to be done by the job. As P-GRADE Portal offers the possibility to submit grid un-aware jobs which do not know anything about the infrastructure used, this task has to be done by the execution layer. Not on the Portal machine, but on the resource, after the job has been submitted, and before the job is started. The trivial solution for this situation is to copy the executable to the frontend machine just like an input file, and specify a wrapper script as the executable in the Condor-G submit file. The wrapper script is described below. The wrapper script can do all the necessary tasks which can not be done before the job is submitted, or is not handled by the real executable.
With all the above in view, the execution layer of P-GRADE Portal does the followings to execute jobs using direct job submission:
- The pre script queries the HOME directory on the frontend node using the fork jobmanager. Next, a temporary working directory is created in the HOME directory also using the fork jobmanager. The final step of the preparation is copying the input files and the executable to this working directory.
- The wrapper script is submitted to the requested jobmanager using the Condor-G submit file. In case of MPI applications, the job is submitted as a 'single' job type, for which the requested number of nodes to be allocated is also specified. This step means setting the 'jobType' RSL attribute to 'single', and the 'count' and 'hostCount' attributes to the requested number of nodes. The wrapper script requires the following parameters: the fully qualified domain name (FQDN) of the frontend node, the path of the temporary working directory on that machine, the job name, and the real executable name. All these variables are specified as environment variables in the wrapper script. These variables are updated by the pre script.
- In this step, the jobmanager script running on the frontend node creates a submit file for the local resource management system (LRMS), and submits it to the LRMS. The LRMS allocates the number of nodes requested, and starts the wrapper script on one of the allocated nodes. We will refer to the node where the wrapper is started as the master node.
- After the wrapper script is started, it checks if the temporary working directory created by the pre script is present. If yes, the selected resource uses shared working directories. Otherwise the executable and input files are copied from the frontend node using the command scp or rcp. If the job type is sequential, the executable is simply started at this point. If it is an MPI job, 'mpirun' is being searched for: first $GLOBUS_LOCATION/libexec/globus-sh-tools.sh file is read, and if the $GLOBUS_SH_MPIRUN environment variable is defined, its value is used, otherwise 'mpirun' is used assuming it is in the PATH. The number of nodes is queried from the job description file. If the $PBS_NODEFILE environment variable is defined, its value is used as the machinefile for mpirun. In case $PSB_NODEFILE is defined and no shared working directories are present on the resource, the wrapper script copies the executable, and the input files to each node enumerated in the file. The target directory is the current working directory, as MPI starts processes in that directory. So the executable, and input files are distributed.
- Next, the real executable is started using the found 'mpirun', the specified process count, and the possibly found machinefile.
- After the real executable has been run, and the $PBS_NODEFILE environment variable is defined and no shared working directories are present on the resource, the output files found in the job input/output file description file are copied from the worker nodes enumerated in the file referenced by $PBS_NODEFILE to the master worker node using scp or rcp. If a file is not present, the error is simply ignored. After this, output files are copied from the master worker node to the frontend machine temporary working directory. Finally the wrapper script exists with the return value of the real executable.
- After the Condor-G job has successfully finished, the post script copies the output files from the temporary working directory on the frontend to the the Portal machine using GridFTP.
The following image gives an overview of the direct job execution in P-GRADE Portal:
MPI job submission using the Resource Broker
In case of submitting a job through the EGEE broker, users do not have the possibility to make the broker copy their remote input files residing on a storage element to the worker node where the job is run. It is possible to specify the remote input files, but this is only a hint for the broker: it can place the job close to a storage element, which has the requested files. It is the job's responsibility to download the remote input files from the storage element as requested.
P-GRADE Portal offers the possibility to users to run their legacy applications using remote input files, even if the application does not support using storage elements. In order to achieve this, on the user interface the input file type has to be set to Remote, and the file must be specified using a Logical File Name (lfn) or a GUID.
The execution layer of P-GRADE Portal does the followings to execute jobs using brokered job submission:
- The Portal submits the job to the resource broker. A portal wrapper script (wrapper_p), the real executable, and local input files are sent with the job. wrapper_p is used as the executable to run.
- The resource broker creates a submit file for the GRAM jobmanager requested. The job type is specified as single, and a new script, the broker wrapper script (wrapper_rb) is specified as the executable to be run. wrapper_p, the real executable and local input files are sent as job input.
- This step is the same as step 3 in case of direct MPI job submission.
- wrapper_rb starts wrapper_p on the allocated nodes using 'mpirun'. The first instance of wrapper_p starts on the master worker node.
- wrapper_p check if the requested remote input files are present. If not, they are downloaded from the storage element. Next, the real executable is started. This step triggers the MPI_Init function in the MPI library.
- MPI_Init starts wrapper_p on the other worker nodes. wrapper_p and not the real executable, because wrapper_p has been specified as the executable to 'mpirun'. wrapper_p running on slave worker nodes behaves just like on the master worker node: checks if the remote input files are present. If yes (probably because working directories are shared), the real executable is simply started. If not, remote input files are downloaded from the storage element.
This method copies remote input files only in case of they are really needed: for shared working directories only once, for unshared working directories only if they are not present. The created portal wrapper script is universal: works for both sequential and MPI applications.
As it can be seen, the main difference between direct and brokered job submission implementation is, that in case of brokered submit the implementation does not need to take care of running 'mpirun', as it is done by the broker wrapper script. In case of direct job submission, the portal wrapper script behaves just like the broker wrapper script mentioned in this section.
The following image gives an overview of MPI execution in P-GRADE Portal using a resource broker:
