Simple grid MPI application using gsl library
From EGEE-see WIki
Contents |
Introduction
The application we are developing some of the typical characteristics of many grid applications and therefore the approach we used, as well as the sample, could be used by others users of SEE-GRID infrastructure. First of all, it is very time consuming and one needs, for the simplest model, around 20 hours of CPU time on one processors. The more complex model would need weeks of calculation on one CPU. Fortunately, all computation could be divided into stages and every stage is, on itself, dividable into batches of tasks that need a little communication between them. Between two consecutive stages, major communication becomes compulsory. The purpose of this document is to help others avoid pitfalls and provide them with simple framework for start their application.
The developers faced different problems in the design of this application.
- How efficiently divide these stages and tasks and make the whole application functional on the seegrid infrastructure.
- The applications itself uses several grid sites and as the coordination and the communication of the applications on these different sites are important to achieve the good performance. The good design decision reduces substantially the computational time. Moreover, the end user is not grid aware and the wall clock time of the application is what is important for us.
- The tasks of one stage are divided into several batches of tasks. Every batch is executed on different grid site as a rather simple MPI application. Once at the site, the tasks of the batch spawn on different available processors and communicate using MPI functions. When all tasks on one grid site are done, the master communicates the results via file communication using lcg_util API. We choose to have all files on one particular storage element. This approach changed recently as the state of the infrastructure improved (number of sites run SRM capable SE) and actually we are using gfal API. This approach leads to the developing the C++'s fstream wrapper around gfal C API. This would also be documented in this part of gridification guide report.
- The application needs the GSL library and Cubpack++ library. The substantial part of the time was devoted to the integration of these libraries into the application that uses the grid middleware libraries and MPI. Different compatibility issues are solved during the development.
We developed the perl scripts to create .jdl and .sh files used to submit the jobs. The plan is to further investigate the ways to improve the scalability on the grid infrastructure and time responsiveness of the application.
As the application itself is rather complicated, we will provide for the sake of this guide the simple “Hello grid world†application that compromise all the elements sited above together with the cookbook to compile and use it. It could be a good start point to develop another grid application with similar requirements.
Description of the application
This is a simple application but also the application that solves the problem we face and it would be equally useful for other similar problems. For the purpose of this guide, the application is even more simplified. We show how to calculate the values of a given function on a given set of points using different grid sites and then combine all results (extrapolating obtained function) and than iteratively calculate the values of new function on the same set of points. The new function in the iteration $i$ will depend on all values obtained in the iteration $i-1$.
For the simplicity, we take as a set of points a given number of equidistant points in the interval that starts with 0 and has a length equal to the number of grid sites used in the calculation.
Every grid site would be responsible for calculation of several equidistant points in the interval of length 1. The starting point of the interval has the same value as the position of the grid site in the application. More precisely, the first grid site is responsible for the points in the interval between 0 and 1, the second one for the points between 1 and 2, etc. Therefore, the whole application will calculate the points in the interval between 0 and S-1, where S is the number of grid sites involved in the calculation. The number of points in one unitary interval depends on the number of worknodes involved in the calculation. It does not be the same for all sites. The system has been built such that these properties can be changed by adjusting the number of worknodes to be involved in the calculation in the perl script provided with the package. The default value is to have equal number of WNs on every site.
Following figure represents overall structure of the application and the division of tasks among grid sites.
Use of MPI on a given grid site
We show equally how to use MPI on every site to do this calculation. In fact, the value of the function on a given point is calculated using a single worknode on the site. Again, for the sake of simplicity the use of MPI is reduced to the collecting data from every work node but one could use full flavor of a standard MPI implementation if she likes. On the figure we can see how the tasks of calculation is divided among worknodes on different grid site participating in the calculation.
When all calculations at every work node at one grid site are done, the data are collected and written in the file repository under the name filename_iterN_siteS1. In this case study, this file is simple text file with natural format
$ x-axe value \tab y-axe value$
$ x-axe value \tab y-axe value$
……….
x-axe value \tab y-axe value
Of course, the developer could come with his own needs and define completely different format or use some standardized format. Use of XML technologies could be considered also as a standard to exchange data.
Coordination and communication between the sites
When all grid sites are done with their tasks in the iteration I, in the file repository one would find all files filename_iterI_site0, filename_iterI_site1, …, filename_iterI_siteS. The special application is run on the chosen SE that waits for all these files. It performs its task periodically checking the presence of all these files. The periodicity is customizable but not critical for this application. In our simple case study, this application will simply concatenate all present files from iteration I and make a graphic in the eps format using graphtool GNU software package. The user can than retrieve manually using edg commands the resulting files checking the unfolding of the whole application.
The result of this process is than sent in the form of an file to all grid sites involved in the calculation for the I+1 iteration. In this simple application, as we want keep it as simple as possible, the result that we send is simply the numerical value of integral of the extrapolated function.
The whole application is divided in three parts
- The part responsible for the numerical calculation. This part could, and should, be customized by the user at her own needs. In this tutorial we provide a simple example to calculate the Bessian function at different points on the x-axe. This part is an MPI-application usually run on different GRID sites simultaneously. We will refer to this part as the numerical part of application, NPA.
- The second part of the application is responsible for collecting the results from different GRID-sites. This collecting could be a simple operation like a concatenation of result files obtained from every single site but could be also more elaborate depending on the needs of the users. We provide here the simple application that only concatenates the results and give the new input for the next iteration to be done on different GRID-site.
- The third part consists of several perl scripts that helps and automize the process of submitting the jobs and retrieving the results
The application is packed in three corresponding parts and we explain in what follows necessary details for all three parts.
Technical description of the application
Numerical calculation
The directory structure of the this part is as follows
HelloWorld_mpi_grid
| --- source
| *.h and *.cpp files
| --- mpi_grid
| ---- makefile
| ---- sources.mk
| ---- objects.mk
| ---- source
| --- mpi
| ---- makefile
| ---- sources.mk
| ---- objects.mk
| ---- source
In order to successfully compile each part, it should be enough to enter the command make under the chosen directory. The following parameters should be adjusted as follows (depending on the installed libraries on your working system). In order to minimize the possible compatibility issues the best would be to the it on the available UI node on your grid site. How to compile and what are the pitfalls to avoid
Let us briefly described the files and the function implemented in this part of the application.
In the file main.cpp one would find only one function main int main( int argc, char **argv); It will take as its arguments the directory where all the files produced by the program would be saved (for example /grid/seegrid/pmf/OAA_2/run0001/). This directory would be a directory in the grid catalog or the simple directory on the nfs in the cluster. The second argument is k, the ordinal number of the grid-site we use in the applicaton. The third argument is 5*k. This argument is application specific. In our simple example, we are calculating the value of the given function at some points given by the array. This argument denotes, in our example, the index of the first point in a given array for which we would like to find the value of the given function. Implecitly, 5 in the argument means that we calcultate 5 points at every grid-site concerned with the calculation. In this simple application, every site has to do approximataly the same amount of work, i.e. calculate 5 values at 5 points. In the other applications, the amount of work to perform at every site could be different and will be regulated by this parameter and the next one, 5 * (k + 1). This argument denotes, in our example, the index of the last point in a given array for which we would like to find the value of the given function. The following argument in our example is 40. This argument denotes the number of iterations, i.e. number of periods to evaluate in the Optimal assets allocation problem. The last argument is a legacy argument for this application and was used to describe the storage element for gfal funcitons used in the first version of the application. For the sake of stability, we choose se.phy.bg.ac.yu storage element.
The main starts with the function MPI_Init( &argc, &argv ). This is a standard MPI function that starts communicator and is mandatory in any MPI program. In our example, it also take out all mpi related arguments from argc so we can easily process argc and argv as usual in C/C++ program.
The next procedure to be called within main() is void OAA_grid_set_up_parametres(int argc, char **argv, OAA_grid_parametres* OAA_grid_params);
The parameters are argc, argv, and the pointer to the struct to keep all parameters OAA_grid_parametres in a managable way. The definition of this structs follows typedef struct{ int OAA_nmb_of_site; int OAA_start_shock_points; int OAA_end_shock_points; int OAA_number_of_periods; int OAA_current_period; char OAA_current_se[100]; char OAA_current_dir[100]; char OAA_filename_in[100]; char OAA_filename_out[100]; } OAA_grid_parametres; and the names and types are selfexplenatory.
After having set the parameters, the function int OAA_grid_mpi_main(MPI_Comm myMPI_COMM_WORLD, OAA_grid_parametres* OAA_grid_params, double (*f)(double x, void* p)); is called. This is the before last function. After it, one needs only to call MPI_Finalize(); to close properly MPI_communicator. OAA_grid_mpi_main function has several purposes: to keep clean and simple main() to allow user easy maintance of the application separate the set-up of the application and the definition of the work in the application
The first two parameters passed to this function are self-explanatory. The third one is a pointer to the function for which we calculate the values. The third parameter could be also the pointer to the class, and in the real application this is a case, that perform all application-specific calculation. In this case study we pass pointer to the simple_gsl funtion defined in the diff_test_functions.h and diff_test_functions.cpp.
The method int OAA_grid_mpi_main(MPI_Comm myMPI_COMM_WORLD, OAA_grid_parametres* OAA_grid_params, double (*f)(double x, void* p)) is defined in the OAA_grid_functions.cpp file. Let us here described the algorithm implemented here and its purpose. There are two buffers of type char[] used to pass and receive the results of the calculation from the central data repository (SE or nfs directory), and there is one integer variable rank used to keep track of the rank of the process in the MPI_Communicator. The process with the rank 0 collect all results and save them in the appropriate directory and also reads the processed information from the previous iteration. For all other processes, this function simply call OAA_mpi_grid_find_points_equal_distribution, that we explain thereafter. When all processes finished their calculation, they write results in OAA_buffer_out which is then written in the file filename_iterN_siteS using the function file_creating(...).
If successfully written the procedure waits for the file filename_iterN to proceed with the next iteration. This is realized with the function file_waiting_and_reading(...) that we explain after in this guide. When the written and read operation are finished successfully, the procedure update the parameters that describe the iteration ( OAA_grid_params->OAA_current_period, OAA_grid_params->OAA_filename_in, OAA_grid_params->OAA_filename_out) and we repeat these steps till we reach the number of periods required by the user ( the variable OAA_grid_params->OAA_number_of_periods keeps this information and worth 40).
The procedure int OAA_mpi_grid_find_points_equal_distribution(MPI_Comm myMPI_COMM_WORLD, char* OAA_buffer_in, char* OAA_buffer_out, OAA_grid_parametres* OAA_grid_params, double (*f)(double x, void* p)) calculates the values of the function f (in this example this is gsl_simple function) on the different points. In this example every site should calculate 5 equidistant points between k and k+ 1 where k is the number of the site in the parameters of the application. For the sake of simplicity, one work node calculates the value of only one point, and the points are assigned to the work nodes following their ranks in the MPI_Communicator. When all nodes finish their jobs the results are collected using MPI_Gather on the process of rank 0. This process then writes all results to the OAA_buffer_out and the procedure exits. The OAA_buffer_in variable has no role in this application but could be used to pass the results from the previous iteration. This result could have some influence on the values of the function f in actual iteration (for example in some dynamic programming algorithm).
double simple_gsl(double x, void* p) is really simple function that uses one gsl function. This kind of function is used here for two main reasons: to show on one hand the use of some external, non standard library in the simple grid application it allows us to “see†the results by drawing its graph at every iteration The parameter p is not actually used here but appears mainly for the sake of consistency with the use of gsl functions.
The remaining two functions are defined in the file file_manipulation_grid_and_cluster.cpp and already their names explain their purposes. file_creating function takes two arguments, filenameout that indicates where write new file and under which name and buffer_out is char[] variable which contents will be written in the filenameout. The function is implemented in the way that distinguishes two types of the application: cluster based application used for checking and debugging and the real grid application.
file_waiting_and_reading function takes equally two arguments and its role is simply to wait the other grid sites and the concatenator on the storage element to finish their jobs. When it finds the result file of actual iteration it reads it and allows start of the next iteration. This function is also implemented in the way that distinguishes two types of the application.
Processing the intermediary results
In order to process the intermediary results, we designed the separate application that deals only with the files on the storage element in use. Therefore, the best place to run this application is neighbor computing element. This application in our package is called grid_files_concatenate. There are four parameters passed in the call of this application : 1. file_name 2. no_nodes 3. no_iter 4. se
The first one is the file_name that stands as the first part of the name of all files used throughout the application. The second parameter denotes the total number of grid sites involved in the calculation and the third one is the total number of iterations. The forth argument is now deprecated and was used as the name of storage element to keep track of all files (eg. se.phy.bg.yu). This application allows in a simple way coordination between different grid sites and was built having in mind our type of calculation described in the beginning of this guide. The application starts with the first iteration waiting for the results from all grid sites involved. When the first set of files completes on the storage element the task of the application is to, in our example, concatenate all these files and create a new one that will serve as the starting point for the next iteration. One can easily replace the concatenation with any more demanding operation depending of the needs of particular application. The number of iterations in this auxiliary application should be match with the number of iterations in the main application.
Automatization using perl scripts
When these two applications are successfully compiled, the user should execute them on the grid infrastructure. For this purpose, we use the suite of standard .jdl files and .sh scripts. For every grid site involved in the calculation we should have one .jdl and one .sh file and then we can submit these jobs on the grid. Every new execution of the application should use empty directory in the file repository so we choose to create new directory for every run of the application. The process of submitting the whole application is automized using several perl scripts that user can change to meet her needs. As there are several compatibility requirements for the application we have the script that simply check functionality of all sites in the SEE-GRID infrastructure. One can use this script to build the list of “really available†sites for this kind of application. The name of this script is parse_ce_information_mpi.pl and is provided in this package in the directory perlscripts. This script will run non-related application on all sites. After these application are executed the files that starts with the names of functional sites will be available in the created directory. So if we choose the new directory /grid/seegrid/pmf/OAA_2/run0600 the files we can find there are .c01.grid.etfbl.netiter0000_site0004 .ce001.grid.uni-sofia.bgiter0000_site0004 .ce001.imbm.bas.bgiter0000_site0004 .grid01.elfak.ni.ac.yuiter0000_site0004 .grid1.irb.hriter0000_site0004 .sn0.hpcc.sztaki.huiter0000_site0004 .yildirim.grid.boun.edu.triter0000_site0004
Other files are missing for different reasons : slow internet connection, lack of MPI support, etc. We expect that all sites comply with the requirements needed by the application soon. To continue with test, the user should create file with the name available_sites that has at each line the name of fully functioning CE. For example
c01.grid.etfbl.net
ce001.grid.uni-sofia.bg
ce001.imbm.bas.bg
grid01.elfak.ni.ac.yu
grid1.irb.hr
sn0.hpcc.sztaki.hu
yildirim.grid.boun.edu.tr
We propose to run the application on all or some of these sites calculating the values of gsl_simple on only two points at every chosen grid site. The script that assists us in this job is run_helloWORLD_mpi_grid.pl and could be also the starting point for the real world script.
The third script that is not actually necessary to run the application but is useful to see and to get feel what is going on and to have results from file repository. It will take the concatenation file from chosen iteration, download it to the user local directory and draw the graphic of the function using simple graph tools available on most Linux boxes. This script is semi-automated and the user could make some of the choice. The name of this script is and get_results_and_render_graphics.pl and you can find it in the same directory.
Getting started fast
Here we give the procedure to have functional grid application as fast as possible.
- 1 download the tarball with the main application : helloWORLD_mpi_grid
- 2 download the tarball with the application for combining results : concat_files
- 3 download the perl scripts
