Testing MPI support
From EGEE-see WIki
On the request of applications that need MPI support on sites, GOODs are expected to test MPI setup on all SEE-GRID sites that claim to support it. The MPI setup test should be performed at least once a week, and GOODs should ensure that the test parallel job runs at the same time on at least two WNs (to test ssh setup as well). On sites with SMPSIZE=4 this would require 5 processes in MPI; on sites with single-core WNs it will be enough to have just 2 processes; therefore, GOOD is responsible for sending the appropriate test job to each site. Since such test jobs require several CPUs, it is likely that their execution will take more time than it is usual for a single-CPU jobs. For this reason, such test jobs should be given enough time to complete (1-2 days). In case the test jobs fail (or do not run to completion in a reasonable time), GOOD site ticket should be created.
However, this guide can be used by any user/application developer interested in testing MPI configuration on any Grid site.
The following short guide is prepared by AEGIS01-PHY-SCL site admins. The files referred to in this guide is are available here.
How to find out which SEE-GRID sites currently support
The following simple JDL is enough:
[antun@ce test-mpi]$ cat test-mpi.jdl
Type = "Job";
JobType = "MPICH";
NodeNumber = 2;
Executable = "test-mpi.sh";
Arguments = "test-mpi";
StdOutput = "test-mpi.out";
StdError = "test-mpi.err";
InputSandbox = {"test-mpi.sh","test-mpi.c"};
OutputSandbox = {"test-mpi.err","test-mpi.out","mpiexec.out"};
Requirements = Member("MPICH", other.GlueHostApplicationSoftwareRunTimeEnvironment);
The scripts test-mpi.sh, MPI C program test-mpi.c, as well as the above JDL file are available here.
Typical usage:
[antun@ce test-mpi]$ voms-proxy-init -voms seegrid
Enter GRID pass phrase:
Your identity: /C=RS/O=AEGIS/OU=Institute of Physics Belgrade/CN=Antun Balaz
Cannot find file or dir: /home/antun/.glite/vomses
Creating temporary proxy .................................................................... Done
Contacting voms.grid.auth.gr:15040 [/C=GR/O=HellasGrid/OU=auth.gr/CN=voms.grid.auth.gr] "seegrid" Done
Creating proxy ................................... Done
Your proxy is valid until Tue Nov 6 23:41:51 2007
[antun@ce test-mpi]$
[antun@ce test-mpi]$ glite-wms-job-delegate-proxy -d antun
Connecting to the service https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
================== glite-wms-job-delegate-proxy Success ==================
Your proxy has been successfully delegated to the WMProxy:
https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
with the delegation identifier: antun
==========================================================================
[antun@ce test-mpi]$ glite-wms-job-list-match -d antun test-mpi.jdl
Connecting to the service https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
==========================================================================
COMPUTING ELEMENT IDs LIST
The following CE(s) matching your job requirements have been found:
*CEId*
- ce.grid.pmf.unsa.ba:2119/jobmanager-pbs-seegrid
- ce.ulakbim.gov.tr:2119/jobmanager-lcgpbs-seegrid
- ce01.afroditi.hellasgrid.gr:2119/jobmanager-pbs-seegrid
- ce01.info.uvt.ro:2119/jobmanager-pbs-seegrid
- ce01.isabella.grnet.gr:2119/jobmanager-pbs-seegrid
- ce01.mosigrid.utcluj.ro:2119/jobmanager-pbs-seegrid
- ce02.grid.acad.bg:2119/jobmanager-pbs-seegrid
- cluster1.csk.kg.ac.yu:2119/jobmanager-pbs-seegrid
- grid-ce.ii.edu.mk:2119/jobmanager-pbs-seegrid
- grid01.cg.ac.yu:2119/jobmanager-pbs-seegrid
- grid01.rcub.bg.ac.yu:2119/jobmanager-pbs-seegrid
- grid1.irb.hr:2119/jobmanager-pbs-grid
- gw01.seegrid.grid.pub.ro:2119/jobmanager-pbs-seegrid
- grid01.elfak.ni.ac.yu:2119/jobmanager-pbs-seegrid
- ce.phy.bg.ac.yu:2119/jobmanager-pbs-seegrid
- ce64.phy.bg.ac.yu:2119/jobmanager-pbs-seegrid
- rti29.etf.bg.ac.yu:2119/jobmanager-pbs-seegrid
==========================================================================
How to test MPI support at a particular CE
In order to test MPI support at a particular CE, the above JDL file can be changed to include a proper Requirements statement. For example, in order to submit MPI job specifically to cluster1.csk.kg.ac.yu:2119/jobmanager-pbs-seegrid, the following JDL can be used:
[antun@ce test-mpi]$ cat test-mpi-AEGIS04-KG.jdl
Type = "Job";
JobType = "MPICH";
NodeNumber = 2;
Executable = "test-mpi.sh";
Arguments = "test-mpi";
StdOutput = "test-mpi.out";
StdError = "test-mpi.err";
InputSandbox = {"test-mpi.sh","test-mpi.c"};
OutputSandbox = {"test-mpi.err","test-mpi.out","mpiexec.out"};
Requirements = Member("MPICH", other.GlueHostApplicationSoftwareRunTimeEnvironment) &&
other.GlueCEUniqueID == "cluster1.csk.kg.ac.yu:2119/jobmanager-pbs-seegrid";
The file available here contains a JDL file for each of 18 CEs that are listed above and identified to support MPI. If more CEs appear when glite-job-list-match command is used, you can easily create JDL file for additional CEs, or remove JDL files for CEs that are not present in the output of glite-job-list-match command.
Typical results of the MPI test
[antun@ce test-mpi]$ voms-proxy-init -voms seegrid
Enter GRID pass phrase:
Your identity: /C=RS/O=AEGIS/OU=Institute of Physics Belgrade/CN=Antun Balaz
Cannot find file or dir: /home/antun/.glite/vomses
Creating temporary proxy .................................... Done
Contacting voms.grid.auth.gr:15040 [/C=GR/O=HellasGrid/OU=auth.gr/CN=voms.grid.auth.gr] "seegrid" Done
Creating proxy .......................................... Done
Your proxy is valid until Tue Nov 6 23:51:09 2007
[antun@ce test-mpi]$
[antun@ce test-mpi]$ glite-wms-job-delegate-proxy -d antun
Connecting to the service https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
================== glite-wms-job-delegate-proxy Success ==================
Your proxy has been successfully delegated to the WMProxy:
https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
with the delegation identifier: antun
==========================================================================
[antun@ce test-mpi]$ glite-wms-job-submit -d antun test-mpi-AEGIS04-KG.jdl
Connecting to the service https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://wms.phy.bg.ac.yu:9000/SGEjRpd7TpT9parBkwDvkA
==========================================================================
[antun@ce test-mpi]$ glite-wms-job-status https://wms.phy.bg.ac.yu:9000/SGEjRpd7TpT9parBkwDvkA
*************************************************************
BOOKKEEPING INFORMATION:
Status info for the Job : https://wms.phy.bg.ac.yu:9000/SGEjRpd7TpT9parBkwDvkA
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: cluster1.csk.kg.ac.yu:2119/jobmanager-pbs-seegrid
Submitted: Tue Nov 6 11:52:28 2007 CET
*************************************************************
[antun@ce test-mpi]$ glite-wms-job-output --dir . https://wms.phy.bg.ac.yu:9000/SGEjRpd7TpT9parBkwDvkA
Connecting to the service https://147.91.84.25:7443/glite_wms_wmproxy_server
Warning - Directory already exists:
/home/antun/test-mpi
Do you wish to overwrite it ? [y/n]n : y
================================================================================
JOB GET OUTPUT OUTCOME
Output sandbox files for the job:
https://wms.phy.bg.ac.yu:9000/SGEjRpd7TpT9parBkwDvkA
have been successfully retrieved and stored in the directory:
/home/antun/test-mpi
================================================================================
[antun@ce test-mpi]$ ls -l antun_8H4ERlgSmA1_Ff1Dtzx7Jw/
total 12
-rw-rw-r-- 1 antun antun 78 Aug 9 19:18 mpiexec.out
-rw-rw-r-- 1 antun antun 2085 Aug 9 19:18 test-mpi.err
-rw-rw-r-- 1 antun antun 1155 Aug 9 19:18 test-mpi.out
[antun@ce test-mpi]$
[antun@ce test-mpi]$ cat antun_8H4ERlgSmA1_Ff1Dtzx7Jw/mpiexec.out
Hello world! from processor 1 out of 2
Hello world! from processor 0 out of 2
[antun@ce test-mpi]$ cat antun_8H4ERlgSmA1_Ff1Dtzx7Jw/test-mpi.out
***********************************************************************
Running on: cluster7.csk.kg.ac.yu
As: seegrid074
***********************************************************************
***********************************************************************
Compiling binary: test-mpi
mpicc -o test-mpi test-mpi.c
*************************************
PBS Nodefile: /var/spool/pbs/aux//10950.cluster1.csk.kg.ac.yu
***********************************************************************
Node count: 2
Nodes in /var/spool/pbs/aux//10950.cluster1.csk.kg.ac.yu:
cluster7.csk.kg.ac.yu
cluster6.csk.kg.ac.yu
***********************************************************************
***********************************************************************
Checking ssh for each node:
Checking cluster7.csk.kg.ac.yu...
cluster7.csk.kg.ac.yu
Checking cluster6.csk.kg.ac.yu...
cluster6.csk.kg.ac.yu
***********************************************************************
***********************************************************************
Executing test-mpi with mpiexec
***********************************************************************
Note that test-mpi.err file is not empty, but this is not the problem - it contains verbous details. The above CE successfully passed the MPI test.
