Testing MPI support

From EGEE-see WIki

Jump to: navigation, search

On the request of applications that need MPI support on sites, GOODs are expected to test MPI setup on all SEE-GRID sites that claim to support it. The MPI setup test should be performed at least once a week, and GOODs should ensure that the test parallel job runs at the same time on at least two WNs (to test ssh setup as well). On sites with SMPSIZE=4 this would require 5 processes in MPI; on sites with single-core WNs it will be enough to have just 2 processes; therefore, GOOD is responsible for sending the appropriate test job to each site. Since such test jobs require several CPUs, it is likely that their execution will take more time than it is usual for a single-CPU jobs. For this reason, such test jobs should be given enough time to complete (1-2 days). In case the test jobs fail (or do not run to completion in a reasonable time), GOOD site ticket should be created.

However, this guide can be used by any user/application developer interested in testing MPI configuration on any Grid site.

The following short guide is prepared by AEGIS01-PHY-SCL site admins. The files referred to in this guide is are available here.

How to find out which SEE-GRID sites currently support

The following simple JDL is enough:

[antun@ce test-mpi]$ cat test-mpi.jdl 
Type = "Job";
JobType = "MPICH";
NodeNumber = 2;
Executable = "test-mpi.sh";
Arguments = "test-mpi";
StdOutput = "test-mpi.out";
StdError = "test-mpi.err";
InputSandbox = {"test-mpi.sh","test-mpi.c"};
OutputSandbox = {"test-mpi.err","test-mpi.out","mpiexec.out"};
Requirements = Member("MPICH", other.GlueHostApplicationSoftwareRunTimeEnvironment);

The scripts test-mpi.sh, MPI C program test-mpi.c, as well as the above JDL file are available here.

Typical usage:

 [antun@ce test-mpi]$ voms-proxy-init -voms seegrid
 Enter GRID pass phrase:
 Your identity: /C=RS/O=AEGIS/OU=Institute of Physics Belgrade/CN=Antun Balaz
 Cannot find file or dir: /home/antun/.glite/vomses
 Creating temporary proxy .................................................................... Done
 Contacting  voms.grid.auth.gr:15040 [/C=GR/O=HellasGrid/OU=auth.gr/CN=voms.grid.auth.gr] "seegrid" Done
 Creating proxy ................................... Done
 Your proxy is valid until Tue Nov  6 23:41:51 2007
 [antun@ce test-mpi]$ 
 [antun@ce test-mpi]$ glite-wms-job-delegate-proxy -d antun
 
 Connecting to the service https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server 
 
 
 ================== glite-wms-job-delegate-proxy Success ================== 
 
 Your proxy has been successfully delegated to the WMProxy:
 https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
 
 with the delegation identifier: antun
 
 ==========================================================================
 
 
 [antun@ce test-mpi]$ glite-wms-job-list-match -d antun test-mpi.jdl  
 
 Connecting to the service https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
 
 ==========================================================================
 
                      COMPUTING ELEMENT IDs LIST 
  The following CE(s) matching your job requirements have been found:
 
         *CEId*
  - ce.grid.pmf.unsa.ba:2119/jobmanager-pbs-seegrid
  - ce.ulakbim.gov.tr:2119/jobmanager-lcgpbs-seegrid
  - ce01.afroditi.hellasgrid.gr:2119/jobmanager-pbs-seegrid
  - ce01.info.uvt.ro:2119/jobmanager-pbs-seegrid
  - ce01.isabella.grnet.gr:2119/jobmanager-pbs-seegrid
  - ce01.mosigrid.utcluj.ro:2119/jobmanager-pbs-seegrid
  - ce02.grid.acad.bg:2119/jobmanager-pbs-seegrid
  - cluster1.csk.kg.ac.yu:2119/jobmanager-pbs-seegrid
  - grid-ce.ii.edu.mk:2119/jobmanager-pbs-seegrid
  - grid01.cg.ac.yu:2119/jobmanager-pbs-seegrid
  - grid01.rcub.bg.ac.yu:2119/jobmanager-pbs-seegrid
  - grid1.irb.hr:2119/jobmanager-pbs-grid
  - gw01.seegrid.grid.pub.ro:2119/jobmanager-pbs-seegrid
  - grid01.elfak.ni.ac.yu:2119/jobmanager-pbs-seegrid
  - ce.phy.bg.ac.yu:2119/jobmanager-pbs-seegrid
  - ce64.phy.bg.ac.yu:2119/jobmanager-pbs-seegrid
  - rti29.etf.bg.ac.yu:2119/jobmanager-pbs-seegrid
 
 ==========================================================================
 

How to test MPI support at a particular CE

In order to test MPI support at a particular CE, the above JDL file can be changed to include a proper Requirements statement. For example, in order to submit MPI job specifically to cluster1.csk.kg.ac.yu:2119/jobmanager-pbs-seegrid, the following JDL can be used:

[antun@ce test-mpi]$ cat test-mpi-AEGIS04-KG.jdl 
Type = "Job";
JobType = "MPICH";
NodeNumber = 2;
Executable = "test-mpi.sh";
Arguments = "test-mpi";
StdOutput = "test-mpi.out";
StdError = "test-mpi.err";
InputSandbox = {"test-mpi.sh","test-mpi.c"};
OutputSandbox = {"test-mpi.err","test-mpi.out","mpiexec.out"};
Requirements = Member("MPICH", other.GlueHostApplicationSoftwareRunTimeEnvironment) &&
               other.GlueCEUniqueID == "cluster1.csk.kg.ac.yu:2119/jobmanager-pbs-seegrid";

The file available here contains a JDL file for each of 18 CEs that are listed above and identified to support MPI. If more CEs appear when glite-job-list-match command is used, you can easily create JDL file for additional CEs, or remove JDL files for CEs that are not present in the output of glite-job-list-match command.

Typical results of the MPI test

 [antun@ce test-mpi]$ voms-proxy-init -voms seegrid
 Enter GRID pass phrase:
 Your identity: /C=RS/O=AEGIS/OU=Institute of Physics Belgrade/CN=Antun Balaz
 Cannot find file or dir: /home/antun/.glite/vomses
 Creating temporary proxy .................................... Done
 Contacting  voms.grid.auth.gr:15040 [/C=GR/O=HellasGrid/OU=auth.gr/CN=voms.grid.auth.gr] "seegrid" Done
 Creating proxy .......................................... Done
 Your proxy is valid until Tue Nov  6 23:51:09 2007
 [antun@ce test-mpi]$ 
 [antun@ce test-mpi]$ glite-wms-job-delegate-proxy -d antun
 
 Connecting to the service https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
 
 
 ================== glite-wms-job-delegate-proxy Success ==================
 
 Your proxy has been successfully delegated to the WMProxy:
 https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
 
 with the delegation identifier: antun
 
 ==========================================================================
 
 
 [antun@ce test-mpi]$ glite-wms-job-submit -d antun test-mpi-AEGIS04-KG.jdl 
 
 Connecting to the service https://wms.phy.bg.ac.yu:7443/glite_wms_wmproxy_server
 
 
 ====================== glite-wms-job-submit Success ======================
 
 The job has been successfully submitted to the WMProxy
 Your job identifier is:
 
 https://wms.phy.bg.ac.yu:9000/SGEjRpd7TpT9parBkwDvkA
 
 ==========================================================================
 
 
 [antun@ce test-mpi]$ glite-wms-job-status https://wms.phy.bg.ac.yu:9000/SGEjRpd7TpT9parBkwDvkA
 
 
 *************************************************************
 BOOKKEEPING INFORMATION:
 
 Status info for the Job : https://wms.phy.bg.ac.yu:9000/SGEjRpd7TpT9parBkwDvkA
 Current Status:     Done (Success)
 Exit code:          0
 Status Reason:      Job terminated successfully
 Destination:        cluster1.csk.kg.ac.yu:2119/jobmanager-pbs-seegrid
 Submitted:          Tue Nov  6 11:52:28 2007 CET
 *************************************************************
 
 [antun@ce test-mpi]$ glite-wms-job-output --dir . https://wms.phy.bg.ac.yu:9000/SGEjRpd7TpT9parBkwDvkA
 
 Connecting to the service https://147.91.84.25:7443/glite_wms_wmproxy_server
 
 
 Warning - Directory already exists: 
 /home/antun/test-mpi
 Do you wish to overwrite it ? [y/n]n : y
 
 ================================================================================
 
                         JOB GET OUTPUT OUTCOME
 
 Output sandbox files for the job:
 https://wms.phy.bg.ac.yu:9000/SGEjRpd7TpT9parBkwDvkA
 have been successfully retrieved and stored in the directory:
 /home/antun/test-mpi
 
 ================================================================================
 
 [antun@ce test-mpi]$ ls -l antun_8H4ERlgSmA1_Ff1Dtzx7Jw/
 total 12
 -rw-rw-r--    1 antun    antun          78 Aug  9 19:18 mpiexec.out
 -rw-rw-r--    1 antun    antun        2085 Aug  9 19:18 test-mpi.err
 -rw-rw-r--    1 antun    antun        1155 Aug  9 19:18 test-mpi.out
 [antun@ce test-mpi]$ 
 [antun@ce test-mpi]$ cat antun_8H4ERlgSmA1_Ff1Dtzx7Jw/mpiexec.out 
 Hello world! from processor 1 out of 2
 Hello world! from processor 0 out of 2
 [antun@ce test-mpi]$ cat antun_8H4ERlgSmA1_Ff1Dtzx7Jw/test-mpi.out 
 ***********************************************************************
 Running on: cluster7.csk.kg.ac.yu
 As:        seegrid074
 ***********************************************************************
 ***********************************************************************
 Compiling binary: test-mpi
 mpicc -o test-mpi test-mpi.c
 *************************************
 PBS Nodefile: /var/spool/pbs/aux//10950.cluster1.csk.kg.ac.yu
 ***********************************************************************
 Node count:       2
 Nodes in /var/spool/pbs/aux//10950.cluster1.csk.kg.ac.yu: 
 cluster7.csk.kg.ac.yu
 cluster6.csk.kg.ac.yu
 ***********************************************************************
 ***********************************************************************
 Checking ssh for each node:
 Checking cluster7.csk.kg.ac.yu...
 cluster7.csk.kg.ac.yu
 Checking cluster6.csk.kg.ac.yu...
 cluster6.csk.kg.ac.yu
 ***********************************************************************
 ***********************************************************************
 Executing test-mpi with mpiexec
 ***********************************************************************
 

Note that test-mpi.err file is not empty, but this is not the problem - it contains verbous details. The above CE successfully passed the MPI test.

Personal tools