SG MPI Guide
From EGEE-see WIki
MPI is popular among SEE-GRID user community, and therefore several SEE-GRID sites support MPI. We strongly recommend such sites to offer shared filesystem for the jobs, because otherwise MPI support is not transparent to the user. In addition to the steps outlined on the Wiki page:
the following is recommended to SEE-GRID-2 sites:
- torque and maui RPMs should be upgraded to the latest version available on: http://hepunx.rl.ac.uk/~traylens/rpms/torque/ http://hepunx.rl.ac.uk/~traylens/rpms/maui/
- EDG_WL_SCRATCH variable must not be set on all WNs (it is usually defined in /etc/profile.d/lcgenv.[c]sh), since otherwise each job would be executed in $EDG_WL_SCRATCH directory and MPI jobs will fail in that case; for MPI it is necessary that jobs are executed in a home directory that is shared among all WNs.
Having jobs (other than MPI ones) executed in a local scratch directory is very good idea, especially when pool accounts home directories are shared. Therefore, second recommendation can be implemented in two ways:
- EDG_WL_SCRATCH is not set on WNs, but instead TMPDIR variable is set to the local scratch directory, and pbs job manager on lcg-CE (/opt/globus/lib/perl/Globus/GRAM/JobManager/pbs.pm) is replaced with the one available on http://glite.phy.bg.ac.yu/GLITE-3_0_2/AEGIS/pbs.pm. The modifications present in this file will not affect in any way MPI jobs and jobs requiring multiple CPUs. They will move just single-CPU jobs to $TMPDIR directory after they arrive on WN, thus avoiding use of shared home directory in this case. In order to avoid recreation of the default pbs.pm job manager file during the next reconfiguration of the lcg-CE, the file /opt/globus/setup/globus/pbs.in should be replaced by the pbs.in file available on http://glite.phy.bg.ac.yu/GLITE-3_0_2/AEGIS/pbs.in.
- EDG_WL_SCRATCH is set on WNs and unset for a job if its $PBS_NODEFILE contains more than one line; this is done by placing the following files in /etc/profile.d/ on WNs: http://glite.phy.bg.ac.yu/GLITE-3_0_2/AEGIS/scl-wn.sh and http://glite.phy.bg.ac.yu/GLITE-3_0_2/AEGIS/scl-wn.csh.
In order to resolve the infamous problem with MPI configuration on sites with dual CPU nodes, it is important to install a submit_filter script, which “patches” the bash scripts, submitted by the LCG RB. This script may also be used by sites that have support for Myrinet or other high-performance network devices, so that they can make their resources available in a grid-friendly way. The script we provide here must be installed in /var/spool/pbs/submit_filter.pl (chmod +x):
#!/usr/bin/perl
# This script read a submission script on the standard input, modifies
# it, and writes the modified script on standard output. This script
# makes two modifications:
# * correct the node specification to allow all cpus to be used
# * add a users DN as the "account" field if available
#
my $GLOBUS_LOCATION = '/opt/globus';
while (<STDIN>) {
# By default just copy the line.
$line = $_;
# If there is a nodes line, then extract the value and adjust it
# as necessary. Only modify the simple nodes request. If there
# is a more complicated request assume that the user knows what
# he/she is doing and leave it alone.
if (m/#PBS\s+-l\s+nodes=(\d+)\s*$/) {
$line = process_nodes($1);
}
# If there is an existing accounts line, delete it.
if (m/#PBS\s+-A/) {
$line = ;
}
print $line;
}
# This takes the number specified in the "nodes" specification and
# returns a "PBS -l" line which can be allocated on the available
# resources. This essentially does per-cpu allocation.
sub process_nodes {
my $nodes = shift;
my $line = "";
# Collect information from the pbsnodes command on the number of
# machine and cpus available. Don't do anything with offline
# nodes.
open PBS, "pbsnodes -a |";
my $state = 1;
my %machines;
while (<PBS>) {
if (m/ state\s*=\s*((\w|,)+)/) {
if ($1=~/offline/){
$state=0;
}else{
$state=1;
}
} elsif (m/np = (\d+)/) {
my $ncpus = $1;
if ($state) {
if (defined($machines{$ncpus})) {
$machines{$ncpus} = $machines{$ncpus}+1;
} else {
$machines{$ncpus} = 1;
}
}
}
}
close PBS;
# Count the total number of machines and cpus.
my $tnodes = 0;
my $tcpus = 0;
my $maxcpu = 0;
foreach my $ncpus (sort num_ascending keys %machines) {
$tnodes += $machines{$ncpus};
$tcpus += $machines{$ncpus}*$ncpus;
$maxcpu = $ncpus if ($tcpus>=$nodes);
}
if ($maxcpu==0) {
# There aren't enough cpus to handle the request. Just pass
# the request through and let the job fail.
$line .= "#PBS -l nodes=$nodes\n";
} else {
$line .="#PBS -l ";
# We've already identified the largest machine we'll have to
# allocate. Start by allocating one of those and iterate until
# all are used.
my %allocated;
my $remaining_cpus = $nodes;
my $remaining_nodes = $tnodes;
foreach my $ncpus (sort num_descending keys %machines) {
if ($ncpus<=$maxcpu && $remaining_cpus>0) {
my $nmach = $machines{$ncpus};
for (my $i=0;
($i<$nmach) && ($remaining_cpus>$remaining_nodes);
$i++) {
$remaining_cpus -= $ncpus;
$remaining_nodes -= 1;
# May only have to use part of a node. Check here
# for that case.
my $used = ($remaining_cpus>=0)
? $ncpus
: $ncpus+$remaining_cpus;
# Increase the allocation.
if (defined($allocated{$used})) {
$allocated{$used} += 1;
} else {
$allocated{$used} = 1;
}
}
# If we can fill out the rest without restricting the
# number of cpus on a node, do so.
if ($remaining_cpus<=$remaining_nodes &&
$remaining_cpus>0) {
my $used = 1;
if (defined($allocated{$used})) {
$allocated{$used} += $remaining_cpus;
} else {
$allocated{$used} = $remaining_cpus;
}
$remaining_cpus = 0;
}
}
}
my $first = 1;
foreach my $i (sort num_descending keys %allocated) {
$line .= "+" unless $first;
$line .= "nodes=" if $first;
# $line .= "nodes=";
$line .= $allocated{$i};
# $line .= ":ppn=" . $i unless ($i == 1);
$line .= ":ppn=" . $i;
$first = 0;
}
$line .= "\n";
}
return $line;
}
sub num_ascending { $a <=> $b; }
sub num_descending { $b <=> $a; }
This script solves some problems present in the similar script used in EGEE production, because it treats information from the batch system correctly.
For sites that have support for special low-latency network devices like Myrinet further improvements are available, but most sites in SEE-GRID will only use TCP/IP for communication and for them this version of the script is sufficient.
Site administrators will have to educate the users to launch their jobs with mpiexec, instead of mpirun, because the former provides correct accounting and in general better handling of the jobs.
