MPI Job Output to sdtout

From EGEE-see WIki

Jump to: navigation, search

Contents

Summary

The following problem has been identified with MPI jobs that print to the standard output in clusters that use a particular file system. The problem has been identified in the SEEGRID HG-01-GRNET cluster.

Developers should be aware of this problem and try to modify their code in order to overcome this problem.

Description of the Problem

Assuming that the following MPI job is run:

  /* C Example */
  #include <stdio.h>
  #include <mpi.h>
  #include <math.h>
  int main (argc, argv)
       int argc;
       char *argv[];
  {
    int message;
    int rank, size, i, tag, node;
    MPI_Status status;
    MPI_Init (&argc, &argv);      /* starts MPI */
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);        /* get current process id */
    MPI_Comm_size (MPI_COMM_WORLD, &size);        /* get number of processes */
    tag = 100;
    if (rank == 0)
      {
        for (i = 1; i < size; i  )
          {
            message=i;
            MPI_Send (&message, 1, MPI_INT, i, tag, MPI_COMM_WORLD);
          }
      }
    else
      {
        MPI_Recv (&message, 1, MPI_INT, 0, tag, MPI_COMM_WORLD, &status);
      }
    printf ("node:%d  %d\n", rank, message);
    MPI_Finalize ();
 }


The output of running the job is inconsistent.

  cat  /tmp/glite/glite-ui/emanouil_DREFA0szRGmDuldUSGAkhw/std.out
  node:1  1
  node:0  2

which means output from node2 is missing.

Why this is happening

The reason for it is a "feature" of IBM's General Parallel Filesystem (GPFS), the parallel filesystem that is used in the WNs of HG-01-GRNET for the /home directories.

The problem shows when MPI peer processes try to output something to stdout concurrently, as happens with your example. When node1 and node2 receive their messages from node0, they both try to printf() it to stdout at the same time. Standard output points to a file on a GPFS filesystem, in order for it to be collected and sent to the RB. GPFS does not update the file pointer atomically, so message data are overwritten.

There is a discussion on IBM's forum for this, please see here for all the details:

http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?forum=479&thread=145946&cat=13

This problem does not affect MPI functionality, message passing happens exactly as it should.

The solution (Workaround)

A workaround is [as most MPI programs do, anyway], to have only one process output to stdout, using printf().

Note that all other filesystem implementations under Linux work as expected, but developers should be aware if this problem and resolve it using this workaround.

Credits

Many thanks to Emanouil Atanassov for identifying the problem and Vangelis Koukis for providing the explanation and workaround for it.

Personal tools