Job output monitoring using grid-stdout-mon

From EGEE-see WIki

Jump to: navigation, search

Contents

Introduction

In addition to job perusal implemented in WMS, gLite also ships with grid-stdout-mon - an external tool capable of monitoring job output. grid-stdout-mon works by periodically copying the job output and standard error from the worker node (WN) to a predefined storage element (SE). The tool consists of three commands:

  • grid-stdout-mon - periodically copies the job output from the WN to SE,
  • grid-stdout-mon-on - enables monitoring for the given job ID,
  • grid-stdout-mon-get - downloads output files from the SE to the UI.

grid-stdout-mon

grid-stdout-mon command should be started as a part of the user job, before the real job executable runs. It requires several environment variables to work:

  • DPM_HOST - defines the SE used to store the files,
  • LCG_STDOUT_MON_FLAG - should be set to ON_DEMAND for monitoring to work,
  • EDG_WL_JOBID - contains the jobid of the current job.

In addition to these, files to be monitored and the user VO are specified as command parameters. After querying the information system for the location of the VO directory on the SE, grid-stdout-mon forks into background and wakes up periodically to upload changes in output files to the SE. Output files are uploaded only if monitoring is enabled for the given job using grid-stdout-mon-on. Example job script using output monitoring could look like this:

 #!/bin/sh
 
 # set required parameters
 export DPM_HOST=egee2.irb.hr
 export LCG_STDOUT_MON_FLAG=ON_DEMAND
 export EDG_WL_JOBID
 
 # create monitored files
 touch stdout
 touch stderr
 
 # start monitoring script
 grid-stdout-mon -out stdout -err stderr -vo 
 
 # start the real job
 ./myapp >stdout 2>stderr

grid-stdout-mon-on

grid-stdout-mon-on enables monitoring for specified job by uploading a file with the list of job IDs (file_on_demand.log) to the SE (/<VOdir>/<UserDN>/ directory). grid-stdout-mon running on the WN downloads this file to check if the output files for a specific job should be monitored. This command is run on the UI after the job is submitted to the RB:

 $ export DPM_HOST=egee2.irb.hr
 $ grid-stdout-mon-on -d egee2.irb.hr -v dteam -j https://g01.phy.bg.ac.yu:9000/RD-BMH9RPVtKZR2pB7Akdg
 DPM-SE:  egee2.irb.hr
 
 JobID: 0  IDRD-BMH9RPVtKZR2pB7Akdg
 
 File already exist on DPM. Copy it to location /tmp/file_on_demand.log

Command first checks if the file_on_demand.log exists on the SE, than downloads it, appends the requested job ID and uploads the updated file back to the SE. As can be seen from the output, only the last part of the job ID is used (protocol, host and port are stripped).

grid-stdout-mon-get

grid-stdout-mon-get downloads the output files from the SE to a directory on the local machine. Since the output files are stored in parts (grid-stdout-mon only uploads the part of the file that changed since the last upload), first the Status file is downloaded. It contains the list of parts and their respective sizes. Based on this information, parts are downloaded an merged into original files.

 $ export DPM_HOST=egee2.irb.hr
 $ grid-stdout-mon-get -s egee2.irb.hr -v dteam -j https://g01.phy.bg.ac.yu:9000/RD-BMH9RPVtKZR2pB7Akdg -d /tmp
 Selected DPM-SE:  egee2.irb.hr
 
 JobID:   IDRD-BMH9RPVtKZR2pB7Akdg
 
 Source:   gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/
 
 Source:   : gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/Status
 filepath is set by user:    /tmp
 
 *** Files are successfully copied ***
 From:  gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/Status
 TO:  file://home.irb.hr/tmp/Status
 
 Source:   : gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stderr.0
 filepath is set by user:    /tmp
 
 Destination:   file://home.irb.hr/tmp/stderr.0
 
 *** Files are successfully copied ***
 From:  gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stderr.0
 TO:  file://home.irb.hr/tmp/stderr.0
 
 Source:   : gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stderr.1
 filepath is set by user:    /tmp
 
 Destination:   file://home.irb.hr/tmp/stderr.1
 
 Source:   : gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stdout.0
 filepath is set by user:    /tmp
 
 Destination:   file://home.irb.hr/tmp/stdout.0
 
 *** Files are successfully copied ***
 From:  gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stdout.0
 TO:  file://home.irb.hr/tmp/stdout.0
 
 Source:   : gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stdout.1
 filepath is set by user:    /tmp
 
 Destination:   file://home.irb.hr/tmp/stdout.1
 
 
 **** Start merging for stdout and stderror ****
 
 Reading currFile :  /tmp/stdout.0
 Reading PrevFile :  /tmp/stderr.0
 Reading currFile :  /tmp/stdout.1
 Reading PrevFile :  /tmp/stderr.1
 
 *** Files are successfuly merged to stdout/stderr at:   /tmp
 Removing stdout.* successful....
 Removing  'stderr.*' successful....

Conclusion

Although the general idea of grid-stdout-mon is sound, the implementation of the three described programs is quite bad. Bugs and problems are too numerous to describe, but just to name a few:

  • programs don't work if DPM_HOST is not set,
  • merged files are not identical to the originals, sometimes lines are duplicated or contain debugging info,
  • output is bad, sometimes there is nothing although the program fails (grid-stdout-mon), sometimes too much (grid-stdout-mon-get),
  • documentation is bad, only from the source code can one see how the programs are supposed to be used,
  • only monitors two files,
  • does not support WMS (uses EDG_WL_JOBID for job ID).

Because of this, the use of grid-stdout-mon is discouraged. See gLite WMS job perusal for a better implementation of the same concept.

References

Personal tools