Job output monitoring using grid-stdout-mon
From EGEE-see WIki
Contents |
Introduction
In addition to job perusal implemented in WMS, gLite also ships with grid-stdout-mon - an external tool capable of monitoring job output. grid-stdout-mon works by periodically copying the job output and standard error from the worker node (WN) to a predefined storage element (SE). The tool consists of three commands:
- grid-stdout-mon - periodically copies the job output from the WN to SE,
- grid-stdout-mon-on - enables monitoring for the given job ID,
- grid-stdout-mon-get - downloads output files from the SE to the UI.
grid-stdout-mon
grid-stdout-mon command should be started as a part of the user job, before the real job executable runs. It requires several environment variables to work:
- DPM_HOST - defines the SE used to store the files,
- LCG_STDOUT_MON_FLAG - should be set to ON_DEMAND for monitoring to work,
- EDG_WL_JOBID - contains the jobid of the current job.
In addition to these, files to be monitored and the user VO are specified as command parameters. After querying the information system for the location of the VO directory on the SE, grid-stdout-mon forks into background and wakes up periodically to upload changes in output files to the SE. Output files are uploaded only if monitoring is enabled for the given job using grid-stdout-mon-on. Example job script using output monitoring could look like this:
#!/bin/sh # set required parameters export DPM_HOST=egee2.irb.hr export LCG_STDOUT_MON_FLAG=ON_DEMAND export EDG_WL_JOBID # create monitored files touch stdout touch stderr # start monitoring script grid-stdout-mon -out stdout -err stderr -vo # start the real job ./myapp >stdout 2>stderr
grid-stdout-mon-on
grid-stdout-mon-on enables monitoring for specified job by uploading a file with the list of job IDs (file_on_demand.log) to the SE (/<VOdir>/<UserDN>/ directory). grid-stdout-mon running on the WN downloads this file to check if the output files for a specific job should be monitored. This command is run on the UI after the job is submitted to the RB:
$ export DPM_HOST=egee2.irb.hr $ grid-stdout-mon-on -d egee2.irb.hr -v dteam -j https://g01.phy.bg.ac.yu:9000/RD-BMH9RPVtKZR2pB7Akdg DPM-SE: egee2.irb.hr JobID: 0 IDRD-BMH9RPVtKZR2pB7Akdg File already exist on DPM. Copy it to location /tmp/file_on_demand.log
Command first checks if the file_on_demand.log exists on the SE, than downloads it, appends the requested job ID and uploads the updated file back to the SE. As can be seen from the output, only the last part of the job ID is used (protocol, host and port are stripped).
grid-stdout-mon-get
grid-stdout-mon-get downloads the output files from the SE to a directory on the local machine. Since the output files are stored in parts (grid-stdout-mon only uploads the part of the file that changed since the last upload), first the Status file is downloaded. It contains the list of parts and their respective sizes. Based on this information, parts are downloaded an merged into original files.
$ export DPM_HOST=egee2.irb.hr $ grid-stdout-mon-get -s egee2.irb.hr -v dteam -j https://g01.phy.bg.ac.yu:9000/RD-BMH9RPVtKZR2pB7Akdg -d /tmp Selected DPM-SE: egee2.irb.hr JobID: IDRD-BMH9RPVtKZR2pB7Akdg Source: gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/ Source: : gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/Status filepath is set by user: /tmp *** Files are successfully copied *** From: gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/Status TO: file://home.irb.hr/tmp/Status Source: : gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stderr.0 filepath is set by user: /tmp Destination: file://home.irb.hr/tmp/stderr.0 *** Files are successfully copied *** From: gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stderr.0 TO: file://home.irb.hr/tmp/stderr.0 Source: : gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stderr.1 filepath is set by user: /tmp Destination: file://home.irb.hr/tmp/stderr.1 Source: : gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stdout.0 filepath is set by user: /tmp Destination: file://home.irb.hr/tmp/stdout.0 *** Files are successfully copied *** From: gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stdout.0 TO: file://home.irb.hr/tmp/stdout.0 Source: : gsiftp://egee2.irb.hr/dpm/irb.hr/home/dteam/C_HRO_eduOU_irbCN_ValentinVidic/IDRD-BMH9RPVtKZR2pB7Akdg/stdout.1 filepath is set by user: /tmp Destination: file://home.irb.hr/tmp/stdout.1 **** Start merging for stdout and stderror **** Reading currFile : /tmp/stdout.0 Reading PrevFile : /tmp/stderr.0 Reading currFile : /tmp/stdout.1 Reading PrevFile : /tmp/stderr.1 *** Files are successfuly merged to stdout/stderr at: /tmp Removing stdout.* successful.... Removing 'stderr.*' successful....
Conclusion
Although the general idea of grid-stdout-mon is sound, the implementation of the three described programs is quite bad. Bugs and problems are too numerous to describe, but just to name a few:
- programs don't work if DPM_HOST is not set,
- merged files are not identical to the originals, sometimes lines are duplicated or contain debugging info,
- output is bad, sometimes there is nothing although the program fails (grid-stdout-mon), sometimes too much (grid-stdout-mon-get),
- documentation is bad, only from the source code can one see how the programs are supposed to be used,
- only monitors two files,
- does not support WMS (uses EDG_WL_JOBID for job ID).
Because of this, the use of grid-stdout-mon is discouraged. See gLite WMS job perusal for a better implementation of the same concept.
