Fix gCE SAM failures

From EGEE-see WIki

Jump to: navigation, search

The infamous PeriodicHold error that appears very often on gCE SAM tests can be eliminated using the easy fix described below, based on the following LCG-ROLLOUT discussion: http://listserv.cclrc.ac.uk/cgi-bin/webadmin?A1=ind0705&L=lcg-rollout#11

At AEGIS01-PHY-SCL this fix works quite well, and almost all gCE SAM failures are gone, except for a few due to SAM UI overload (Timeout when executing test gCE-sft-job after 600 seconds!) or real WMS problems (e.g. /var/glite/Sandboxdir full).

Until the solution for the problem is implemented in YAIM (because the problem is related just to configuration, not to any MW component), this is the simplest fix you can easily apply on your gCE in order to significantly reduce the number of SAM failures:

1) Verify that /opt/globus/etc/grid-services/jobmanager exists and that it is a link to /opt/globus/etc/grid-services/jobmanager-fork:

  [root@g02 root]# ls -l /opt/globus/etc/grid-services/jobmanager
  lrwxrwxrwx    1 root     root           15 Jun 29  2006 /opt/globus/etc/grid-services/jobmanager -> jobmanager-fork

2) Overwrite /opt/globus/etc/globus-job-manager.conf with the content of /opt/glite/etc/globus-job-manager.conf:

  cp /opt/glite/etc/globus-job-manager.conf /opt/globus/etc/globus-job-manager.conf

3) Restart the service

  /etc/init.d/gLite restart
Personal tools