GLite30

From EGEE-see WIki

Jump to: navigation, search

Contents

Introduction

In SEE roc we took advantage of our representatives in PPS and installed a small certification test bed aiming to test the installation and configuration of gLite 3.0 release using 5 sites in total (2 from PPS , One from the Installation / Certification testbed aka SA3, and two sites from the production service. HG-06-EKT, EGEE-SEE-CERT, preGR01-UOM and preGR02-UPATRAS installed a clean version of gLite 3.0 whereas GR-04-FORTH-ICS upgraded its LCG flavour services from 2.7.0 and installed an extra gLite CE. Our plan was to install an LCG and gLite CE sharing the same batch system hosted on the LCG-CE on each site and test to see if this configuration is accessible both form a WMSLB and an LCG RB.

Migration Plan

Our aim was to emulate a full testbed with all the core components that we are supposed to run in production and test weather its feasible or not to do so. This service as it is short leaved was designed to be independent of core services / gocdb and the rest 2ndary services installed outside the SEE federation. The aim was to install the following services on all sites a. a LCG_CE + Torque or other batch System Server b. a gLite_CE talking with the Torque server on the LCG_CE ( big clusters should also consider having the Torgue server independent of LCG/gLite_CE) c. a gLite_SE as the lcg equivalent no longer exists d. gLite_Mon e. gLite_WN (which are actually combined WNs according to the documentation) Additionally EGEE_SEE_CERT installed an LCG_RB + BDII as this was need for us to check if everything works ok both ways and preGR02-Upatras installed the later WMSLB (push mode) and UI In order to test our pilot testbed we run a small number of Jobs storms (more like drizzles), the results of which can be found here

GR-04-FORTH-ICS Notes

EGEE-SEE-CERT Notes

Issues regarding fresh gLite 3.0 installation on SL 3.0.7 using glite-yaim-3.0.0-12

SL and DHCP: Hostname issue

The symptom

You are using a dhcp server to ease your site installation/administration and hostname/IP assignment. This is the case for EGEE-SEE-CERT, for example, since installations from scratch are quite common. While configuring your SE_dpm_mysql you get errors similar to the ones below:

Setting mysql password on se01.gridctb.uoa.gr.
ERROR 1130 (00000): #HY000Host 'se01.gridctb.uoa.gr' is not allowed to connect to this MySQL server
ERROR 1130 (00000): #HY000Host 'se01.gridctb.uoa.gr' is not allowed to connect to this MySQL server

After that your node configuration script exits with a fatal error.

The cause

Both SL and glite middleware assume that /bin/hostname returns the FQDN of the server instead of the unqualified hostname (i.e. the system expects that /bin/hostname will return se01.gridctb.uoa.gr and not se01). However SL fails to properly initialize server hostname to its FQDN, whenever this should be done based on the hostname and domainname assigned by a dhcp server, unless the HOSTNAME or the DHCP_HOSTNAME variables have been already initialized and haven't been left to their default values (See below 1.1, 1.2, 1.3, 1.4). As a consequence during the initialization of mysql privileges tables at postinstall phase of mysqld, full access is granted only to root logins either from the localhost or from the server identified with the unqualified hostname (See below 1.5, 1.6) and not from one identified by the server's FQDN.


mysql> select Host,User,Password from user where user='root'\G
*************************** 1. row ***************************
      Host: localhost
      User: root
Password: ******encoded password*******
*************************** 2. row ***************************
      Host: se01
      User: root
Password:
2 rows in set (0.01 sec)

Therefore any attempt later by the middleware installation scripts to modify mysql db privileges (e.g. add new DB user) will also fail, with an access denied error message, since the connected host will resolve (probably....?) to its FQDN and not to its unqualified hostname( See Below 1.7).

The Solution

To get rid of this type of bug, and/or any other similar make sure that /bin/hostname returns the FQDN. This can be accomplished either by setting the HOSTNAME variable in /etc/sysconfig/network or by adapting accordingly your dhcp server. A better solution would be to add the "-f" argument to any call of "/bin/hostname" wherever an FQDN is expected.


  1. /etc/sysconfig/network-scripts/network-functions (see need_hostname, set_hostname)
  2. /etc/sysconfig/network-scripts/ifup-post
  3. /etc/sysconfig/network-scripts/ifup
  4. /etc/rc.d/rc.sysinit
  5. /usr/bin/mysql_install_db
  6. /usr/bin/mysql_create_system_tables
  7. /opt/glite/yaim/functions/config_DPM_mysql


glite-WMSLB: Failed to start glite-lb-locallogger and glite-lb-bkserverd

The symptom

Configuration of WMS/LB ends with a fatal error like the one below:

Starting glite-lb-logd ...This is LocalLogger, part of Workload
Management System in EU DataGrid.Copyright (c) 2002 CERN, INFN and
CESNET on behalf of the EU DataGrid.
[17411] Initializing...
[17411] Parse messages for correctness...[17411] yes.
[17411] Send messages also to inter-logger...[17411] yes.
[17411] Store messages with the filename prefix "/tmp/dglogd.log"...[17411] yes.
[17411] Initializing Globus common module...[17411] yes.
[17411] Failed to get GSI credentials. Exiting.  FAILED
Starting glite-lb-interlogd ...[-1218517920]   removing stale input socket /tmp/interlogger.sock
[-1218517920] Failed to load GSI credential:
edg_wll_gss_acquire_cred_gsi(): GSS Major Status: General failure  (GSS Minor Status Error Chain:
import_cred.c:199: gss_import_cred: Unable to read credential for import
globus_i_gsi_gss_utils.c:1247: globus_i_gsi_gss_cred_read_bio: Error with GSI credential
globus_gsi_credential.c:924: globus_gsi_cred_read_proxy_bio: Error reading proxy credential: Couldn't read X509 proxy cert from bio
OpenSSL Error: pem_lib.c:768: in library: PEM routines, function PEM_read_bio: bad end line)
 FAILED
[ERROR] Could not start the gLite LB Local Logger daemons
[ERROR] Please verify and re-run the script

The cause

glite-lb-logd, glite-lb-interlogd, glite-lb-notif-interlogd will fail to start whenever the certificate and the corresponding private key are passed to them as arguments (-c and -k respectively). This behavior seems to be confirmed by the following bug report although it refers to a previous release of glite (1.4.1):

bug #13988 overview: Failed to start glite-lb-locallogger on glite 1.4.1 https://savannah.cern.ch/bugs/?func=detailitem&item_id=13988

As a consequence the execution of:

/opt/glite/etc/init.d/glite-lb-locallogger and /opt/glite/etc/init.d/glite-lb-bkserverd

from which the previous binaries are called by, will also fail during the configuration of WMSLB, because the environment variables GLITE_HOST_CERT, GLITE_HOST_KEY whenever are defined will be used as arguments to glite-lb-logd and glite-lb-*-interlogd.


The Solution

Besides working on the binary source code to solve the problem, another temporary workaround would be to take advantage of the binaries default behavior of falling back into ~UID/.globus/usercert.pem and ~UID/.globus/userkey.pem as last resort when no arguments are provided to them

stat64("/home/glite", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
stat64("/home/glite/.globus/usercert.pem", 0xbfff8130) = -1 ENOENT (No such file or directory)
stat64("/home/glite/.globus/userkey.pem", 0xbfff8130) = -1 ENOENT (No such file or directory)
stat64("/home/glite/.globus/usercred.p12", 0xbfff8130) = -1 ENOENT (No such file or directory)

The default location of user certification files, as defined in /opt/glite/etc/config/glite-global.cfg.xml (~glite/.certs/) does not match the build-in defaults of glite-lb-logd, glite-lb-interlogd, glite-lb-notif-interlogd (~glite/.globus).

Unfortunately, and most likely only on clean installs, the .globus directory .globus has not been created by the time one runs the configuration scripts.

So before running the configuration script one has to

  1. clear the "creds" variable found in glite-lb-locallogger and glite-lb-bkserverd to make sure that glite-lb-logd, glite-lb-interlogd, glite-lb-notif-interlogd are executed without the -c and -k arguments
  2. create ~glite/.globus
  3. copy the hostkey.pem and hostcert.pem to userkey.pem and usercert.pem respectively under ~glite/.globus

glite-CE and lcg-CE: NFS shared gridmapdir

On both glite-CE and lcg-CE, the directory /etc/grid-security/gridmapdir contains the correspondence between true Grid users and local pool accounts. Since the two CEs share the same WNs, having seperate gridmapdir on each CE can lead to security problems and other issues.

The proposed solution is to share via NFS the gridmapdir on both CEs. The exact steps are:

on the lcg-CE (ce02):

# echo "/etc/grid-security/gridmapdir ce01.gridctb.uoa.gr(rw,no_root_squash,sync)" >> /etc/exports
# chkconfig nfs on
# service nfs restart

on the glite-CE (ce01):

# rm -rf /etc/grid-security/gridmapdir/*
# echo "ce02.gridctb.uoa.gr:/etc/grid-security/gridmapdir /etc/grid-security/gridmapdir nfs defaults 0 0" >> /etc/fstab
# service netfs restart

In our case ce01 is the glite-CE and ce02 is the lcg-CE.

Now both CEs should be sharing the same gridmapdir. However, while the ownership of the gridmapdir in both systems is root:edguser, this is not the case after the NFS sharing procedure because group ids are not identical in both systems (perhaps this is a bug and should be fixed upstream). So to ensure proper permissions for the edguser on both nodes, we create a new "gridmount" group with the same GID on both CEs, and add edguser to it:

On one CE:

# chgrp 5000 /etc/grid-security/gridmapdir

On both CEs:

# groupadd -g 5000 gridmount
# gpasswd -a edguser gridmount

gLite-CE: Incorrect information published

In the process of testing the selected configuration (Dual CEs with Torque server and site BDII on the lcg-CE_torque) we encountered a number of bugs related to the information published by the BDII. In particular, glite-CE GRIS doesn't publish correct information for any of the GlueCEInfoLRMSVersion, GlueCEInfoTotalCPUs, GlueCEStateEstimatedResponseTime LDAP attributes.

Please find below a detailed description of the bugs and proposed solutions.

maui-client rpm missing

GIP running on the glite-CE calls the following executables in the following order: lcg-info-dynamic-scheduler-->vomaxjobs-maui-->diagnose

Unfortunately /usr/bin/diagnose doesn't exist on the glite-CE when no torque server has been installed. So the solution is to run:

# apt-get install maui-client

This should be corrected upstream in yaim: this particular rpm (maui-client) shall be included in the metapackage glite-CE, so that it is installed regardless of the torque-server.

vomaxjobs-maui bug

The /opt/lcg/libexec/vomaxjobs-maui python script calls the /usr/bin/diagnose program. When it is called with "-h somehost" parameter it should call the diagnose program with "--host=somehost " parameter. This behavior is achieved by applying the following patch:

--- /opt/lcg/libexec/vomaxjobs-maui.orig	2006-05-25 17:04:23.000000000 +0300
+++ /opt/lcg/libexec/vomaxjobs-maui	2006-05-25 22:24:38.000000000 +0300
@@ -34,7 +34,7 @@
     cmd = cmd + ' --host=' + schedhost
     
 import commands
-(stat, out) = commands.getstatusoutput('diagnose -g')
+(stat, out) = commands.getstatusoutput(cmd)
 if stat:
     print sys.argv[0] + ': Maui \'diagnose\' command' + \
           ' exited with nonzero status'

lcg-info-dynamic-scheduler configuration change

By default, when GIP on the glite-CE executes lcg-info-dynamic-scheduler, it calls vomaxjobs-maui with no parameters, which results in the diagnose command quering for a maui server in localhost.

This behaviour can be corrected in two ways. The first one (perhaps the best to be applied upstream in the yaim configure script) is to edit /var/spool/maui/maui.cfg with appropriate values, especially change SERVERHOST from localhost to the lcg-CE_torque host. The second one, and probably the simplest for the site admin to apply, is to append to the last line of /opt/lcg/etc/lcg-info-dynamic-scheduler.conf " -h YourTorqueServerHost". For example, in our case the last line becomes:

/opt/lcg/libexec/vomaxjobs-maui -h ce02.gridctb.uoa.gr

lcg-CE: maui-server & edguser privileges

After applying these workarounds, information seems to be published right for the glite-CE. However, looking at /opt/bdii/var/bdii.log one can see that lcg-info-dynamic-scheduler sometimes still fails.

The cause is that when this script is being executed as edguser, diagnose queries the server as edguser too. So the command fails like with the following error:

# su edguser
$ diagnose -g --host=ce02.hep.ntua.gr
ERROR:    'diagnose' failed
ERROR:    user 'edguser' is not authorized to execute command 'diagnose'

This one should be corrected on the server: On the lcg-CE_torque host, edit the file /var/spool/maui/maui.cfg and add edguser on the ADMIN3 line. Everything should be OK after executing "service maui restart".

glite-yaim-3.0.0-12: error in config_lcgenv

The following diff applies to /opt/glite/yaim/functions/config_lcgenv of glite-yaim-3.0.0-12

--- /opt/glite/yaim/functions/config_lcgenv.orig        2006-05-17 00:03:47.000000000 +0300
+++ /opt/glite/yaim/functions/config_lcgenv     2006-05-17 00:04:03.000000000 +0300
@@ -14,7 +14,7 @@
     requires VOS VO__SW_DIR SE_LIST
 fi
 
-TEMP_SE=eval `echo $SE_LIST | sed s'/^[ ]*//'`
+TEMP_SE=`echo $SE_LIST | sed s'/^[ ]*//'`
 default_se="$TEMP_SE"
 #${SE_LIST%% *}"


PreGR01-UOM Notes

PreGR02-UPATRAS Notes

The full report, with job storm results, is available at EGEE%20PreGR-02-UPATRAS%20gLite%203_0%20RC5%20For%20SEE%20CERT%20-%20CSLWiki.pdf

Node Preparation

APT

cat > /etc/apt/sources.list.d/glite.list << EOF
rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates
EOF
apt-get update; apt-get dist-update; apt-get upgrade

Java

rpm -ivh /nfs/yaim/j2sdk-1_4_2_08-linux-i586.rpm

CA Installation

Install CAs

cat > /etc/apt/sources.list.d/ca.list << EOF
rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG_CA/en/i386 lcg
EOF

apt-get update; apt-get install lcg-CA

YAIM Configuration

(actually after the middleware has been installed on one node ...)

The following values are not yet known, so commenting out

  • RB_HOST
  • PX_HOST
  • MON_HOST
  • FTS_HOST
  • REG_HOST

Changed From YAIM_VERSION=3.0.0-3 To YAIM_VERSION=3.0.0-11

Changed From

#LCG_REPOSITORY="'rpm http://linuxsoft.cern.ch LCG/apt/LCG-2_7_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security' 
'rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG-2_7_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security'"

to

LCG_REPOSITORY="'rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates'"

Hd to set MON_BOX and REG_HOST to dummy values :-) (SEE does not have RGMA-Registry + Schema Server)

Had to change VO_DTEAM_DEFAULT_SE like below, in order for CSH to work on WNs:

#VO_DTEAM_DEFAULT_SE=$CLASSIC_HOST
VO_DTEAM_DEFAULT_SE=$DPM_HOST

WMS + LB

apt-get install glite-WMSLB
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def WMSLB

BD-II

Install

apt-get install glite-BDII

Configure

/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def BDII

Change BDII_AUTO_UPDATE to NO in file /opt/bdii/etc/bdii.conf

BDII_AUTO_UPDATE=no

Add manually GIIS URI for each site in file /opt/bdii/etc/bdii-update.conf, like

PreGR-02-UPATRAS  ldap://grid1.csl.ee.upatras.gr:2170/mds-vo-name=PreGR-02-UPATRAS,o=grid

Restart BDI-II

service bdii restart

Do a quick check

ldapsearch -x -H ldap://grid28:2170 -b o=grid

glite CE

Careful: This recipe results to a problematic info provider in the site BDII, with entries like this:

# grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam, PreGR-02-UPATRAS, grid
dn: GlueCEUniqueID=grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam,mds-vo-name=Pre
 GR-02-UPATRAS,o=grid
GlueCEStateEstimatedResponseTime: 999999

# dteam, grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam, PreGR-02-UPATRAS, grid
dn: GlueVOViewLocalID=dteam,GlueCEUniqueID=grid1.csl.ee.upatras.gr:2119/blah-p
 bs-dteam,mds-vo-name=PreGR-02-UPATRAS,o=grid
GlueCEStateEstimatedResponseTime: 0

Add BDII_site type to YAIM

cat >> /opt/glite/yaim/scripts/node-info.def << EOF
BDII_site_FUNCTIONS="config_edgusers
config_bdii"
EOF

Install yaim

apt-get install glite-yaim

Install

/opt/glite/yaim/scripts/install_node /nfs/yaim/site-info.def glite-CE glite-torque-server-config

Configure

/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def gliteCE TORQUE_server BDII_site

Changed in file /opt/glite/yaim/scripts/node-info.def to add config_torque_server

From

TORQUE_server_FUNCTIONS="
config_host_certs
config_users
config_mkgridmap
config_crl
config_torque_submitter_ssh
config_glite
config_add_glite_env"

To

TORQUE_server_FUNCTIONS="
config_host_certs
config_users
config_mkgridmap
config_crl
config_torque_submitter_ssh
config_torque_server
config_glite
config_add_glite_env"

Configure

/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def gliteCE TORQUE_server BDII_site

Run the correction as noted in the installation manual

sed -i '{s/jobmanager/blah/}' /opt/lcg/libexec/lcg-info-dynamic-scheduler

Add some lines for LCAS authorization

cat >> /etc/grid-security/grid-mapfile << EOF
/dteam/Role=lcgadmin dteamsgm
/alice/Role=lcgadmin alicesgm
/atlas/Role=lcgadmin atlassgm
/cms/Role=lcgadmin cmssgm
/lhcb/Role=lcgadmin lhcbsgm
/dteam .dteam
/dteam/* .dteam
/alice .alice
/alice/* .alice
/atlas .atlas
/atlas/* .atlas
/cms .cms
/cms/* .cms
/lhcb .lhcb
/lhcb/* .lhcb
/picard/* .picard
/picard .picard
/riker/* .riker
/riker .riker
/biomed .biomed
/biomed/* .biomed
EOF

These lines would be overwritten the next time the cron job edg-mkgridmap runs. To convince edg-mkgridmap to append these lines, put them in a file, e.g. /etc/grid-security/grid-mapfile-local and add the following line at the end of the file /opt/edg/etc/edg-mkgridmap.conf

gmf_local /etc/grid-security/grid-mapfile-local

WN

Install yaim

apt-get install glite-yaim

Install

/opt/glite/yaim/scripts/install_node /nfs/yaim/site-info.def glite-WN glite-torque-client-config

Configure

/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def WN_torque

SE DPM

Some preparation

cat >> /etc/fstab << EOF
/dev/hdc                /dpm                    ext3    defaults        1 2
EOF

#Automount NFS share for YAIM config
cat >> /etc/auto.master << EOF
/nfs    /etc/auto.misc --timeout=300
EOF
cat >> /etc/auto.misc << EOF
yaim    -intr,rsize=8192,wsize=8192 grid10:/nfs/yaim-conf
EOF
chkconfig autofs on
service autofs start

cat > /etc/sysconfig/network << EOF
NETWORKING=yes
HOSTNAME=grid13.csl.ee.upatras.gr
GATEWAY=150.140.186.193
EOF

cat > /etc/sysconfig/network-scripts/ifcfg-eth0 << EOF
DEVICE=eth0
BOOTPROTO=static
IPADDR=150.140.186.227
NETMASK=255.255.255.192
ONBOOT=yes
TYPE=Ethernet
EOF

scp grid10:/nfs/hosts/grid13/gridsec/*.pem /etc/grid-security/

Install

apt-get install glite-SE_dpm_mysql

Configure

/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def  SE_dpm_mysql

UI

Install

apt-get install glite-UI

Configure

/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def UI

UI - HellasGrid GSI Config

Install the HellasGrid GSI rpm. You might want to visit http://pki.physics.auth.gr/hellasgrid-ca/RPMS/ to see if there is a newer version.

rpm -ivh http://pki.physics.auth.gr/hellasgrid-ca/RPMS/ca_HellasGrid-local-0.2-1.noarch.rpm

Use the tools at the following URL to generate the two necessary config files,

/etc/grid-security/globus-user-ssl.conf and 

/etc/grid-security/globus-user-ssl.conf.

http://pki.physics.auth.gr/hellasgrid-ca/util/globus2_config.html

Our files follow:

cat > /etc/grid-security/globus-user-ssl.conf << EOF
RANDFILE		= $ENV::HOME/.rnd

####################################################################
[ ca ]
default_ca	= CA_default		# The default ca section

####################################################################
[ CA_default ]

dir		= ./demoCA		# Where everything is kept
certs		= $dir/certs		# Where the issued certs are kept
crl_dir		= $dir/crl		# Where the issued crl are kept
database	= $dir/index.txt	# database index file.
new_certs_dir	= $dir/newcerts		# default place for new certs.

certificate	= $dir/cacert.pem 	# The CA certificate
serial		= $dir/serial 		# The current serial number
crl		= $dir/crl.pem 		# The current CRL
private_key	= $dir/private/cakey.pem# The private key
RANDFILE	= $dir/private/.rand	# private random number file

x509_extensions	= x509v3_extensions	# The extentions to add to the cert
default_days	= 365			# how long to certify for
default_crl_days= 365 # DEE 30	# how long before next CRL
default_md	= md5			# which md to use.
preserve	= no			# keep passed DN ordering

policy		= policy_match

[ policy_match ]
countryName		= match
stateOrProvinceName	= optional
organizationName	= match
organizationalUnitName	= optional
commonName		= supplied
emailAddress		= optional

[ policy_anything ]
countryName		= optional
stateOrProvinceName	= optional
localityName		= optional
organizationName	= optional
organizationalUnitName	= optional
commonName		= supplied
emailAddress		= optional

####################################################################
[ req ]
default_bits		= 1024
default_keyfile 	= privkey.pem
distinguished_name	= req_distinguished_name
attributes		= req_attributes

[ req_distinguished_name ]
# BEGIN CONFIG
countryName                     = Country Name (2 letter code)
countryName_default             = GR
countryName_min                 = 2
countryName_max                 = 2
0.organizationName               = Level 0 Organization
0.organizationName_default       = HellasGrid
0.organizationalUnitName          = Level 0 Organizational Unit
0.organizationalUnitName_default = csl.ee.upatras.gr
commonName                      = Name (e.g., John M. Smith)
commonName_max                  = 64
# END CONFIG

[ req_attributes ]
#unstructuredName		= An optional company name

[ x509v3_extensions ]
nsCertType			= 0x40
EOF

cat > /etc/grid-security/globus-host-ssl.conf << EOF
RANDFILE		= $ENV::HOME/.rnd

####################################################################
[ ca ]
default_ca	= CA_default		# The default ca section

####################################################################
[ CA_default ]

dir		= ./demoCA		# Where everything is kept
certs		= $dir/certs		# Where the issued certs are kept
crl_dir		= $dir/crl		# Where the issued crl are kept
database	= $dir/index.txt	# database index file.
new_certs_dir	= $dir/newcerts		# default place for new certs.

certificate	= $dir/cacert.pem 	# The CA certificate
serial		= $dir/serial 		# The current serial number
crl		= $dir/crl.pem 		# The current CRL
private_key	= $dir/private/cakey.pem# The private key
RANDFILE	= $dir/private/.rand	# private random number file

x509_extensions	= x509v3_extensions	# The extentions to add to the cert
default_days	= 365			# how long to certify for
default_crl_days= 365 # DEE 30	# how long before next CRL
default_md	= md5			# which md to use.
preserve	= no			# keep passed DN ordering

# A few difference way of specifying how similar the request should look
# For type CA, the listed attributes must be the same, and the optional
# and supplied fields are just that :-)
policy		= policy_match

# For the CA policy
[ policy_match ]
countryName		= match
stateOrProvinceName	= optional
organizationName	= match
organizationalUnitName	= optional
commonName		= supplied
emailAddress		= optional

# For the 'anything' policy
# At this point in time, you must list all acceptable 'object'
# types.
[ policy_anything ]
countryName		= optional
stateOrProvinceName	= optional
localityName		= optional
organizationName	= optional
organizationalUnitName	= optional
commonName		= supplied
emailAddress		= optional

####################################################################
[ req ]
default_bits		= 1024
default_keyfile 	= privkey.pem
distinguished_name	= req_distinguished_name
attributes		= req_attributes

[ req_distinguished_name ]
countryName                     = Country Name (2 letter code)
countryName_default             = GR
countryName_min                 = 2
countryName_max                 = 2
0.organizationName               = Level 0 Organization
0.organizationName_default       = HellasGrid
0.organizationalUnitName          = Level 0 Organizational Unit
0.organizationalUnitName_default = csl.ee.upatras.gr
commonName                      = Name (e.g., John M. Smith)
commonName_max                  = 64

[ req_attributes ]
#unstructuredName		= An optional company name

[ x509v3_extensions ]

nsCertType			= 0x40
EOF

LCG + gLite CE Combination

LCG CE + Torque + site BDII

Install:

apt-get install lcg-CE_torque

Configure:

/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def lcg-CE_torque

Install BLAHP Log Parser

apt-get install glite-ce-blahp

The following is a daemon

/opt/glite/bin/BLParserPBS -p 33332 -s /var/spool/pbs &

Put gLite CE into /etc/hosts.equiv

echo grid19.csl.ee.upatras.gr >> /etc/hosts.equiv

The installation guide gives an advice about collisions of pool account mapping. This is typically achieved either by allocating separate pool account ranges to each CE or by allowing them to share a gridmapdir.

Share gridmapdir with gLite CE using NFS

cat >> /etc/exports <<EOF
/etc/grid-security/gridmapdir grid19.csl.ee.upatras.gr(rw,sync,no_root_squash)
EOF
exportfs -r

gLite CE

/opt/glite/yaim/scripts/install_node /nfs/yaim/site-info.def glite-CE
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def gliteCE 

Mount the gridmapdir

mount -t nfs grid1.csl.ee.upatras.gr:/etc/grid-security/gridmapdir /etc/grid-security/gridmapdir
Work Around for ERT problem

On the Torque Server node: edit file /var/spool/maui/maui.cfg, at the line that starts with ADMIN3 to add edguser. The result should be like this:

ADMIN3                  edginfo rgma edguser

On the gLite CE node:

Install maui-client

apt-get install maui-client

Edit file /opt/lcg/etc/lcg-info-dynamic-scheduler.conf, to add your torque server at the parameter vo_max_jobs_cmd, like this

vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h grid1.csl.ee.upatras.gr

In file /opt/lcg/libexec/vomaxjobs-maui, change the commented line to the line below:

#(stat, out) = commands.getstatusoutput('diagnose -g')
(stat, out) = commands.getstatusoutput(cmd)

Cross your fingers, wait until Aphrodite aligns with Saturn and do an

ldapsearch -x -H ldap://grid8.csl.ee.upatras.gr:2170 -b o=grid | grep Est

If you still see any 999999, then something must have gone wrong with the stars.

Changes needed on the combined WNs

Put gLite CE hostname into /opt/edg/etc/edg-pbs-knownhosts.conf, line which starts with NODES = . The entries there are separated by space. Run /opt/edg/sbin/edg-pbs-knownhosts, or wait cron to run it later.

YAIM site-info.def

To remove passwords:

sed '/PASSWORD=/ s/=*$/=<PASSWORD_REMOVED_HERE>/g' /nfs/yaim/site-info.def \
| mail -s "YAIM+`date --iso-8601=minutes`" goulas@ee.upatras.gr

YAIM, on 2006-05-17T18:11+0300

# YAIM example site configuration file - adapt it to your site!

MY_DOMAIN=csl.ee.upatras.gr

# Node names
# Note: - SE_HOST -->  Removed, see CLASSIC_HOST, DCACHE_ADMIN, DPM_HOST below
#       - REG_HOST --> There is only 1 central registry for the time being.
CE_HOST=grid1.$MY_DOMAIN
RB_HOST=my-rb.$MY_DOMAIN
WMS_HOST=grid25.$MY_DOMAIN
PX_HOST=px01.lip.pt
BDII_HOST=grid28.$MY_DOMAIN
MON_HOST=grid19.$MY_DOMAIN
#FTS_HOST=my-fts.$MY_DOMAIN
#REG_HOST=lcgic01.gridpp.rl.ac.uk	
REG_HOST=my_reg_host.$MY_DOMAIN
GLITECE_HOST=grid19.csl.ee.upatras.gr

# VO-BOX - Set this if you are building a VO-BOX 
#VOBOX_HOST=my-vobox.$MY_DOMAIN
#VOBOX_PORT=1975

# Set this to "yes" your site provides an X509toKERBEROS Authentication Server 
# Only for sites with Experiment Software Area under AFS 
GSSKLOG=no
GSSKLOG_SERVER=my-gssklog.$MY_DOMAIN

# LFC - Set these if you are installing an LFC
#LFC_HOST=my-lfc.$MY_DOMAIN
#LFC_DB_PASSWORD="lfc_password"=<PASSWORD_REMOVED_HERE>

# These are set to default to using the standard database on the same hosts
# as the LFC daemon is on
LFC_DB_HOST=$LFC_HOST
LFC_DB=cns_db

# All catalogues are local unless you add a VO to 
# LFC_CENTRAL, in which case that will be central
LFC_CENTRAL=""

# If you want to limit the VOs your LFC serves, add the locals here
LFC_LOCAL=""

# TORQUE - Change this if your torque server is not on the CE
# it's ingored for other batch systems
TORQUE_SERVER=$CE_HOST

# These variables tell YAIM where to find additional configuration files.
WN_LIST=/nfs/yaim/wn-list.conf
USERS_CONF=/nfs/yaim/users.conf
GROUPS_CONF=/nfs/yaim/groups.conf
FUNCTIONS_DIR=/opt/glite/yaim/functions
YAIM_VERSION=3.0.0-11

''# Repository settings 
#LCG_REPOSITORY="'rpm http://linuxsoft.cern.ch LCG/apt/LCG-2_7_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security'\
 'rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG-2_7_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security'"
LCG_REPOSITORY="'rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates'"
CA_REPOSITORY="rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG_CA/en/i386 lcg"
REPOSITORY_TYPE="apt" # or "yum"''

# For the relocatable (tarball) distribution, ensure
# that INSTALL_ROOT is set correctly
INSTALL_ROOT=/opt

# You will probably want to change these too for the relocatable dist
OUTPUT_STORAGE=/tmp/jobOutput
JAVA_LOCATION="/usr/java/j2sdk1.4.2_08"

# Set this to '/dev/null' or some other dir if you want
# to turn off yaim installation of cron jobs
CRON_DIR=/etc/cron.d

# Set this to your prefered and firewall allowed port range
GLOBUS_TCP_PORT_RANGE="20000 25000"

# Choose a good password ! And be sure that this file cannot be read by
# any grid job !
MYSQL_PASSWORD=valuable_secret=<PASSWORD_REMOVED_HERE>
APEL_DB_PASSWORD="APEL_secret"=<PASSWORD_REMOVED_HERE>


# GRID_TRUSTED_BROKERS: DNs of services (RBs) allowed to renew/retrives 
# credentials from/at the myproxy server. Put single quotes around each trusted DN !!! 

GRID_TRUSTED_BROKERS="
'broker one'
'broker two'
"

# The RB now uses the DLI by default; set VOs here which should use RLS
RB_RLS="" # "atlas cms"

# Space separated list of ldap servers in edg-mkgridmap.conf which authenticate users.
# Ex.: GRIDMAP_AUTH="ldap://lcg-registrar.cern.ch/ou=users,o=registrar,dc=lcg,dc=org ldap://xyz"
GRIDMAP_AUTH="ldap://lcg-registrar.cern.ch/ou=users,o=registrar,dc=lcg,dc=org"

# GridIce server host name (usually run on the MON node).
GRIDICE_SERVER_HOST=$MON_HOST

# Site-wide settings 
SITE_EMAIL=egee-grid@ee.upatras.gr
SITE_NAME=PreGR-02-UPATRAS
SITE_LOC="Patras, Greece"
SITE_LAT=38.2368 # -90 to 90 degrees
SITE_LONG=21.7341 # -180 to 180 degrees
SITE_WEB="http://www.csl.ee.upatras.gr"
SITE_TIER="TIER 2"
SITE_SUPPORT_SITE="my-bigger-site.their_domain"

# Jobmanager specific settings
JOB_MANAGER=lcgpbs
CE_BATCH_SYS=torque
BATCH_BIN_DIR=/usr/bin
BATCH_VERSION=torque-1.0.1b
BATCH_LOG_DIR=/var/spool/pbs/server_priv/accounting

# Architecture and enviroment specific settings
CE_CPU_MODEL=PIII
CE_CPU_VENDOR=intel
CE_CPU_SPEED=1001
CE_OS="Scientific Linux SL"
CE_OS_RELEASE="SL"
CE_OS_VERSION=3.0.3
CE_MINPHYSMEM=513
CE_MINVIRTMEM=1025
CE_SMPSIZE=1
CE_SI00=381
CE_SF00=0
CE_OUTBOUNDIP=TRUE
CE_INBOUNDIP=FALSE
CE_RUNTIMEENV="
    LCG-2
    LCG-2_1_0
    LCG-2_1_1
    LCG-2_2_0
    LCG-2_3_0
    LCG-2_3_1
    LCG-2_4_0
    LCG-2_5_0
    LCG-2_6_0
    LCG-2_7_0
    GLITE-3_0_0
    R-GMA
"
# Set this if your WNs have a shared directory for temporary storage
CE_DATADIR=""

# Classic SE
CLASSIC_HOST="classic SE host"
CLASSIC_STORAGE_DIR="/storage"

# dCache-specific settings
# ignore if you are not running d-cache

# Your dcache admin node
DCACHE_ADMIN=""
DCACHE_POOLS="my-pool-node1:/pool-path1 my-pool-node2:/pool-path2"

# Optional
# DCACHE_PORT_RANGE="20000,25000"
# DCACHE_DOOR_SRM="door_node1[:port]"
# DCACHE_DOOR_GSIFTP="door_node1[:port] door_node2[:port]"
# DCACHE_DOOR_GSIDCAP="door_node1[:port] door_node2[:port]"
# DCACHE_DOOR_DCAP="door_node1[:port] door_node2[:port]"


# Set to "yes" only if YAIM shall reset the dCache configuration,
# i.e. if you want YAIM to configure dCache - WARNING:
# this may wipe out any dCache parameters previously configured!
# DCACHE_PORT_RANGE="20000,25000"
# RESET_DCACHE_CONFIGURATION=no

# Set to "yes" only if YAIM shall reset the dCache nameserver,
# i.e. if you want YAIM to clear the content of dCache - WARNING:
# this may wipe out any dCache files previously stored!
# RESET_DCACHE_PNFS=no

# Set to "yes" only if YAIM shall reset the dCache Databases,
# i.e. if you want YAIM to clear the metadata of dCache - WARNING:
# this may wipe out any dCache files names previously stored!
# Leaving your system without any way to reestablish which files 
# are stored.
# RESET_DCACHE_RDBMS=no


#
# SE_dpm-specific settings - Ignore if you are not running a DPM
#
# Set these if you are installing a DPM yourself
# and/or if you need a default DPM for the lcg-stdout-mon
#
# DPMDATA is now deprecated. Use an entry like $DPM_HOST:/filesystem in
# the DPM_FILESYSTEMS variable.
# From now on we use DPM_DB_USER and DPM_DB_PASSWORD to make clear
# it's different role from that of the dpmmgr unix user who owns the
# directories and runs the daemons.


# The name of the DPM head node 
DPM_HOST=grid13.$MY_DOMAIN   # my-dpm.$MY_DOMAIN

# The DPM pool name
DPMPOOL=UPATRASDPM

# The filesystems/partitions parts of the pool
#DPM_FILESYSTEMS="$DPM_HOST:/path1 my-dpm-poolnode.$MY_DOMAIN:/path2"
DPM_FILESYSTEMS="$DPM_HOST:/dpm my-dpm-poolnode.$MY_DOMAIN:/path2"

# The database user
DPM_DB_USER=dpmuser

# The database user password
DPM_DB_PASSWORD=dpm_pass_pps_egee=<PASSWORD_REMOVED_HERE>

# The DPM database host
DPM_DB_HOST=$DPM_HOST

# Specifies the default amount of space reserved  for a file
DPMFSIZE=200M

# Variable for the port range  - Optional, default value is shown
# RFIO_PORT_RANGE="20000 25000" 


# This largely replaces CE_CLOSE_SE but it is a list of hostnames
SE_LIST="$CLASSIC_HOST $DPM_HOST $DCACHE_ADMIN"
SE_ARCH="multidisk" # "disk, tape, multidisk, other"

FTS_SERVER_URL="https://fts.${MY_DOMAIN}:8443/path/glite-data-transfer-fts"

# BDII/GIP specific settings
BDII_HTTP_URL="http://grid-deployment.web.cern.ch/grid-deployment/gis/lcg2-bdii/dteam/lcg2-all-sites.conf"
# Set this to use FCR
#BDII_FCR="http://goc.grid-support.ac.uk/gridsite/bdii/BDII/www/bdii-update.ldif"
# Ex.: BDII_REGIONS="CE SE RB PX VOBOX"
BDII_REGIONS="CE DPM GLITECE"	# list of the services provided by the site
BDII_CE_URL="ldap://$CE_HOST:2135/mds-vo-name=local,o=grid"
BDII_GLITECE_URL="ldap://$GLITECE_HOST:2170/mds-vo-name=resource,o=grid"
BDII_SE_URL="ldap://$CLASSIC_HOST:2135/mds-vo-name=local,o=grid"
BDII_DPM_URL="ldap://$DPM_HOST:2135/mds-vo-name=local,o=grid"
BDII_RB_URL="ldap://$RB_HOST:2135/mds-vo-name=local,o=grid"
BDII_PX_URL="ldap://$PX_HOST:2135/mds-vo-name=local,o=grid"
BDII_LFC_URL="ldap://$LFC_HOST:2135/mds-vo-name=local,o=grid"
BDII_VOBOX_URL="ldap://$VOBOX_HOST:2135/mds-vo-name=local,o=grid"
BDII_FTS_URL="ldap://$FTS_HOST:2170/mds-vo-name=resource,o=grid"

# Use this to set your contact string. 
# Ex.: BDII_BIND="mds-vo-name=mystorage,o=grid"


# E2EMONIT specific settings
# This specifies the location to download the host specific configuration file
E2EMONIT_LOCATION=grid-deployment.web.cern.ch/grid-deployment/e2emonit/production

#
# Replace this with the siteid supplied by the person setting up the networking 
# topology.
E2EMONIT_SITEID=my.siteid

# VOS="atlas alice lhcb cms dteam biomed"
# Space separated list of supported VOs by your site
VOS="dteam"	
QUEUES=${VOS}
VO_SW_DIR=/opt/exp_soft

# Set this if you want a scratch directory for jobs
EDG_WL_SCRATCH=""

# VO specific settings. For help see: https://lcg-sft.cern.ch/yaimtool/yaimtool.py
VO_ATLAS_SW_DIR=$VO_SW_DIR/atlas
VO_ATLAS_DEFAULT_SE=$CLASSIC_HOST
VO_ATLAS_STORAGE_DIR=$CLASSIC_STORAGE_DIR/atlas
VO_ATLAS_QUEUES="atlas"

VO_ATLAS_SGM=ldap://grid-vo.nikhef.nl/ou=lcgadmin,o=atlas,dc=eu-datagrid,dc=org
VO_ATLAS_USERS=ldap://grid-vo.nikhef.nl/ou=lcg1,o=atlas,dc=eu-datagrid,dc=org
VO_ATLAS_VOMS_POOL_PATH="/lcg1"
VO_ATLAS_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/atlas?/atlas/' 'vomss://voms.cern.ch:8443/voms/atlas?/atlas/'"
#VO_ATLAS_VOMS_EXTRA_MAPS="'Role=production production' 'usatlas .usatlas'"
VO_ATLAS_VOMSES="'atlas lcg-voms.cern.ch 15001 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch atlas' 'atlas voms.cern.ch 15001 /C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch atlas'"


VO_ALICE_SW_DIR=$VO_SW_DIR/alice
VO_ALICE_DEFAULT_SE=$CLASSIC_HOST
VO_ALICE_STORAGE_DIR=$CLASSIC_STORAGE_DIR/alice
VO_ALICE_QUEUES="alice"

VO_ALICE_SGM=ldap://grid-vo.nikhef.nl/ou=lcgadmin,o=alice,dc=eu-datagrid,dc=org
VO_ALICE_USERS=ldap://grid-vo.nikhef.nl/ou=lcg1,o=alice,dc=eu-datagrid,dc=org
VO_ALICE_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/alice?/alice/' 'vomss://voms.cern.ch:8443/voms/alice?/alice/'"
VO_ALICE_VOMSES="'alice lcg-voms.cern.ch 15000 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch alice' 'alice voms.cern.ch 15000 /C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch alice'"


VO_CMS_SW_DIR=$VO_SW_DIR/cms
VO_CMS_DEFAULT_SE=$CLASSIC_HOST
VO_CMS_STORAGE_DIR=$CLASSIC_STORAGE_DIR/cms
VO_CMS_QUEUES="cms"

VO_CMS_SGM=ldap://grid-vo.nikhef.nl/ou=lcgadmin,o=cms,dc=eu-datagrid,dc=org
VO_CMS_USERS=ldap://grid-vo.nikhef.nl/ou=lcg1,o=cms,dc=eu-datagrid,dc=org
VO_CMS_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/cms?/cms/' 'vomss://voms.cern.ch:8443/voms/cms?/cms/'"
VO_CMS_VOMSES="'cms lcg-voms.cern.ch 15002 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch cms' 'cms voms.cern.ch 15002 /C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch cms'"


VO_LHCB_SW_DIR=$VO_SW_DIR/lhcb
VO_LHCB_DEFAULT_SE=$CLASSIC_HOST
VO_LHCB_STORAGE_DIR=$CLASSIC_STORAGE_DIR/lhcb
VO_LHCB_QUEUES="lhcb"

VO_LHCB_SGM=ldap://grid-vo.nikhef.nl/ou=lcgadmin,o=lhcb,dc=eu-datagrid,dc=org
VO_LHCB_USERS=ldap://grid-vo.nikhef.nl/ou=lcg1,o=lhcb,dc=eu-datagrid,dc=org
VO_LHCB_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/lhcb?/lhcb/' 'vomss://voms.cern.ch:8443/voms/lhcb?/lhcb/'"
VO_LHCB_VOMS_EXTRA_MAPS="lcgprod lhcbprod"
VO_LHCB_VOMSES="'lhcb lcg-voms.cern.ch 15003 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch lhcb' 'lhcb voms.cern.ch 15003 /C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch lhcb'"

VO_DTEAM_SW_DIR=$VO_SW_DIR/dteam
#VO_DTEAM_DEFAULT_SE=$CLASSIC_HOST
VO_DTEAM_DEFAULT_SE=$DPM_HOST
VO_DTEAM_STORAGE_DIR=$CLASSIC_STORAGE_DIR/dteam
VO_DTEAM_QUEUES="dteam"

VO_DTEAM_SGM=ldap://lcg-vo.cern.ch/ou=lcgadmin,o=dteam,dc=lcg,dc=org
VO_DTEAM_USERS=ldap://lcg-vo.cern.ch/ou=lcg1,o=dteam,dc=lcg,dc=org
VO_DTEAM_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/dteam?/dteam/' 'vomss://voms.cern.ch:8443/voms/dteam?/dteam/'"
VO_DTEAM_VOMSES="'dteam lcg-voms.cern.ch 15004 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch dteam' 'dteam voms.cern.ch 15004 /C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch dteam'"

VO_BIOMED_SW_DIR=$VO_SW_DIR/biomed
VO_BIOMED_DEFAULT_SE=$CLASSIC_HOST
VO_BIOMED_STORAGE_DIR=$CLASSIC_STORAGE_DIR/biomed
VO_BIOMED_QUEUES="biomed"
VO_BIOMED_USERS=ldap://vo-biome.in2p3.fr/ou=lcg1,o=biomedical,dc=lcg,dc=org
VO_BIOMED_SGM=ldap://vo-biome.in2p3.fr/ou=lcgadmin,o=biomedical,dc=lcg,dc=org
VO_BIOMED_VOMSES="biomed cclcgvomsli01.in2p3.fr 15000 /O=GRID-FR/C=FR/O=CNRS/OU=CC-LYON/CN=cclcgvomsli01.in2p3.fr/Email=sysunix@cc.in2p3.fr biomed"

Problems

gLite publishes default values for ERTs

The information providers do not run. The observed, erratic, behavior is shown below:

# grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam, PreGR-02-UPATRAS, grid
dn: GlueCEUniqueID=grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam,mds-vo-name=Pre
 GR-02-UPATRAS,o=grid
GlueCEStateEstimatedResponseTime: 999999

# dteam, grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam, PreGR-02-UPATRAS, grid
dn: GlueVOViewLocalID=dteam,GlueCEUniqueID=grid1.csl.ee.upatras.gr:2119/blah-p
 bs-dteam,mds-vo-name=PreGR-02-UPATRAS,o=grid
GlueCEStateEstimatedResponseTime: 0

BD-II log file /opt/bdii/var/bdii.log has the following relevant entries:

2006-05-17 14:19:30 lcg-info-dynamic-scheduler: VO max jobs backend command retu
rned nonzero exit status
2006-05-17 14:19:30 lcg-info-dynamic-scheduler: Exiting without output, GIP will
 use static values

The above behavior is observed, both with torque installed on the gLite CE and on separate node. The result is that an gLite CE is never chosen by a WMS when LCG CEs match the job.

gStat does not show gLite CE in the 'Service Check' Section

Everything is in the title

gStat counts double CPUs

gStat is unaware of the two CEs using the LRMS, thus Worker Nodes, so it counts them twice.

BLParserPBS should be a service

The manual says to run BLParserPBS. This is actually a daemon which never returns, so it should have a service wrapper in /etc/rc.d, in order to restart at boot time.

CSH problems

This error appears (at least) to UI and WNs. When a there is no classic SE, the default is to have in YAIM

CLASSIC_HOST="classic SE host"

When this happens, for each VO supported, an line like this (for dteam) is in file /etc/profile.d/lcgenv.csh

setenv VO_DTEAM_DEFAULT_SE classic SE host

This is not correct syntax and causes CSH to fail to start. A question is, should this be corrected to DPM_HOST or to something else? Note that it also has to be moved a couple of lines below, since DPM_HOST is populated later.

setenv VO_DTEAM_DEFAULT_SE $DPM_HOST

The same error is also in file /etc/profile.d/lcgenv.sh but this does not cause the shell to break.

export VO_DTEAM_DEFAULT_SE=classic SE host

Again, I am not sure if the correction is appropriate. Note that it also has to be moved a couple of lines below, since DPM_HOST is populated later.

export VO_DTEAM_DEFAULT_SE=$DPM_HOST

WMS: Invalid proxy cache

I get the following e-mail from the WMS:

Subject: Cron <root@grid25> . /etc/glite/profile.d/glite_setenv.sh ; $GLITE_LOCATION/bin/glite_wms_wmproxy_purge_proxycache $GLITE_LOCATION_VAR/proxycache > $GLITE_LOCATION_LOG/glite_wms_wmproxy_purge_proxycache.log

Error - invalid proxy cache path (invalid directory (/var/glite/proxycache): boost::filesystem::is_directory: "/var/glite/proxycache": No such file or directory)

The directory refered here does not exit. From a quick look at yaim, I think that it creates $GLITE_LOCATION_TMP/proxycache and not $GLITE_LOCATION_VAR/proxycache

See also [savannah #16959]

Usage of CLASSIC_HOST

A host of settings are dependent on yaim parameter $CLASSIC_HOST, which defaults to CLASSIC_HOST="classic SE host".

I am not sure how safe is to set in YAIM this: CLASSIC_HOST=$DPM_HOST


A little percentage of jobs fail with Job proxy is expired

In job storms, 1-5 jobs out of 100 fail. Job status returns the message Job proxy is expired.

Logging info for these jobs returns the following messages:

Got a job held event, reason: Globus error 131: the user proxy expired (job is still running)

Job got an error while in the CondorG queue.

Ben Gurion University Notes

AEGIS01-PHY-SCL Notes

YAIM's config_lcgenv fix

Before doing configure_node, it is necessary to fix yaim's config_lcgenv function, as explained in several bugs in savannah, e.g.http://savannah.cern.ch/bugs/?func=detailitem&item_id=16895

HG-06-EKT notes

Generally the installation of LCG-CE, MON, SE_dcache and WN_torque nodes was completed without any problems. However, in the site-info.def file that was given as an example there are a couple of points that need attention:

  • The LCG_REPOSITORY parameter points to the LCG-2.7.0 repository. Obviously it should be changed to
LCG_REPOSITORY="'rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates'"
  • The various VO_DTEAM_DEFAULT_SE, VO_LHCB_DEFAULT_SE etc. parameters are defined as
VO_DTEAM_DEFAULT_SE=$CLASSIC_HOST

That is OK if you have a Classic SE, but if you have a DPM or a dCache SE they should be changed to $DPM_HOST or to $DCACHE_ADMIN. Otherwise you might have trouble with /etc/profile.d/lcgenv.{sh,csh} and with the value published for GlueCEInfoDefaultSE by your site GIIS.


Summary

Even though the final job storm proved to be 80 successful (20 % caused by proxy expired failures) the tweaking that it needed to make this configuration schema to work, is far to complex and unreliable to deploy in a production environment. Thus in SEE ROC we decided to upgrade all our clusters to the LCG flavour only of gLite30 and proceed with the deployment of additional gLite CEs as soon as it is stable enough

Personal tools