GLite30
From EGEE-see WIki
Introduction
In SEE roc we took advantage of our representatives in PPS and installed a small certification test bed aiming to test the installation and configuration of gLite 3.0 release using 5 sites in total (2 from PPS , One from the Installation / Certification testbed aka SA3, and two sites from the production service. HG-06-EKT, EGEE-SEE-CERT, preGR01-UOM and preGR02-UPATRAS installed a clean version of gLite 3.0 whereas GR-04-FORTH-ICS upgraded its LCG flavour services from 2.7.0 and installed an extra gLite CE. Our plan was to install an LCG and gLite CE sharing the same batch system hosted on the LCG-CE on each site and test to see if this configuration is accessible both form a WMSLB and an LCG RB.
Migration Plan
Our aim was to emulate a full testbed with all the core components that we are supposed to run in production and test weather its feasible or not to do so. This service as it is short leaved was designed to be independent of core services / gocdb and the rest 2ndary services installed outside the SEE federation. The aim was to install the following services on all sites a. a LCG_CE + Torque or other batch System Server b. a gLite_CE talking with the Torque server on the LCG_CE ( big clusters should also consider having the Torgue server independent of LCG/gLite_CE) c. a gLite_SE as the lcg equivalent no longer exists d. gLite_Mon e. gLite_WN (which are actually combined WNs according to the documentation) Additionally EGEE_SEE_CERT installed an LCG_RB + BDII as this was need for us to check if everything works ok both ways and preGR02-Upatras installed the later WMSLB (push mode) and UI In order to test our pilot testbed we run a small number of Jobs storms (more like drizzles), the results of which can be found here
GR-04-FORTH-ICS Notes
EGEE-SEE-CERT Notes
Issues regarding fresh gLite 3.0 installation on SL 3.0.7 using glite-yaim-3.0.0-12
SL and DHCP: Hostname issue
The symptom
You are using a dhcp server to ease your site installation/administration and hostname/IP assignment. This is the case for EGEE-SEE-CERT, for example, since installations from scratch are quite common. While configuring your SE_dpm_mysql you get errors similar to the ones below:
Setting mysql password on se01.gridctb.uoa.gr. ERROR 1130 (00000): #HY000Host 'se01.gridctb.uoa.gr' is not allowed to connect to this MySQL server ERROR 1130 (00000): #HY000Host 'se01.gridctb.uoa.gr' is not allowed to connect to this MySQL server
After that your node configuration script exits with a fatal error.
The cause
Both SL and glite middleware assume that /bin/hostname returns the FQDN of the server instead of the unqualified hostname (i.e. the system expects that /bin/hostname will return se01.gridctb.uoa.gr and not se01). However SL fails to properly initialize server hostname to its FQDN, whenever this should be done based on the hostname and domainname assigned by a dhcp server, unless the HOSTNAME or the DHCP_HOSTNAME variables have been already initialized and haven't been left to their default values (See below 1.1, 1.2, 1.3, 1.4). As a consequence during the initialization of mysql privileges tables at postinstall phase of mysqld, full access is granted only to root logins either from the localhost or from the server identified with the unqualified hostname (See below 1.5, 1.6) and not from one identified by the server's FQDN.
mysql> select Host,User,Password from user where user='root'\G
*************************** 1. row ***************************
Host: localhost
User: root
Password: ******encoded password*******
*************************** 2. row ***************************
Host: se01
User: root
Password:
2 rows in set (0.01 sec)
Therefore any attempt later by the middleware installation scripts to modify mysql db privileges (e.g. add new DB user) will also fail, with an access denied error message, since the connected host will resolve (probably....?) to its FQDN and not to its unqualified hostname( See Below 1.7).
The Solution
To get rid of this type of bug, and/or any other similar make sure that /bin/hostname returns the FQDN. This can be accomplished either by setting the HOSTNAME variable in /etc/sysconfig/network or by adapting accordingly your dhcp server. A better solution would be to add the "-f" argument to any call of "/bin/hostname" wherever an FQDN is expected.
- /etc/sysconfig/network-scripts/network-functions (see need_hostname, set_hostname)
- /etc/sysconfig/network-scripts/ifup-post
- /etc/sysconfig/network-scripts/ifup
- /etc/rc.d/rc.sysinit
- /usr/bin/mysql_install_db
- /usr/bin/mysql_create_system_tables
- /opt/glite/yaim/functions/config_DPM_mysql
glite-WMSLB: Failed to start glite-lb-locallogger and glite-lb-bkserverd
The symptom
Configuration of WMS/LB ends with a fatal error like the one below:
Starting glite-lb-logd ...This is LocalLogger, part of Workload Management System in EU DataGrid.Copyright (c) 2002 CERN, INFN and CESNET on behalf of the EU DataGrid. [17411] Initializing... [17411] Parse messages for correctness...[17411] yes. [17411] Send messages also to inter-logger...[17411] yes. [17411] Store messages with the filename prefix "/tmp/dglogd.log"...[17411] yes. [17411] Initializing Globus common module...[17411] yes. [17411] Failed to get GSI credentials. Exiting. FAILED Starting glite-lb-interlogd ...[-1218517920] removing stale input socket /tmp/interlogger.sock [-1218517920] Failed to load GSI credential: edg_wll_gss_acquire_cred_gsi(): GSS Major Status: General failure (GSS Minor Status Error Chain: import_cred.c:199: gss_import_cred: Unable to read credential for import globus_i_gsi_gss_utils.c:1247: globus_i_gsi_gss_cred_read_bio: Error with GSI credential globus_gsi_credential.c:924: globus_gsi_cred_read_proxy_bio: Error reading proxy credential: Couldn't read X509 proxy cert from bio OpenSSL Error: pem_lib.c:768: in library: PEM routines, function PEM_read_bio: bad end line) FAILED [ERROR] Could not start the gLite LB Local Logger daemons [ERROR] Please verify and re-run the script
The cause
glite-lb-logd, glite-lb-interlogd, glite-lb-notif-interlogd will fail to start whenever the certificate and the corresponding private key are passed to them as arguments (-c and -k respectively). This behavior seems to be confirmed by the following bug report although it refers to a previous release of glite (1.4.1):
bug #13988 overview: Failed to start glite-lb-locallogger on glite 1.4.1 https://savannah.cern.ch/bugs/?func=detailitem&item_id=13988
As a consequence the execution of:
/opt/glite/etc/init.d/glite-lb-locallogger and /opt/glite/etc/init.d/glite-lb-bkserverd
from which the previous binaries are called by, will also fail during the configuration of WMSLB, because the environment variables GLITE_HOST_CERT, GLITE_HOST_KEY whenever are defined will be used as arguments to glite-lb-logd and glite-lb-*-interlogd.
The Solution
Besides working on the binary source code to solve the problem, another temporary workaround would be to take advantage of the binaries default behavior of falling back into ~UID/.globus/usercert.pem and ~UID/.globus/userkey.pem as last resort when no arguments are provided to them
stat64("/home/glite", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
stat64("/home/glite/.globus/usercert.pem", 0xbfff8130) = -1 ENOENT (No such file or directory)
stat64("/home/glite/.globus/userkey.pem", 0xbfff8130) = -1 ENOENT (No such file or directory)
stat64("/home/glite/.globus/usercred.p12", 0xbfff8130) = -1 ENOENT (No such file or directory)
The default location of user certification files, as defined in /opt/glite/etc/config/glite-global.cfg.xml (~glite/.certs/) does not match the build-in defaults of glite-lb-logd, glite-lb-interlogd, glite-lb-notif-interlogd (~glite/.globus).
Unfortunately, and most likely only on clean installs, the .globus directory .globus has not been created by the time one runs the configuration scripts.
So before running the configuration script one has to
- clear the "creds" variable found in glite-lb-locallogger and glite-lb-bkserverd to make sure that glite-lb-logd, glite-lb-interlogd, glite-lb-notif-interlogd are executed without the -c and -k arguments
- create ~glite/.globus
- copy the hostkey.pem and hostcert.pem to userkey.pem and usercert.pem respectively under ~glite/.globus
glite-CE and lcg-CE: NFS shared gridmapdir
On both glite-CE and lcg-CE, the directory /etc/grid-security/gridmapdir contains the correspondence between true Grid users and local pool accounts. Since the two CEs share the same WNs, having seperate gridmapdir on each CE can lead to security problems and other issues.
The proposed solution is to share via NFS the gridmapdir on both CEs. The exact steps are:
on the lcg-CE (ce02):
# echo "/etc/grid-security/gridmapdir ce01.gridctb.uoa.gr(rw,no_root_squash,sync)" >> /etc/exports # chkconfig nfs on # service nfs restart
on the glite-CE (ce01):
# rm -rf /etc/grid-security/gridmapdir/* # echo "ce02.gridctb.uoa.gr:/etc/grid-security/gridmapdir /etc/grid-security/gridmapdir nfs defaults 0 0" >> /etc/fstab # service netfs restart
In our case ce01 is the glite-CE and ce02 is the lcg-CE.
Now both CEs should be sharing the same gridmapdir. However, while the ownership of the gridmapdir in both systems is root:edguser, this is not the case after the NFS sharing procedure because group ids are not identical in both systems (perhaps this is a bug and should be fixed upstream). So to ensure proper permissions for the edguser on both nodes, we create a new "gridmount" group with the same GID on both CEs, and add edguser to it:
On one CE:
# chgrp 5000 /etc/grid-security/gridmapdir
On both CEs:
# groupadd -g 5000 gridmount # gpasswd -a edguser gridmount
gLite-CE: Incorrect information published
In the process of testing the selected configuration (Dual CEs with Torque server and site BDII on the lcg-CE_torque) we encountered a number of bugs related to the information published by the BDII. In particular, glite-CE GRIS doesn't publish correct information for any of the GlueCEInfoLRMSVersion, GlueCEInfoTotalCPUs, GlueCEStateEstimatedResponseTime LDAP attributes.
Please find below a detailed description of the bugs and proposed solutions.
maui-client rpm missing
GIP running on the glite-CE calls the following executables in the following order: lcg-info-dynamic-scheduler-->vomaxjobs-maui-->diagnose
Unfortunately /usr/bin/diagnose doesn't exist on the glite-CE when no torque server has been installed. So the solution is to run:
# apt-get install maui-client
This should be corrected upstream in yaim: this particular rpm (maui-client) shall be included in the metapackage glite-CE, so that it is installed regardless of the torque-server.
vomaxjobs-maui bug
The /opt/lcg/libexec/vomaxjobs-maui python script calls the /usr/bin/diagnose program. When it is called with "-h somehost" parameter it should call the diagnose program with "--host=somehost " parameter. This behavior is achieved by applying the following patch:
--- /opt/lcg/libexec/vomaxjobs-maui.orig 2006-05-25 17:04:23.000000000 +0300
+++ /opt/lcg/libexec/vomaxjobs-maui 2006-05-25 22:24:38.000000000 +0300
@@ -34,7 +34,7 @@
cmd = cmd + ' --host=' + schedhost
import commands
-(stat, out) = commands.getstatusoutput('diagnose -g')
+(stat, out) = commands.getstatusoutput(cmd)
if stat:
print sys.argv[0] + ': Maui \'diagnose\' command' + \
' exited with nonzero status'
lcg-info-dynamic-scheduler configuration change
By default, when GIP on the glite-CE executes lcg-info-dynamic-scheduler, it calls vomaxjobs-maui with no parameters, which results in the diagnose command quering for a maui server in localhost.
This behaviour can be corrected in two ways. The first one (perhaps the best to be applied upstream in the yaim configure script) is to edit /var/spool/maui/maui.cfg with appropriate values, especially change SERVERHOST from localhost to the lcg-CE_torque host. The second one, and probably the simplest for the site admin to apply, is to append to the last line of /opt/lcg/etc/lcg-info-dynamic-scheduler.conf " -h YourTorqueServerHost". For example, in our case the last line becomes:
/opt/lcg/libexec/vomaxjobs-maui -h ce02.gridctb.uoa.gr
lcg-CE: maui-server & edguser privileges
After applying these workarounds, information seems to be published right for the glite-CE. However, looking at /opt/bdii/var/bdii.log one can see that lcg-info-dynamic-scheduler sometimes still fails.
The cause is that when this script is being executed as edguser, diagnose queries the server as edguser too. So the command fails like with the following error:
# su edguser $ diagnose -g --host=ce02.hep.ntua.gr ERROR: 'diagnose' failed ERROR: user 'edguser' is not authorized to execute command 'diagnose'
This one should be corrected on the server: On the lcg-CE_torque host, edit the file /var/spool/maui/maui.cfg and add edguser on the ADMIN3 line. Everything should be OK after executing "service maui restart".
glite-yaim-3.0.0-12: error in config_lcgenv
The following diff applies to /opt/glite/yaim/functions/config_lcgenv of glite-yaim-3.0.0-12
--- /opt/glite/yaim/functions/config_lcgenv.orig 2006-05-17 00:03:47.000000000 +0300
+++ /opt/glite/yaim/functions/config_lcgenv 2006-05-17 00:04:03.000000000 +0300
@@ -14,7 +14,7 @@
requires VOS VO__SW_DIR SE_LIST
fi
-TEMP_SE=eval `echo $SE_LIST | sed s'/^[ ]*//'`
+TEMP_SE=`echo $SE_LIST | sed s'/^[ ]*//'`
default_se="$TEMP_SE"
#${SE_LIST%% *}"
PreGR01-UOM Notes
PreGR02-UPATRAS Notes
The full report, with job storm results, is available at
Node Preparation
APT
cat > /etc/apt/sources.list.d/glite.list << EOF rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates EOF
apt-get update; apt-get dist-update; apt-get upgrade
Java
rpm -ivh /nfs/yaim/j2sdk-1_4_2_08-linux-i586.rpm
CA Installation
Install CAs
cat > /etc/apt/sources.list.d/ca.list << EOF rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG_CA/en/i386 lcg EOF apt-get update; apt-get install lcg-CA
YAIM Configuration
(actually after the middleware has been installed on one node ...)
The following values are not yet known, so commenting out
- RB_HOST
- PX_HOST
- MON_HOST
- FTS_HOST
- REG_HOST
Changed From YAIM_VERSION=3.0.0-3 To YAIM_VERSION=3.0.0-11
Changed From
#LCG_REPOSITORY="'rpm http://linuxsoft.cern.ch LCG/apt/LCG-2_7_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security' 'rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG-2_7_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security'"
to
LCG_REPOSITORY="'rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates'"
Hd to set MON_BOX and REG_HOST to dummy values :-) (SEE does not have RGMA-Registry + Schema Server)
Had to change VO_DTEAM_DEFAULT_SE like below, in order for CSH to work on WNs:
#VO_DTEAM_DEFAULT_SE=$CLASSIC_HOST VO_DTEAM_DEFAULT_SE=$DPM_HOST
WMS + LB
apt-get install glite-WMSLB
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def WMSLB
BD-II
Install
apt-get install glite-BDII
Configure
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def BDII
Change BDII_AUTO_UPDATE to NO in file /opt/bdii/etc/bdii.conf
BDII_AUTO_UPDATE=no
Add manually GIIS URI for each site in file /opt/bdii/etc/bdii-update.conf, like
PreGR-02-UPATRAS ldap://grid1.csl.ee.upatras.gr:2170/mds-vo-name=PreGR-02-UPATRAS,o=grid
Restart BDI-II
service bdii restart
Do a quick check
ldapsearch -x -H ldap://grid28:2170 -b o=grid
glite CE
Careful: This recipe results to a problematic info provider in the site BDII, with entries like this:
# grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam, PreGR-02-UPATRAS, grid dn: GlueCEUniqueID=grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam,mds-vo-name=Pre GR-02-UPATRAS,o=grid GlueCEStateEstimatedResponseTime: 999999 # dteam, grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam, PreGR-02-UPATRAS, grid dn: GlueVOViewLocalID=dteam,GlueCEUniqueID=grid1.csl.ee.upatras.gr:2119/blah-p bs-dteam,mds-vo-name=PreGR-02-UPATRAS,o=grid GlueCEStateEstimatedResponseTime: 0
Add BDII_site type to YAIM
cat >> /opt/glite/yaim/scripts/node-info.def << EOF BDII_site_FUNCTIONS="config_edgusers config_bdii" EOF
Install yaim
apt-get install glite-yaim
Install
/opt/glite/yaim/scripts/install_node /nfs/yaim/site-info.def glite-CE glite-torque-server-config
Configure
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def gliteCE TORQUE_server BDII_site
Changed in file /opt/glite/yaim/scripts/node-info.def to add config_torque_server
From
TORQUE_server_FUNCTIONS=" config_host_certs config_users config_mkgridmap config_crl config_torque_submitter_ssh config_glite config_add_glite_env"
To
TORQUE_server_FUNCTIONS=" config_host_certs config_users config_mkgridmap config_crl config_torque_submitter_ssh config_torque_server config_glite config_add_glite_env"
Configure
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def gliteCE TORQUE_server BDII_site
Run the correction as noted in the installation manual
sed -i '{s/jobmanager/blah/}' /opt/lcg/libexec/lcg-info-dynamic-scheduler
Add some lines for LCAS authorization
cat >> /etc/grid-security/grid-mapfile << EOF /dteam/Role=lcgadmin dteamsgm /alice/Role=lcgadmin alicesgm /atlas/Role=lcgadmin atlassgm /cms/Role=lcgadmin cmssgm /lhcb/Role=lcgadmin lhcbsgm /dteam .dteam /dteam/* .dteam /alice .alice /alice/* .alice /atlas .atlas /atlas/* .atlas /cms .cms /cms/* .cms /lhcb .lhcb /lhcb/* .lhcb /picard/* .picard /picard .picard /riker/* .riker /riker .riker /biomed .biomed /biomed/* .biomed EOF
These lines would be overwritten the next time the cron job edg-mkgridmap runs.
To convince edg-mkgridmap to append these lines, put them in a file, e.g. /etc/grid-security/grid-mapfile-local
and add the following line at the end of the file /opt/edg/etc/edg-mkgridmap.conf
gmf_local /etc/grid-security/grid-mapfile-local
WN
Install yaim
apt-get install glite-yaim
Install
/opt/glite/yaim/scripts/install_node /nfs/yaim/site-info.def glite-WN glite-torque-client-config
Configure
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def WN_torque
SE DPM
Some preparation
cat >> /etc/fstab << EOF /dev/hdc /dpm ext3 defaults 1 2 EOF #Automount NFS share for YAIM config cat >> /etc/auto.master << EOF /nfs /etc/auto.misc --timeout=300 EOF cat >> /etc/auto.misc << EOF yaim -intr,rsize=8192,wsize=8192 grid10:/nfs/yaim-conf EOF chkconfig autofs on service autofs start cat > /etc/sysconfig/network << EOF NETWORKING=yes HOSTNAME=grid13.csl.ee.upatras.gr GATEWAY=150.140.186.193 EOF cat > /etc/sysconfig/network-scripts/ifcfg-eth0 << EOF DEVICE=eth0 BOOTPROTO=static IPADDR=150.140.186.227 NETMASK=255.255.255.192 ONBOOT=yes TYPE=Ethernet EOF scp grid10:/nfs/hosts/grid13/gridsec/*.pem /etc/grid-security/
Install
apt-get install glite-SE_dpm_mysql
Configure
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def SE_dpm_mysql
UI
Install
apt-get install glite-UI
Configure
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def UI
UI - HellasGrid GSI Config
Install the HellasGrid GSI rpm.
You might want to visit http://pki.physics.auth.gr/hellasgrid-ca/RPMS/ to see if there is a newer version.
rpm -ivh http://pki.physics.auth.gr/hellasgrid-ca/RPMS/ca_HellasGrid-local-0.2-1.noarch.rpm
Use the tools at the following URL to generate the two necessary config files,
/etc/grid-security/globus-user-ssl.conf and
/etc/grid-security/globus-user-ssl.conf.
http://pki.physics.auth.gr/hellasgrid-ca/util/globus2_config.html
Our files follow:
cat > /etc/grid-security/globus-user-ssl.conf << EOF RANDFILE = $ENV::HOME/.rnd #################################################################### [ ca ] default_ca = CA_default # The default ca section #################################################################### [ CA_default ] dir = ./demoCA # Where everything is kept certs = $dir/certs # Where the issued certs are kept crl_dir = $dir/crl # Where the issued crl are kept database = $dir/index.txt # database index file. new_certs_dir = $dir/newcerts # default place for new certs. certificate = $dir/cacert.pem # The CA certificate serial = $dir/serial # The current serial number crl = $dir/crl.pem # The current CRL private_key = $dir/private/cakey.pem# The private key RANDFILE = $dir/private/.rand # private random number file x509_extensions = x509v3_extensions # The extentions to add to the cert default_days = 365 # how long to certify for default_crl_days= 365 # DEE 30 # how long before next CRL default_md = md5 # which md to use. preserve = no # keep passed DN ordering policy = policy_match [ policy_match ] countryName = match stateOrProvinceName = optional organizationName = match organizationalUnitName = optional commonName = supplied emailAddress = optional [ policy_anything ] countryName = optional stateOrProvinceName = optional localityName = optional organizationName = optional organizationalUnitName = optional commonName = supplied emailAddress = optional #################################################################### [ req ] default_bits = 1024 default_keyfile = privkey.pem distinguished_name = req_distinguished_name attributes = req_attributes [ req_distinguished_name ] # BEGIN CONFIG countryName = Country Name (2 letter code) countryName_default = GR countryName_min = 2 countryName_max = 2 0.organizationName = Level 0 Organization 0.organizationName_default = HellasGrid 0.organizationalUnitName = Level 0 Organizational Unit 0.organizationalUnitName_default = csl.ee.upatras.gr commonName = Name (e.g., John M. Smith) commonName_max = 64 # END CONFIG [ req_attributes ] #unstructuredName = An optional company name [ x509v3_extensions ] nsCertType = 0x40 EOF
cat > /etc/grid-security/globus-host-ssl.conf << EOF RANDFILE = $ENV::HOME/.rnd #################################################################### [ ca ] default_ca = CA_default # The default ca section #################################################################### [ CA_default ] dir = ./demoCA # Where everything is kept certs = $dir/certs # Where the issued certs are kept crl_dir = $dir/crl # Where the issued crl are kept database = $dir/index.txt # database index file. new_certs_dir = $dir/newcerts # default place for new certs. certificate = $dir/cacert.pem # The CA certificate serial = $dir/serial # The current serial number crl = $dir/crl.pem # The current CRL private_key = $dir/private/cakey.pem# The private key RANDFILE = $dir/private/.rand # private random number file x509_extensions = x509v3_extensions # The extentions to add to the cert default_days = 365 # how long to certify for default_crl_days= 365 # DEE 30 # how long before next CRL default_md = md5 # which md to use. preserve = no # keep passed DN ordering # A few difference way of specifying how similar the request should look # For type CA, the listed attributes must be the same, and the optional # and supplied fields are just that :-) policy = policy_match # For the CA policy [ policy_match ] countryName = match stateOrProvinceName = optional organizationName = match organizationalUnitName = optional commonName = supplied emailAddress = optional # For the 'anything' policy # At this point in time, you must list all acceptable 'object' # types. [ policy_anything ] countryName = optional stateOrProvinceName = optional localityName = optional organizationName = optional organizationalUnitName = optional commonName = supplied emailAddress = optional #################################################################### [ req ] default_bits = 1024 default_keyfile = privkey.pem distinguished_name = req_distinguished_name attributes = req_attributes [ req_distinguished_name ] countryName = Country Name (2 letter code) countryName_default = GR countryName_min = 2 countryName_max = 2 0.organizationName = Level 0 Organization 0.organizationName_default = HellasGrid 0.organizationalUnitName = Level 0 Organizational Unit 0.organizationalUnitName_default = csl.ee.upatras.gr commonName = Name (e.g., John M. Smith) commonName_max = 64 [ req_attributes ] #unstructuredName = An optional company name [ x509v3_extensions ] nsCertType = 0x40 EOF
LCG + gLite CE Combination
LCG CE + Torque + site BDII
Install:
apt-get install lcg-CE_torque
Configure:
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def lcg-CE_torque
Install BLAHP Log Parser
apt-get install glite-ce-blahp
The following is a daemon
/opt/glite/bin/BLParserPBS -p 33332 -s /var/spool/pbs &
Put gLite CE into /etc/hosts.equiv
echo grid19.csl.ee.upatras.gr >> /etc/hosts.equiv
The installation guide gives an advice about collisions of pool account mapping. This is typically achieved either by allocating separate pool account ranges to each CE or by allowing them to share a gridmapdir.
Share gridmapdir with gLite CE using NFS
cat >> /etc/exports <<EOF /etc/grid-security/gridmapdir grid19.csl.ee.upatras.gr(rw,sync,no_root_squash) EOF exportfs -r
gLite CE
/opt/glite/yaim/scripts/install_node /nfs/yaim/site-info.def glite-CE
/opt/glite/yaim/scripts/configure_node /nfs/yaim/site-info.def gliteCE
Mount the gridmapdir
mount -t nfs grid1.csl.ee.upatras.gr:/etc/grid-security/gridmapdir /etc/grid-security/gridmapdir
Work Around for ERT problem
On the Torque Server node:
edit file /var/spool/maui/maui.cfg,
at the line that starts with ADMIN3 to add edguser. The result should be like this:
ADMIN3 edginfo rgma edguser
On the gLite CE node:
Install maui-client
apt-get install maui-client
Edit file /opt/lcg/etc/lcg-info-dynamic-scheduler.conf,
to add your torque server at the parameter vo_max_jobs_cmd, like this
vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h grid1.csl.ee.upatras.gr
In file /opt/lcg/libexec/vomaxjobs-maui, change the commented line to the line below:
#(stat, out) = commands.getstatusoutput('diagnose -g')
(stat, out) = commands.getstatusoutput(cmd)
Cross your fingers, wait until Aphrodite aligns with Saturn and do an
ldapsearch -x -H ldap://grid8.csl.ee.upatras.gr:2170 -b o=grid | grep Est
If you still see any 999999, then something must have gone wrong with the stars.
Changes needed on the combined WNs
Put gLite CE hostname into /opt/edg/etc/edg-pbs-knownhosts.conf, line which starts with NODES = .
The entries there are separated by space.
Run /opt/edg/sbin/edg-pbs-knownhosts, or wait cron to run it later.
YAIM site-info.def
To remove passwords:
sed '/PASSWORD=/ s/=*$/=<PASSWORD_REMOVED_HERE>/g' /nfs/yaim/site-info.def \ | mail -s "YAIM+`date --iso-8601=minutes`" goulas@ee.upatras.gr
YAIM, on 2006-05-17T18:11+0300
# YAIM example site configuration file - adapt it to your site!
MY_DOMAIN=csl.ee.upatras.gr
# Node names
# Note: - SE_HOST --> Removed, see CLASSIC_HOST, DCACHE_ADMIN, DPM_HOST below
# - REG_HOST --> There is only 1 central registry for the time being.
CE_HOST=grid1.$MY_DOMAIN
RB_HOST=my-rb.$MY_DOMAIN
WMS_HOST=grid25.$MY_DOMAIN
PX_HOST=px01.lip.pt
BDII_HOST=grid28.$MY_DOMAIN
MON_HOST=grid19.$MY_DOMAIN
#FTS_HOST=my-fts.$MY_DOMAIN
#REG_HOST=lcgic01.gridpp.rl.ac.uk
REG_HOST=my_reg_host.$MY_DOMAIN
GLITECE_HOST=grid19.csl.ee.upatras.gr
# VO-BOX - Set this if you are building a VO-BOX
#VOBOX_HOST=my-vobox.$MY_DOMAIN
#VOBOX_PORT=1975
# Set this to "yes" your site provides an X509toKERBEROS Authentication Server
# Only for sites with Experiment Software Area under AFS
GSSKLOG=no
GSSKLOG_SERVER=my-gssklog.$MY_DOMAIN
# LFC - Set these if you are installing an LFC
#LFC_HOST=my-lfc.$MY_DOMAIN
#LFC_DB_PASSWORD="lfc_password"=<PASSWORD_REMOVED_HERE>
# These are set to default to using the standard database on the same hosts
# as the LFC daemon is on
LFC_DB_HOST=$LFC_HOST
LFC_DB=cns_db
# All catalogues are local unless you add a VO to
# LFC_CENTRAL, in which case that will be central
LFC_CENTRAL=""
# If you want to limit the VOs your LFC serves, add the locals here
LFC_LOCAL=""
# TORQUE - Change this if your torque server is not on the CE
# it's ingored for other batch systems
TORQUE_SERVER=$CE_HOST
# These variables tell YAIM where to find additional configuration files.
WN_LIST=/nfs/yaim/wn-list.conf
USERS_CONF=/nfs/yaim/users.conf
GROUPS_CONF=/nfs/yaim/groups.conf
FUNCTIONS_DIR=/opt/glite/yaim/functions
YAIM_VERSION=3.0.0-11
''# Repository settings
#LCG_REPOSITORY="'rpm http://linuxsoft.cern.ch LCG/apt/LCG-2_7_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security'\
'rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG-2_7_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security'"
LCG_REPOSITORY="'rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates'"
CA_REPOSITORY="rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG_CA/en/i386 lcg"
REPOSITORY_TYPE="apt" # or "yum"''
# For the relocatable (tarball) distribution, ensure
# that INSTALL_ROOT is set correctly
INSTALL_ROOT=/opt
# You will probably want to change these too for the relocatable dist
OUTPUT_STORAGE=/tmp/jobOutput
JAVA_LOCATION="/usr/java/j2sdk1.4.2_08"
# Set this to '/dev/null' or some other dir if you want
# to turn off yaim installation of cron jobs
CRON_DIR=/etc/cron.d
# Set this to your prefered and firewall allowed port range
GLOBUS_TCP_PORT_RANGE="20000 25000"
# Choose a good password ! And be sure that this file cannot be read by
# any grid job !
MYSQL_PASSWORD=valuable_secret=<PASSWORD_REMOVED_HERE>
APEL_DB_PASSWORD="APEL_secret"=<PASSWORD_REMOVED_HERE>
# GRID_TRUSTED_BROKERS: DNs of services (RBs) allowed to renew/retrives
# credentials from/at the myproxy server. Put single quotes around each trusted DN !!!
GRID_TRUSTED_BROKERS="
'broker one'
'broker two'
"
# The RB now uses the DLI by default; set VOs here which should use RLS
RB_RLS="" # "atlas cms"
# Space separated list of ldap servers in edg-mkgridmap.conf which authenticate users.
# Ex.: GRIDMAP_AUTH="ldap://lcg-registrar.cern.ch/ou=users,o=registrar,dc=lcg,dc=org ldap://xyz"
GRIDMAP_AUTH="ldap://lcg-registrar.cern.ch/ou=users,o=registrar,dc=lcg,dc=org"
# GridIce server host name (usually run on the MON node).
GRIDICE_SERVER_HOST=$MON_HOST
# Site-wide settings
SITE_EMAIL=egee-grid@ee.upatras.gr
SITE_NAME=PreGR-02-UPATRAS
SITE_LOC="Patras, Greece"
SITE_LAT=38.2368 # -90 to 90 degrees
SITE_LONG=21.7341 # -180 to 180 degrees
SITE_WEB="http://www.csl.ee.upatras.gr"
SITE_TIER="TIER 2"
SITE_SUPPORT_SITE="my-bigger-site.their_domain"
# Jobmanager specific settings
JOB_MANAGER=lcgpbs
CE_BATCH_SYS=torque
BATCH_BIN_DIR=/usr/bin
BATCH_VERSION=torque-1.0.1b
BATCH_LOG_DIR=/var/spool/pbs/server_priv/accounting
# Architecture and enviroment specific settings
CE_CPU_MODEL=PIII
CE_CPU_VENDOR=intel
CE_CPU_SPEED=1001
CE_OS="Scientific Linux SL"
CE_OS_RELEASE="SL"
CE_OS_VERSION=3.0.3
CE_MINPHYSMEM=513
CE_MINVIRTMEM=1025
CE_SMPSIZE=1
CE_SI00=381
CE_SF00=0
CE_OUTBOUNDIP=TRUE
CE_INBOUNDIP=FALSE
CE_RUNTIMEENV="
LCG-2
LCG-2_1_0
LCG-2_1_1
LCG-2_2_0
LCG-2_3_0
LCG-2_3_1
LCG-2_4_0
LCG-2_5_0
LCG-2_6_0
LCG-2_7_0
GLITE-3_0_0
R-GMA
"
# Set this if your WNs have a shared directory for temporary storage
CE_DATADIR=""
# Classic SE
CLASSIC_HOST="classic SE host"
CLASSIC_STORAGE_DIR="/storage"
# dCache-specific settings
# ignore if you are not running d-cache
# Your dcache admin node
DCACHE_ADMIN=""
DCACHE_POOLS="my-pool-node1:/pool-path1 my-pool-node2:/pool-path2"
# Optional
# DCACHE_PORT_RANGE="20000,25000"
# DCACHE_DOOR_SRM="door_node1[:port]"
# DCACHE_DOOR_GSIFTP="door_node1[:port] door_node2[:port]"
# DCACHE_DOOR_GSIDCAP="door_node1[:port] door_node2[:port]"
# DCACHE_DOOR_DCAP="door_node1[:port] door_node2[:port]"
# Set to "yes" only if YAIM shall reset the dCache configuration,
# i.e. if you want YAIM to configure dCache - WARNING:
# this may wipe out any dCache parameters previously configured!
# DCACHE_PORT_RANGE="20000,25000"
# RESET_DCACHE_CONFIGURATION=no
# Set to "yes" only if YAIM shall reset the dCache nameserver,
# i.e. if you want YAIM to clear the content of dCache - WARNING:
# this may wipe out any dCache files previously stored!
# RESET_DCACHE_PNFS=no
# Set to "yes" only if YAIM shall reset the dCache Databases,
# i.e. if you want YAIM to clear the metadata of dCache - WARNING:
# this may wipe out any dCache files names previously stored!
# Leaving your system without any way to reestablish which files
# are stored.
# RESET_DCACHE_RDBMS=no
#
# SE_dpm-specific settings - Ignore if you are not running a DPM
#
# Set these if you are installing a DPM yourself
# and/or if you need a default DPM for the lcg-stdout-mon
#
# DPMDATA is now deprecated. Use an entry like $DPM_HOST:/filesystem in
# the DPM_FILESYSTEMS variable.
# From now on we use DPM_DB_USER and DPM_DB_PASSWORD to make clear
# it's different role from that of the dpmmgr unix user who owns the
# directories and runs the daemons.
# The name of the DPM head node
DPM_HOST=grid13.$MY_DOMAIN # my-dpm.$MY_DOMAIN
# The DPM pool name
DPMPOOL=UPATRASDPM
# The filesystems/partitions parts of the pool
#DPM_FILESYSTEMS="$DPM_HOST:/path1 my-dpm-poolnode.$MY_DOMAIN:/path2"
DPM_FILESYSTEMS="$DPM_HOST:/dpm my-dpm-poolnode.$MY_DOMAIN:/path2"
# The database user
DPM_DB_USER=dpmuser
# The database user password
DPM_DB_PASSWORD=dpm_pass_pps_egee=<PASSWORD_REMOVED_HERE>
# The DPM database host
DPM_DB_HOST=$DPM_HOST
# Specifies the default amount of space reserved for a file
DPMFSIZE=200M
# Variable for the port range - Optional, default value is shown
# RFIO_PORT_RANGE="20000 25000"
# This largely replaces CE_CLOSE_SE but it is a list of hostnames
SE_LIST="$CLASSIC_HOST $DPM_HOST $DCACHE_ADMIN"
SE_ARCH="multidisk" # "disk, tape, multidisk, other"
FTS_SERVER_URL="https://fts.${MY_DOMAIN}:8443/path/glite-data-transfer-fts"
# BDII/GIP specific settings
BDII_HTTP_URL="http://grid-deployment.web.cern.ch/grid-deployment/gis/lcg2-bdii/dteam/lcg2-all-sites.conf"
# Set this to use FCR
#BDII_FCR="http://goc.grid-support.ac.uk/gridsite/bdii/BDII/www/bdii-update.ldif"
# Ex.: BDII_REGIONS="CE SE RB PX VOBOX"
BDII_REGIONS="CE DPM GLITECE" # list of the services provided by the site
BDII_CE_URL="ldap://$CE_HOST:2135/mds-vo-name=local,o=grid"
BDII_GLITECE_URL="ldap://$GLITECE_HOST:2170/mds-vo-name=resource,o=grid"
BDII_SE_URL="ldap://$CLASSIC_HOST:2135/mds-vo-name=local,o=grid"
BDII_DPM_URL="ldap://$DPM_HOST:2135/mds-vo-name=local,o=grid"
BDII_RB_URL="ldap://$RB_HOST:2135/mds-vo-name=local,o=grid"
BDII_PX_URL="ldap://$PX_HOST:2135/mds-vo-name=local,o=grid"
BDII_LFC_URL="ldap://$LFC_HOST:2135/mds-vo-name=local,o=grid"
BDII_VOBOX_URL="ldap://$VOBOX_HOST:2135/mds-vo-name=local,o=grid"
BDII_FTS_URL="ldap://$FTS_HOST:2170/mds-vo-name=resource,o=grid"
# Use this to set your contact string.
# Ex.: BDII_BIND="mds-vo-name=mystorage,o=grid"
# E2EMONIT specific settings
# This specifies the location to download the host specific configuration file
E2EMONIT_LOCATION=grid-deployment.web.cern.ch/grid-deployment/e2emonit/production
#
# Replace this with the siteid supplied by the person setting up the networking
# topology.
E2EMONIT_SITEID=my.siteid
# VOS="atlas alice lhcb cms dteam biomed"
# Space separated list of supported VOs by your site
VOS="dteam"
QUEUES=${VOS}
VO_SW_DIR=/opt/exp_soft
# Set this if you want a scratch directory for jobs
EDG_WL_SCRATCH=""
# VO specific settings. For help see: https://lcg-sft.cern.ch/yaimtool/yaimtool.py
VO_ATLAS_SW_DIR=$VO_SW_DIR/atlas
VO_ATLAS_DEFAULT_SE=$CLASSIC_HOST
VO_ATLAS_STORAGE_DIR=$CLASSIC_STORAGE_DIR/atlas
VO_ATLAS_QUEUES="atlas"
VO_ATLAS_SGM=ldap://grid-vo.nikhef.nl/ou=lcgadmin,o=atlas,dc=eu-datagrid,dc=org
VO_ATLAS_USERS=ldap://grid-vo.nikhef.nl/ou=lcg1,o=atlas,dc=eu-datagrid,dc=org
VO_ATLAS_VOMS_POOL_PATH="/lcg1"
VO_ATLAS_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/atlas?/atlas/' 'vomss://voms.cern.ch:8443/voms/atlas?/atlas/'"
#VO_ATLAS_VOMS_EXTRA_MAPS="'Role=production production' 'usatlas .usatlas'"
VO_ATLAS_VOMSES="'atlas lcg-voms.cern.ch 15001 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch atlas' 'atlas voms.cern.ch 15001 /C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch atlas'"
VO_ALICE_SW_DIR=$VO_SW_DIR/alice
VO_ALICE_DEFAULT_SE=$CLASSIC_HOST
VO_ALICE_STORAGE_DIR=$CLASSIC_STORAGE_DIR/alice
VO_ALICE_QUEUES="alice"
VO_ALICE_SGM=ldap://grid-vo.nikhef.nl/ou=lcgadmin,o=alice,dc=eu-datagrid,dc=org
VO_ALICE_USERS=ldap://grid-vo.nikhef.nl/ou=lcg1,o=alice,dc=eu-datagrid,dc=org
VO_ALICE_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/alice?/alice/' 'vomss://voms.cern.ch:8443/voms/alice?/alice/'"
VO_ALICE_VOMSES="'alice lcg-voms.cern.ch 15000 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch alice' 'alice voms.cern.ch 15000 /C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch alice'"
VO_CMS_SW_DIR=$VO_SW_DIR/cms
VO_CMS_DEFAULT_SE=$CLASSIC_HOST
VO_CMS_STORAGE_DIR=$CLASSIC_STORAGE_DIR/cms
VO_CMS_QUEUES="cms"
VO_CMS_SGM=ldap://grid-vo.nikhef.nl/ou=lcgadmin,o=cms,dc=eu-datagrid,dc=org
VO_CMS_USERS=ldap://grid-vo.nikhef.nl/ou=lcg1,o=cms,dc=eu-datagrid,dc=org
VO_CMS_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/cms?/cms/' 'vomss://voms.cern.ch:8443/voms/cms?/cms/'"
VO_CMS_VOMSES="'cms lcg-voms.cern.ch 15002 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch cms' 'cms voms.cern.ch 15002 /C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch cms'"
VO_LHCB_SW_DIR=$VO_SW_DIR/lhcb
VO_LHCB_DEFAULT_SE=$CLASSIC_HOST
VO_LHCB_STORAGE_DIR=$CLASSIC_STORAGE_DIR/lhcb
VO_LHCB_QUEUES="lhcb"
VO_LHCB_SGM=ldap://grid-vo.nikhef.nl/ou=lcgadmin,o=lhcb,dc=eu-datagrid,dc=org
VO_LHCB_USERS=ldap://grid-vo.nikhef.nl/ou=lcg1,o=lhcb,dc=eu-datagrid,dc=org
VO_LHCB_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/lhcb?/lhcb/' 'vomss://voms.cern.ch:8443/voms/lhcb?/lhcb/'"
VO_LHCB_VOMS_EXTRA_MAPS="lcgprod lhcbprod"
VO_LHCB_VOMSES="'lhcb lcg-voms.cern.ch 15003 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch lhcb' 'lhcb voms.cern.ch 15003 /C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch lhcb'"
VO_DTEAM_SW_DIR=$VO_SW_DIR/dteam
#VO_DTEAM_DEFAULT_SE=$CLASSIC_HOST
VO_DTEAM_DEFAULT_SE=$DPM_HOST
VO_DTEAM_STORAGE_DIR=$CLASSIC_STORAGE_DIR/dteam
VO_DTEAM_QUEUES="dteam"
VO_DTEAM_SGM=ldap://lcg-vo.cern.ch/ou=lcgadmin,o=dteam,dc=lcg,dc=org
VO_DTEAM_USERS=ldap://lcg-vo.cern.ch/ou=lcg1,o=dteam,dc=lcg,dc=org
VO_DTEAM_VOMS_SERVERS="'vomss://lcg-voms.cern.ch:8443/voms/dteam?/dteam/' 'vomss://voms.cern.ch:8443/voms/dteam?/dteam/'"
VO_DTEAM_VOMSES="'dteam lcg-voms.cern.ch 15004 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch dteam' 'dteam voms.cern.ch 15004 /C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch dteam'"
VO_BIOMED_SW_DIR=$VO_SW_DIR/biomed
VO_BIOMED_DEFAULT_SE=$CLASSIC_HOST
VO_BIOMED_STORAGE_DIR=$CLASSIC_STORAGE_DIR/biomed
VO_BIOMED_QUEUES="biomed"
VO_BIOMED_USERS=ldap://vo-biome.in2p3.fr/ou=lcg1,o=biomedical,dc=lcg,dc=org
VO_BIOMED_SGM=ldap://vo-biome.in2p3.fr/ou=lcgadmin,o=biomedical,dc=lcg,dc=org
VO_BIOMED_VOMSES="biomed cclcgvomsli01.in2p3.fr 15000 /O=GRID-FR/C=FR/O=CNRS/OU=CC-LYON/CN=cclcgvomsli01.in2p3.fr/Email=sysunix@cc.in2p3.fr biomed"
Problems
gLite publishes default values for ERTs
The information providers do not run. The observed, erratic, behavior is shown below:
# grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam, PreGR-02-UPATRAS, grid dn: GlueCEUniqueID=grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam,mds-vo-name=Pre GR-02-UPATRAS,o=grid GlueCEStateEstimatedResponseTime: 999999 # dteam, grid1.csl.ee.upatras.gr:2119/blah-pbs-dteam, PreGR-02-UPATRAS, grid dn: GlueVOViewLocalID=dteam,GlueCEUniqueID=grid1.csl.ee.upatras.gr:2119/blah-p bs-dteam,mds-vo-name=PreGR-02-UPATRAS,o=grid GlueCEStateEstimatedResponseTime: 0
BD-II log file /opt/bdii/var/bdii.log has the following relevant entries:
2006-05-17 14:19:30 lcg-info-dynamic-scheduler: VO max jobs backend command retu rned nonzero exit status 2006-05-17 14:19:30 lcg-info-dynamic-scheduler: Exiting without output, GIP will use static values
The above behavior is observed, both with torque installed on the gLite CE and on separate node. The result is that an gLite CE is never chosen by a WMS when LCG CEs match the job.
gStat does not show gLite CE in the 'Service Check' Section
Everything is in the title
gStat counts double CPUs
gStat is unaware of the two CEs using the LRMS, thus Worker Nodes, so it counts them twice.
BLParserPBS should be a service
The manual says to run BLParserPBS. This is actually a daemon which never returns, so it should have a service wrapper in /etc/rc.d, in order to restart at boot time.
CSH problems
This error appears (at least) to UI and WNs. When a there is no classic SE, the default is to have in YAIM
CLASSIC_HOST="classic SE host"
When this happens, for each VO supported, an line like this (for dteam) is in file /etc/profile.d/lcgenv.csh
setenv VO_DTEAM_DEFAULT_SE classic SE host
This is not correct syntax and causes CSH to fail to start. A question is, should this be corrected to DPM_HOST or to something else? Note that it also has to be moved a couple of lines below, since DPM_HOST is populated later.
setenv VO_DTEAM_DEFAULT_SE $DPM_HOST
The same error is also in file /etc/profile.d/lcgenv.sh but this does not cause the shell to break.
export VO_DTEAM_DEFAULT_SE=classic SE host
Again, I am not sure if the correction is appropriate. Note that it also has to be moved a couple of lines below, since DPM_HOST is populated later.
export VO_DTEAM_DEFAULT_SE=$DPM_HOST
WMS: Invalid proxy cache
I get the following e-mail from the WMS:
Subject: Cron <root@grid25> . /etc/glite/profile.d/glite_setenv.sh ; $GLITE_LOCATION/bin/glite_wms_wmproxy_purge_proxycache $GLITE_LOCATION_VAR/proxycache > $GLITE_LOCATION_LOG/glite_wms_wmproxy_purge_proxycache.log
Error - invalid proxy cache path (invalid directory (/var/glite/proxycache): boost::filesystem::is_directory: "/var/glite/proxycache": No such file or directory)
The directory refered here does not exit. From a quick look at yaim, I think that it creates $GLITE_LOCATION_TMP/proxycache and not $GLITE_LOCATION_VAR/proxycache
See also [savannah #16959]
Usage of CLASSIC_HOST
A host of settings are dependent on yaim parameter $CLASSIC_HOST, which defaults to CLASSIC_HOST="classic SE host".
I am not sure how safe is to set in YAIM this: CLASSIC_HOST=$DPM_HOST
A little percentage of jobs fail with Job proxy is expired
In job storms, 1-5 jobs out of 100 fail. Job status returns the message Job proxy is expired.
Logging info for these jobs returns the following messages:
Got a job held event, reason: Globus error 131: the user proxy expired (job is still running)
Job got an error while in the CondorG queue.
Ben Gurion University Notes
AEGIS01-PHY-SCL Notes
YAIM's config_lcgenv fix
Before doing configure_node, it is necessary to fix yaim's config_lcgenv function, as explained in several bugs in savannah, e.g.http://savannah.cern.ch/bugs/?func=detailitem&item_id=16895
HG-06-EKT notes
Generally the installation of LCG-CE, MON, SE_dcache and WN_torque nodes was completed without any problems. However, in the site-info.def file that was given as an example there are a couple of points that need attention:
- The
LCG_REPOSITORYparameter points to the LCG-2.7.0 repository. Obviously it should be changed to
LCG_REPOSITORY="'rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates'"
- The various
VO_DTEAM_DEFAULT_SE,VO_LHCB_DEFAULT_SEetc. parameters are defined as
VO_DTEAM_DEFAULT_SE=$CLASSIC_HOST
That is OK if you have a Classic SE, but if you have a DPM or a dCache SE they should be changed to
$DPM_HOST or to $DCACHE_ADMIN. Otherwise you might have trouble with /etc/profile.d/lcgenv.{sh,csh} and with the value published for GlueCEInfoDefaultSE by your site GIIS.
Summary
Even though the final job storm proved to be 80 successful (20 % caused by proxy expired failures) the tweaking that it needed to make this configuration schema to work, is far to complex and unreliable to deploy in a production environment. Thus in SEE ROC we decided to upgrade all our clusters to the LCG flavour only of gLite30 and proceed with the deployment of additional gLite CEs as soon as it is stable enough
