SG GLITE-3 0 1 Guide
From EGEE-see WIki
This guide is based on the experiences gained by AEGIS01-PHY-SCL admins at the Scientific Computing Laboratory of the Institute of Physics in Belgrade, http://scl.phy.bg.ac.yu/
All files used in this guide can be found on http://glite.phy.bg.ac.yu/GLITE-3_0_1/
Real configuration files for AEGIS01-PHY-SCL site can be found on http://glite.phy.bg.ac.yu/GLITE-3_0_1/AEGIS/
SEE-GRID adjusted yaim files can be found on http://glite.phy.bg.ac.yu/GLITE-3_0_1/yaim/
The official upgrade and installation guide can be found on http://lcg.web.cern.ch/LCG/Sites/releases.html
Very useful set of notes and experiences, compiled by the PPS of EGEE-SEE ROC can be found here: http://wiki.egee-see.org/index.php/GLite30
All SEE-GRID sites involved in the first wave of upgrade should provide their feedback and experiences (SEE-GRID Wiki, see-grid-gim), thus enabling us to plan further SEE-GRID wide deployment of gLite MW.
Suggested upgrade path for now is to install and configure LCG flavour of all services. After we gain experience with the installation of gliteCE and glite WMSLB (glite flavour of RB) on volunteer sites, we will plan upgrade of production SEE-GRID RBs to the glite flavour, and migration of SEE-GRID CEs to gliteCE version on all sites. Clear instructions will be provided for site administrators when this is decided.
0) SCL scripts for easier administration
A small set of scripts developed at AEGIS01-PHY-SCL, enabling
a) providing public/private key based ssh authentication among all nodes
b) scp of file from one node to all other nodes (or to all WNs, or to all nodes except for WNs)
c) execution of the issued command on all nodes (or on all WNs, or on all nodes except for WNs)
can be found on http://glite.phy.bg.ac.yu/GLITE-3_0_1/scl-scripts.tgz
Read README file for more details.
1) Backup
Please backup your site-info.def, wn-list.conf, users.conf, and groups.conf files if you are keeping them in /opt/lcg/yaim/examples/ directory, since they will be deleted by yaim upgrade. With glite version of yaim, default location of all yaim files is /opt/glite/yaim. Feel free to backup all other conf files you have customized. Just be aware not to copy them afterwards over the upgraded ones, but do diff and change appropriate values, since even format can be different!
2) Scheduled downtime
Before starting the upgrade, you should declare downtime and disable all of your queues. To declare downtime, you should send an e-mail to see-grid-gim@see-grid.eu and enter appropriate information into the HGSM (SEE-GRID GOCDB), available on https://hgsm.grid.org.tr/
To disable the queues, execute as root on your CE:
qdisable seegrid dteam
Add here also other queues you may have configured. Your site will not accept any further jobs. However, you may have some running jobs. You should wait until all of them are executed, and only then start with the upgrade.
3) OS upgrade and configuration
You are encouraged to upgrade your OS as well to the latest version. If you are using Scientific Linux < 3.0.7, you can upgrade using the following procedure: change /etc/apt/sources.list.d/sl.list on all of your nodes so that it contains
rpm ftp://ftp.scientificlinux.org/linux/scientific/ 307/i386/apt-rpm os updates contrib
and remove repositories for previous SL versions (except for 3.0.5, since 3.0.7 is just a slight modification, and these two releases share some of RPMs; if you don't have 3.0.5 repository in the above file, add it manually - just replace 307 with 305). After that, execute
apt-get update
apt-get dist-upgrade
apt-get upgrade-kernel
Be sure to manually update /etc/lsb-release on all of your nodes (just replace 3.0.5 with 3.0.7), since SL 3.0.7 repository contains redhat-lsb-1.3-3.1.SL.305.i386, not the 3.0.7 version. Do not forget to reboot all nodes if kernel upgrade is performed!
In order to ensure automatic publication of installed OS, make sure that the following plugin is installed on your CE (assuming the same OS on WNs, which actually should be published): http://glite.phy.bg.ac.yu/GLITE-3_0_1/lcg-info-dynamic-os-generic-1.0-1.noarch.rpm
Check if ntp is working on all nodes (chkconfig ntpd on). Example of configuration file /etc/ntp.conf can be found on http://glite.phy.bg.ac.yu/GLITE-3_0_1/ntp.conf
However, be sure to create /etc/ntp.drift and /etc/ntp.drift.TEMP files and to change their ownership to ntp.ntp on all nodes:
touch /etc/ntp.drift
touch /etc/ntp.drift.TEMP
chown ntp.ntp /etc/ntp.drift
chown ntp.ntp /etc/ntp.drift.TEMP
It is also suggested to shut down all services on each node that are not needed, and to remove them from the corresponding runlevel.
Command hostname should return FQDN on all nodes. Be sure that /etc/hosts contains
127.0.0.1 localhost.localdomain localhost
and IP number-FQDN pairs for all nodes (i.e. FQDN is not allowed to be associated with 127.0.0.1).
4) Filesystems
All WNs must have shared application software filesystem where VO SGMs (software grid managers) will install VO-specific software. Please take a look at http://wiki.egee-see.org/index.php/SEE-GRID_ESM_Software_Installation_Guide for more details, including the permissions of directories.
In all template files provided at http://glite.phy.bg.ac.yu/GLITE-3_0_1/ it is assumed that this directory is /storage/exp_soft, and that it is mounted through NFS from SE to CE and all WNs. Please take care, since this is not the default in yaim example files. Relevant site-info.def variable is VO_SW_DIR.
The easiest way to accomplish that this shared directory is available on all WNs (and on CE, if you are mounting it from SE, as we do), is to allow export of VO_SW_DIR in /etc/exports on the machine exporting it, and to include it in /etc/fstab on all machines that will mount it. This is probably most simple way, since all VOs would eventually require this (even if they do not currently). Do not forget to ensure that relevant nfs services are started automatically on the exporting machine. Example line for /etc/exports on exporting machine, allowing for export of /storage/exp_soft to all machines on the network 147.91.84.*:
/storage/exp_soft 147.91.84.0/255.255.255.0(rw,sync,no_root_squash)
Example line for /etc/fstab on the machine that will mount /storage/exp_soft over nfs from se.phy.bg.ac.yu:
se.phy.bg.ac.yu:/storage/exp_soft /storage/exp_soft nfs defaults 0 0
Be sure to create /storage/exp_soft on machine that will be mounting it!
The same procedure can be applied to /var/cache/apt/archives (if you want to avoid multiple downloads of update RPMs by apt-get) and for /home (if you support MPI).
5) Latest YAIM installation
Download and install latest yaim version on all of your nodes:
wget http://grid-deployment.web.cern.ch/grid-deployment/gis/yaim/glite-yaim-latest.rpm
rpm -Uvh glite-yaim-latest.rpm
Since newer version can be available from updates repository, put the following into /etc/apt/sources.list.d/glite.list file (in one line):
rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates
and execute
apt-get update
apt-get install glite-yaim
You should also apply SEE-GRID customizations. AFTER installing the latest yaim on all nodes, replace
/opt/glite/yaim/scripts/node-info.def with http://glite.phy.bg.ac.yu/GLITE-3_0_1/yaim/node-info.def
Other customizations should be put in /opt/glite/yaim/functions/local directory:
http://glite.phy.bg.ac.yu/GLITE-3_0_1/yaim/config_gip
http://glite.phy.bg.ac.yu/GLITE-3_0_1/yaim/config_fmon_client
http://glite.phy.bg.ac.yu/GLITE-3_0_1/yaim/config_vomses
The customizations provide gridice agent configuration on WNs (node-info.def and configure_fmon_client), ntp monitoring by gridice on all nodes (config_fmon_client), as well as prividing more meaningful information published by the information system using newly introduced variables (SITE_SUPPORT, SITE_SECURITY, PHYSICAL_CPUS, LOGICAL_CPUS) that need to be defined in site-info.def (already present in the SEE-GRID templates). The last file fixes known bug:
http://savannah.cern.ch/bugs/?func=detailitem&item_id=17287
You can do diff between customized and original files and further change them to suite your needs. Be aware that you need to replace node-info.def file after any yaim upgrade is applied on your site. Be aware also that you need to do this on all of your nodes!
6) Preparation of configuration files
You need to prepare the following configuration files:
site-info.def
users.conf
wn-list.conf
groups.conf
However, they can be named any way you like. It is suggested to put them into selected directory (say, /root/yaim), and not to keep them in /opt/glite/yaim/examples. The templates for all of these files can be found on http://glite.phy.bg.ac.yu/GLITE-3_0_1/
The first one (site-info-SEEGRID.def is the name of the template) should be adjusted using the instructions and comments in it. It should contain proper paths to all other configuration files. Our production conf files can be found on http://glite.phy.bg.ac.yu/GLITE-3_0_1/AEGIS/
The second one (users-SEEGRID.def is the name of the template) can be created using the following script: http://glite.phy.bg.ac.yu/GLITE-3_0_1/generate-pool-accounts-AEGIS
GLITE-3_0_0 by default uses large number of pool accounts (200 per VO), as well as some additional accounts (software grid manager - sgm, production - prd), the same as LCG-2_7_0. If you did not adjust the pool accounts earlier, or want to rearrange UIDs, the suggested solution is to remove all pool accounts and groups definitions from /etc/passwd, /etc/shadow and /etc/group, and to remove all of their home directories. All pool accounts information will be created and entered again by yaim. The above script (generate-pool-accounts-AEGIS) can be used like this:
./generate-pool-accounts-AEGIS 20000 seegrid 2000 > users-SEEGRID.conf
./generate-pool-accounts-AEGIS 21000 dteam 2100 >> users-SEEGRID.conf
Please note >> in the second line! This will generate definitions for pool accounts for seegrid in the UID range 20001-20200, seegridprd will have UID 20201, and seegridprd will have UID 20202, while GID will be 2000 for all these accounts, and similar for dteam (UID range 21001-21200, dteamprd UID 21201, dteamsgm UID 21202, GID 2100 for all dteam accounts). Please ensure that UID and GID ranges used here are not already occupied! This is way removing all pool account users is suggested. Our production conf files can be found on http://glite.phy.bg.ac.yu/GLITE-3_0_1/AEGIS/
Third configuration file (wn-list-SEEGRID.def is the name of the template) is the same as in LCG-2_7_0, so you can use the old one. Our production conf files can be found on http://glite.phy.bg.ac.yu/GLITE-3_0_1/AEGIS/
Fourth configuration file (groups-SEEGRID.def is the name of the template) is the new one. The template contains definitions for SEEGRID and DTEAM VO. More definitions (if needed) can be created by hand (mutatis mutandis, i.e. replacing what should be replaced). Our production conf files can be found on http://glite.phy.bg.ac.yu/GLITE-3_0_1/AEGIS/
For sites supporting AEGIS VO
AEGIS (Academic and Educational Grid Initiative of Serbia) created its VO, and all Serbian sites should support it. In order to do so, you should enter aegis among your queues (variable VOS in site-info.def), and the following among VO specific definitions at the end of site-info.def:
VO_AEGIS_SW_DIR=$VO_SW_DIR/aegis VO_AEGIS_DEFAULT_SE=$CLASSIC_HOST VO_AEGIS_STORAGE_DIR=$CLASSIC_STORAGE_DIR/aegis VO_AEGIS_QUEUES="aegis" VO_AEGIS_VOMS_SERVERS="vomss://se.phy.bg.ac.yu:8443/voms/aegis?/aegis"
VO_AEGIS_VOMSES="aegis se.phy.bg.ac.yu 15001 /DC=ORG/DC=SEEGRID/O=Hosts/O=Institute of Physics Belgrade/CN=host/se.phy.bg.ac.yu aegis"
Note that VO_AEGIS_VOMSES should be entered in one long line. Ensure that YAIM's config_vomses function on all of your nodes is replaced by this one:
http://glite.phy.bg.ac.yu/GLITE-3_0_1/yaim/config_vomses
In addition, ensure that you have relevant section in your groups.conf and users.conf; those sections can be copied from our files on http://glite.phy.bg.ac.yu/GLITE-3_0_1/AEGIS/
7) install_node
It is necessary to execute the following command on all nodes
/opt/glite/yaim/scripts/install_node <path-to-site-info.def> <node_function> [<another_node_function ...]
The names of all possible node functions are available on the official upgrade page. However, most of SEE-GRID sites will use the following (grouped in the same way as previosuly, in LCG-2_7_0): lcg-CE_torque (previously CE_torque), glite-SE_classic (previously SE_classic), glite-MON (previously MON), glite-UI (previously UI), glite-WN (previously WN_torque). Please note that in order to install combined LCG/gLite WN with torque client (as you probably want, you should aslo add glite-torque-client-config to the glite-WN node type when doing install_node:
/opt/glite/yaim/scripts/install_node <path-to-site-info.def> glite-WN glite-torque-client-config
RPM dependencies problems with existing OS packages are to be resolved by removing unnecessary OS packages. Do not hesitate to do install_node several time, until you are sure that everything is properly installed. After the rpm installation is complete, make sure that there are no RPM dependency issues. If you encounter problems, submit a ticket in SEE-GRID Helpdesk, or try asking for help from the see-grid-gim mailing list.
8) configure_node
There is no particular order in which nodes should be configured, but it is suggested first to upgrade CE (containing site BDII), and then all other nodes. Of course, you should configure this node (and all other ones) using all of its functions at the same time, i.e.
/opt/lcg/yaim/scripts/configure_node <path-to-site-info.def> CE_torque [<another_node_function ...]
After CE you should do configure_node on all other nodes. Please, bear in mind that you need to use CE_torque and WN_torque node functions in order to include batch system components on CE and on WNs. The names of LCG-flavoured components are the same as before (i.e. CE_torque, WN_torque, SE_classic, MON, UI, etc.).
If you encounter java problems when configuring MON node, be sure to remove jpackage-utils package.
For sites supporting AEGIS VO
The following should be done on your CE, SE, UI, and RB.
After configure_node, please verify that /opt/edg/etc/vomses/ and /opt/glite/etc/vomses/ contain configuration file for AEGIS VO. If not, save the following file http://glite.phy.bg.ac.yu/GLITE-3_0_1/AEGIS/aegis-se.phy.bg.ac.yu in /opt/edg/etc/vomses/ and /opt/glite/etc/vomses/.
Save host certificate of AEGIS VOMS server to /etc/grid-security/vomsdir/. Host certificate is available on http://glite.phy.bg.ac.yu/GLITE-3_0_1/AEGIS/se.phy.bg.ac.yu.pem
9) Postconfiguration
New SEEGRID VOMS RPM should be installed on all nodes requiring it (CE, SE, RB, UI). You can find it here: http://glite.phy.bg.ac.yu/GLITE-3_0_1/seegrid-0.2-1.noarch.rpm
It is also advisable to fine-tune your queues, setting qmgr limits for each VO. Template script is available on http://glite.phy.bg.ac.yu/GLITE-3_0_1/tune-queues-SEEGRID while our production script is available on http://glite.phy.bg.ac.yu/GLITE-3_0_1/AEGIS/tune-queues-AEGIS
To fine-tune your WNs, adjust /var/spool/pbs/mom_priv/config file and restart pbs_mom on all WNs. Template file is available on http://glite.phy.bg.ac.yu/GLITE-3_0_1/mom_priv-config-SEEGRID while our production mom_priv/config file is available on http://glite.phy.bg.ac.yu/GLITE-3_0_1/AEGIS/mom_priv-config-AEGIS
Verify that /var/spool/maui/maui.cfg on your CE contains the following line
ADMIN3 edginfo rgma
If not, add it and restart maui: /etc/init.d/maui restart
If yoy are deploying myproxy server (PX node type), please change in /etc/init.d/myproxy the line containing
VERBOSE=--verbose
to
VERBOSE=
and restart myproxy.
10) Testing
This should be it. Verify that all queues are correctly configured and enabled:
qmgr -c "print server"
You can enable or disable queues using qenable and qdisable comands.
Test if your site correctly publishes data using SEE-GRID GStat available on http://goc.grid.sinica.edu.tw/gstat/seegrid/
However, be patient - the status published on GStat page is about 15 minutes delayed, so any changes will not be reflected immediatelly. You can also query your site-BDII ane verify correctness of the data issuing the following command:
ldapsearch -x -H ldap://<CE FQDN>:2170 -b mds-vo-name=<SITE-NAME>,o=grid
where <CE FQDN> should be replaced by your CE hostname, and <SITE-NAME> with the appropriate name of your site enetered into the site-info.def and in SEE-GRID top-level BDIIs (example: AEGIS01-PHY-SCL for our site).
Verify that you CE and SE are visible by SEE-GRID BDII:
lcg-infosites --vo seegrid ce
lcg-infosites --vo seegrid se
Try to submit simple test jobs to your site and to retrieve the output.
If you encounter problems, submit a ticket in SEE-GRID Helpdesk, or contact see-grid-gim@see-grid.eu mailing list; other SEE-GRID GIMs will probably be able to help you.
