SG LCG-2 7 0 Guide
From EGEE-see WIki
This guide is based on the experiences gained by AEGIS01-PHY-SCL admins at the Scientific Computing Laboratory of the Institute of Physics in Belgrade, http://scl.phy.bg.ac.yu/
All files used in this guide can be found on http://lcg.phy.bg.ac.yu/LCG-2_7_0/
Real configuration files for AEGIS01-PHY-SCL site can be found on http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/
SEE-GRID adjusted yaim files can be found on http://lcg.phy.bg.ac.yu/LCG-2_7_0/yaim/
The official upgrade and installation guide can be found on http://lcg.web.cern.ch/LCG/Sites/releases.html
0) SCL scripts for easier administration
A small set of scripts developed at AEGIS01-PHY-SCL, enabling
a) providing public/private key based ssh authentication among all nodes
b) scp of file from one node to all other nodes (or to all WNs, or to all nodes except for WNs)
c) execution of the issued command on all nodes (or on all WNs, or on all nodes except for WNs)
can be found on http://lcg.phy.bg.ac.yu/LCG-2_7_0/scl-scripts.tgz
Read README file for more details.
1) Backup
Please backup your site-info.def, wn-list.conf, and users.conf files if you are keeping them in /opt/lcg/yaim/examples/ directory, since they can be overwritten by yaim upgrade. It should not happen with the latest version of yaim, but it is better to be on the safe side. Feel free to backup all other conf files you have customized. Just be aware not to copy them afterwards over the upgraded ones, but do diff and change appropriate values, since even format can be different!
2) Scheduled downtime
Before starting the upgrade, you should declare downtime and disable all of your queues. To declare downtime, you can send an e-mail to see-grid-gim@see-grid.org; later SEE-GRID GOCDB will be used for this. To disable the queues, execute as root on your CE:
qdisable seegrid dteam
Add here also other queues you may have configured. Your site will not accept any further jobs. However, you may have some running jobs. You should wait until all of them are executed, and only then start with the upgrade.
3) OS upgrade
You are encouraged to upgrade your OS as well to the latest version. If you are using Scientific Linux < 3.0.5, you can upgrade using the following procedure: change /etc/apt/sources.list.d/sl.list on all of your nodes so that it contains
rpm ftp://ftp.scientificlinux.org/linux/scientific/ 305/i386/apt-rpm os updates contrib
and remove repositories for previous SL versions. After that, execute
apt-get update
apt-get dist-upgrade
apt-get upgrade-kernel
4) Latest YAIM installation
Download and install latest yaim version on all of your nodes:
wget http://grid-deployment.web.cern.ch/grid-deployment/gis/yaim/lcg-yaim-latest.rpm
rpm -Uvh lcg-yaim-latest.rpm
You should also apply SEE-GRID customizations. AFTER installing the latest yaim on all nodes, replace
/opt/lcg/yaim/scripts/node-info.def with http://lcg.phy.bg.ac.yu/LCG-2_7_0/yaim/node-info.def
/opt/lcg/yaim/functions/config_gip with http://lcg.phy.bg.ac.yu/LCG-2_7_0/yaim/config_gip
/opt/lcg/yaim/functions/config_fmon_client with http://lcg.phy.bg.ac.yu/LCG-2_7_0/yaim/config_fmon_client
The customizations provide gridice agent configuration on WNs (node-info.def and configure_fmon_client), ntp monitoring by gridice on all nodes (config_fmon_client), as well as prividing more meaningful information published by the information system using newly introduced variables (SITE_SUPPORT, SITE_SECURITY, PHYSICAL_CPUS, LOGICAL_CPUS) that need to be defined in site-info.def (already present in the SEE-GRID templates). You can do diff between customized and original files and further change them to suite your needs. Be aware that you need to replace these files after any yaim upgrade is applied on your site. Be aware also that you need to replace these files on all of your nodes!
5) Preparation of configuration files
You need to prepare the following configuration files:
site-info.def
users.conf
wn-list.conf
groups.conf
However, they can be named any way you like. It is suggested to put them into selected directory (say, /root/yaim), and not to keep them in /opt/lcg/yaim/examples. The templates for all of these files can be found on http://lcg.phy.bg.ac.yu/LCG-2_7_0/
The first one (site-info-SEEGRID.def is the name of the template) should be adjusted using the instructions and comments in it. It should contain proper paths to all other configuration files. Our production conf files can be found on http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/
The second one (users-SEEGRID.def is the name of the template) can be created using the following script: http://lcg.phy.bg.ac.yu/LCG-2_7_0/generate-pool-accounts-AEGIS
LCG-2_7_0 by default uses larger number of pool accounts (200 per VO), as well as some additional accounts (software grid manager - sgm, production - prd; before it was just the sgm account). The suggested solution is to remove all pool accounts and groups definitions from /etc/passwd, /etc/shadow and /etc/group, and to remove all of their home directories. All of these will be created by yaim. The above script (generate-pool-accounts-AEGIS) can be used like this:
./generate-pool-accounts-AEGIS 20000 seegrid 2000 > users-SEEGRID.conf
./generate-pool-accounts-AEGIS 21000 dteam 2100 >> users-SEEGRID.conf
Please note >> in the second line. This will generate definitions for pool accounts for seegrid in the UID range 20001-20200, seegridprd will have UID 20201, and seegridprd will have UID 20202, while GID will be 2000 for all these accounts, and similar for dteam (UID range 21001-21200, dteamprd UID 21201, dteamsgm UID 21202, GID 2100 for all dteam accounts). Please ensure that UID and GID ranges used here are not already occupied! This is way removing all pool account users is suggested. Our production conf files can be found on http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/
Third configuration file (wn-list-SEEGRID.def is the name of the template) is the same as in LCG-2_6_0, so you can use the old one. Our production conf files can be found on http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/
Fourth configuration file (groups-SEEGRID.def is the name of the template) is the new one. The template contains definitions for SEEGRID and DTEAM VO. More definitions (if needed) can be created by hand (mutatis mutandis, i.e. replacing what should be replaced). Our production conf files can be found on http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/
For sites supporting AEGIS VO
AEGIS (Academic and Educational Grid Initiative of Serbia) created its VO, and all Serbian sites should support it. In order to do so, you should enter aegis among your queues (variable VOS in site-info.def), and the following among VO specific definitions at the end of site-info.def:
VO_AEGIS_SW_DIR=$VO_SW_DIR/aegis VO_AEGIS_DEFAULT_SE=$CLASSIC_HOST VO_AEGIS_STORAGE_DIR=$CLASSIC_STORAGE_DIR/aegis VO_AEGIS_QUEUES="aegis" VO_AEGIS_USERS=vomss://se.phy.bg.ac.yu:8443/voms/aegis?/aegis VO_AEGIS_SGM=vomss://se.phy.bg.ac.yu:8443/voms/aegis?/aegis/Role=VO-Admin VO_AEGIS_VOMS_SERVERS="vomss://se.phy.bg.ac.yu:8443/voms/aegis?/aegis"
VO_AEGIS_VOMSES="aegis se.phy.bg.ac.yu 15001 /DC=ORG/DC=SEEGRID/O=Hosts/O=Institute of Physics Belgrade/CN=host/se.phy.bg.ac.yu aegis"
Note that VO_AEGIS_VOMSES should be in one line, not in two lines as above.
In addition, ensure that you have relevant section in your groups.conf and users.conf; those sections can be copied from our files on http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/
6) install_node
It is stronly suggested to execute the following command on all nodes
/opt/lcg/yaim/scripts/install_node <path-to-site-info.def> <node_function> [<another_node_function ...]
The names of all possible node functions are available on the official upgrade page. However, most of SEE-GRID sites will use the following (grouped in the same way as previosuly, in LCG-2_6_0): CE_torque, SE_classic, MON, UI, WN_torque.
Do not hesitate to do install_node several time, until you are sure that everything is properly installed. After the rpm installation is complete, make sure that there are no RPM dependency issues.
7) configure_node
As suggested by the official upgrade guide, you should first configure MON box, i.e. the node that will (in SEE-GRID, usually together with other functions) be responsible for publishing accounting data. Of course, you should configure this node using all of its functions at the same time, i.e.
/opt/lcg/yaim/scripts/configure_node <path-to-site-info.def> MON [<another_node_function ...]
After MON box (i.e. node that have this function among others) you should do configure_node on all other nodes (preferably on CE before WNs). Please, bear in mind that you need to use CE_torque and WN_torque node functions.
For sites supporting AEGIS VO
The following should be done for your CE, SE, UI, and RB.
After configure_node, please verify that /opt/edg/etc/vomses/ contains configuration file for AEGIS VO. If not, save the following file http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/aegis-se.phy.bg.ac.yu in /opt/edg/etc/vomses/.
Save host certificate of AEGIS VOMS server to /etc/grid-security/vomsdir/. Host certificate is available on http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/se.phy.bg.ac.yu.pem
8) Postconfiguration
New SEEGRID VOMS RPM should be installed on all nodes requiring it (CE, SE, RB, UI). You can find it here: http://lcg.phy.bg.ac.yu/LCG-2_7_0/seegrid-0.2-1.noarch.rpm
It is also advisable to fine-tune your queues, setting qmgr limits for each VO. Template script is available on http://lcg.phy.bg.ac.yu/LCG-2_7_0/tune-queues-SEEGRID while our production script is available on http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/tune-queues-AEGIS
To fine-tune your WNs, adjust /var/spool/pbs/mom_priv/config file and restart pbs_mom on all WNs. You may restart pbs_server on CE as well after that. Template file is available on http://lcg.phy.bg.ac.yu/LCG-2_7_0/mom_priv-config-SEEGRID while our production mom_priv/config file is available on http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/mom_priv-config-AEGIS
Verify that /var/spool/maui/maui.cfg on your CE contains the following line
ADMIN3 edginfo rgma
If not, add it and restart maui: /etc/init.d/maui restart
9) Testing
This should be it. Verify that all queues are correctly configured and enabled:
qmgr -c "print server"
You can enable or disable queues using qenable and qdisable comands.
Test if your site correctly publishes data using SEE-GRID GStat available on http://goc.grid.sinica.edu.tw/gstat/seegrid/
However, be patient - the status published on GStat page is about 15 minutes delayed, so any changes will not be reflected immediatelly. You can also query your site-BDII ane verify correctness of the data issuing the following command:
ldapsearch -x -H ldap://<CE FQDN>:2170 -b mds-vo-name=<SITE-NAME>,o=grid
where <CE FQDN> should be replaced by your CE hostname, and <SITE-NAME> with the appropriate name of your site enetered into the site-info.def and in SEE-GRID top-level BDIIs (example: AEGIS01-PHY-SCL for our site).
Verify that you CE and SE are visible by SEE-GRID BDII:
lcg-infosites --vo seegrid ce
lcg-infosites --vo seegrid se
Try to submit simple test jobs to your site and to retrieve the output.
If you encounter problems, contact see-grid-gim@see-grid.org mailing list; other SEE-GRID GIMs will probably be able to help you.
10) Important hack for sites with shared homes
For sites supporting MPI, having for this reason shared home directories on WNs, the following hack can be useful. If majority of jobs arriving on your site are actually single-CPU (serial) jobs, you will get better performance if they use local file system on WNs. In order to achive this, create local scratch directory on ALL of your WNs, e.g.
mkdir /scratch
chmod ugo=rwx /scratch
and define shell variable TMPDIR pointing to it, e.g. add the following to /etc/sysconfig/lcg on ALL of your WNs:
TMPDIR=/scratch
After that, replace /opt/globus/lib/perl/Globus/GRAM/JobManager/pbs.pm with http://lcg.phy.bg.ac.yu/LCG-2_7_0/AEGIS/pbs.pm
You are advised to carefully inspect diff between the original and modified pbs.pm file. The modifications present in this file will NOT affect in any way MPI jobs and jobs requiring multiple CPUs. They will move just single-CPU jobs to /scratch after they arrive on WN, thus avoiding use of shared home directory in this case.
