Installing and configuring guide for MonALISA
From EGEE-see WIki
Installing MonALISA service
The MonALISA Service distribution can be downloaded from its official site, http://monalisa.cacr.caltech.edu, as a tar.gz archive.
There are four files in this archive:
* README - contains instalation information; * install.sh: the installation script; * install2.sh: auxiliary script that is run by the "install.sh" script; * MonaLisa.v1.2.tar.gz: the MonALISA monitoring service main distribution.
Requirements
1. Java
For running MonALISA Service you need to have installed JAVA 1.4.2 or higher.
2. Connectivity Requirements
MonALISA needs outbound TCP connectivity to a few sites. Here is the list of hosts/ports. The service should be able to connect to: Lookup Discovery Services (LUSs): TCP ports 4160, 8765 and 80
* monalisa.cern.ch * monalisa.cacr.caltech.edu
Port 8765 is used for lease renewal and service discovery. When the service is started, or there are network problems and it looses registration from the LUSs, it should be able to reach another port 4160. This should be very short lived TCP connections, used only to bootstrap the registration mechanism. Proxy Services: TCP ports 6001, 6002, 6003
* monalisa.cern.ch * monalisa2.cern.ch * monalisa.caltech.edu * monalisa.cacr.caltech.edu
The TCP connection to port 6001 is long lived and used for communication between ML services and proxy services. Very short lived TCP connections are also needed to ports 6002, 6003, whenever a ML service first discovers the proxy service. WEB Servers: TCP port 80
* monalisa.cern.ch * monalisa.cacr.caltech.edu
This are used to autoupdate the service. Please check(telnet) you can reach the above ports from machine running MonALISA service.
Installation process
The service can be installed automatically and basically configured by running the install.sh script and following the step-by-step procedure.
Automatic installation and minimal configuration
For automatically installing the service the "install.sh" script has to be run. MonaLisa Service MUST be run by a non-privileged user, so the installation script will first check whether the user running it is root or not. If the user is not root, then "install2.sh" script will be executed.
If the user is root then it will ask for an account from which MonaLisa service will run. If the specified account does not exist, then the script will attempt to create one and it will copy the install2.sh and MonaLisa.v1.2.tar.gz in the users' ~/monalisa_install and will start install2.sh from there. After install2 finished the ~/monalisa_install will be deleted.
The "install2.sh" script will ask for the destination folder, unpack the MonaLisa.v1.2.tar.gz file and ask for the farm configuration options. One of the first options you have to make is the farm name, this option defaults to the short hostname of the computer. You will also have to enter the latitude and longitude of your place for the client to correct locate and show your service. You can find an aproximate value with http://geotags.com/.
The user will also be asked the path to java, a contact person name (the administrator of the service) and his e-mail address.
If the destination folder already contains an older MonaLisa installation, then all the configuration files are kept unchanged. In fact the CMD, TEST and myFarm folders remain untouched, which is bad if you run an older version of MonaLisa. So if you want to upgrade from an older MonaLisa installation please choose another destination folder and then check if the newly generated configuration is correct.
The directory structure is as follows:
- Service - monitoring service directory structure
- CMD - this directory contains a set of scripts that are used for the management of the MonALISA service execution as well as a file where the user can set global environment variables:
- the ML_SER script is used to start, stop or restart the Monitoring Service.
- the MLD script is used to start the Monitoring Service from init.d.
- the CHECK_UPDATE script is used to automatically update the Monitoring Service.
- in the ml_env file you have to set some environment variables. This is the only file from this directory that can be modified.
- SSecurity - this directory contains the farm keystore (FarmMonitor.ks),as well as scripts for exporting, importing or creating new certificates (exportCert, importCert and genKey).
- lib - contains service needed libraries.
- ml_dl - contains the dl jar - classes that have to be remote loaded be users of the service.
- myFarm - this directory contains files for configuring your monitoring service. You can rename it with the name of your service.
- usr_code - this directory contains the sources of some module examples. Users can develop their own monitoring and gathering module and load them in the monitoring service.
- CMD - this directory contains a set of scripts that are used for the management of the MonALISA service execution as well as a file where the user can set global environment variables:
- bin - contains tools used from the scripts from Service/CMD.
- policy - contains policy files for running the MonALISA service.
- util - contains different simple programs, their source and scripts for getting useful information:
- ShowReceivedValues - a short program example that interrogates a farm database and gets results from it. It receives as argument a configuration file, like the Service/TEST/ml.properties.
- ShowStoreConfig - a short program example that parses a farm configuration file (as Service/TEST/ml.properties) received as argument and shows the database tables structure used.
- SimpleClient - a program example that finds the MonALISA farm services registered in the reggie services from the locators given in the locators.conf file and shows their attributes.
- SimpleDBShell - a set of useful scripts for getting information from a mysql farm database. For using these scripts you must first edit the mysql_console.sh script, set the variables from here and delete the following lines:
echo "Please edit mysql_console.sh first" > /dev/stderr exit
All the other scripts use "mysql_console.sh". Use them for getting information from the service database tables.
- TestClient - simple client that registers with predicates at farm service and shows receiving information.
After the install completes you can check the configuration by editing the following files:
ML_SER - starts,stops or restarts the service.
MLD - to be put in init.d so you can control the service.
CHECK_UPDATE - should be put in crontab to automatically update the MonALISA.
For advanced configuration of the service environment and of the monitoring parameters edit the following files:
ML_HOME/Service/CMD/ml_env
ML_HOME/Service/CMD/site_env
ML_HOME/Service/myFarm/ml.properties
ML_HOME/Service/myFarm/myFarm.conf
Configuration
There are three configuration files that the user can modify for specifying farm service environment and characteristics: a global configuration file ($MonaLisa_HOME/Service/CMD/ml_env), and the others used by MonALISA itself ( $MonaLisa_HOME/Service/<YOUR_FARM_DIRECTORY>/ml.properties and $MonaLisa_HOME/Service/<YOUR_FARM_DIRECTORY>/<YOUR_FARM_CONF_FILE>.conf).
Global configuration file
The file for the global configuration is $MonaLisa_HOME/Service/CMD/ml_env. The variables that the user has to set or can set are:
- MONALISA_USER - the name of the user that is running the service. It will not start from other account or from the root account.
- JAVA_HOME - the path to your current JDK.
- SHOULD_UPDATE - whether or not MonALISA should check for updates when it is started. If this parameter is "true" when MonALISA is started, first it will check for updates and after that it will start. If set to "false" it will not check for updates. This parameter is also used to check for autoupdates when it is running. Please see section “How to start a Monitoring Service with Autoupdate” from this user guide.
- MonaLisa_HOME - path to your MonALISA installation directory. Environment variables can also be used. (e.g ${HOME}/MonaLisa)
- FARM_HOME - path to a directory where reside your farm specific files. It's better to place this directory in the Service directory. (e.g. You can use the variable MonaLisa_HOME defined above. ${MonaLisa_HOME}/Service/MyTest. MonALISA comes with a simple example in ${MonaLisa_HOME}/Service/myFarm.
- FARM_CONF_FILE - the file used at the startup of the services to define the clusters, nodes and the monitor modules to be used. It should be in the ${FARM_HOME} directory. (e.g FARM_CONF_FILE="${FARM_HOME}/mytest.conf").
- FARM_NAME - the name for your farm. (e.g FARM_HOME="MyTest"). We would like to ask the users to use short names to describe the SITE on which they are running MonALISA.
- JAVA_OPTS - is an optional parameter to pass parameters directly to the Java Virtual Machine (e.g JAVA_OPTS="-Xmx128m").
The MonALISA properties
The file $MonaLisa_HOME/Service/<YOUR_FARM_DIRECTORY>/ml.properties is specific for your farm configuration.
You can specify here:
- what lookup services to use (lia.Monitor.LUSs);
- the jini groups that your service should join (lia.Monitor.group);
- the location of the farm server (MonaLisa.LAT, MonaLisa.LONG, MonaLisa.Country);
- Web Services settings (lia.Monitor.startWSDL=true starts the MonALISA web service, lia.Monitor.wsdl_port);
- database configuration (lia.Monitor.keep_history how long to keep data in farm database, parameters to configurate database tables, etc.);
- parameters for logging (.level - the logging level - defaults to INFO, etc.)
You will find explanations before every field for setting it correctly.
The MonALISA Service Configuration
The MonALISA service is using a very simple configuration file to generate the site configuration and the modules to be used for collecting monitoring information. By using the administrative interface with SSL connection the user may dynamically change the configuration and modules used to collect data.
It is possible to use the built-in modules (for snmp, local or remote /proc file...) or external modules. We provide several modules which allow exchanging information with other monitoring tools. These modules are really very simple and the user can also develop its own modules.
Below we will present a simple configuration example. This file is the .conf file from your Service/<FARM_DIRECTORY> directory.
For a complete list of the available monitoring modules please refer to Monitoring Modules Base.
Service monitoring configuration
Example 1.1.
*Master
>citgrid3.cacr.caltech.edu citgrid3
monProcLoad%30
monProcStat%30
monProcIO%30
*ABPing{monABPing, citgrid3.cacr.caltech.edu, " "}
*PN_CIT
>c0-0
snmp_Load%30
snmp_IO%30
snmp_CPU%30
>c0-1
snmp_Load%30
snmp_IO%30
snmp_CPU%30
>c0-2
snmp_Load%30
snmp_IO%30
snmp_CPU%30
>c0-3
snmp_Load%30
snmp_IO%30
snmp_CPU%30
The first line (*Master) defines a Functional Unit (or Cluster). The second line (>citgrid3.cacr.caltech.edu citgrid3) adds a node in this Functional Unit class and optionally an alias. The lines:
monProcLoad%30
monProcIO%30
monProcStat%30
define three monitoring modules to be used on the node "citgrid3". These measurements are done periodically, every 30s. The monProc* modules are using the local /proc files to collect information about the cpu, load and IO. In this case this is a master node for a cluster, were in fact MonALISA service is running and simple modules using the /proc files used to collect data.
The line:
*ABPing{monABPing, citgrid3.cacr.caltech.edu, " "}
defines a Functional unit named "ABPing" which is using an internal module monABPing. This module is used to perform simple network measurements using small UDP packages. It requires as the first parameter the full name of the system corresponding the real IP on which the ABping server is running (as part of the MonALISA service). The second parameter is not used. These ABPing measurements are used to provide information about the quality of connectivity among different centers as well as for dynamically computing optimal trees for connectivity (minimum spanning tree, minimum path for any node to all the others...)
*PN_CIT
defines a new cluster name. This is for a set of processing nodes used by the site. The string "PN" in the name is necessary if the user wants to automatically use filters to generate global views for all this processing units.
Then it has a list of nodes in the cluster and for each node a list of modules to be used for getting monitoring information from the nodes. For each module a repetition time is defined (%30). This means that each such module is executed once every 30s. Defining the repeating time is optional and the default value is 30s.
Database support configuration
The configuration options relevant to the storage are set in the FARMNAME/ml.properties file:
lia.Monitor.use_emysqldb=true|false
this will unpack the embedded mysql (if any)
lia.Monitor.use_epgsqldb=true|false
for the embedded postgresql
If none of these options is enabled then the following options are relevant for the database server selection:
lia.Monitor.jdbcDriverString= com.mckoi.JDBCDriver or com.mysql.jdbc.Driver or org.postgresql.Driver
McKoi is the default database if nothing else is available, but we don't recommend using it for storing large data structures. If you have a standalone database server you should disable the embedded databases and specify the mysql or postgresql driver here accordingly. The following parameters are the connection parameters for the JDBC driver.
lia.Monitor.ServerName=IP_ADDRESS lia.Monitor.DatabasePort=TCP_PORT lia.Monitor.DatabaseName=DB_NAME lia.Monitor.UserName=DB_USERNAME lia.Monitor.Pass=DB_PASSWORD
The actual database structure is determined by the following options:
lia.Monitor.Store.TransparentStoreFast.web_writes=N
this option specify the number of tables that are used. For each X=0..N-1 you should have:
lia.Monitor.Store.TransparentStoreFast.writer_X.total_time=SECONDS lia.Monitor.Store.TransparentStoreFast.writer_X.table_name=UNIQUE_NAME lia.Monitor.Store.TransparentStoreFast.writer_X.writemode=MODE lia.Monitor.Store.TransparentStoreFast.writer_X.samples=SAMPLES lia.Monitor.Store.TransparentStoreFast.writer_X.descr=UNIQUE_STRING
SECONDS specify the time period for which the data is stored in the database. Data older than now()-SECONDS will be automatically deleted.
The "table_name" and "descr" should be unique among the other options of the same kind. "table_name" must be a valid database table name (no spaces and so on), "descr" can be any string you like.
You can store data in either averaged or raw modes. When using and averaged mode the SAMPLES value determine the number of values that are kept for the specified interval. For example if you want to store a single value each minute for an year you should specify SECONDS=31536000 and SAMPLES=SECONDS/60=525600. This is applied separately for each parameter that you store, so such a database can become rather large.
MODE has these possible values:
0: averaged mode
the table structure will be
rectime | farm | cluster | node | function | mval | mmin | mmax
long | text | text | text | text | double | double | double
1: raw mode
same structure as 0
2: raw mode for storing abstract Object values
seldom used
3: averaged mode, data is only kept in memory
to control the maximum size of the in-memory buffer use:
lia.Monitor.Store.TransparentStoreFast.writer_X.countLimit
if set to -1 then only the time limit given by SECONDS is relevant
4: raw mode, in memory, same as 3 but without data averaging 5, 6 : averaged / raw modes for an optimized table structure
each farm/cluster/node/function combination is given an unique ID, stored in monitor_ids table, and the database structure is now:
rectime | id | mval | mmin | mmax
this option is the best for large data but with always-changing parameter names (for example netflow data aquisition)
7,8 : averaged / raw modes for another ID-related structure
for each unique ID a separate table is kept with the data from that series only, the table name will be UNIQUE_NAME_id and the structure is
rectime | mval | mmin | mmax
this option is the best one when the data series are constant in time, it works well with up to 10000 table names (10000 unique ids if you have a single table writer, 5000 unique ids if you define 2 separate writers and so on).
Important
Modes 7 and 8 only work with PostgreSQL because of some stored procedures needed to improve response times.
For a large data repository we would recommend using PostgreSQL with something like:
lia.Monitor.Store.TransparentStoreFast.web_writes = 2
lia.Monitor.Store.TransparentStoreFast.writer_0.total_time=31536000
lia.Monitor.Store.TransparentStoreFast.writer_0.samples=525600
lia.Monitor.Store.TransparentStoreFast.writer_0.table_name=monitor_1y_1min
lia.Monitor.Store.TransparentStoreFast.writer_0.descr=1y 1min
lia.Monitor.Store.TransparentStoreFast.writer_0.writemode=7
lia.Monitor.Store.TransparentStoreFast.writer_1.total_time=31536000
lia.Monitor.Store.TransparentStoreFast.writer_1.samples=5256
lia.Monitor.Store.TransparentStoreFast.writer_1.table_name=monitor_1y_100min
lia.Monitor.Store.TransparentStoreFast.writer_1.descr=1y 100min
lia.Monitor.Store.TransparentStoreFast.writer_1.writemode=7
We define 2 separate writers with different averaging intervals (1min and 100min) so the repository can use the proper one in different situations. For example when plotting a 1-hour chart it will choose the 1min table, but if you plot a 6-months chart it will choose the 100min one, reducing the number of operations needed to plot that data. A single writer would either limit the data resolution or response speed, more than 2 writers add much overhead and supplemental disk usage without much benefit.
Whatever storage type you use there is a memory buffer that is used in parallel with the disk storage (if any). Its size depends on the maximum JVM memory (-Xmx parameter) and is dinamically adjusted so that it doesn't use all of the available memory. When making a history query this is the first source of data, if more data is needed then a separate database query is executed to retrieve the remaining interval. In a repository you can see the current buffer status in http://......./info.jsp, look for something like:
Data cache: values: 252275/262144 (max 262144), time frame: 2:13:29, served requests: 16490
this tells you the number of values in the buffer, what period of time it holds and how many requests were served from this buffer.
How to setup the configuration files for your site
- Go to "MonaLisa"/Service directory and create a directory for your site (e.g MySite). You may copy the configuration files from one of the available site directory (e.g.: those from the "MonaLisa"/Service/TEST directory). You must include the following files in you new Farm (ml.properties, db.conf.embedded and my_test.conf)
- Edit the configuration file (my_site.conf) to reflect the environment you want to monitor.
- Edit ml.properties if you would like to change the Lookup Discovery Services that will be used or if you would like to use another DB System.
- You may add a myIcon.gif file with an icon of your organization in "MonaLisa"/Service/ml_dl.
The only script used to start/stop/restart "MonaLisa" is ML_SER from this directory. After you have done what is described in Section 3, “The MonALISA Service Configuration” section you can start using MonALISA:
Service/CMD/ML_SER start
How to start a Monitoring Service from init.d
Please set correctly MonaLisa_HOME and MONALISA_USER variables from ${MonaLisa_HOME}/Service/CMD/MLD.
For 'Redhat like'
#cp ${MonaLisa_HOME}/Service/CMD/MLD /etc/init.d
#chkconfig --add MLD
#chkconfig --level 345 MLD on
For Debian
#cp ${MonaLisa_HOME}/Service/CMD/MLD /etc/init.d
#update-rc.d MLD start 80 3 4 5 .
#update-rc.d MLD stop 86 3 4 5 .
Connectivity requirements for Monitoring Service
MonALISA service needs only outbound TCP connectivity to the following hosts: LUS Servers: TCP ports 4160, 8765 and 8288 - monalisa.cern.ch - monalisa.cacr.caltech.edu Port 8765 is used for lease renewal and service discovery. When the service is started, or there are network problems and it looses registration from the LUSs, it should be able to reach another two ports 4160 and 8288. This should be very short lived TCP connections, used only to bootstrap the registration mechanism. Proxy Servers: TCP ports 6001, 6002, 6003 - monalisa.cern.ch - monalisa2.cern.ch - monalisa.caltech.edu - monalisa-ul.caltech.edu - monalisa.cacr.caltech.edu The TCP connection to port 6001 is long lived and used for communication between ML services and proxy services. Very short lived TCP connections are also needed to ports 6002, 6003, whenever a ML service first discovers the proxy service. WEB Servers: TCP port 80 - monalisa.cern.ch - monalisa.cacr.caltech.edu - monalisa.caltech.edu This are used to autoupdate the service. Before starting the service you should check that the above hosts/ports can be reached.
How to start a Monitoring Service with Autoupdate
This allows to automatically update your Monitoring Service. The cron script will periodically check for updates using a list of URLs. When a new version is published the system will check its digital signature and then will download the new distribution as a set of signed jar files. When this operation is completed the MonALISA service will restart automatically. The dependecies and the configurations related with the service are done in a very similar way like the Web Start technology.
This functionality makes it very easy to maintain and run a MonALISA service. We recomnend to use it!
In this case you should add "MonaLisa"/Service/CMD/CHECK_UPDATE to the user's crontab that runs MonALISA. To edit your crontab use: $crontab -e
Add the following line:
*/20 * * * * /<path_to_your_MonaLisa>/Service/CMD/CHECK_UPDATE
This would check for update every twenty minutes. It is resonable value that this value should be >= twenty minutes. To check for update every 30 minutes add the following line instead of the one above.
*/30 * * * * /<path_to_your_MonaLisa>/Service/CMD/CHECK_UPDATE
To disable autoupdate you cand edit the ml_env file in "MonaLisa"/Service/CMD and set SHOULD_UPDATE="false">. It is no need to remove the script CHECK_UPDATE from your crontab.
Launch "MonaLisa"/Service/CMD/ML_SER start. MonALISA should check for updates now.
Configure MonALISA with gLite
1) Service installation:
MonALISA service should be installed on MON node in your site (also SE in many configurations) using a non-priviledged account (e.g monalisa)
[Monalisa service: Download http://monalisa.cern.ch/monalisa__Download__MonALISA_service.html]
monalisa@mon:~$ tar xzvf MonaLisa.v<current_version>.tar.gz monalisa@mon:~$ cd MonaLisa.v1.4 && ./install.sh
2) LCG VO-IO and VO-JOBS Modules Configuration
The LCG modules are available at http://monitor.seegrid.grid.pub.ro:8080/lcg-modules/
Directory structure:
/lcg-modules
Download and Installation:
monalisa@mon:~$ cd <MonALISA_HOME>/Service/usr_code monalisa@mon:usr_code$ wget http://monitor.seegrid.grid.pub.ro:8080/lcg-modules/LCG-latest.tgz monalisa@mon:usr_code$ tar -xzvf LCG-latest.tgz
You should rebuild the modules:
monalisa@mon:usr_code$ cd LCG && ./comp
Configuration:
a) Edit <MonALISA_HOME>/Service/myFarm/myFarm.conf and paste the fallowing configuration snippet at the end of file:
#[LCG modules] set <your_ce_host> to your CE hostname/ip
#Site configuration
*PN_PBS_LCG{monPN_LCG_PBS, localhost, GridDistribution=LCG2.4, RemoteHost=<your_ce_host>}%60
#--Jobs Monitoring Information--
# gathered from CE (remote)
# Note: You SHOULD define "MapFile" and "SiteInfoFile" parameters to point to the location of users.conf and site-info.def files
# The default path for these files is: /opt/lcg/yaim/examples
*LcgVO_JOBS_CE{monLcgVoJobs, localhost, GridDistribution=LCG2.4, RemoteHost=<your_ce_host>, JobManager=PBS, MapFile=<path.to.site_users.conf>, SiteInfoFile=<path.to.site.site-info.def>}%60
#--IO Traffic Monitoring--
# gathered from CE (remote)
# Note: You SHOULD define "MapFile" and "SiteInfoFile" parameters to point to the location of users.conf and site-info.def files
# The default path for these files is: /opt/lcg/yaim/examples
*LcgVO_IO_CE{monLcgVO_IO, localhost, GridDistribution=LCG2.4, RemoteHost=<your_ce_host>, RemoteFile=/var/log/globus-gridftp.log, MapFile=<path.to.site_users.conf>, SiteInfoFile=<path.to.site.site-info.def>}%180
#--Storage Element (local)--
# Note: You may need to add to the the "MapFile" and "SiteInfoFile" parameters to define the location of users.conf and site-info.def files
# MapFile=<path.to.site_users.conf>, SiteInfoFile=<path.to.site.site-info.def>
# The default path for these files is: /opt/lcg/yaim/examples
*LcgVO_IO_SE{monLcgVO_IO, localhost, GridDistribution=LCG2.4, FtpLog=/var/log/globus-gridftp.log, MapFile=<path.to.site_users.conf>, SiteInfoFile=<path.to.site.site-info.def>}%180
b) Edit <MonALISA_HOME>/Service/myFarm/ml.properties, find the Loading of Additional modules section and set the path to your LCG modules in order to be dynamically loaded:
lia.Monitor.CLASSURLs=file:${MonaLisa_HOME}/Service/usr_code/LCG/,file:${MonaLisa_HOME}/Service/usr_code/FilterExamples/ExLoadFilter/
c) Edit <MonALISA_HOME>/Service/myFarm/ml.properties, find the REGISTRATION section and set your group to "seegrid"
lia.Monitor.group=seegrid
d) Setup remote access from the MON node to the CE and SE nodes
The MonALISA modules require remote access to CE and SE nodes in order to collect the site information:
1. It is recommended that on both CE and SE nodes a user named "monalisa" to be created 2. On the MON(SE) node install and run the MonALISA service under the "monalisa" user credentials 3. Set up the CE node so as the remote access via ssh between SE and CE to be executed without password challenge for the "monalisa" user.In order to do that you could use the "remote_access.sh" script which run it on the CE and SE under "monalisa" account
NOTE: The following procedure may help you in respect with the 2.d) step:
Create "monalisa" account on MON,CE,SE nodes: MON:
root@MON# groupadd monalisa && useradd -g monalisa monalisa root@MON# passwd monalisa New Password: <PASS> Re-enter ...: <PASS> root@MON# id monalisa //example output uid=20947(monalisa) gid=2689(monalisa) groups=2689(monalisa)
CE
//replace XXXX with the GID returned by the "id" command on the MON root@CE# groupadd -g XXXX monalisa //replace XXXX with the UID returned by the "id" command on the MON root@CE# useradd -u XXXXX -g monalisa monalisa //same password as on the MON root@CE# passwd monalisa New Password: <PASS> Re-enter ...: <PASS>
SE (optional, if SE is not the same as MON node)
//replace XXXX with the GID returned by the "id" command on the MON root@SE# groupadd -g XXXX monalisa //replace XXXX with the UID returned by the "id" command on the MON root@SE# useradd -u XXXXX -g monalisa monalisa //same password as on the MON root@SE# passwd monalisa New Password: <PASS> Re-enter ...: <PASS>
Back on the MON host Distribute the ssh "monalisa" public key from the MON node to CE,SE nodes Download the remote access setup script from http://monitor.seegrid.grid.pub.ro:8080/lcg-modules/remote_access.sh This script will copy the SSH public key from local system to the .ssh/authorized_keys on the specified remote system. If the identity public key does not exist this script will generate a *fresh* key-pair.
root@MON# su - monalisa monalisa@MON$ wget http://monitor.seegrid.grid.pub.ro:8080/lcg-modules/remote_access.sh monalisa@MON$ chmod 700 remote_access.sh monalisa@MON$ ./remote_access.sh <CE_HOST> monalisa@MON$ ./remote_access.sh <SE_HOST>
e) Start/Restart MonALISA service:
root@MON:Service/CMD# ML_SER start
/doc
Complete documentation for LCG modules
For more information please check: http://monalisa.caltech.edu
SEE-GRID MonALISA Repository: http://monitor.seegrid.grid.ro:8080
Interaction of MonALISA service with PBS, Condor, LSF and Ganglia
Description
The PN modules offer monitoring information about the processing nodes from a cluster. The metrics provided are a subset of the Ganglia metrics (see section 2), but the information is obtained from a job manager running on the cluster instead of Ganglia. Currently the modules work with Condor, OpenPBS/Torque and LSF and the commands used to obtain the nodes' status are:
For Condor:
condor_status [-pool <server_name>] [-constraint <constraint_expr>] -l
For OpenPBS/Torque:
pbsnodes [-s <server_name>] -a
For LSF:
bhosts -l
lshosts
Results Provided by the Modules
The parameters provided by the PN_Condor and PN_PBS modules are:
PN_Condor/PN_PBS
|____node1
|____node2
| (parameters)
| |____NoCPUs
| |____VIRT_MEM_free
| |____MEM_total
| |____Load1
|......
|____nodeN
The parameters provided by the PN_LSF module are:
PN_LSF
|____node1
|____node2
| (parameters)
| |____NoCPUs
| |____MEM_free
| |____MEM_total
| |____SWAP_free
| |____SWAP_total
| |____Load1
| |____Load15
|......
|____nodeN
where:
* No_CPUs - the number of CPUs on the node * MEM_free - the amount of free physical memory (in MB) * MEM_total - the total amount of physical memory, in MB * SWAP_free - the amount of free swap memory, in MB * SWAP_total - the total amount of swap memory, in MB * Load1 - load average for 1 minute on the node * Load15 - load average for 15 minutes on the node
If the modules are initialized with the Statistics argument, an additional cluster with statistical information about the number of nodes is provided:
For PBS:
PN_PBS_Statistics
|____Statistics
| (parameters)
| |____Total Nodes
| |____Total Available Nodes
| |____Total Free Nodes
| |____Total Down Nodes
|____...
where:
* Total Nodes - total number of nodes registered to the PBS server * Total Available Nodes - total number of nodes that are currently communicating with the server * Total Free Nodes - total number of nodes which are in the "free" state (can execute incoming jobs) * Total Down Nodes - number of nodes whose state is unknown to the server
For Condor:
PN_Condor_Statistics
|____Statistics
| (parameters)
| |____Total Nodes
| |____Total Slots
| |____Total CPUs
| |____Total Available Slots
| |____Total Free Slots
| |____Total Owner Slots
|____...
where:
* Total Nodes - total number of nodes from the Condor pool (a multi-processor machine counts as a single node) * Total Slots - total number of slots (virtual machines in Condor). For a multi-processor machine, separate virtual machines are usually created for each processor. * Total CPUs - total number of CPUs (should be equal withthe number of slots; if it is not the case, you should add the SlotsFactor argumet to the module - see below). * Total Available Slots - total number of slots that are available for Condor (i.e., the user is not executing his/her own jobs on them) * Total Free Slots - total number of nodes which are in the "free" state (can execute incoming jobs) * Total Owner Slots - number of slots in "Owner" state (the user is executing his/her own jobs on them)
The Statistics cluster also contains a "Status" node which indicates the module's status (0 if it was executed correctly and non-zero if there was an error).
For LSF:
PN_LSF_Statistics
|____Statistics
| (parameters)
| |____Total Nodes
| |____Total Slots
| |____Total Free Slots
| |____Total Down Nodes
|____...
where:
* Total Nodes - total number of nodes * Total Slots - total number of job slots * Total Free Slots - total number of free job slots * Total Down Nodes - number down nodes (nodes for which LSF bhosts does not report the "ok" status)
Modules activation
In order to use these modules, you should have MonALISA 1.2.38 or newer.
If you have the OSG distribution and put the modules in your folder in urs_code folder from Monalisa/Service, it is necessary to source two scripts:
. /OSG/setup.sh
. /OSG/MonaLisa/Service/CMD/ml_env
(replace "/OSG" with the path to your OSG directory)
To compile, just run the "comp" script from the modules' directory:
./comp
Compiling is only necessary if you use the version of the module from the usr_code/ directory.
To enable the modules you should add to the farm configuration file a line of the following form:
*<cluster_name>{moduleName, localhost, <arguments>}%<time_interval>
where:
* cluster_name - the cluster name for the results that this module produces (PN_Condor, PN_PBS or PN_LSF) * moduleName - name of module: monPN_Condor, monPN_PBS, monPN_LSF (or, if running from usr_code: PN_Condor, PN_PBS, PN_LSF) * <arguments> - list of arguments. The arguments that may be passed to the modules are Statistics, Server, SlotsFactor (only for PN_Condor) and NodesLabel (only for PN_PBS).
If the Statistics argument appears in the list of arguments, the module will provide an aditional "cluster" that contains statistics about number of nodes in the cluster, as described above.
The Server argument indicates the name of PBS server / Condor central manager that will be queried. For example:
Server=lcfg.rogrid.pub.ro
is a valid entry for this parameter. If this argument is used for the PN_Condor module, the "condor_status" command will be run with the "-pool" option, and for the PN_PBS module the "pbsnodes" command will be run with the "-server" option. The "Server" argument is optional and it can appear more than once in the list, to specifiy multiple servers from which information should be collected; if it doesn't appear, the PBS server / Condor central manager corresponding to the local machine will be used.
The "SlotsFactor" argument can be used for Condor, in order to display correctly the number of CPUs (the Total CPUs result), if it is different from the number of Condor slots. The number of CPUs will be calculated as the number of Condor slots times the SlotsFactor; for example, if you have 100 CPUs and 400 Condor slots, you should set "SlotsFactor = 0.25".
CondorConstraints = <constraints> -with this argument you can specify a constraint expression that will be used with condor_status (for example, CondorConstraints = HasCheckpointing==TRUE). Multiple Condor constraints can be specified with an expression containing "&&"-s, "||"-s, etc. (for example: CondorConstraints = HasCheckpointing==TRUE&&TotalVirtualMachines<4). To use quoted strings in the constraints expressions it is a little more complicated because the quotes should be also quoted with 3 backslashes: Condorconstraints = FileSystemDomain==\\\\\\\"cithep90.ultralight.org\\\\\\\"
NodesLabel= <label< - for PN_PBS, with this argument you can specify a property label; the module will create statitstics only for the nodes that have this label. A column must be placed at the beginning of the label string (e.g., NodesLabel=:mylabel).
Examples:
*PN_Condor{monPN_Condor, localhost}%120
Here, the PN_Condor module is used with the default settings. The information will be obtained from the local Condor central manager and no statistics about the number of nodes will be created.
*PN_Condor{monPN_Condor, localhost, Statistics}%240
In this example the module will provide statistical information about the number of nodes.
*PN_Condor{monPN_Condor, localhost, Server=pccil.cern.ch, Statistics}%80
Here, only information from the Condor manager running on pccil.cern.ch will be collected; statistical information about the number of nodes will also be provided.
*PN_Condor{monPN_Condor, localhost, Server=lcfg.rogrid.pub.ro, Server=wn1.rogrid.pub.ro}%180
In this example the module will provide information collected from the lcfg.rogrid.pub.ro and wn1.rogrid.pub.ro Condor managers.
*PN_Condor{monPN_Condor, localhost, Server=cithep90.ultralight.org, CondorConstraints = HasCheckpointing==TRUE}%80
In this example the module will provided information collected from the cithep90.ultralight.org Condor manager, restricted to the nodes that satisfy the condition HasCheckpointing==TRUE.
*PN_PBS{monPN_PBS, localhost}%120
In this example the PN_PBS module is used with the default settings. The information will be obtained from the local PBS server and no statistics about the number of nodes will be created.
*PN_PBS{monPN_PBS, localhost, Statistics}%120
In this example the module will provide statistical information about the number of nodes.
*PN_PBS{monPN_PBS, localhost, Server=pccil.cern.ch, Statistics}%90
Here, only information from the PBS server running on pccil.cern.ch will be collected; statistical information about the number of nodes will also be provided.
*PN_PBS{monPN_PBS, localhost, Statistics, Server=gw01.rogrid.pub.ro, server=lcfg.rogrid.pub.ro}%180
In this example, information is collected from the gw01.rogrid.pub.ro and lcfg.rogrid.pub.ro servers, and statistical data about the number of nodes is provided.
*PN_LSF{monPN_LSF, localhost, Statistics}%120
In this example the PN_LSF module will provide statistical information about the number of nodes.
Note: The verification of the parameter names for these modules is case insensitive (i.e., you can write "statistics" or "Statistics").
When the modules are run, there are some environment variables that should be set, which indicate the location of the available queue managers:
* for PBS: if you have PBS, you should set the PBS_LOCATION variable; this variable should be set such that the path to the pbsnodes command is ${PBS_LOCATION}/bin/pbsnodes.
* for Condor: if you have Condor, you should set the CONDOR_LOCATION variable; this variable should be set such that the path to the condor_status command is ${CONDOR_LOCATION}/bin/condor_status.
* for LSF: if you have LSF, you should set the LSF_LOCATION variable; this variable should be set such that the path to the lshosts command is ${LSF_LOCATION}/bin/lshosts.
If you have the OSG distribuition and you sourced the OSG/setup.sh script, all the needed variables are already set and it is not necessary to set any other environment variables.
Logging levels
To change the logging level for this module logger, add/modify the following line in ml.properties file.
lia.Monitor.modules.<module_name>.level = LEVEL
Value for LEVEL can be: SEVERE, WARNING, INFO, FINE. Value for module_name can be: monPN_Condor, monPN_PBS, monPN_LSF.
Ganglia
Ganglia is a well known monitoring system which is using a multi cast messaging system to collect system information from large clusters. MonALISA can be easily interfaced with Ganglia. This can be done using the multicast messaging system or the gmon interface which is based on getting the cluster monitoring information in XML format. In the MonALISA distribution we provide modules for both these two possibilities. If the MonALISA service runs in the multicast range for the nodes sending monitoring data, we suggest using the Ganglia module which is a multicast listener. The code for interfacing MonALISA with Ganglia using gmon is Service/usr_code/GangliaMod and using the multicast messages is Service/usr_code/GangliaMCAST. The user may modify these modules. Please look at the service configuration examples to see how these modules may be used.
Monitoring a Farm using Ganglia gmon module
The configuration file should look like this:
Example 1.1. Farm configuration with Ganglia gmon
*PN_popcrn {IGanglia, popcrn01.fnal.gov, 8649}
The line:
*PN_popcrn {IGanglia, popcrn01.fnal.gov, 8649}%30
defines a cluster named "PN_popcrn" for which the entire information is provided by the IGanglia module. This module is using telnet to get an XML based output from the Ganglia gmon. The telnet request will be sent to node popcrn01.fnal.gov on port 8649.
All the nodes which report to ganglia will be part of this cluster unit and for all of them the parameters selected in the IGanglia module will be recorded. This measurement will de done every 30s.
The Ganglia module is located in the Service/usr_code/GangliaMod . The user may edit the file and customize it. This module is NOT in the MonaLISA jar files and for using it the user MUST add the path to this module to the MonaLISA loader. This can be done in ml.propreties by adding this line:
lia.Monitor.CLASSURLs=file:${MonaLisa_HOME}/Service/usr_code/GangliaMod/
Monitoring a Farm using Ganglia Multicast module
For getting copies of the monitoring data sent by the nodes running the ganglia demons (using a multicast port) it is necessary that the system on which MonaLISA is running to be in muticast range for these messages.
Adding such a line:
*PN_cit{monMcastGanglia, tier2, "GangliaMcastAddress=239.2.11.71; GangliaMcastPort=8649"}
in the configuration file, will use the Ganglia multicast module to listen to all the monitoring data and then to select certain values which will be recorded into MonALISA. The service system will automatically create a configuration for all the nodes which report data in this way.
The PN_cit is the name of the cluster of processing nodes. Is is important for the cluster name of processing nodes to contain the "PN" string. This is used by farm filters to report global views for the farms.
The tier2 is the name of the system corresponding to the real IP address on which this MonALISA service is running. The second parameter defines the multicast address and port used by Ganglia.
The GangliaMcat module is located in the Service/usr_code/GangliaMCAST . The user may edit the file and customize it. This module is NOT in the MonaLISA jar files and to be used, the user MUST add the path to this module to the MonaLISA loader. This can be done in ml.propreties by adding this line:
lia.Monitor.CLASSURLs=file:${MonaLisa_HOME}/Service/usr_code/GangliaMCAST/
Deploying custom MonALISA modules and agents
Introduction
New monitoring modules can be easily developed. These modules may use SNMP requests or can simply run any script (locally or on a remote system) to collect the requested values. The mechanism to run these modules under independent threads, to perform the interaction with the operating system or to control a SNMP session are inherited from a basic monitoring class. The user basically should only provide the mechanism to collect the values, to parse the output and to generate a result object. It is also required to provide the names of the parameters that are collected by this module.
While the modules currently provided with MonALISA are integrated in the service binary distribution, the source code of some example modules is provided in the ${MonaLisa_HOME}/Service/usr_code directory. This is also the directory in which the users can develop their own modules. The next section contains instructions for creating and running new modules.
How to Write a New Module
Creating a new module means writing a class that extends the lia.Monitor.monitor.cmdExec class and implements lia.Monitor.monitor.MonitoringModule interface.
This interface has the following structure:
package lia.Monitor.monitor;
public interface MonitoringModule extends lia.util.DynamicThreadPoll.SchJobInt {
public MonModuleInfo init( MNode node, String args ) ;
public String[] ResTypes() ;
public String getOsName();
public Object doProcess() throws Exception ;
public MNode getNode();
public String getClusterName();
public String getFarmName();
public boolean isRepetitive() ;
public String getTaskName();
public MonModuleInfo getInfo();
}
The doProcess function is actually the function that collects and returns the results. Usually the return type is a vector of lia.Monitor.monitor.Result objects, but it can also be a single Result object.
The init function initializes the useful information for the module, like the name of the cluster that contains the monitoring nodes, the name of the farm and the parameters for this module. This function is the first called when the farm loads the module. The second parameter of the function represents the list of parameters provided for the module in the farm configuration file (see the section on activating the modules), which should be parsed to obtain the parameter values.
The isRepetitive function tells if the module has to collect results only once or repetitively. The return values is the isRepetitive module's boolean variable. If true, then the module is called from time to time. The repetitive time is specified in the <farm>.conf file. If not there, then the default repetitive call time is 30s.
The other functions return different module information, that is usually set in the init() method. In the source code examples from usr_code you can find models for writing these functions.
How to Activate a New Module
.... (myFarm.conf) ...
In order for MonALISA to be able to load the new module, the path to the module's directory should be added to the CLASSURLs property from the ${MonaLisa_HOME}/Service/ml.properties file. For example:
lia.Monitor.CLASSURLs=file:${MonaLisa_HOME}/Service/usr_code/MyModule/
Multiple directories can be specified here separated by commas.
Examples
Examples to generate new modules can be found in ${MonaLisa_HOME}/Service/usr_code.
In usr_code/MDS there is an example of writing the received values into MDS. This is done using a unix pipe to communicate between the dynamically loadable java module and the script performing the update into the LDAP server.
Another simple example which simply prints all the values on sysout can be found on usr_code/SimpleWriter.
Another example to write the values into UDP sockets is in usr_code/UDPWriter.
Data Filters / Event Triggers
Filters allow to dynamically create any new type of derived value from the collected values. As an example it allows to evaluate the integrated traffic over last n minutes, or the number of nodes for which the load is less than x. Filters may also send an email to a list or SMS messages when predefined complex condition occur. These filters are executed in independent threads and allow any client to register for its output. They may be used to help application to react when certain conditions occur, or to help in presenting global values for large computing facilities. Each Filter has it's own Thread in MonALISA Service, so that they can run independently from each other. To write your own Filters/Triggers please follow the following steps:
- 1. Your filter MUST extend lia.Monitor.Filters.GenericMLFilter
- 2. It must have a constructor with a String param (the FarmName) in which you must call super(farmName). This constructor is used to dynamicaly instantiate your filter at runtime.
- 3. Your filter MUST override the following methods:
* public String getName()
returns the Filter name
It is a short name to identify data sent by your filter in the client. It is also used by MonALISA clients to inform the Service that they are interested in the data processed by this filter. It MUST be unique because all the filters in ML are identified by their name.
* public String getName()
returns the Filter name
It is a short name to identify data sent by your filter in the client. It is also used by MonALISA clients to inform the Service that they are interested in the data processed by this filter. It MUST be unique because all the filters in ML are identified by their name.
* public monPredicate[] getFilterPred()
returns a vector of monPredicate(s)
These predicates are used to filter only the interested results that they want to receive from the entire data flow. If it returns null, the filter will receive all the monitoring information.
* public void notifyResult(Object o)
This method is called every time a Result matches a predicate defined at b). The Filter could save this in a local buffer for future analysis, or it can take some real time decision(s)/action(s) if it is a trigger.
* public Object expressResults()
returns a vector of Gresults and/or Results
This method is called from time to time to let the filter to process the data that it has received. It should return a Vector of Gresults and/or Results that will be further sent to all the registered clients, or null if no data should be sent to Clients (e.g. the filter is a trigger).
* public long getSleepTime()
returns a vector of Gresults and/or Results
Returns a time(in milliseconds) for how often expressResults() should be called. E.g.: If this method returns 2*60*1000 the function expressResults() will be called every 2 minutes.
- 4. In your ml.properties file please add the path to the directory where filter has it's .class files. The parameter is lia.Monitor.CLASSURLs (if there are more filters/directories please separate them by ,(commas)). E.g:
lia.Monitor.CLASSURLs=file:${MonaLisa_HOME}/Service/usr_code/FilterExamples/ExTrigger/
- 5. In ml.properties you must specify what filters should be loaded,separated by commas. E.g:
lia.Monitor.ExternalFilters=ExTrigger,ExLoadFilter
The Service/usr_code/FilterExamples directory contains some simple examples of dynamic filters One of them (ExTrigger) is a simple alarm which send an email if the Load5 parameter on master node reaches a threshold value, and the other one (ExLoadFilter) computes min, max and mean value for a cluster. The data flux between MonALISA Service and clients can contain, more or less, the following two classes:
Autonomous Agents
Agents are entities loaded on MonALISA service that process the monitoring gathered data and communicate between them for resolving a distributed task based on these data.
An agent respects a given interface. Writing an agent actualy means creating a class that implements lia.Monitor.monitor.AgentI interface. This interface has the following structure:
import lia.Monitor.DataCache.AgentsCommunication;
import lia.Monitor.monitor.AgentInfo ;
public interface AgentI {
public void init(AgentsCommunication comm);
public void doWork();
public String getName();
public String getGroup();
public String getAddress();
public AgentInfo getAgentInfo ();
public void processMsg(Object msg);
public void processErrorMsg (Object msg);
}
For an agent to be able to communicate, the agent-to-agent communication environment has to be initiated. An agent can do this by implementing the init method. This method is called by the Agents Engine when first loading the agent.
Agents hosted on the monitoring service usually communicate using the agents communication platform created over the tcp connections to all the proxy services. The communication is one reliabe, secure, fast and scalable.
The AgentCommunication has methods to send agent-to-agent messages (the sendMsg method), or agent-to-proxy message (the sendCtrlMsg method) for getting information about other agents from the distributed system (the list of agents from a group or the number of agents from a group).
package lia.Monitor.DataCache;
public interface AgentsCommunication {
public void sendMsg (Object o);
public void sendCtrlMsg (Object o, String cmd);
}
Messages sent between agents are of a specified format:
public class AgentMessage implements java.io.Serializable {
public Integer messageID;
public Long timeStamp;
public Integer messageType;
public Integer priority;
public String agentAddrS;
public String agentAddrD;
public String agentGroupS;
public String agentGroupD;
public Integer messageTypeAck ;
public Object message ;
}
In the messages sent between clients there are the following fields:
- messageID - an integer number for messages sequance.
- timeStamp - time in milliseconds when the messages was sent from the source.
- messageType - type of the message.
- priority - messages priority, a number between 1 and 10, default 5. If the priority is high, the message is forwarded faster by the proxy service than the other messages.
-agentAddrS - address of the source agent.
- agentAddrD - address of the destination agent(s). Can be a multicast address sent to all the agents registered in a group.
- agentGroupS - the group of the source agent. If the source agent hasn't had registered in a group yet, then this field is null. When specified for the first time, the agent registers in the group. If is the first agent that registeres in the specified group, then the new group is created in the proxy service.
- agentGroupD - the group of the destination agent.
- messageTypeAck - if its an ACK message, then a confirmation is required when reaching the destination.
- message - the effective message transmitted. Can be any serializing object.
What an agent does is implemented in the doWork function. An agent is loaded on the monitoring service calling the addAgent function from the lia.Monitor.DataCache.AgentsEngine. Anytime an agent is loaded a new execution thread is created. This thread executes the agent's dowork function.
An agent is identified in the monitoring service by its name. Every agent has to have a unique name. Based on this name and on the monitoring service (hosting service) ID, an agent has a distinct address in the whole distributed system, agentName@farmID. Also, an agent can register itself in an agent group. Agent groups make possible multicast messages sent to all agents registered in a group. If the agent doesn't want to register in a group, it doesn't set the group field. All the information about agent's name, group, address can be known by calling getName , getAddress or getAgentInfo methods. For the last mentioned method, an object of AgentInfo type is returned, containing all the information about an agent. The lia.Monitor.monitor.AgentInfo class has the following structure:
public class AgentInfo {
public String agentName;
public String agentGroup;
public String farmID;
public String agentAddr;
public AgentInfo (String agentName, String agentGroup, String farmID) {
this.agentName = agentName ;
this.agentGroup = agentGroup;
this.farmID = farmID;
this.agentAddr = agentName+"@"+farmID;
}
}
Messages can be received from other agents in the distributed system. Messages are process by the processMsg method.
If a message sent by the agent couldn't reach the destination, and error message returns to the sending agent to announce it about communication failure. The error message is processed by the processErrorMsg method.
An abstract class, lia.Monitor.Agents.AbstractAgent exists to simplify the agents developement. This class wraps the AgentI interface, defining all AgentI methods, except processMsg and doWork methods. There also is a method for messages creating:
public AgentMessage createMsg(
int messageID,
int messageType,
int messageTypeAck,
int priority,
String agentAddrD,
String agentGroupD,
Object message);
