User guide for ApMON (Application Monitoring API)
From EGEE-see WIki
Contents |
ApMON
ApMON is a set of flexible APIs that can be used by Grid applications to send monitoring information to MonALISA services. The monitoring data is sent as UDP datagrams to one or more hosts running MonALISA services. Applications can periodically report any type of information about current status. The ApMOM implementations is propvided for 5 programming languages: C, C++, Java, Perl and Python. The library is easy to be used in complex data processing programs as well as from scripts or utility programs and has the advantages of flexibility, dynamic configuration, high communication performance, structured storage of the information in MonALISA databases.
ApMON has the possibility to send, in a background thread, additional datagrams which contain monitoring information regarding the system and/or the application that uses it. The system monitoring datagrams include the current values for parameters like the CPU load, the number of processes currently running, the amount of free memory, disk usage etc. ApMON obtains the current values for parameters like memory/swap usage or the number of processes currently running. For other parameters, the values are averaged on the time interval between the moment when the last monitoring datagram was sent and the current moment. Such parameters are the percentage of CPU user/system/nice time from the total time or the average amount of KB transferred per second through each network interface.
The job monitoring datagrams contain values for parameters like the amount of memory, disk and CPU time consumed by the application. Multiple jobs (determined by their parent pid and working directories) can be monitored simultaneously; if a job is multithreaded or has created children processes, all its threads/sub-processes are considered when calculating the amount of resources consumed.
All the ApMon versions are written for Linux, the monitoring information being obtained from the proc/ filesystem; an exception to this is the Java version, which can be used both on Linux and Windows.
Installation
ApMON C and C++
To compile the ApMon routines and all the examples, type:
./configure [options] make make install
where "options" are the typical configure options. The library will be installed in $prefix/lib and the ApMon.h include file into the $prefix/include directory.
If you have Doxygen, you can get the API docs by issuing make doc.
ApMON Java
The ApMon archive contains the following files and folders:
- apmon/ - package that contains the ApMon class and other helper classes: XDRDataOutput and XDROutputStream, which are a part of a library for XDR encoding/decoding, provided under the LGPL license - see http://java.freehep.org; the lisa_host package, which contains classes from LISA - see http://monalisa.cacr.caltech.edu/dev_lisa.html.
- lib/ - directory which will contain, after building, the libraries needed in order to use ApMon
- examples/ examples for using the routines
- lesser.txt - the LGPL license for the XDR library
- destinations.conf - contains the IP addresses or DNS names of the destination hosts and the ports where the MonaLisa modules listen
- build_apmon.sh, env_apmon - for building on Linux systems
- build_apmon.bat - for building on Windows systems
- README
- Doxyfile - for generating Doxygen documentation
There is an additional directory, ApMon_docs, which contains the Doxygen documentation of the source files.
To build ApMon on Linux systems:
1. Set the JAVA_HOME environment variable to the location where Java is installed
2. To build ApMon, cd to the ApMon directory and type:
./build_apmon.sh
The ApMon jar file (apmon.jar) and a small Linux library (libnativeapm.so) are now available in the lib/ directory. In order use ApMon, the CLASSPATH must contain the path to apmon.jar and, optionally, the LD_LIBRARY_PATH must contain the path to libnativeapm.so (this library only provides one function, mygetpid(), which has the functionality of getpid(). You might want to use it for job monitoring, as in Example_x1.java and Example_x2.java ). You can adjust the CLASSPATH and LD_LIBRARY_PATH manually or by sourcing the env_apmon script.
3. To build the ApMon examples, go to the examples/ directory and type:
./build_examples.sh
To build ApMon on Windows systems:
1. Add the directory that contains the apmon package to the CLASSPATH
2. Run build_apmon.bat
3. When running the examples, the directory apmon\lisa_host\Windows must be in the library path (the system.dll library from this directory will be used). This can be done by using the option -Djava.library.path:
java -Djava.library.path=<path to apmon\lisa_host\Windows> exampleSend_1a
ApMON Perl
The ApMon archive contains the following files in the ApMon module:
- ApMon.pm - main ApMon module. It can be instantiated by users to send data.
- ApMon/Common.pm - contains common functions for all other modules.
- ApMon/XDRUtils.pm - contains functions that encode different values in XDR format
- ApMon/ProcInfo.pm - procedures to monitor the system and a given application
- ApMon/ConfigLoader.pm - manages configuration retrieval from multiple places
- ApMon/BgMonitor.pm - handles the background monitoring of system and applications
- README
- example/* - a set of examples with the usage of the ApMon module.
- example/destinations.conf - a sample destinations file, for url/file configuration
- MAN - a short description and API functions
To install this module type the following:
perl Makefile.PL make make test make install
DEPENDENCIES: This module requires these other modules and libraries:
Data::Dumper LWP::UserAgent Socket Exporter
ApMON Python
The ApMon archive contains the following files in the ApMON module:
- apmon.py - main ApMON module. It can be instantiated by users to send data.
- ProcInfo.py - procedures to monitor the system and a given application.
- README
- example/*.py - a set of examples with the usage of the ApMON module.
- example/*.conf - a set of sample destination files, for url/file configuration.
- ApMon_doc/*.html - HTML documentation
There is no other steps to install python ApMON. For more information see the HTML documentation.
Configure ApMON
ApMon Initialization
There are several ways to initialize ApMon. A first method to initialize ApMon is from a configuration file, which contains the IP addresses or DNS names of the hosts running MonALISA, to which the data will be sent; the ports on which the MonALISA services listen on the destination hosts should also be specified in the file. The configuration file also contains lines that specify lines for configuring xApMon (see Section 3, “xApMon - Automatically Sending Monitoring Informationâ€). The lines that specify the destination hosts have the following syntax:
IP_address|DNS_name[:port] [password]
Examples:
rb.rogrid.pub.ro:8884 mypassword rb.rogrid.pub.ro:8884 ui.rogrid.pub.ro mypassword ui.rogrid.pub.ro
If the port is not specified, the default value 8884 will be assumed. If the password is not specified, an empty string will be sent as password to the MonALISA host (and the host will accept the datagram either if it does not require a password for the ApMon packets or if the machine from which the packet was sent is in the host's "accept" list). The configuration file may contain blank lines and comment lines (starting with "#"); these lines are ignored, and so are the leading and the trailing white spaces from the other lines.
Another method to initialize ApMON is to provide a list which contains hostnames and ports as explained above, and/or URLs; the URLs point to plain text configuration files which have the format described above. The URLs may also represent requests to a servlet or a CGI script which can automatically provide the best configuration, taking into account the geographical zone in which the machine which runs ApMON is situated, and the application for which ApMon is used. The geographical zone is determined from the machine's IP and the application name is given by the user as the value of the "appName" parameter included in the URL.
Sending Datagrams with User Parameters
A datagram sent to the MonaLisa module has the following structure:
- a header which has the following syntax:
v:<ApMon_version>p:<password>
(the password is sent in plaintext; if the MonALISA host does not require a password, a 0-length string should be sent instead of the password).
- cluster name (string) - the name of the monitored cluster
- node name (string) - the name of the monitored nodes
- number of parameters (int)
- for each parameter: name (string), value type (int), value (can be double, int or string)
- optionally a timestamp (int) if the user wants to specify the time associated with the data; if the timestamp is not given, the current time on the destination host which runs MonALISA will be used. The option to include a timestamp is possible since version 2.0.
The configuration file and/or URLs can be periodically checked for changes, in a background thread or process, but this option is disabled by default. It can be enabled from the configuration file as follows:
- to enable/disable the periodical checking of the configuration file or URLs:
xApMon_conf_recheck = on/off
- to set the time interval at which the file/URLs are checked for changes:
xApMon_recheck_interval = number_of_seconds
xApMon - Automatically Sending Monitoring Information
ApMon can be configured to send automatically, in a background thread, monitoring information regarding the application or the system. The system information is obtained from the proc/ filesystem and the job information is obtained by parsing the output of the ps command. If job monitoring for a process is requested, all its sub-processes will be taken into consideration (i.e., the resources consumed by the process and all the subprocesses will be summed).
There are three categories of monitoring datagrams that ApMon can send:
a) job monitoring information - contains the following parameters:
- run_time: elapsed time from the start of this job
- cpu_time: processor time spent running this job
- cpu_usage: percent of the processor used for this job, as reported by ps
- virtualmem: virtual memory occupied by the job (in KB)
- rss: resident image size of the job (in KB)
- mem_usage: percent of the memory occupied by the job, as reported by ps
- workdir_size: size in MB of the working directory of the job
- disk_total: size in MB of the disk partition containing the working directory
- disk_used: size in MB of the used disk space on the partition containing the working directory
- disk_free: size in MB of the free disk space on the partition containing the working directory
- disk_usage: percent of the used disk partition containing the working directory
b) system monitoring information - contains the following parameters:
- cpu_usr: percent of the time spent by the CPU in user mode
- cpu_sys: percent of the time spent by the CPU in system mode
- cpu_nice: percent of the time spent by the CPU in nice mode
- cpu_idle: percent of the time spent by the CPU in idle mode
- cpu_usage: CPU usage percent
- pages_in: the number of pages paged in per second (average for the last time interval)
- pages_out: the number of pages paged out per second (average for the last time interval)
- swap_in: the number of swap pages brought in per second (average for the last time interval)
- swap_out: the number of swap pages brought out per second (average for the last time interval)
- load1: average system load over the last minute
- load5: average system load over the last 5 min
- load15: average system load over the last 15 min
- mem_used: amount of currently used memory, in MB
- mem_free: amount of free memory, in MB
- mem_usage: used system memory in percent
- swap_used: amount of currently used swap, in MB
- swap_free: amount of free swap, in MB
- swap_usage: swap usage in percent
- net_in: network (input) transfer in kBps
- net_out: network (input) transfer in kBps
- net_errs: number of network errors (these will produce params called sys_ethX_in, sys_ethX_out, sys_ethX_errs, corresponding to each network interface)
- processes: curent number of processes (this will also produce parameters called processes_{D,R,T,S,Z}- number of processes in the D (uninterruptible sleep),R (running), T(traced/stopped), S (sleeping),Z (zombie) states)
- uptime: system uptime in days
- net_sockets: the number of open TCP, UDP, ICM, Unix sockets (this will produce parameters called sockets_tcp, sockets_udp, ...)
- net_tcp_details: the number of TCP sockets in each possible state (this will produce parameters called sockets_tcp_ESTABLISHED, sockets_TCP_LISTEN, ...)
c) general system information - contains the following parameters:
- hostname: the machine's hostname
- ip: will produce ethX_ip params for each interface
- cpu_MHz: CPU frequency
- no_CPUs: number of CPUs
- total_mem: total amount of memory, in MB
- total_swap: total amount of swap, in MB
The parameters can be enabled/disabled from the configuration file (if they are disabled, they will not be included in the datagrams). In order to enable/disable a parameter, the user should write in the configuration file lines of the following form:
xApMon_job_parametername = on/off
(for job parameters) or:
xApMon_sys_parametername = on/off
(for job parameters) or:
xApMon_parametername = on/off
(for general system parameters) Example:
xApMon_job_run_time = on
xApMon_sys_load1 = off
xApMon_no_CPUs = on
By default, all the parameters are enabled.
The job/system monitoring can be enabled/disabled by including the following lines in the configuration file:
xApMon_job_monitoring = on/off xApMon_sys_monitoring = on/off
The datagrams with general system information are only sent if system monitoring is enabled, at greater time intervals than the system monitoring datagrams. To enable/disable the sending of general system information datagrams, the following line should be written in the configuration file:
xApMon_general_info = on/off
The time interval at which job/system monitoring datagrams are sent can be set with:
xApMon_job_interval = number_of_seconds xApMon_sys_interval = number_of_seconds
Using ApMON
We will consider in the following section the java ApMON version. The similar information about C/C++, Perl an Python ApMON can be found on the MonALISA project web site.
Starting work with ApMON
In the Java version, the ApMON features are available as methods of a class called ApMON.
An ApMON object can be initialized with one of the constructors (see the API docs for more details):
- initializes ApMon from a configuration file whose name is given as parameter.
ApMon(String filename);
- initializes ApMon from a vector which contain strings of the form "address:[port][ password]" specifying destination hosts, and/or addresses of configuration webpages.
ApMon(Vector destList);
- initializes ApMon from a list of destination hosts and the corresponding lists of ports and passwords.
ApMon(Vector destAddresses, Vector destPorts, Vector destPasswds);
After initialization, there are two ways in which the user can send parameter values to MonALISA:
- a single parameter in a datagram
- multiple parameters in a datagram
For sending a datagram with a single parameter, the user should call the function sendParameter() which has several overloaded variants.
For sending multiple parameters in a datagram, the user should call the function sendParameters(), which receives as arguments arrays with the names and the values of the parameters to be sent.
There are two additional functions, apMon_sendTimedParameter() and sendTimedParameters() , which allow the user to specify a timestamp for the parameters.
IMPORTANT: When the ApMon object is no longer in use, the stopIt() method should be called in order to close the UDP socket used for sending the parameters.
Job Monitoring and Logging
To monitor jobs, you have to specify the PID of the parent process for the tree of processes that you want to monitor, the working directory, the cluster and the node names that will be registered in MonALISA (and also the job monitoring must be enabled). If work directory is "", no information will be retrieved about disk:
void addJobToMonitor(int pid, String workDir, String clusterName, String nodeName);
To stop monitoring a job, the
removeJobToMonitor(int pid)
should be called.
ApMon also prints its messages to a file called apmon.log, with the aid of the Logger class from the Java API. The user may print its own messages with the logger (see the examples). The ApMon loglevels are
FATAL (equivalent to Level.SEVERE) WARNING INFO FINE DEBUG (equivalent to Level.FINEST).
The ApMon loglevel can be set from the configuration file (by default it is INFO):
xApMon_loglevel = <level>
When setting the loglevel in the configuration file, you must use the ApMon level names rather than the Java names (so that the configuration file be compatible with the other ApMon versions). The loglevel can also be set with the function setLogLevel() from the ApMon class.
Example of integration with applications
In the following two section are presented some examples in Java and Python languages. Examples for C/C++ and Perl could be find in official ApMON documentation.
Java example
In this section we show how the ApMOM API can be used to write a simple program that sends monitoring datagrams. The source code for this short tutorial (slightly modified) can be found in the Example_2.java file under the examples/ directory. The program generates some double values for a parameter called "my_cpu_load" and sends them to the MonALISA destinations. The number of iterations is given as a command line argument; in each iteration two datagrams are sent, one with timestamp and one without timestamp.
With this example program we'll illustrate the steps that should usually be taken to write a program with ApMon:
- Import the ApMon package (and possibly other necessary packages):
import java.util.Vector;
import java.util.logging.Logger;
import apmon.*;
- Initialize the variables we shall use.
public class Example_2 {
private static Logger logger = Logger.getLogger("apmon");
public static void main(String args[]) {
// a vector of destinations
Vector destList = new Vector();
int nDatagrams = 20;
// the ApMON object (initial null)
ApMon apm = null;
double val = 0;
int i, timestamp;
if (args.length == 1){
nDatagrams = Integer.parseInt(args[0]);
}
- Construct an ApMon object (in this example we used an intialization list containing the name of a destination host, ui.rogrid.pub.ro, and a webpage where other destination hosts and possibly ApMon options can be specified). The ApMon functions throw exceptions if errors appear, so it is recommended to place them in a try-catch block:
destList.add(new String("ui.rogrid.pub.ro:8884 password"));
destList.add(new String("http://lcfg.rogrid.pub.ro/~corina/destinations_2.conf"));
try {
apm = new ApMon(destList);
} catch (Exception e) {
logger.severe("Error initializing ApMon: " + e);
System.exit(-1);
}
- Adjust the settings for the ApMon object, if necessary (here we set the time interval for reloading the configuration page to 300 sec, and we change the logging level to DEBUG). This can also be done from the configuration file.
// set the time interval for periodically checking the
// configuration URL
apm.setRecheckInterval(300);
// this way we can change the logging level
apm.setLogLevel("DEBUG");
- Send datagrams; we used here two functions: one that includes a timestamp in the datagram and one that doesn't.
for (i = 0; i < nDatagrams - 1; i++) {
val += 0.05;
if (val > 2){
val = 0;
}
logger.info("Sending " + val + " for cpu_load");
try {
/* use the wrapper function with simplified syntax */
apm.sendParameter("TestCluster2_java", null, "my_cpu_load", ApMon.XDR_REAL64, new Double(val));
/* now send a datagram with timestamp (as if this was 5h ago) */
long crtTime = System.currentTimeMillis();
timestamp = (int)(crtTime / 1000 - (5 * 3600));
apm.sendTimedParameter("TestClusterOld2_java", null, "my_cpu_load", ApMon.XDR_REAL64, new Double(val), timestamp);
} catch(Exception e) {
logger.warning("Send operation failed: " + e);
}
try {
Thread.sleep(1000);
} catch (Exception e) {}
} // for
- Stop ApMon:
apm.stopIt();
}
}
Python example
In this section we show how the ApMON API can be used to write a simple program that sends monitoring datagrams. The source code for this short tutorial (slightly modified) can be found in the simple_send.py file under the examples/ directory.
The program generates values for a few parameters and sends them to the MonALISA destinations specified in the 'dest_2.conf' file; this action is repeated in 20 iterations.
pcardaab.cern.ch:8884 #lcfg.rogrid.pub.ro
The second host are disabled, but could be enabled by simple removing the '#' character.
With this example program we'll illustrate the steps that should usually be taken to write a program with ApMon:
- Import the ApMon module (and possibly other necessary modules):
import apmon import time
- Initialize ApMon and possibly set some options (in this example, we disabled the background job/system monitoring and the periodical reloading of the configuration file, and also set the loglevel). These options could have also been set from a configuration file.
# Initialize ApMon specifying that it should not send information about the system.
# Note that in this case the background monitoring process isn't stopped, in case you may
# want later to monitor a job.
apm = apmon.ApMon('dest_2.conf')
apm.setLogLevel("INFO");
apm.confCheck = False
apm.enableBgMonitoring(False)
- Send the datagrams. We used here two functions: sendParameters, which specifies the cluster name and the node name (which will be cached in the ApMon object), and sendParams, which uses the names that were memorized at the call of the first function.
for i in range(1,20):
# you can put as many pairs of parameter_name, parameter_value as you want
# but be careful not to create packets longer than 8K.
apm.sendParameters("SimpleCluster", "SimpleNode", {'var_i': i, 'ar_i^2': i*i})
f = 20.0 / i
# send in the same cluster and node as last time
apm.sendParams({'var_f': f, '5_times_f': 5 * f, 'i+f': i + f})
print "simple_send-ing for i=",i
time.sleep(1)
Configure MonALISA with ApMON support
To configure MonALISA to listen on UDP port 8884 for incoming datagrams (XDR encoded, using ApMon) you should add the following line in your MonALISA farm config file:
^monXDRUDP{ListenPort=8884}%30
The Clusters, Nodes and Parameters are dynamically created in MonALISA's configuration tree every time a new one is received. It is possible, also, to dynamically remove "unused" Clusters/Nodes/Parameters, if there are no datagrams to match them for a period of time. The timeouts are in seconds:
^monXDRUDP{ParamTimeout=10800,NodeTimeout=10800,ClusterTimeout=86400,ListenPort=8884}%30
In the example above the parameters and the nodes are automatically removed from ML configuration tree if there are no data received for 3 hours (10800 seconds). The Cluster is removed after one day (24 hours - 86400 seconds).
Complete documentation for ApMON
For more information please check: http://monalisa.caltech.edu
Download
ApMon can be obtained from the MonALISA download page: http://monalisa.cern.ch/monalisa__Download__ApMon.html
