SG Logwatch-G

From EGEE-see WIki

Jump to: navigation, search

This page describes Logwatch-G tool contributed to the SA1 and JRA1 components of SEE-GRID-SCI project by Rudjer Boskovic Institute.

Contents

Introduction

Logwatch for Grid Services is a set of plugins for Logwatch – a customizable log analysis system. Logwatch parses through system logs and creates a report analyzing specified areas for a given period in the past. Default Logwatch installation comes with a set of plugins for reporting on standard Linux services. The goal of this application is to extend Logwatch so that it can report on the activity of standard Grid services. Current version of Logwatch-G contains support for services running on the most common node types:

  • CE
  • SE (SRM)
  • WN (Torque)
  • BDII
  • MON

Architecture

Logwatch tool consists of the parsing engine and a number of log file and service format filters. Parsing engine drives the process of generating a report for the given period of time and a set of services. Log file filters are used to extract the lines from the log files that match the given time period. Matched lines are then fed into a service filter that accumulates interesting lines and generates a summary report of the service behavior based on that information. Reports for all the requested services are concatenated together to form the Logwatch report. Based on the command line options, the report is either mailed to the administrator or displayed on the standard output. The following figure gives an example of a Logwatch report for a period of one day (yesterday), for one service (torque-accounting).

$ logwatch --print --range yesterday --service torque-accounting

 ################### LogWatch 5.2.2 (06/23/04) #################### 
       Processing Initiated: Thu May 21 21:17:48 2009
       Date Range Processed: yesterday
     Detail Level of Output: 0
          Logfiles for Host: grid1.irb.hr
 ################################################################ 

 --------------------- torque-accounting Begin ------------------------ 

Job queued: 168 times
Job started: 141 times
Job exited: 141 times
Job deleted: 4 times

Short jobs: 56 times

 ---------------------- torque-accounting End ------------------------- 

 ###################### LogWatch End ######################### 

Although most of the existing log and service filters are written in Perl, this is not a hard requirement. Logwatch filters are executed in pipelines - one programs output becomes the input of the next stage filter. Therefore any program language can be used as long it works correctly inside a pipeline. The filters for Grid service log files will be written in Perl.

Installation

For Grid sites using SEE-GRID-SCI yum repository Logwatch-G can be installed as any other system package:

# yum install logwatch-grid
Setting up Install Process
Setting up repositories
Reading repository metadata in from local files
Parsing package install arguments
Resolving Dependencies
--> Populating transaction set with selected packages. Please wait.
---> Downloading header for logwatch-grid to pack into transaction set.
---> Package logwatch-grid.noarch 0:0.6-1 set to be updated
--> Running transaction check
--> Processing Dependency: logwatch for package: logwatch-grid
--> Restarting Dependency Resolution with new changes.
--> Populating transaction set with selected packages. Please wait.
---> Package logwatch.noarch 0:5.2.2-4.el4 set to be updated
--> Running transaction check

Dependencies Resolved

=============================================================================
 Package                 Arch       Version          Repository        Size 
=============================================================================
Installing:
 logwatch-grid           noarch     0.6-1            SEE-GRID General SL4 noarch   24 k
Installing for dependencies:
 logwatch                noarch     5.2.2-4.el4      base              133 k

Transaction Summary
=============================================================================
Install      2 Package(s)         
Update       0 Package(s)         
Remove       0 Package(s)         
Total download size: 157 k
Is this ok [y/N]: y

Downloading Packages:
(1/1): logwatch-grid-0.6- 100% |=========================|  24 kB    00:00     
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing: logwatch                     ######################### [1/2] 
  Installing: logwatch-grid                ######################### [2/2] 

Installed: logwatch-grid.noarch 0:0.6-1
Dependency Installed: logwatch.noarch 0:5.2.2-4.el4
Complete!

If SEE-GRID-SCI package repository is not used for some reason the package can also be installed manually. Assuming standard Logwatch package is already installed on this host, Logwatch-G can be installed by running:

# rpm -Uhv http://www.irb.hr/users/vvidic/seegrid/logwatch-grid-latest.noarch.rpm
Retrieving http://www.irb.hr/users/vvidic/seegrid/logwatch-grid-latest.noarch.rpm
warning: /var/tmp/rpm-xfer.IiA1tM: V3 DSA signature: NOKEY, key ID c204a7eb
Preparing...                ########################################### [100%]
   1:logwatch-grid          ########################################### [100%]

In the default installation Logwatch reports are mailed to root system user once a day. It is common to redirect root mails to administrators real mail addresses as local mailboxes on server machines don't get checked regularly:

# cat /etc/aliases
...
root: grid-admin@some.site
# newaliases

After this modification system administrator should start receiving daily e-mail reports on the status of both Grid an standard system services.

Usage

In addition to daily e-mail reports, Logwatch can also be called from the command line. The most simple usage is to call it with the --print parameter. Generated Logwatch output will report on all services found on the system for yesterday:

# logwatch --print

 ################### LogWatch 5.2.2 (06/23/04) #################### 
       Processing Initiated: Mon Dec 14 22:51:58 2009
       Date Range Processed: yesterday
     Detail Level of Output: 0
          Logfiles for Host: grid2.irb.hr
 ################################################################ 

 --------------------- apel Begin ------------------------ 

Accounting records published: 57

 ---------------------- apel End ------------------------- 


 --------------------- globus-gridftp Begin ------------------------ 

Stores: 189 (5.55 MiB)
Retrieves: 245 (370.70 MiB)

 ---------------------- globus-gridftp End ------------------------- 


 ------------------ Disk Space --------------------

/dev/hda1             1.9G  1.2G  602M  67% /
/dev/hda8              69G   41G   25G  63% /home
/dev/hda6             479M   13M  442M   3% /tmp
/dev/hda7             1.9G  862M  927M  49% /var

 ###################### LogWatch End ######################### 

Additional parameters can be added to narrow the report range. Parameter --range can be used to create reports for yesterday (default), today or all days. Parameter --service is used to limit the report to a specified service(s). Currently supported services are listed in /etc/log.d/conf/services. Currently this includes:

  • apel
  • bdii-update
  • dpm-srm
  • globus-gatekeeper
  • globus-gridftp
  • gridftp-session
  • torque-accounting
  • torque-mom
# logwatch --print --range today --service globus-gridftp

 ################### LogWatch 5.2.2 (06/23/04) #################### 
       Processing Initiated: Mon Dec 14 23:00:47 2009
       Date Range Processed: today
     Detail Level of Output: 0
          Logfiles for Host: grid2.irb.hr
 ################################################################ 

 --------------------- globus-gridftp Begin ------------------------ 

Stores: 194 (5.71 MiB)
Retrieves: 252 (351.84 MiB)

 ---------------------- globus-gridftp End ------------------------- 

 ###################### LogWatch End ######################### 

Parameter --logfile is used to limit the report to the selected log files. List of supported log files is available in /etc/log.d/conf/logfiles. The following log files are scanned by the current version:

  • /var/log/apel.log
  • /opt/bdii/var/bdii.log
  • /var/log/srmv1/log
  • /var/log/srmv2/log
  • /var/log/srmv2.2/log
  • /var/log/messages
  • /var/log/globus-gridftp.log
  • /var/log/dpm-gsiftp/dpm-gsiftp.log
  • /var/log/gridftp-session.log
  • /var/log/dpm-gsiftp/gridftp.log
  • /var/spool/torque/server_priv/accounting/
  • /var/spool/torque/mom_logs/

And finally, parameter --archives is used to include information from compressed log files in the report.

# logwatch --print --range all --logfile globus-gridftp --archives 

 ################### LogWatch 5.2.2 (06/23/04) #################### 
       Processing Initiated: Mon Dec 14 23:17:59 2009
       Date Range Processed: all
     Detail Level of Output: 0
          Logfiles for Host: grid2.irb.hr
 ################################################################ 

 --------------------- globus-gridftp Begin ------------------------ 

Stores: 13670 (5.27 GiB)
Retrieves: 19348 (128.82 GiB)

 ---------------------- globus-gridftp End ------------------------- 

 ###################### LogWatch End ######################### 

Another common usage of Logwatch is to run it after logging into a server in order to check the status of running services:

# logwatch --print --range today
# logwatch --print --range yesterday

Logwatch only checks the logs on the server where it runs so it cannot analyze problems spanning multiple servers. Administrator needs to check Logwatch reports on all of the involved machines and try to figure out what happened. For example, if APEL publishing fails for whatever reason on the CE node, this will cause the MON node to publish 0 new records. Logwatch-G will report both of these problem and it's up to the admin to figure out the root cause of the failure.

Internals

APEL accounting logs (/var/log/apel.log) are processed on CE and MON nodes. First the log lines for the given time period (yesterday by default) get extracted. Resulting lines are than scanned to find the status of the accounting system on the site. Script checks if APEL started and finished properly and found at least some job records. On the CE node it checks if the records got inserted into the accounting database on the MON node. On the MON node it checks if the records got transfered to the central accounting server. The following example shows the possible outputs of this check.

 --------------------- apel Begin ------------------------
Accounting records inserted: 122

Accounting records published: 84

Warning: apel not started

Warning: apel not finished

Warning: no records published
 --------------------- apel End --------------------------

BDII information system logs (/opt/bdii/var/bdii.log) get scanned on all site nodes that run bdii-update service (CE, SE, site BDII). In the first step parts of the log file for the given time period get extracted. Resulting lines are than scanned for possible errors usually generated by ldapadd command. If any errors are found they are reported together with the number of times they appeared. Example of some common errors are displayed in the following example.

 --------------------- bdii-update Begin ------------------------
could not parse entry: 2 times
invalid value for attribute GlueServiceVersion: 10 times
 --------------------- bdii-update End --------------------------

DPM SRM logs (/var/log/srmv*/log) are checked on the DPM SE node. A common script is used for extracing log lines belonging to the requested time period. After that the selected lines are scanned for signs of possible errors. Any problems found are reported including the count how many times the error appeared. The following example shows a common problem of failed VOMS mappings. Unfortunately SRM logs don't contain detailed information why this happened.

 --------------------- dpm-srm Begin ------------------------
Server error messages:
 Error retrieveing the VOMS credentials (3 times)
 --------------------- dpm-srm End --------------------------

The following script checks the results of X.509 certificate to local Unix account mapping done by the Globus Gatekeeper and logged via syslog to /var/log/messages. This check is executed on the LCG CE node type. After extracting relevant lines from this log file, they are checked for successful or unsuccessful authentication and mapping events. These are reported as shown in the next example output. It is usually possible to find the cause of mapping failures and fix them using information from the local logs.

 --------------------- globus-gatekeeper Begin ------------------------
Mapping succeded: 1114 times
Mapping failed: 6 times
  /C=HR/O=SRCE CA/OU=GRID/OU=irb/CN=Some User (6 times)
 ---------------------- globus-gatekeeper End -------------------------

GridFTP server used on the SE and CE nodes logs into two types of files. First log type is the accounting file containing statistics of the successful file downloads and uploads. This check processes information from /var/log/globus-gridftp.log on CE and /var/log/dpm-gsiftp/dpm-gsiftp.log on the DPM SE. After selecting file transfer records for the requested time range, basic statistics are calculated and displayed. This includes the total number and size of files uploaded or downloaded from the server in the selected time period. This report can be used to track the FTP server usage over time.

 --------------------- globus-gridftp Begin ------------------------
Stores: 258 (1.09 MiB)
Retrieves: 3146 (4.10 GiB)
 --------------------- globus-gridftp End --------------------------

The second log type produced by the GridFTP server contains the history of FTP commands sent by the clients and reposes generated by the server. On the DPM SE session information is written to /var/log/dpm-gsiftp/gridftp.log, while CE uses /var/log/gridftp-session.log. After the relevant lines get extracted based on the timestamp, server responses are checked for possible problems. Failures (response code starting with digits 4 or 5) are reported together with their occurrence counts as shown in the following example.

 --------------------- gridftp-session Begin ------------------------
Server error messages:
 421 Server terminated (5 times)
 500 Command failed : unlink error: Host not known (5 times)
 500 Command failed. : open error: No such file or directory (1 time)
 500-Command failed. : globus_xio: System error in write: Connection reset by peer (1 time)
 --------------------- gridftp-session End --------------------------

Torque accounting log (/var/spool/pbs/server_priv/accounting/<date>) collects statistic on all the jobs that were executed by the CE batch server. Logwatch plugin uses this information to report on the cluster usage. First the log lines for the requested time period get selected. Following that exit status of the jobs recorded in these lines is collected and reported as total number of jobs that were queued, started, finished or deleted before finishing. Finally a number of short jobs finishing in less than a minute is reported. Large values for this statistic can point to a cluster misconfiguration causing jobs to exit shortly after being started. The following example shows a report generated for one day of typical small cluster usage.

 --------------------- torque-accounting Begin ------------------------
Job queued: 256 times
Job started: 256 times
Job exited: 242 times
Job deleted: 1 time

Short jobs: 89 times
 --------------------- torque-accounting End --------------------------

Similar to the previous plugin, log file /var/spool/pbs/mom_logs/<date> contains history of events that occured on the given batch cluster node (WN). This check parses this file for lines pertaining to the selected time period and generates a usage report. This report includes the number of jobs that were started and finished on this WN or were killed by the request of the CE. In addition to this, a common error to copy input/output files from/to the CE is detected and reported. An example report including all of the above is shown next.

 --------------------- torque-mom Begin ------------------------
File stage in/out failed 5 times.

Started jobs: 179
Finished jobs: 175
Killed jobs: 4
 --------------------- torque-mom End --------------------------

Development

Contact

Valentin Vidic <vvidic at irb.hr>

Personal tools