Worker Nodes Failure Alerting Tool

From EGEE-see WIki

Jump to: navigation, search

This Wiki page is a part of SEE-GRID Gridification Guide. It is contributed by Belgrade University Computer Centre.


Introduction

Here you can find information about a simple but useful alerting tool used for checking Worker Nodes on the Computing Element for possible network and PBS failures (this includes a hardware failure when machine is not responding). If an error is detected an alert e-mail is sent to the admin (e-mail is specified in the script).

Email is sent only once when the error occurs and again only once when the error is resolved.

This script is run from the CE (ssh root connection from CE to WN's without password must be enabled) and tests each WN. Script is intended to be run from cron. In the example given below, it is run every 5 minutes.

Script source

Source of the testWN script is given below.

#!/bin/bash
# 
# filename: /root/bin/testWN
#
# description: network connection and PBS test on WN's
# prerequisite: ssh root connection fron CE to WN's without password
# work: from cron on CE
# cron example: "*/5 * * * * /root/bin/testWN"
 
# email where alerts will be sent
email=admin-email

# list all nodes
for i in `cat /var/spool/pbs/server_priv/nodes | awk -F" " '{ print $1}'`; do
        ping -c 1 -W 2 $i > /dev/null
        if [ $? -ne 0 ]; then
                if [ ! -f /root/data/$i.ping ]; then
# network problem
                        mail -s  "$i not responding on ping" $email < /root/data/note
                        touch /root/data/$i.ping
                fi
        else
# network OK
                if [ -f /root/data/$i.ping ]; then
# network had a problem, but is OK now
                        mail -s  "$i responding on ping again!" $email < /root/data/note
                        rm -f /root/data/$i.ping
                fi
# check PBS 
                if [ -n "`pbsnodes -l $i`" ]; then
# PBS problem
                        if [ ! -f /root/data/$i.pbs ]; then 
                                touch /root/data/$i.pbs
                                pbsstatus=`/usr/bin/ssh $i service pbs_mom status | awk -F "is " '{ print $2 }'`
                                case $pbsstatus in
                                running...)
                                        mail -s "PBS on $i running...! Unknown error!!!" $email < /root/data/note
                                        ;;
                                stopped)
                                        mail -s "PBS on $i manually stopped" $email < /root/data/note
                                        ;;
                                *)
                                        mail -s "PBS error on $i . Probing restart!" $email < /root/data/note
                                        ssh $i service pbs_mom restart
                                        sleep 5
                                        if [ -n "`pbsnodes -l $i`" ]; then 
                                                mail -s "PBS error on $i persist!" $email < /root/data/note
                                        else
                                                mail -s "PBS on $i is OK now! " $email < /root/data/note
                                                rm -f /root/data/$i.pbs
                                        fi
                                        ;;
                                esac
                        fi
                else
# PBS OK
                        if [ -f /root/data/$i.pbs ]; then
# PBS had a problem, but is OK now
                                mail -s "PBS on $i is OK again! " $email < /root/data/note
                                rm -f /root/data/$i.pbs
                        fi
                fi
        fi
done

Contact

Milan Potocnik [milan (d) potocnik (a) rcub (d) bg (d) ac (d) yu]

Personal tools