Worker Nodes Failure Alerting Tool
From EGEE-see WIki
This Wiki page is a part of SEE-GRID Gridification Guide. It is contributed by Belgrade University Computer Centre.
Introduction
Here you can find information about a simple but useful alerting tool used for checking Worker Nodes on the Computing Element for possible network and PBS failures (this includes a hardware failure when machine is not responding). If an error is detected an alert e-mail is sent to the admin (e-mail is specified in the script).
Email is sent only once when the error occurs and again only once when the error is resolved.
This script is run from the CE (ssh root connection from CE to WN's without password must be enabled) and tests each WN. Script is intended to be run from cron. In the example given below, it is run every 5 minutes.
Script source
Source of the testWN script is given below.
#!/bin/bash
#
# filename: /root/bin/testWN
#
# description: network connection and PBS test on WN's
# prerequisite: ssh root connection fron CE to WN's without password
# work: from cron on CE
# cron example: "*/5 * * * * /root/bin/testWN"
# email where alerts will be sent
email=admin-email
# list all nodes
for i in `cat /var/spool/pbs/server_priv/nodes | awk -F" " '{ print $1}'`; do
ping -c 1 -W 2 $i > /dev/null
if [ $? -ne 0 ]; then
if [ ! -f /root/data/$i.ping ]; then
# network problem
mail -s "$i not responding on ping" $email < /root/data/note
touch /root/data/$i.ping
fi
else
# network OK
if [ -f /root/data/$i.ping ]; then
# network had a problem, but is OK now
mail -s "$i responding on ping again!" $email < /root/data/note
rm -f /root/data/$i.ping
fi
# check PBS
if [ -n "`pbsnodes -l $i`" ]; then
# PBS problem
if [ ! -f /root/data/$i.pbs ]; then
touch /root/data/$i.pbs
pbsstatus=`/usr/bin/ssh $i service pbs_mom status | awk -F "is " '{ print $2 }'`
case $pbsstatus in
running...)
mail -s "PBS on $i running...! Unknown error!!!" $email < /root/data/note
;;
stopped)
mail -s "PBS on $i manually stopped" $email < /root/data/note
;;
*)
mail -s "PBS error on $i . Probing restart!" $email < /root/data/note
ssh $i service pbs_mom restart
sleep 5
if [ -n "`pbsnodes -l $i`" ]; then
mail -s "PBS error on $i persist!" $email < /root/data/note
else
mail -s "PBS on $i is OK now! " $email < /root/data/note
rm -f /root/data/$i.pbs
fi
;;
esac
fi
else
# PBS OK
if [ -f /root/data/$i.pbs ]; then
# PBS had a problem, but is OK now
mail -s "PBS on $i is OK again! " $email < /root/data/note
rm -f /root/data/$i.pbs
fi
fi
fi
done
Contact
Milan Potocnik [milan (d) potocnik (a) rcub (d) bg (d) ac (d) yu]
