SG GOOD
From EGEE-see WIki
Recognizing that improvements in the quality and shaping-up of the SEE-GRID infrastructure is an important and ongoing effort, necessary for the successful work of SEE-GRID application developers, as well as for the usage of our infrastructure by the existing user community, the pro-active monitoring of SEE-GRID sites is organized in rotating shifts taken by WP3 country representatives (GIMs). During the shift, GIM is designated as Grid-Operator-On-Duty (GOOD).
Basically, the idea is that each GIM (i.e. GIM team from one country) is on shift during one week, and opens tickets in SEE-GRID Helpdesk to sites from all countries which are failing SAM tests, have other problems identified by GStat or in some other way. Of course, all GIMs are expected to continually monitor and provide support to sites from their countries - this is their day-to-day duty, not related to GOOD shifts.
Details of the organization of GOOD shifts are given below, and will be updated according to the available information.
1) Shifts are taken according to the following rotating plan:
|
|
|
|
|
|
|
|
|
|
EGEE countries are put firts, so that the procedure can be polished and refined until new SEE-GRID partners with less experience take their shifts. GIMs will of course be in place of GOODs.
2) GOOD shift tickets and GOOD site tickets
In their work, GOODs will encounter two types of tickets: shift tickets and site tickets.
In order to have written record of each shift and GOOD actions taken during them, a GOOD shift ticket will be created for each GOOD shift in the SEE-GRID Helpdesk (Task group: gLite, Category: Site availability). Shift ticket should be created by the previous GOOD to the next one at the end of each shift (e.g. when Switzerland finishes its shift, it will create a new shift ticket on 2007-02-19 to Serbia's GIM, Serbia's GOOD will create new shift ticket to Croatia's GIM on 2007-02-26 and so on). In addition, a new plugin for SEE-GRID Helpdesk enables one to see the GOOD shift rota schedule from the Helpdesk page. It is available in the task manager view only (simple users can not access it). The view shows to tables: future GOODs and past GOODs on your right. Helpdesk system will take care of sending the mail notifications about open tasks to a country representative that is on duty that week according to this schedule.
Second type of tickets are GOOD site tickets. They are normally created by GOODs each day to all sites experiencing operational problems (Task group: gLite, Category: Site availability). When a problem with some site is identified (site fails some of SEE-GRID SAM tests, GStat page for site displays problems, jobs submitted to the site by GOOD experience problems etc.), GOOD will open a new ticket and assign it to the site. Page with templates for tickets and links to useful Wiki pages with troubleshooting information is available here:
http://wiki.egee-see.org/index.php/SG_Helpdesk_tickets
GOODs are expected to update this Wiki page with new templates for various identified problems, and to add links to useful troubleshooting pages.
On the request of applications that need MPI support on sites, GOODs are expected to test MPI setup on all SEE-GRID sites that claim to support it. The MPI setup test should be performed at least once a week, and GOODs should ensure that the test parallel job runs at the same time on at least two WNs (to test ssh setup as well). On sites with SMPSIZE=4 this would require 5 processes in MPI; on sites with single-core WNs it will be enough to have just 2 processes; therefore, GOOD is responsible for sending the appropriate test job to each site. Since such test jobs require several CPUs, it is likely that their execution will take more time than it is usual for a single-CPU jobs. For this reason, such test jobs should be given enough time to complete (1-2 days). In case the test jobs fail (or do not run to completion in a reasonable time), GOOD site ticket should be created. More details can be found on the Wiki page on Testing MPI support.
3) Hand-over report
When creating a shift ticket to the next GOOD, previous GOOD should enter brief hand-over report, i.e. list of major problems that remain to be solved, observations about some hard cases, number of newly created GOOD site tickets during the last week, overall number of GOOD site tickets still open, number of GOOD site tickets closed last week, etc. (2 paragraphs at most usually).
4) GOOD shift ticket updates
During the shift, GOOD will update the shift ticket with all relevant information about newly created site tickets, updates on operational documents on the Wiki etc. The ticket should be closed when the shift is finished, and new ticket opened to the next GOOD, containing hand-over report.
