SG GOOD
From EGEE-see WIki
Recognizing that improvements in the quality and shaping-up of the SEE-GRID infrastructure is an important and ongoing effort, necessary for the successful work of SEE-GRID application developers, as well as for the usage of our infrastructure by the existing user community, the pro-active monitoring of SEE-GRID sites is organized in rotating shifts taken by WP3 country representatives (GIMs). During the shift, GIM is designated as Grid-Operator-On-Duty (GOOD).
Basically, the idea is that each GIM (i.e. GIM team from one country) is on shift during one week, and opens tickets in SEE-GRID Helpdesk to sites from all countries which are failing SAM tests, have other problems identified by GStat or in some other way. Of course, all GIMs are expected to continually monitor and provide support to sites from their countries - this is their day-to-day duty, not related to GOOD shifts.
Details of the organization of GOOD shifts are given below, and will be updated according to the available information.
1) Shifts are taken according to the following rotating plan:
|
|
|
|
|
|
2) GOOD shift tickets and GOOD site tickets In their work, GOODs will encounter two types of tickets: shift tickets and site tickets.
http://wiki.egee-see.org/index.php/SG_Helpdesk_tickets GOODs are expected to update this Wiki page with new templates for various identified problems, and to add links to useful troubleshooting pages. On the request of applications that need MPI support on sites, GOODs are expected to test MPI setup on all SEE-GRID sites that claim to support it. The MPI setup test should be performed at least once a week, and GOODs should ensure that the test parallel job runs at the same time on at least two WNs (to test ssh setup as well). On sites with SMPSIZE=4 this would require 5 processes in MPI; on sites with single-core WNs it will be enough to have just 2 processes; therefore, GOOD is responsible for sending the appropriate test job to each site. Since such test jobs require several CPUs, it is likely that their execution will take more time than it is usual for a single-CPU jobs. For this reason, such test jobs should be given enough time to complete (1-2 days). In case the test jobs fail (or do not run to completion in a reasonable time), GOOD site ticket should be created. More details can be found on the Wiki page on Testing MPI support. 3) Hand-over report When creating a shift ticket to the next GOOD, previous GOOD should enter brief hand-over report, i.e. list of major problems that remain to be solved, observations about some hard cases, number of newly created GOOD site tickets during the last week, overall number of GOOD site tickets still open, number of GOOD site tickets closed last week, etc. (2 paragraphs at most usually).
4) GOOD shift ticket updates During the shift, GOOD will update the shift ticket with all relevant information about newly created site tickets, updates on operational documents on the Wiki etc. The ticket should be closed when the shift is finished, and new ticket opened to the next GOOD, containing hand-over report. |
