Upgrades
From EGEE-see WIki
Procedures for upgrading a site
Contents |
Scope
This page is a guide on the best practice to be followed by the EGEE SEE ROC as a whole and by the individual sites withing SEE ROC when upgrading.
This guide is complementary to the upgrade policy, stated in
Here we do not deal with special, security related releases, or similar, where the time window for upgrade may be rather short. This document is not a policy document, and deviations from the procedures described here are acceptable.
Upgrade Phases
Preparation
The Installation Support Team follows the established procedure for deploying pre-release versions, reporting problems, etc. The site administrators of the other sites plan their upgrade according to the expected release notes - time, reconfiguration of the site as needed (upgrade OS version, kernel upgrades, new services, etc.).
Initial upgrade phase
After the release is out, sites anounce their plans for upgrade. The Installations Support Team releases a short installation guide, where they describe potential problems, especially if there is something NOT in the release notes.
Two or three small to medium sites with the most experienced site administrators within the SEE ROC upgrade during the first week after the release, report problems encountered and their solution to the ROLLOUT list and egee-sa1-tech, update the installation guide. The idea is to get some positive publicity for the ROC by showing technical expertise in fixing the problems in the release, and to make the transition easier for the other sites.
Regular upgrade
Most sites within the ROC SHOULD upgrade at once, during the second week after the release. This will give us good statistics. At this time most bugs should have been fixed, and the solutions to the various problems should be in the guide. Their should be one guide only.
Post upgrade phase
It frequently happens that problems are encountered and resolved, but the sites that upgraded first still have those problems present. This means for important problems we should use the helpdesk to check that all our sites have fixed them.
Upgrading a site
I will only describe the best practice as I see it for a given site to accomplish the upgrade.
Time
Ideally upgrade begins and finishes inside the working week. In this way the site can get the best support if problems arise, and problems will be discovered and resolved easier.
Downtime is anounced enough in advance. Queues are drained before entering in downtime. The idea is to give the best service, not to claim the fastest upgrade.
Upgrading
Since we usually have lots of boxes, test installations can be performed. There is no problem to install a fake CE temporarily on a WN, which is out of service, just to see if the procedure works as expected. Well, this should be done with care, you don't want this CE to appear in the top-level BDII:).
Most VOs don't care about downtimes, so you should keep your queues closed during upgrade. Especially the default creation of queues on CE in open state is annoying. Do NOT close the dteam queue before or during upgrade - you lose important information, and you may lose the green dot on the map for your site.
Make sure your SE stays operational as much as possible.
Getting back to production
After the upgrade is complete, there comes the testing phase. Site administrators MUST perform extensive testing of their site, before getting back in production. This means they should test to see that grid functionality is there - running jobs, performing gridftp, the information system looks good at gstat, rgma client/server, accounting.
Sites CAN NOT go back in production if they fail SFTs.
Sites SHOULD NOT go back in production if they have WARNING or ERROR status in Gstat, unless the reason for this status is well known, and is not considered a problem (e.g, the problem is in gstat software itself).
Sites that do not satisfy these two points extend their downtime, describe their problems to egee-sa1-tech and resolve them ASAP.
Support during upgrade
SEE ROC has Installation Support Team for dealing with problems arising during upgrade. Therefore encountered problems SHOULD be resolved mostly via reporting to egee-sa1-tech. Problems that look more difficult to solve or require some sort of interaction, may also be resolved via the helpdesk. The helpdesk approach is meant to scale better, in my opinion, se we should gradually increase its use.
Sites DO NOT request support from LCG-ROLLOUT list, unless the problem is quite special, deep, requires expertise that we are not expected to have. The ROLLOUT list was established in times before the ROCs were created, and ideally it should not be needed anymore.
Sites DO report problems in the release itself - bugs, missing functionality, etc.
