First,
If the reason for recent Forums downtime is related to the infrastructure migration and upgrade announced recently (https://news.opensuse.org/2017/07/14/heroes-preparing-to-make-the-leap/) congrats on executing this with relatively little interruption.
But,
The fact on at least 2 nights/days this past week was noteworthy and perhaps disappointing.
I assume that whether the downtime was expected depends entirely on the objectives stated prior to these events…
- Were they expected and planned or were they unexpected?
- Was there an evaluation that a certain amount of downtime was bearable considering that the Forums (and any related web services) considered non-critical which was part of evaluating “allowed downtime?”
- Even if a certain amount of allowed downtime was defined, did anyone think this might have been a unique opportunity to test related strategy and policy like disaster recovery (It’s always nice to know you can execute a disaster recovery when you’re not in the middle of a real emergency)?
- If any of these optional objectives were considered, were there sufficient resources allocated to do the planning, setup, testing and then actual execution of a perhaps complex sequence of steps?
So,
For example I would think that this would have provided a nice opportunity to
- Test virtual machine orchestration and management (OpenStack?) in both the USA and Germany colo sites.
- Test the upgrade process of the Forums which I would think would include step by step connecting and disconnecting the web frontend with the database backends in turn (or not) with an effort to cut downtime to minutes instead of the many hours I observed.
- Perhaps included would be a number of staging and test instances.
The real value is that if the above were done, then all issues and solutions would be fully documented to improve procedures, practices and perhaps resulting even in white papers.
IMO,
TSU