Updates on the forum outage today?

Hello everyone,

The forums were down for most of today and the status page wasn’t updated to show the downtime for a long while. Later on it did show the outage as being a major one but then it is resolved and there’s no incident report or updates on it, like it never happened.

Does anyone have details on the outage?
Steps to improve platform?
How to help out?

Clearly the monitoring could/should be improved as the status page was blissfully unaware of anything being wrong until I guess someone manually updated it. :see_no_evil:

We will schedule and contain any maintenance it deems as Major to the following time:
Thursday, 08:00 - 10:00 CET for all openSUSE Services

You may want to get in contact with the infrastructure team.

Are we to assume some maintenance was happening and it all went very wrong?
Considering the amount of people who’re on the forums, it’s surprising there was no communication at all during the maintenance and outage. Do forum admins not have access to update the status dashboard?

The upper part of the status page contains informations how to stay informed…

From the IRC:
[14:32] forums.o.o is back - for some reason the VM did not start up when it should have.

1 Like

Yes, autostart did not work, we had a cold boot scheduled at the same time as the IDP (login) upgrade…

It happens…

1 Like

Thanks for pointing out other avenues of contact @hui but a status page is meant to inform the end-user of the status and it failed to do that for the most part. :face_with_monocle:

I hope the VM supervisor (and monitoring) is fixed so it doesn’t happen again :wink:

@pavinjoseph status.o.o is a manual process, unlikely, the openSUSE infrastructure is best effort by just a few people and in the EU timezone. Always expect the unexpected…

2 Likes

Ah, that makes sense. If it can’t be automated, at least the infra people could grant forums admins/mods to update the status :confused:

@pavinjoseph we have the maintenance window scheduled as indicated, anything can go down at that time and sometimes not come back :wink: there was also an announcement about IDP.

1 Like

Yep, just an unanticipated issue during a restart to allocate some additional resources to the system. Pretty rare kind of issue for us, but I’ve been in the weekly Heroes meeting today to understand what happened and what we can do to prevent a recurrence in the future.

Appreciate the concern - we’re taking steps to prevent a repeat. :slight_smile:

@hui’s mention of the contact info at the top of that page is correct - if you see something down (and can confirm it using something like downforeveryoneorjustme.com), please do submit a ticket to the admin address to get a ticket logged. It may be being looked at, but if it’s not, someone will pick the ticket up.

I understand that there are things being done to implement more robust monitoring - there’s some, but as Malcolm said, it’s a ‘best effort’ by a dedicated team of volunteers. :slight_smile:

3 Likes

I was about to post that down for everyone link as a suggestion for verifying if the forums are down, but the forum editor alerted me that you had already posted the link. That is a nice touch in forum operability!

1 Like

Thank you very much :slightly_smiling_face:
I will submit a ticket next time if nothing has been submitted already.
This is the right place right?

Ah, this service wouldn’t have been much help for yesterday’s outage as I was able to view a maintenance page when visiting the forums telling me to go grab a cup of coffee with a gecko lying on its back with crossed out eyes :dizzy_face:

From that we can conclude the main web server / reverse proxy and TLS termination was working right but the upstream service for this forum was down.

1 Like

An e-mail to admin@o.o will create a ticket in Progress for you.

Fair point on the down detection service.

2 Likes