Recent Logins giving me Internal server error and 500 on screen

And the info is appreciated. As noted, this appears to be on the backend of the identity provider, and is not browser-related.

It’s an intermittent issue, and troubleshooting it (which I and another member of the Heroes team have spent the better part of the past 2 days digging deep into, with some assistance from a couple other folks). We’ve got a couple ideas of things to try, but it’s going to take some time to resolve.

With regards to the logout - the way OIDC works, single log-out (which is conceptually feasible) is pretty challenging to implement effectively. What happens is the identity provider (IdP, id.o.o in our case) has its own authentication token. The Ipsilon identity provider system doesn’t provide a SLO endpoint as part of its well-known endpoints, so software that depends on that for knowing where to redirect for a full logout doesn’t know where to go to do that.

In the case of Discourse, it issues its own authorization token, and that’s what the Logout link clears here - but when the token for id.o.o is still valid (as it is because the Discourse OIDC plugin doesn’t know about it), when you hit the IdP a second time, it says you’re already logged in, and essentially refreshes the token for forums.o.o without requiring a login.

The same can hold true if the logout goes the other way - while forums.o.o won’t cause id.o.o to issue a new token for itself, forums.o.o will retain its own token, and thus can remain authenticated.

There’s a lot of work that ultimately has to go into making SLO work, including ensuring that the token lifetimes for the application (Relying Party or Service Provider) and the IdP don’t cause each other issues.

Most sites don’t end up implementing it across all systems because doing is is incredibly challenging.

(FYI, I spent nearly a decade working for a company that makes an Identity Provider and got deep into IAM solutions - and prior to that, I was a SME on Novell’s directory technology - wrote a couple of books on it.)

@hendersj Thank you very much for the detailed information and explanation you provided so far and all the work you and others have put into investigating that problem.

Maybe it helps to know that the problem has not been there from the very beginning (i.e. when the forum software changed and ID-portal was introduced). I started seeing the problem in early 2024 (maybe April, sorry for being not more accurate).

You’re welcome. It is looking like a backend database issue in the identity provider, possibly a failing maintenance task that’s causing a weird set of conditions. The timing doesn’t correlate with any specific change that I’m aware of, but I’ll make sure that info is shared with the others troubleshooting just so we can see if there’s a correlation.

Something that I noticed while testing yesterday is that you can tell when it’s going to fail - when you get to the login form, there’s a banner at the top that says which system is asking for authentication. If it shows “discourse.opensuse.org”, then the authentication should work. If it shows “src.opensuse.org”, then it will fail.

I hadn’t noticed that in the banner before in my previous fails, but it was there yesterday when I had the single failure that occurred while we were doing additional troubleshooting.

1 Like

I cannot confirm the first part of this:

When I did login just a few minutes ago the banner on the ID portal showed something like “discourse.opensuse.org is asking for authentication” and after entering my credentials and pressing the login button it took me to the 500 error page.

I can’t remember seeing the “src.opensuse.org is asking for authetification” message but I will look out for it. In the last days I rarely saw a “normal” login so probably I can report a hit soon.

Wanted to attach this image to your message.

image

1 Like

Maybe what I shall say now is ridiculous, so no offense, right?

I feel that the intermittent character of this “Heisenbug” might be the result of some race condition. Signals are ping-pong-ing servers back and forth, sometimes things go well, other times they’re not.

So, what might that “something” be? - Could it be that it is a timing issue? Can the code be “fuzzied”, i.e., manipulated by just placing more or less arbitraty delays at appropriate places (e.g., before an url fetch statement or such) and see, if you get some more intel out of it?

Just an idea, you know much more about these things, I guess. - Regards, M.

BTW, I get the idea that the frequency is increasing the last few days (i now get it once a day on one, maybe two logins a day). Just because we are discussing it? :sweat_smile:

Appreciate the feedback, folks.

We think it is being caused by a large SQL transaction that’s causing tmp space on a drive to fill up very briefly. Not yet understanding why it’s affecting tmp space, because it’s not the database usage itself that’s causing the issue (different host, hundreds of GB of free space). The transaction itself appears to be a maintenance transaction that runs periodically.

Not 100% sure that’s the cause, but it is correlated with the failures we’ve been able to reproduce, so it seems to at least be related in some way.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.