11.4 opensuse - server hangs

Hello,

After 4 months of perfect work my server with opensuse 11.4 started to freeze. When I’m trying to login into it from another server in the same network I get just the following message:
Last login: Sat Sep 10 23:17:38 2011 from blablabla
Have a lot of fun…

and nothing else. From my home pc I can’t even get connected with it, just form the neighbour server.
I have 16 Gb of ram and don’t believe that swap gets full…

Could somebody explain such server behaivior and how to find the cause ?

and now I can’t even connect from the neighbour server, get:
ssh_exchange_identification: Connection closed by remote host

It’s so strange…

Did you try to login directly to it and that does not work? If only ssh does not work, it does not mean the server froze.

You should try that and look to find which process is consuming your resources, maybe your sever is under an attack, e.g. ssh bruteforce.

If the server is not under a heavy load it might be a hard disk failure and the system is running only from the RAM.

What do you mean “nothing else” ? You mean you can not issue any command to the shell?

Try to solve it step by step, login directly to the machine, check the space, (df -h), the logs, and if that works, then it means it might be a ssh issue.

Cheers.

Thanks for your reply!

Yes, I have tried to login directly and the same situation, can not send any command to the shell. After last reboot it worked for 2 hours and froze again. Drive space was okey and no error logs.
I’m using SSD with TRIM support and some newer recompiled kernel, not the suse’s default one, could it be the kernel issue ? or is it hardware problem ? What do you think ?:slight_smile:

Hello again,

I am not sure if I can help you in respect to your specific configuration (I do not use OS or SSD’s TRIM atm) but from my experience regarding such “freezing” these were the culprits:

  1. hard-disk / hardware failure - maybe you can try to boot a live-cd-usb distro and do some checking there

  2. ulimit issues - a process was opening too many files (in my case it was either nscd or cron) ← was hard to track the process since I could only see the PID from /var/log/messages and I did a kill -HUP on it

  3. ldap + dns issue (the connection to the ldap server had issues)

If in 1st case I could not login to the shell or run some commands in a pre-opened shell console so I either had to reset the server or sometimes hard-power-off was working, in the latter 2 situations the server was dead slow (I really had to be patient between issuing the commands and waiting for the output or error to them) but I could eventually solve the problems.

In all three cases the services on the server were still running.

In any case, I would suggest that you look carefully through your logs since you might have missed the error and also see what the booting process gives (maybe use splash=verbose at bootup).

Cheers and good luck.

Hello,

If someone interested I have found the cause of the instant hangs up:

kernel: 802.825331] [drm:pch_irq_handler] ERROR PCH poison interrupt

So It was some kernel (2.6.38.2-1.2-desktop) bug and I switched the kernel with the newest one - 3.0.4-2-desktop. Hope it helps!