openSUSE 13.1 (i586) server hangs / freezes / gets unresponsive randomly

Wow, it’s been a while since I last wrote on these forums. Guess my knowledge is better nowadays. Hi all.

Anyways, today my knowledge ended after days of trying to figure this out.

My setup is quite simple, I have one little home server running in my clauset, which contains Intel Core 2 Duo E6300 1.86GHz processor, Asus P5LD2 SE motherboard, 2 GB kingston DDR2 RAM, Nexus NX-3500 350W power, and couple of hard drives.

My server is over 5 years old, running now 4 years without any problems. I’m using LAMPP, postgresql, couple of mediaserver-streaming services like plex and subsonic and irssi and bots. It’s been couple of months now this random freezing when all connections die, SSH timeouts etc. Nothing helps but pressing the reset button physically. When I check the logs there’s nothing in

/var/log/messages

, only some endless ^@^@^@^@^@ line of characters where the freeze was. After that there is skip in timestamps (the time during “timeout”/freezing/unresponsiveness) and then boot messages, nothing in between, no hints to give what’s going on.

I’ve done so far:

  • Replaced my original Deltaco 350W power with Nexus NX-3500, old but works slightly better I think
  • Replaced two of 512 RAM sticks with one 1 GB stick
  • Disabled any Fan controls in BIOS, got CPU fan RPM from 800 to 1700-2000
  • Removed unnecessary graphics card and wlan-network card I have not used for ages
  • Added more thermal paste on CPU after wiping off the old one, it had been like 6 years with the same, there were very little of it and it was dry

No visible difference. Sensors show little high numbers sometimes, core 1 and 2 sometimes over 70, but they get stable 45-50 C quite quick.

Any hints what’s going on? is it the system, some process, or hardware failure? maybe motherboard going bad? Should I start consider getting new motherboard/processor or buying a more effective cooler?

I may have solved this myself. I had nvidia propiertary drivers left in the system even I have been running on runlevel 3 for so long. Don’t know if that caused a conflict, however I removed them running .run-file with --uninstall option. I also created a bash script which alerts me through pushover if temperature gets too high (I have cpu_temp as service which monitors CPU temperature into a file), running this as cronjob every minute:

**#!/bin/bash**
TEMP=`cat /home/rolle/**.**cpu_temp`
if ** $TEMP > 60 ]]** ;then
  sendpushover 'CPU temp is CRITICAL! **(**60°C+**)**';
fi

So when I got alarmed, I noticed CPU (both cores) had been running 100% for 10 minutes and temperatures were as high as 90 C. The process was perl pisg which caused CPU to throttle. I removed that cronjob and now investing in new cooler, I think something is broken in the old fan.

Server uptime is now over a day and I think it will stay up. If not, I’ll be back :slight_smile: hope this helps someone.

Hi again. Sorry for the double posts but I solved it. It was not the temperatures, nor the perl or python processes. It was the hard drive that was breaking. I wonder why thy didn’t show up in /var/log/messages but only in the actual monitor of the server. Bad blocks and superblocks errors, a lot of them. Finally after one freezing and reboot “No operating system found”.

Got most of the data saved with KNOPPIX and testdisk. Now reinstalling opensuse. Hope this helps someone.

I hope your reinstalled server works better now, but it may be that the corrupted disk blocks are a result of a failing system and not the cause.

90C is crazy high. Is the closet door open so there’s adequate air circulation? The thermal paste layer should be very thin – too much paste (or any gap or non-flat surface mating) prevents it from transferring heat to the heatsink. Isopropyl alcohol is good for cleaning off the old paste. Dust clogged in the CPU or case or PSU fans can also cause overheating. Try vacuuming them and look for dust collecting on the motherboard (esp. along the RAM sockets).

The ^@ characters are just NULs (binary zero).

The pair of 512MB couldn’t have contributed much heat beyond a single 1GB, and the C2D will run faster with dual-channels. You could always run a RAM test if you think one of them might be flaky.

Thanks for the answer! That’s true, I probably should have thought that before. I’m sad to say my server is still not working. The amazing thing about this is that I replaced the entire hardware. I have now compaq PC with the same age with almost the same type of parts like Intel dual core processor. No graphics card. I did a clean reinstall of suse again, just to be sure. Had to disable Intel Management Engine to stop the [65.456035] mei_me 0000:00:03.0: reset: connect/disconnect timeout flood in logs.

I should probably mention that only thing I have left from old PC in this one is the other hard drive (1 TB Western Digital Green). I have backupped my earlier system to that drive in case I need the files there. I scanned it with e2fsck and it has bad blocks, so I played safe to not to use it. So it’s unlikely it would cause these freezes since the system is not on it?

When I was backupping the original Seagate 320 GB drive that my system was installed on when I first experienced these freezes I noticed I couldn’t mount the part my /home was in a first place. It didn’t let me mount. Only KNOPPIX mounted it in its GUI so that way I got everything backed up. Emptied my spare hard drive from other PC and formatted it to ext4 and installed openSUSE 13.1 on it (minimal server stuff of course, nothing more).

Yeah I double checked that. Most of the time the temperatures were all good but I guess something was wrong with my fan. Nevertheless, I have now entirely new hardware and that has a proper cooler and everything, temperatures are very low all the time. And I STILL experience the same downtime! How is this possible?

I have now newer RAMs. Had a test for those older ones and didn’t show up anything special. A question: how to boot to memtest when I don’t have an option in GRUB (in fact, my server boots in without splash, delay and without GUI).

I was installing the same software I used earlier when I experienced this crash again:

  • Latest transmission (I’m using nightly, would that be the problem? usually when downloading a lot of data at once)
  • Couchpotato
  • Sick-Beard
  • eggdrop IRC bot
  • willie IRC bot
  • irssi
  • apache2, mysql/mariadb, php
  • postgreSQL (not yet installed so I highly doubt this is the cause)
  • plexmediaserver (not yet installed so I highly doubt this is the cause)
  • dropbox
  • btsync (not yet installed so I highly doubt this is the cause)
  • subsonic (not yet installed so I highly doubt this is the cause)
  • samba

Only hint I currently have is that it happens when I’m writing lots of data at once or when system is operating some process-exessive actions like transcoding a video while watching plex, but I’m not sure about that either.

I’m for the very first times in my life out of luck with these things. I’m now really not sure is my hard drives done, or is it my memory or what. I replaced motherboard and processor so I can count those out in this point? or what, I don’t know.

I need help figuring this out.

On 2013-12-29 01:06, rollex2 wrote:

> My setup is quite simple, I have one little home server running in my
> clauset, which contains Intel Core 2 Duo E6300 1.86GHz processor, Asus
> P5LD2 SE motherboard, 2 GB kingston DDR2 RAM, Nexus NX-3500 350W power,
> and couple of hard drives.

13.1, 32 bit arch, has a nasty kernel bug. Under some circumstances, like hibernating the machine,
it triggers. I can’t find a link just now. It needs a kernel update to a certain version, but this
has not yet been released on the update channel.

However, I understand the core 2 duo is a 64 bit processor, so you could try that. You don’t have
lots of memory, though.

And anyway, I don’t know if that is the cause of your problem.


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” (Elessar))

On 2014-01-03 21:06, rollex2 wrote:
>
> Hi again. Sorry for the double posts but I solved it. It was not the
> temperatures, nor the perl or python processes. It was the hard drive
> that was breaking. I wonder why thy didn’t show up in /var/log/messages
> but only in the actual monitor of the server. Bad blocks and superblocks
> errors, a lot of them. Finally after one freezing and reboot “No
> operating system found”.

You should run the long test of smartctl to make sure the disk is bad, and that the bad blocks were
not caused by the crashes.


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” (Elessar))

Well, I forgot to mention that every time the crash occurs I see a lot of kernel stuff in the final log and what caucht my eye was “DWARF2 unwinder stuck” in between. I also thought this could be kernel-related.

I indeed have 32bit 13.1. I would appreciate the links or kernel update procedure!

On 2014-01-05 13:06, rollex2 wrote:

> I should probably mention that only thing I have left from old PC in
> this one is the other hard drive (1 TB Western Digital Green). I have
> backupped my earlier system to that drive in case I need the files
> there. I scanned it with e2fsck and it has bad blocks, so I played safe
> to not to use it. So it’s unlikely it would cause these freezes since
> the system is not on it?

Bad blocks may cause freezes, but typically they don’t. The system tries about 20 times to read
the same block, resetting the hard disk perhaps between each attempt. I’m not sure of the numbers,
but I’m sure that it takes a long time. Depending on where is the problem, main system disk or data
disk, the system may be totally unresponsive, or may recover. If it can write to the log, it will
certainly do so.

I had a system, with IDE buses, which periodically crashed, once or twice a month. I had to
poweroff, reseat the ide cables, and it would reboot just fine. I never found the cause. I still
keep the system, but I don’t run it. It was only one hard disk which was affected.

> Yeah I double checked that. Most of the time the temperatures were all
> good but I guess something was wrong with my fan. Nevertheless, I have
> now entirely new hardware and that has a proper cooler and everything,
> temperatures are very low all the time. And I STILL experience the same
> downtime! How is this possible?

Dunno… :-o

> Only hint I currently have is that it happens when I’m writing lots of
> data at once or when system is operating some process-exessive actions
> like transcoding a video while watching plex, but I’m not sure about
> that either.

How big is the swap? 2 gigs of RAM is too little nowdays.


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” (Elessar))

On 2014-01-05 13:56, rollex2 wrote:

> I indeed have 32bit 13.1. I would appreciate the links or kernel update
> procedure!

A procedure I don’t have, because there is no official update for the problem they found. And the
emails that explained it I have saved on a system that is currently running a backup, so I can not
access them. Ping me in some hours.

It is possible to update to a more recent kernel from the repo where they experiment with them. I’m
not familiar with that repo, I hesitate to give instructions.

I have a 32 bit machine I want to update to 13.1 but I’m also waiting for the kernel official
upgrade before doing it.

But I don’t know if this reported problem is related to yours.


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” (Elessar))

On 2014-01-05 14:06, Carlos E. R. wrote:
> On 2014-01-05 13:56, rollex2 wrote:
>
>> I indeed have 32bit 13.1. I would appreciate the links or kernel update
>> procedure!
>
> A procedure I don’t have, because there is no official update for the problem they found. And the
> emails that explained it I have saved on a system that is currently running a backup, so I can not
> access them. Ping me in some hours.

Found a link.

http://forums.opensuse.org/showthread.php?t=493955


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” (Elessar))

Now when I checked back my /etc/motd I noticed I had updated from 12.3 to 13.1 30th of November 2013 and right after that the crashing started. I had stated in the motd in 04.12.2013 “I have no idea what happened, but the whole system freezed and I had to press reset physically! I was watching movie on Plex normally. This started a week ago or so.” So I guess this kinda finally confirms it’s about 32bit 13.1. I had no issues in 12.3 and prior.

Thanks a lot! It seems I have stumbled on that topic before when googled about this. But it seems to be different issue, though the freezing we have in common. I don’t use suspend/hibernate at all, unless new suse has some feature that chooses to do that randomly or when idle? seems unlikely.

Btw, how do I know if I can continue using my hard disks safely? I removed the one completely that didn’t get mounted and was the result of “Error No Operating system, Please reboot and select proper boot device” after one of the freezes. That was 320 GB Seagate my suse was earlier. Now I have my system on 500 GB Western Digital Blue I took from other PC, 1000 GB Western Digital Green which has media mostly.

My server has been up for 5 hours straight perfectly fine now, completely without transmission-daemon that I suspect is causing this behaviour. This is just a hunch but I really suspect seeding/downloading lots of files at once does something and the kernel hangs. I haven’t been touching to my 1000 GB HDD either.

But then I don’t really know. Pretty clueless… I don’t want to change my favourite distro to something else because of this. I want to take this as a challence. The only problem is the very odd and random cause I can’t seem to sort out…

Okay, confirmed. 6 hours uptime, no problems. Started transmission-daemon = 10 minutes and frozen system. I haven’t done much installing apps and stuff so I’m going to try CentOS now. It breaks my heart to do this since I’ve been openSUSE user forever :’( BUT, if it makes my server run stable again so be it!

Some strange settings I have, combination of apps, or something in transmission makes the kernel go BOOM. Can’t spend more time figuring it out. I’m not clearing anything out but it is certainly not a hardware problem when things can run smoothly for hours or even days if one or two apps are not running.

On 2014-01-05 19:16, rollex2 wrote:
>
> Okay, confirmed. 6 hours uptime, no problems. Started
> transmission-daemon = 10 minutes and frozen system.

Well, then the culprit is transmission, not openSUSE Linux.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

On 2014-01-05 18:06, rollex2 wrote:

> Thanks a lot! It seems I have stumbled on that topic before when googled
> about this. But it seems to be different issue, though the freezing we
> have in common. I don’t use suspend/hibernate at all, unless new suse
> has some feature that chooses to do that randomly or when idle? seems
> unlikely.

No, that would not happen.

> Btw, how do I know if I can continue using my hard disks safely? I
> removed the one completely that didn’t get mounted and was the result of
> -“Error No Operating system, Please reboot and select proper boot
> device”- after one of the freezes. That was 320 GB Seagate my suse was
> earlier. Now I have my system on 500 GB Western Digital Blue I took from
> other PC, 1000 GB Western Digital Green which has media mostly.

Well, as I said, you have to run the smartctl long test on your disks.
And you should do that before attempting anything else.

> My server has been up for 5 hours straight perfectly fine now,
> completely without transmission-daemon that I suspect is causing this
> behaviour. This is just a hunch but I really suspect seeding/downloading
> lots of files at once does something and the kernel hangs. I haven’t
> been touching to my 1000 GB HDD either.

If there is a bad block in the region of disk where those files are
stored, that would be a problem.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

No, something in transmission causes suse to crash. I tried versions from 2.50 to 2.82+ and none of them worked with suse 13.1. Before 13.1 I had no issues.

I’m now on CentOS 6.5 and running the same transmission-daemon (with exactly the same settings!) for hours now, so I guess this was it. If you don’t hear from me any more, I guess this solved it. :slight_smile: thanks Carols for your support, I really appreciate it!

On Sun, 05 Jan 2014 13:26:54 GMT “Carlos E. R.” wrote:

> http://forums.opensuse.org/showthread.php?t=493955

I use suspend only on my laptop and because of this discussion from
your link i start using the kernel-standard repository without any
problems yet. With the multiversion feature of zypp it is no problem to
step back if the devs solves the problem with the normal kernel.

Coming back to report here that my Centos 6.5 uptime is 167 days now. No problems whatsover. I hope this nasty kernel bug is fixed or getting fixed soon. Sad to go but farewell my dearest openSUSE. We shall meet in Valhalla.