Mystery machine shutdown...

Hello,
I have two identical servers that seems to mysteriously shutdown.

These server export user directories to 10 node machines each. They have an ip address accessing the “world” and one that the nfs uses to export to the clients. Occasionally, there will be heavy disk writes where 10+gb files are written, that lead to somewhat prolonged high io “wait” times. The systems usually make it through the heavy write period. But it seems that these writes occasionally hang the machines.

The ip address to the “world” can still be pinged but not logged into. The ip address to the nodes shuts down completely. I don’t have access yet to the machines themselves (in adata center) to see if I can still access them without the network.

Other servers, different HW configuration, have the same loading and have occasionally maxed out their user drives without the machines hanging. In the name of being complete, the trouble machines are the only servers we use that operate off of suse11.2 and ext4…the others use suse11.1 and ext3.

Is this perhaps a disk controller issue? Could slower CPU’s (dual core threaded 1.6gz) be THE issue?

Thank you

pdalach wrote:
> Hello,
> I have two identical servers that seems to mysteriously shutdown.

note: i’m not a real guru, i’m letting you know you post is in the
forum…maybe a real guru comes by soon…until then:

are there no hints in the logs?
are you logging temperature?
how about disk health via SMART?

> These server export user directories to 10 node machines each. They
> have an ip address accessing the “world” and one that the nfs uses to
> export to the clients. Occasionally, there will be heavy disk writes
> where 10+gb files are written, that lead to somewhat prolonged high io
> “wait” times. The systems usually make it through the heavy write
> period. But it seems that these writes occasionally hang the machines.
>
> The ip address to the “world” can still be pinged but not logged into.
> The ip address to the nodes shuts down completely. I don’t have access
> yet to the machines themselves (in adata center) to see if I can still
> access them without the network.

as you suspect a possible networking problem: do you run atop?
do networking logs show problems?

> Other servers, different HW configuration, have the same loading and
> have occasionally maxed out their user drives without the machines
> hanging.

WHOA! maxing out disk drives is not a good thing, ever.

> In the name of being complete, the trouble machines are the
> only servers we use that operate off of suse11.2 and ext4…the others
> use suse11.1 and ext3.

i think you said all those not working correctly are running 11.2
ext4, and all those working good are running 11.1 with ext3, BUT with
different hardware…is that right?

well, with a set of identical servers with identical work, it is
either something to do with 11.1 vs 11.2 or ext3 vs ext4?

but, if it also might be a hardware problem, then we know: it is
either hardware OR software…not much to go on so far, huh?

have you checked bugzilla to see if others have reported the problem
in any case 11.1 v 11.2 or ext3 v 4 or hardware1 vs hardware2 ??

> Is this perhaps a disk controller issue? Could slower CPU’s (dual core
> threaded 1.6gz) be THE issue?

could be most anything at this point…check logs…


DenverD
CAVEAT: http://is.gd/bpoMD [posted via NNTP w/openSUSE 10.3]

DenverD,
Thanks for the reply…

I, ridiculously, didn’t think to check out the messages file. Drives look healthy…ran the smartctl self-test which came up clean.

Break-in attempts from illegal users range from 800-3500 a day. That is MUCH higher than my other servers.

I dont’ see atop in the suse repositories…I may install it manually later.

Maxed drives are very bad…agreed. But on the very infrequent times it happened to the other servers due to the same code in development (the drives are filled by a bug in a code that hasn’t been fully tracked down yet), the systems didn’t hang.

Don’t see anything on bugzilla that refers to an issue with 11.2 or ext4 like I’m experiencing…could be I’m not searching in the right place.

Here is a log from when machine1 had an issue last July. I have access to the datacenter containing machine2 today so I can setup a terminal to look at logs. I’ll see if machine2 has the same errors as the following repeated message on machine1:

ul 27 20:33:27 kernel: [30886.044626] The following is only an harmless informational message.
Jul 27 20:33:27 kernel: [30886.044660] Unless you get a continuous_flood of these messages it means
Jul 27 20:33:27 kernel: [30886.044680] everything is working fine. Allocations from irqs cannot be
Jul 27 20:33:27 kernel: [30886.044699] perfectly reliable and the kernel is designed to handle that.
Jul 27 20:33:27 kernel: [30886.044720] kswapd0: page allocation failure. order:0, mode:0x20
Jul 27 20:33:27 kernel: [30886.044742] Pid: 48, comm: kswapd0 Not tainted 2.6.31.12-0.2-desktop #1
Jul 27 20:33:27 kernel: [30886.044762] Call Trace:
Jul 27 20:33:27 kernel: [30886.044801] <ffffffff81011a19>] try_stack_unwind+0x189/0x1b0
Jul 27 20:33:27 kernel: [30886.044833] <ffffffff8101025d>] dump_trace+0xad/0x3a0
Jul 27 20:33:27 kernel: [30886.044862] <ffffffff81011524>] show_trace_log_lvl+0x64/0x90
Jul 27 20:33:27 kernel: [30886.044891] <ffffffff81011573>] show_trace+0x23/0x40
etc…

Again, thanks for the help.

Paul

pdalach wrote:

> I dont’ see atop in the suse repositories…I may install it manually
> later.

atop is in
http://download.opensuse.org/repositories/server:/monitoring/openSUSE_11.2

if you add that repo, YaST will find and install it…

you need to read the man/info on it…it is kinda like top but
includes two really neat things: one is it tracks network activity,
and it by default builds a log of what was busy when…i don’t recall
now exactly but i think by default it takes a snapshot every ten
minutes…

which may or may not catch your evil…you may have to adjust the
timing to actually see what is happening…if there are no other
hints in your other logs…

and, reminder: you are actually hoping a real guru drops
in…unfortunately your subject title is not one which will cause a
‘real hacker’ to wanna see if you have something interesting to
solve…next time i’d use something more like

data center production servers fail daily, no hint in logs

doesn’t that sound like more of a challenge to solve?


DenverD
CAVEAT: http://is.gd/bpoMD [posted via NNTP w/openSUSE 10.3]

On 2010-08-13 17:06, pdalach wrote:
>
> DenverD,
> Thanks for the reply…
>
> I, ridiculously, didn’t think to check out the messages file. Drives
> look healthy…ran the smartctl self-test which came up clean.
>
> Break-in attempts from illegal users range from 800-3500 a day. That
> is MUCH higher than my other servers.

Huh? On what service? From Internet?

Take measures. What if they succeeded?

> Here is a log from when machine1 had an issue last July. I have access
> to the datacenter containing machine2 today so I can setup a terminal to
> look at logs. I’ll see if machine2 has the same errors as the following
> repeated message on machine1:

It is a kernel problem. I don’t really know if it is serious or not.

It says to ignore a problem about irq allocation that goes next in the log, but what I see is a page
allocation failure by kswap followed by a dump. I’m not sure if that is what was expected to follow
or if it is a different thing. If the message repeats a lot, then it is indeed a problem.

I would create a bugzilla for it (after getting more info).

Of course, it could be unrelated to your problem.

Wild guess: if the irq allocation that fails, repeatedly, is the one of the network card, it could
go down. That’s your problem, yes?


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” GM (Elessar))