I have two identical servers that seems to mysteriously shutdown.
These server export user directories to 10 node machines each. They have an ip address accessing the “world” and one that the nfs uses to export to the clients. Occasionally, there will be heavy disk writes where 10+gb files are written, that lead to somewhat prolonged high io “wait” times. The systems usually make it through the heavy write period. But it seems that these writes occasionally hang the machines.
The ip address to the “world” can still be pinged but not logged into. The ip address to the nodes shuts down completely. I don’t have access yet to the machines themselves (in adata center) to see if I can still access them without the network.
Other servers, different HW configuration, have the same loading and have occasionally maxed out their user drives without the machines hanging. In the name of being complete, the trouble machines are the only servers we use that operate off of suse11.2 and ext4…the others use suse11.1 and ext3.
Is this perhaps a disk controller issue? Could slower CPU’s (dual core threaded 1.6gz) be THE issue?