OpenSuSE in a cluster: Tips for improving stability?

Hi all,

I’m hoping somebody out there has some experience running OpenSuSE 11.3 in a cluster. We have a simple Linux cluster for teaching purposes, where students use 12 OpenSuSE workstations as terminals for submitting jobs via GridEngine and running programs like OpenOffice and firefox, and there are another 10 machines in a server-room that are only used for running the submitted jobs (runlevel 3 I think). The master-node acts as an NFS and NIS server (the cluster is behind a firewall with no access from outside, so security isn’t a priority).

My question is: the ‘execution nodes’ are great and really stable, I can’t remember the last time I needed to reboot one of them, but the nodes that act as terminals regularly become slow or hang completely at the login screen. Checking the logs shows a few recurring crashes with things like flash web apps, but nothing serious. After a reboot they’re generally fine again. Does anyone have any tips for how I can improve stability for these workstations? Can anyone think why running X-applications would cause so much instability? Or how about a way that I could automatically detect when this was happening and auto-reboot the machine overnight before it gets too slow or freezes completely?

Thanks for any tips!

m dev34 wrote:
> My question is: the ‘execution nodes’ are great and really stable, I
> can’t remember the last time I needed to reboot one of them, but the
> nodes that act as terminals regularly become slow or hang completely at
> the login screen.

I would start by running top to get a quick idea of the processes and
memory utilization and then dig into whatever you find. If the machine
is slow then either:
(1) something is hogging the CPU
(2) something is hogging memory, causing lots of swapping
(3) something is doing lots of I/O
(4) there’s some kind of network problem such as DNS timeouts or a
broken switch
The first three of these should show up in top. ping may show up the last.

> Checking the logs shows a few recurring crashes with
> things like flash web apps, but nothing serious. After a reboot they’re
> generally fine again. Does anyone have any tips for how I can improve
> stability for these workstations?

Just reboot them every night?

> Can anyone think why running
> X-applications would cause so much instability?

No. Even if all the X clients were buggy and the X server itself was
buggy, it wouldn’t cause those symptoms, assuming you’re restarting
everything when somebody logs out.

While the technical advise above is sound, as a side-note I want to add that the combination of

… where students use 12 OpenSuSE workstations as terminals …
and

… so security isn’t a priority …

struck me. I allways learnt that in house dangers are even greater those from outside. And that students are offering the greatest challenge rotfl!

One of the machines is obligingly showing a few strange symptoms (trouble logging in and the mouse cursor is suddenly very slow). It doesn’t seem to have any CPU-intensive processes running in top, ‘free -m -t’ shows plenty of free memory, iotop doesn’t show anything interesting and ping response time is the same as for any of the other nodes. As all nodes are on the same network I guess that should be ok as well…

Some kind of script might be possible, but these nodes are also execution nodes for classroom exercises and some jobs can overflow to the next day, so I’d have to be careful not to restart these. (I checked to make sure there weren’t any running jobs slowing the affected nodes down as well :))

You may be onto something there. Students often forget to logout so we kill the processes of anyone still logged in each night. Perhaps this causes instabilities? I guess I’m probably onto a loser if I try to diagnose each problem individually as the symptoms seem to vary from machine to machine and from day to day, with nothing in the logs and no obvious cause. It’s all curable by rebooting, though. You’re probably right that finding a way to restart the machines regularly is the best fix

hcvv: Good point! lol! I did forget to mention that they are SWISS students, though. I wouldn’t have tried the same thing in the UK :slight_smile:

Hm, I am not sure if that is a postive point for the Swiss or for the UK students :question:

And yes, people not loging out is a (security) problem. I do not know what you kill there at night and in which sequence, but there could be processes that stay hanging around.

The Swiss have the talent, just not the inclination for causing random mayhem. Much too serious :\

Users’ sessions are locked after a short time but users aren’t logged out. New users can ‘switch user’ to login if the last user forgot. Each night users that are still logged in at a workstation are located with

who | grep “(localhost)”
who | grep “(console)”

All processes belonging to that user are then killed. I looked around a lot for a more graceful way to do this, but I didn’t find any documentation on logging out a user cleanly. Any suggestions there?

m dev34 wrote:
> One of the machines is obligingly showing a few strange symptoms
> (trouble logging in and the mouse cursor is suddenly very slow).

Slow mouse usually indicates hardware or kernel trouble in my experience
if there’s no other very obvious problem.

> djh-novell;2451930 Wrote:
>> Just reboot them every night?
>
> Some kind of script might be possible, but these nodes are also
> execution nodes for classroom exercises and some jobs can overflow to
> the next day, so I’d have to be careful not to restart these. (I checked
> to make sure there weren’t any running jobs slowing the affected nodes
> down as well :))

It would be simpler if the two functions were on separate machines, but
perhaps the exercises could be run under another user ID so they can be
easily distinguished from user processes.

> You may be onto something there. Students often forget to logout so we
> kill the processes of anyone still logged in each night. Perhaps this
> causes instabilities?

It should be OK. I’d restart the display manager, and thus the X server,
as well, just to be sure. But it all gets more complicated than a cron
job to reboot!

> I guess I’m probably onto a loser if I try to
> diagnose each problem individually as the symptoms seem to vary from
> machine to machine and from day to day, with nothing in the logs and no
> obvious cause. It’s all curable by rebooting, though. You’re probably
> right that finding a way to restart the machines regularly is the best
> fix

I don’t know about best but it might well be easiest! :slight_smile: