Headless server crash

On Fri, 25 Jul 2008 17:46:03 GMT
nalexmay <nalexmay@no-mx.forums.opensuse.org> wrote:

>
> Thanks guys.
> Just as I got your post I was waiting for the one click install from
> the search at software.opensuse.org to finish.
> I still am!
>
> I have X on all my machines (I think).
> I think it was part of the default Suse installation.
> I have to be in init 5 for the webcam to work.
> I always have a telnet daemon running.
> Where I am administering on the local network,
> it is great to be able to telnet so easily from a Win box.
> It doesn’t seem to cause a problem on other machines.
> Yes, I have apache, but no, it’s not an external website.
> The scanner software interface is webbased and it is also useful to be
> able to see the webcam picture directly.
> I don’t really know why I have postfix. Could that cause a problem?
> Motion is the webcam application, yes.
> I did have a look at the apache logs, but I didn’t find anything
> enlightening.
>
Hi
You should look at using Putty and disable telnet and use ssh :slight_smile:

Postfix will be running for local mail which is normal.

I would look at the firewall and mail logs, turn off any unnecessary
services.

Do you have auto updates running, maybe zypper is running and
hogging resources.

Have you enabled the apache server status to look at what is running,
whos connected etc.

Have a look at running lastlog to see if someone else has been logging
in… who is another command to run.

Else just keep and eye on the hardware sensors and wait.


Cheers Malcolm °¿° (Linux Counter #276890)
SLED 10 SP2 i586 Kernel 2.6.16.60-0.25-default
up 6:22, 2 users, load average: 0.00, 0.01, 0.13
GPU GeForce Go 6600 TE/6200 TE Version: 173.14.09

Went for a long period without a crash!
But now I had one.
One of the cron jobs that I set up proved that the system was hung and not just the network.
From that it also looks like the system went down at between 21:00 and 21:10

I ran atop and got this:
ATOP - NEC 2008/07/27 14:28:07 42 seconds elapsed
PRC | sys 3.38s | user 2.46s | #proc 56 | #zombie 0 | #exit|
CPU | sys 23% | user 28% | irq | idle | wait 42% |
CPL | avg1 2.03 | avg5 0.53 | avg15 0.18 | csw 24066 | intr 10791 |
MEM | tot 185.1M | free 75.6M | cache 56.5M | buff 4.4M | slab 7.0M |
SWP | tot 502.0M | free 502.0M | | vmcom 32.3M | vmlim 594.6M |
DSK | sda | busy 65% | read 2804 | write 397 | avio 8 ms |
NET | network | ipi 1 | ipo 3 | ipfrw| deliv|
NET | eth0 0% | pcki| pcko| si 0 Kbps | so 0 Kbps |
*** system and process activity since boot ***
PID SYSCPU USRCPU VGROW RGROW USERNAME THR ST EXC S CPU CMD 1/4
2078 1.78s 0.52s 81628K 7280K root N- - S 5% Xorg
1946 0.13s 1.03s 5652K 3904K haldaemo 1 N- - S 3% hald
933 0.64s 0.11s 2288K 1044K root N- - S 2% udevd
1 0.56s 0.08s 740K 288K root N- - S 2% init
2456 0.10s 0.48s 24204K 12480K root N- - D 1% kdm_greet
2604 0.01s 0.08s 2768K 1440K root N- - S 0% ifup
1794 0.00s 0.06s 2640K 1316K root N- - S 0% rc
1950 0.02s 0.02s 3040K 1068K root N- - S 0% hald-runner
1843 0.00s 0.04s 2196K 836K messageb 1 N- - S 0% dbus-daemon
876 0.04s 0.00s 0K 0K root N- - D 0% kjournald
2749 0.02s 0.01s 2324K 2324K root N- - R 0% atop
1876 0.01s 0.01s 7512K 1984K root N- - S 0% console-kit-da
1815 0.02s 0.00s 1592K 632K root N- - S 0% startpar
1810 0.02s 0.00s 9984K 492K root N- - S 0% blogd

Does that help understand why the system crashed?

I checked lastlog and it looks fine.

This is the apache2 log for the day:
[Sun Jul 27 14:28:44 2008] [notice] mod_python: Creating 8 session mutexes based
on 150 max processes and 0 max threads.
[Sun Jul 27 14:28:44 2008] [notice] mod_python: using mutex_directory /tmp
[Sun Jul 27 14:28:45 2008] [notice] Apache/2.2.4 (Linux/SUSE) PHP/5.2.6 with Suh
osin-Patch mod_python/3.3.1 Python/2.5.1 mod_perl/2.0.3 Perl/v5.8.8 configured -

  • resuming normal operations
    [Sun Jul 27 23:05:58 2008] [notice] mod_python: Creating 8 session mutexes based
    on 150 max processes and 0 max threads.
    [Sun Jul 27 23:05:58 2008] [notice] mod_python: using mutex_directory /tmp
    [Sun Jul 27 23:05:59 2008] [notice] Apache/2.2.4 (Linux/SUSE) PHP/5.2.6 with Suh
    osin-Patch mod_python/3.3.1 Python/2.5.1 mod_perl/2.0.3 Perl/v5.8.8 configured -
  • resuming normal operations

Here is an extract from my messages file from around the time of the crash:
Jul 27 20:58:23 NEC pure-ftpd: (?@aptiva.home) [INFO] alex is now logged in
Jul 27 20:58:28 NEC pure-ftpd: (alex@aptiva.home) [INFO] Logout.
Jul 27 20:59:22 NEC pure-ftpd: (?@aptiva.home) [INFO] New connection from aptiva
.home
Jul 27 20:59:22 NEC pure-ftpd: (?@aptiva.home) [INFO] alex is now logged in
Jul 27 20:59:30 NEC pure-ftpd: (alex@aptiva.home) [INFO] Logout.
Jul 27 21:00:23 NEC pure-ftpd: (?@aptiva.home) [INFO] New connection from aptiva
.home
Jul 27 21:00:23 NEC pure-ftpd: (?@aptiva.home) [INFO] alex is now logged in
Jul 27 21:00:28 NEC pure-ftpd: (alex@aptiva.home) [INFO] Logout.
Jul 27 23:05:02 NEC syslog-ng[1921]: syslog-ng version 1.6.12 starting

Help ! (please)

> Went for a long period without a crash!
> But now I had one.
> One of the cron jobs that I set up proved that the system was hung and
> not just the network.
> From that it also looks like the system went down at between 21:00 and
> 21:10
>
> I ran atop and got this:
> ATOP - NEC 2008/07/27 14:28:07 42 seconds elapsed
<snip>
> Does that help understand why the system crashed?

no it does not, (it shows no problems i can see, maybe someone else can
see a prob)…

but, then again WHEN is that atop snapshot /which is date/time stamped
as “2008/07/27 14:28:07” in relation to the actual “crash” which you
say occurred between 21:00 and 21:10??

if you ran atop at 21:28, after a forced poweroff reboot (as you
mentioned in your first posting) you should not be surprised that
everything looks good just then…huh?

anyway, what you need to do is figure out what the time WAS on your
headless server when it “crashed”, and find the correct atop raw file
/var/log/atop/atop_YYYYMMDD (where YYYYMMDD are digits representing the
current date)

and run
atop atop -r /var/log/atop/atop_YYYYMMDD

then step though that file with ‘t’ forward, or ‘T’ backwards until you
get to the shap shots which were taken between :00 and :10 after the
hour when the crash occurred…

good luck, and keep at it…see if you can find the culprit in the atop
logs…come back with a shapshot before, during and after the “crash”
and then we can maybe see what happened…

by the way, you definition of a “crash” (in your initial posting) of
“becomes unreachable (via http or telnet)” leads me to believe that it
sure may be necessary to add a temporary head to that machine, and the
next time someone reports they can’t http/telnet to it, physically go to
the machine and see what the heck its actual state is…i mean if it is
sitting there humming along nicely and you can, from its keyboard, run
atop (or whatever) then you KNOW it is not hardware…and, it must be
external/internal networking problem, or some software application
conflict, or or or … but, at least you will have ruled out the
physical box and OS as the problem, right?

and, you can maybe track down what is not working that should be
working BEFORE you do a forced poweroff reboot and CLEAR the problem…


DenverD (Linux Counter 282315)
A Texan in Denmark
*

I am pretty sure that the time of the crash was between 21:00 and 21:08.
I have a cron job that checks that the network is reachable every 10 mins and leaves a log either way. It ran successfully at 21:00, but then didn’t run again until I forced the restart at about 23:00. This tells me that it’s not just a network problem. The server has “crashed” or “hung” or somthing. I don’t think adding a head will help.

I took another look at atop the way you said. Paging through, I couldn’t seem to see the snapshot time at all, but the man page said that I could pipe the output to a file. I did that and the output is here. (I didn’t change the time interval)

Does this help more? It doesn’t mean anything to me, but I think this is the log that you were looking for.

Thanks again for the help,
Alex

> I am pretty sure that the time of the crash was between 21:00 and
> 21:08.
<snip>
> I took another look at atop the way you said. Paging through, I
> couldn’t seem to see the snapshot time at all

it is at the very top of each page you come to after you hit ‘t’…see
what it looks like below…

> the output is (http://alexmay.homeip.net/test/atopoutput.txt).
> Does this help more?

that URL shows four ‘snapshots’, the time stamp is at the top of each:
ATOP - NEC 2008/07/27 20:38:07
ATOP - NEC 2008/07/27 20:48:07
ATOP - NEC 2008/07/27 20:58:07
ATOP - NEC 2008/07/27 23:05:27

three before the crash (which look pretty okay to me–anyone else see
anything?) and one after the reboot…

none in the time frame of “between 21:00 and 21:08”…

so, try this: set the interval to 60 (down from 600) which will take a
snap every minute…and we can hope that it will catch a shot of
something ‘strange’ starting up, just before the crash…


DenverD (Linux Counter 282315)
A Texan in Denmark

Will do.
Thanks

Ok, got another crash this morning.
This one was quite soon after the server was started up.
The crash occurred between 07:51:32 and 07:52:32.
At 12:31 I realised I had another crash and forced a restart.
I have placed the output for atop around the time of the crash at http://alexmay.homeip.net/test/atopoutput2.txt.
At 7:50 hdtempt was logged as follows:
/dev/sda: ST320014A: 22 C
This seems very low. It normally runs around 40 C.
Below is the output that was logged at 7:50 for sensors.
I notice a few alarms on here, but they are there now too and the server is running. Is there a way that I can/should clear them?
Does any of this help explain what is going on?

w83627hf-isa-0290
Adapter: ISA adapter
VCore 1:   +1.62 V  (min =  +1.44 V, max =  +1.86 V)              (beep)
VCore 2:   +1.47 V  (min =  +1.30 V, max =  +1.70 V)              
+3.3V:     +3.26 V  (min =  +2.80 V, max =  +3.81 V)              (beep)
+5V:       +4.97 V  (min =  +4.52 V, max =  +5.51 V)              (beep)
+12V:     +11.80 V  (min = +10.03 V, max = +13.98 V)              (beep)
-12V:     -11.95 V  (min = -14.01 V, max =  -9.98 V)              (beep)
-5V:       +3.54 V  (min =  -6.00 V, max =  -3.99 V)       ALARM  
V5SB:      +5.32 V  (min =  +2.58 V, max =  +3.14 V)       ALARM  
VBat:      +2.93 V  (min =  +1.50 V, max =  +3.50 V)              
fan1:        0 RPM  (min = 2657 RPM, div = 2)              ALARM  
fan2:     3183 RPM  (min = 2657 RPM, div = 2)                     
fan3:     45000 RPM  (min = 2657 RPM, div = 2)                     
temp1:       +31 C  (high =  +119 C, hyst =   -82 C)   sensor = thermistor           
temp2:     +41.5 C  (high =   +85 C, hyst =   +70 C)   sensor = diode           (beep)
temp3:     -47.0 C  (high =   +75 C, hyst =   +60 C)   sensor = thermistor           
vid:      +1.650 V  (VRM Version 8.2)
alarms:   Chassis intrusion detection                      ALARM
beep_enable:
          Sound alarm enabled

> The crash occurred between 07:51:32 and 07:52:32.

i began a note to reply about 13 hours ago…and, i still don’t have the
time to look at the atop output in enough detail to make a decision…

that is, first glance didn’t give an easy answer…

i think we have look much deeply into other log files…i see in your
first posting you say “I have tried looking in /var/log for a crash or
dump file”…well, no Linux won’t build a dump or crash file…you have
dig…

try this–as root issue these, one at a time:

grep /var/log/messages -i ‘Jul 31 07:5*’ >> 101messages.txt

grep /var/log/acpid -i ‘Jul 31 07:5*’ >> 101acpid.txt

grep /var/log/firewall -i ‘Jul 31 07:5*’ >> 101firewall.txt

grep /var/log/scpm -i ‘Jul 31 07:5*’ >> 101scpm.txt

grep /var/log/warn -i ‘Jul 31 07:5*’ >> 101warn.txt

and, you should have several Xorg.0 or .1 files …one will cover 31
Jul use its name in the below:

cat /var/log/Xorg.[fill in].log[MAYBE .old, maybe not] -i ‘Jul 31 07:5*’
>> 101Xorg.[fill in above].log.txt

similarly there will be several and Xorg.99 log files pick the one
covering 31 July and

cat /var/log/Xorg.99.log.[MAYBE .old, maybe not] -i ‘Jul 31 07:5*’ >>
101Xorg.99.log.txt

then copy all 101* files to /alexmay.homeip.net/test/

and i’ll have a look as soon as i can (and, so should others
interested–INCLUDING you Alex)…for sure look for “error” or “warn”
or “warning”…

peace,


DenverD (Linux Counter 282315)
A Texan in Denmark

I went through all of that, but only three files yielded any output at all for that timeframe.
Here are the links:
http://alexmay.homeip.net/test/101acpid.txt
http://alexmay.homeip.net/test/101warn.txt
http://alexmay.homeip.net/test/101messages.txt
I’ve looked at the output, but I can’t see a reason for a crash in there.
Can anyone else?

Hi
Just a couple of things in the messages, have a look at fixing the
smartd errors there are comments on what to do :slight_smile: Also why all the jpg
errors, can that be addressed?

The other thing is confirming your sensors config, have a look through
the sensors.conf file and read the comments about your sensors to tweak
it for correct information. Although even if it’s not quite right at
least you may see a trend develop.

How old it the machine? I’m just thinking it may be a hardware issue,
can you change the power supply, check all the connections etc?


Cheers Malcolm °¿° (Linux Counter #276890)
SLED 10 SP2 i586 Kernel 2.6.16.60-0.27-default
up 3:15, 1 user, load average: 0.47, 0.95, 0.79
GPU GeForce Go 6600 TE/6200 TE Version: 173.14.12

I have modified /etc/smartd.conf to add ‘/dev/sda -a -d sat’

I don’t understand the jpeg errors. They must come from the mirror, but I don’t understand why. I have this server mirroring a directory on another server. The files would not exist on the other server as they are more than 12 hours old and that server is set to delete the jpegs after 12 hours. But why mirror is still trying to download them is a mystery to me.

I tried looking through sensors.conf, but I couldn’t understand the comments. Sorry. I thought the installation of the sensors monitoring included an automated process which set that up.

The machine is an old machine. I bought it second hand, and I don’t really know exactly how old it is. I don’t have another powersupply for it and I can’t see anything wrong with the connections, but I haven’t checked all the solder joints!

Hi
Ok, maybe worth investigating if those errors could cause a problem
somehow? Not creating a whole stack of connections or anything, netstat
and lsof should help there.

Sensors should be pretty ok, but the 45000rpm for a fan seems a bit
strange :wink:

So the capacitors on the motherboard look ok, not leaking or such. That
has been an issue in the past. The only other thing would be to try
another power supply and see how that goes.

Oh and no excuses for not getting the magnifying glass out for those
solder joints lol :slight_smile:


Cheers Malcolm °¿° (Linux Counter #276890)
SLED 10 SP2 i586 Kernel 2.6.16.60-0.27-default
up 1:47, 1 user, load average: 0.15, 0.32, 0.60
GPU GeForce Go 6600 TE/6200 TE Version: 173.14.12

I have dumped out netstat and lsof to the following files:
http://alexmay.homeip.net/test/netstat.txt
and
http://alexmay.homeip.net/test/lsof.txt
They are not telling me much as I don’t know what they should look like.
Should I set up a cron job to log the output from netstat and lsof so that I can see what they looked like just before a crash?

> I’ve looked at the output, but I can’t see a reason for a crash in
> there.
> Can anyone else?

i can only guess (and remember, you began this saying you didn’t know
where to begin looking) that you should begin by looking at:

  • why your hard drive suddenly grew cold?

i think it is because it crashed…

i think it crashed because it had just run a cron job with required it
to swing its read/write arm back and forth looking for BUNCHES of
non-existing files, and write HUGE BUNCHES of error messages in TWO
different log files…

however, i’m not certain…i am guessing…and, i guess i also ought to
notice that according to http://alexmay.homeip.net/test/101messages.txt
a guy named alex:

logged in at Jul 31 07:49:27
(and caused far too many errors to count, and then)
logged out at Jul 31 07:50:18
logged in at Jul 31 07:50:26
logged out at Jul 31 07:50:26
logged in at Jul 31 07:51:26
logged out at Jul 31 07:51:26

and, i wonder what is going on…i THINK if you solve the error
problem you will have made a step in the right direction…at least you
will have given an old hard drive an easier life…and, removed a
variable from the system, and maybe another culprit can be found…

sorry, i do not know how to help you solve the mystery of why so many
…jpgs are being searched for, and not found…ask Alex :wink:

oh, and don’t throw away these log files (including the atop files)
because perhaps a REAL guru (i am NOT one) will come along and spot
the problem immediately…


DenverD (Linux Counter 282315)
A Texan in Denmark

The log in and out are from mirror. It uses my login from each machine to mirror a directory on the other machine. Why it is looking for non-existant files is beyond me. I have the identical setup on another machine and it never crashes.

I appreciate your trying to help, but I don’t think we are really getting anywhere.

Can anyone suggest a different place to look?
I thought that someone in this forum would be able to tell me how to track down the cause of a crash, but that doesn’t seem to be the case. If this is not the right forum for such a question, where is?
Is this a question for Novell? Surely their OS shouldn’t crash for no reason.
Don’t they monitor these forums?
If not, is there some other way that I can attract their interest?

> Is this a question for Novell? Surely their OS shouldn’t crash for no
> reason.

it has a reason…my guess is when you find and fix the reason for all
the errors you will have solved your crash problem…but, i’m just
guessing…

> Don’t they monitor these forums?

yes and no, this is a community of users, some of the users here are
tight with Novell…some even work for Novell (i’m neither)…but, they
don’t work here…the help here comes from other uses who volunteer
their time to try to help other folks with OPENsuse…

> If not, is there some other way that I can attract their interest?

you bet…they are in the BUSINESS of supporting SUSE Enterprise Linux
Server and SUSE Enterprise Linux Desktop, if you buy a license to their
server you have access to their support…

if you don’t you have 24x7 access to us…and all the good ideas we can
come up with…sorry i can’t be more helpful…i’ve given a couple hours
of my time and wish i could solve your problem from here (and happy to
do it), but i cannot…

you might consider asking another question in another thread…something
like “Why all these ‘no such file’ errors?” in the Programming/Scripting
forum maybe…its free to try.

which reminds me, i get the idea you are a system administrator at
work …administering a headless linux box from a work provided
Windows™ machine…and, i wonder what is your work place’s budget for
Windows™ licenses per year, and for Linux?

just something to think about…


DenverD (Linux Counter 282315)
A Texan in Denmark

Thanks again.
I will keep working on resolving the jpeg errors,
but I’m not so sure.

Just wanted to make it clear that I really do appreciate your efforts.

No, this is not a commercial installation. I am running this server in my home so that anyone in the family can access a scanner and so that we can keep an eye on intruders with a webcam. So my “budget” is very, very, very low!

> So my “budget” is very, very, very low!

understand…please do post and see if you can get some help on that
mystery of why your cron works one way, and not another…

i pretty sure that is the key to your solution…


DenverD (Linux Counter 282315)
A Texan in Denmark

Well I messed about with mirror yesterday, and I may have solved the issue with the error messages, (though I am not sure how!), but I haven’t solved the crash problem.
I got another crash soon after starting the server this morning. Just after 4.30am.

Here are the logs.
hdtemp : /dev/sda: ST320014A: 33 C
messages : http://alexmay.homeip.net/test/0802messages
atop : http://alexmay.homeip.net/test/0802atop
sensors : http://alexmay.homeip.net/test/0802sensors

Has anyone any more ideas?

> Has anyone any more ideas?

since you eliminated the overworked hard drive theory (of mine) and
continue to show voltage irregularities reported by ‘sensors’…added
to your recent post “old machine. I bought it second hand” i revert to
this WAG:

you are experiencing the same kinds of flaky system symptoms i did which
couldn’t be explained by researching software problems, which i
finally overcame AFTER doing these hardware things, in this order (on
the advice of several OLD men with much more experience than me):

  1. replaced the ribbon connector between the hard drive and the
    motherboard (i just took a connector out of a ‘spare’ machine…maybe
    the first was bad, maybe it just needed the contacts swiped)

  2. replaced the power supply with one of a larger output capacity (often
    the case in retail, off-the-shelf boxes to install a power supply of the
    LOWEST possible cost which will last until the warranty expires–that
    often means it has JUST enough power capacity to run all the hardware
    supplied WITH the box…come in later and add a 7200 RPM drive and a
    CD/DVD burner/player, and a blah blah…and, the power supply runs ok
    most of the time UNTIL it dosen’t for a micro second when a system
    activity springs from one state to another, and that spike is JUST
    enough hiccup to cause ‘unexplained’ problems…)

  3. removed the cpu heatsink and applied new thermal paste in the correct
    amount

ok, so it seems your sensors rule out number three…BUT i notice a
rather high fan RPM…maybe it is ok, maybe not…i dont know…enough
paste to do LOTs of CPUs can be had in a little tube for 5 or 10 bucks…

item one is close to no cost to just remove and replace the ribbon (to
test the slightly corroded and ALMOST 100% good theory)…

item two can cost from almost nothing to over $100 i’d guess…and, i
judge it to be the most likely weakest link…

though you never (as far as i can see) mentioned which version of SuSE
you are running, i do need to mention that old box has WAY too little
RAM (at 185 MB)) for most anything after (what?) version 7-something or
so…

and, i’m not sure why you run X for a headless (webcam and scanner)
server (maybe you have to, i don’t know–as mentioned, i’m not a real
guru!!)

i really do hope you get it going…but, it might be better and cheaper
to pickup a newer box in a garage sale…there should be lots of them
coming on the market after folks tried to “upgrade” from XP to Vista and
then found it impossible on a box that will EASILY run 10.3…


DenverD (Linux Counter 282315)
A Texan in Denmark