AVG Scan causing Kernel Failure

I just installed OpenSUSE 11.4 x64 as a fresh full-install last week. Seems to be running great, up until the last couple days when I installed AVG. Every time I do a avgscan, it runs for 20 to 60 min, and then causes the PC to reboot. The only thing I can find is the following event in the /var/log/messages:


May  1 11:45:01 linuxbox sntp[7653]: Started sntp
May  1 11:45:59 linuxbox kernel: [38429.349995] iTCO_wdt: Unexpected close, not stopping watchdog!
May  1 11:46:00 linuxbox kernel: [38430.111688] PM: Marking nosave pages: 000000000009d000 - 0000000000100000
May  1 11:46:00 linuxbox kernel: [38430.111694] PM: Marking nosave pages: 00000000befd3000 - 00000000bf5cd000
May  1 11:46:00 linuxbox kernel: [38430.111726] PM: Marking nosave pages: 00000000bf5cf000 - 00000000bf681000
May  1 11:46:00 linuxbox kernel: [38430.111731] PM: Marking nosave pages: 00000000bf800000 - 0000000100000000
May  1 11:46:00 linuxbox kernel: [38430.112609] PM: Basic memory bitmaps created
May  1 11:46:00 linuxbox kernel: [38430.152038] PM: Basic memory bitmaps freed

I get this result every time I try to do a virus scan. I also have seen this result once when doing a large scp transfer from another Suse box. The AVG log just terminates and gives no warnings as to what’s going wrong.

Are there any other logs I can check out to try to troubleshoot this further??? Any ideas what’s happening here?

Thanks!

On 2011-05-01 20:36, PsychoGTI wrote:
> Any ideas what’s happening here?

Heat?


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

What are you scanning for Virus’? Do you windows partitions ?

I checked the temperatures, thinking the same thing. The CPU is at 52 Deg C (125 deg F) and the HD’s are at ~50 deg C (122 deg F), so within normal ranges. The PC is in my basement, where it’s cool 99% of the time.

This machine serves out SAMBA shares to several Windows PC’s in my family. They all upload/download files to the server, as well as backup to it. Scanning on the server seems like a prudent thing to do. :slight_smile: I had AVG working just fine under Suse 11.1 with my old server. I had it rigged to email me results and everything after a scan.

PsychoGTI wrote
> I checked the temperatures, thinking the same thing. The CPU is at 52
> Deg C (125 deg F) and the HD’s are at ~50 deg C (122 deg F), so within
> normal ranges. The PC is in my basement, where it’s cool 99% of the
> time.
The next best thing which comes to my mind is bad RAM. Did you run a memory
check (at least several hours)?


PC: oS 11.3 64 bit | Intel Core2 Quad Q8300@2.50GHz | KDE 4.6.2 | GeForce
9600 GT | 4GB Ram
Eee PC 1201n: oS 11.4 64 bit | Intel Atom 330@1.60GHz | KDE 4.6.0 | nVidia
ION | 3GB Ram

On 05/01/2011 10:18 PM, martin_helm wrote:
>
> The next best thing which comes to my mind is bad RAM.
>

googling “Marking nosave pages” i think i learned that that means the
kernel found no place to save stuff it wanted to save somewhere (even if
only momentarily)…

i guess that was probably caused by insufficient memory for the entire
scan operation…

it might be filling physical RAM but i think that would be reported
differently, so i think it is spilling into swap and filling it
also…or, (i am guessing) maybe it is filling the root partition by
filling /tmp…

maybe you need more RAM or swap space (how much do you have of each?)

if you are low on RAM–maybe you are running X and other unneeded stuff
on that server and taking resources needed for the scan? perhaps you
could skinny down the running services etc and get by…or, just add
some swap space…

you could open an instance of top and sit and watch the memory/swap etc
lines at the top and maybe see what is going on (with only an investment
of the expected 20 to 60 minutes prior to unexpected reboot)…

or you could install atop and set it to take an save a snapshot of what
is going on each (say) minute or two (the default is one save each 10
minutes, which might not be often enough)…


CAVEAT: http://is.gd/bpoMD
[openSUSE 11.3 + KDE4.5.5 + Thunderbird3.1.8 via NNTP]
HACK Everything → http://www.youtube.com/watch?v=j5b4CCe9pS8&NR=1

So I’ve done some work on this based on the suggestions above, and here’s some results…

I tried running the Memtest86+ included on the install disc, however it would freeze at startup. I did a bit of digging around, and found out that the version on the disc (v4.10) is not compatible with my CPU and Mobo, but the newest version (v4.20) is, so I’ve started a full scan with this version and will let you know the results when finished. :slight_smile:

I’ve run into compatibility issues during the install a fair amount. This isn’t to be unexpected, as I just bought all the parts and pieced this brand-new server together last week. For the record, it’s an Intel Sandy Bridge 2500K CPU with an Asus P8P67 Evo motherboard with the B3 Stepping and 16 GB of RAM. So pretty new. I configured it with 8 GB of swap, but the system never seems to use it. I’m starting to wonder if this newer architecture is relating to a few of my issues I have encountered over the last week.

I did re-run avgscan with the System Monitor open with active graphs on CPU usage, Harddrive Usage, Memory (Total mem usage, swap, application, buffer, and cached), and Load Averages. Some graphs on a 0.5 sec refresh, some on 30 sec for longer trends as I didn’t know if this test was going to last 5 min or 60. I also installed atop as suggested and monitored that as I went.

The results give a bit of information. Essentially, the PC starts out with 0.9 GB memory used, 2.0 GB with buffered and cached worked in. You can tell when I start the scan. The memory usage is slow to climb for the first 10 minutes, mainly the cached and the total used with the application mem used staying constant around 1.0 GB. After the 10 min mark, the total memory usage and cached mem climbs fairly fast. At min 20, all 16 GB of mem is used (also confirmed by atop) however this is all due to cached mem, with the application mem using only 1.2 GB… so not back. This whole time the swap remains untouched. In this test, the mem was maxed out for another 10 to 15 min or so, and would occasionally dip a bit after a bit of garbage collection… and then the kernel crapped out again, and the server reset.

So… not sure if this tells much of a story, other than the memory is being used, but only for caching. So most likely to read-ahead files and such. I notice this same behavior when I transfer files via scp, which I mentioned in my first posting I also had noticed one such kernel failure.

On another note… this server is configured to use mdadm for linux software raid. Two drives in RAID 1 mirroring as the system drive, and 4 drives in RAID 5 as a data drive… not sure if that makes a difference or not either. This is my first time using linux mdadm software raid.

Thanks for the suggestions!

On 2011-05-03 03:36, PsychoGTI wrote:
> So… not sure if this tells much of a story, other than the memory is
> being used, but only for caching. So most likely to read-ahead files and
> such. I notice this same behavior when I transfer files via scp, which I
> mentioned in my first posting I also had noticed one such kernel
> failure.

Maybe the problem is not memory, but disk.

The crash is sudden and fast, or slow?

> On another note… this server is configured to use mdadm for linux
> software raid. Two drives in RAID 1 mirroring as the system drive, and 4
> drives in RAID 5 as a data drive… not sure if that makes a difference
> or not either. This is my first time using linux mdadm software raid.

Dunno, but perhaps you can try without, I mean, with a single disk or two.
For learning if it is a problem.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

On 05/03/2011 04:20 AM, Carlos E. R. wrote:
>
> Maybe the problem is not memory, but disk.
>
> The crash is sudden and fast, or slow?
>
[snip]
> Dunno, but perhaps you can try without, I mean, with a single disk or two.
> For learning if it is a problem.

i agree…the memory use sounds normal…

@PsychoGTI, you have not declared your experience/knowledge level in
linux, so i throw out in case you are kinda new: though the climbing to
near full memory use may sound bad when thought of in the way Windows
uses RAM/swap, it sounds perfectly normal to me ()…EXCEPT for the
crash, and i think it is probably a RAID problem…

OR, maybe the kernel can’t actually write to swap…certainly it should
be able to, and if it tried and found it impossible to write it would go
on an unpredictable path (i think) which might lead to either a kernel
panic, or the error displayed…but, i’ve never seen the uncommanded
restart (but, i’ve not see it all yet)

so, @PsychoGTI i ask you to take a look at the logs…/var/log/messages
recorded the event…each line begins with a date/time…zoom down to
the time of any kernel restart and have a good look…find the place
where it changes from all ok to going south, and copy paste that to
paste.opensuse.org (please spin the time to keep from “1 Week” to “3
Years”…maybe someone will be able to spot the exact cause of the restart…

also, atop should have a log there somewhere (i’ve not used it in a
while, so i don’t remember its name or location [both should be
learnable from its doc]) which might also give a clue…

sorry, i can’t help with RAID…never used it, too much trouble, too
easy to mess up and too easy to make you feel all safe and secure (and
lax on backup)…

but, i guess if you were to unhook all of those drives, and (do as
Carlos suggests) put in one drive and do a fresh install of 11.4 and let
the install script choose the partitioning scheme (/, /home, /swap with
ext4 for root and home) then i guess this problem would be gone…

by the way: if Carlos and i give conflicting advice you would be wise to
always follow his! (i’m not a real guru)


CAVEAT: http://is.gd/bpoMD
[openSUSE 11.3 + KDE4.5.5 + Thunderbird3.1.8 via NNTP]
HACK Everything → http://www.youtube.com/watch?v=j5b4CCe9pS8&NR=1

PsychoGTI wrote:
> I’ve run into compatibility issues during the install a fair amount.
> This isn’t to be unexpected, as I just bought all the parts and pieced
> this brand-new server together last week. For the record, it’s an Intel
> Sandy Bridge 2500K CPU with an Asus P8P67 Evo motherboard with the B3
> Stepping and 16 GB of RAM. So pretty new. I configured it with 8 GB of
> swap, but the system never seems to use it. I’m starting to wonder if
> this newer architecture is relating to a few of my issues I have
> encountered over the last week.

Hmm, I’m not sure why this thread claims the kernel failed? Is there any
evidence of that? All I see so far is evidence of a hardware reboot.

Anyway, given the newness of the h/w, what version of the kernel are you
running? I’d try installing the latest stable kernel in case there have
been recent bugfixes. I think there have been significant problems with
Sandy Bridge but I don’t follow hardware much.

It’s definitely also worth swapping hardware around as Carlos & DenverD
have suggested.

Cheers, Dave

On 2011-05-03 11:24, DenverD wrote:
> by the way: if Carlos and i give conflicting advice you would be wise to
> always follow his! (i’m not a real guru)

Ha, I’m sure you also have a lot of experience. Even if it is only by being
here and reading of people problems, you learn a lot :slight_smile:


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

The crash is definitely very fast and all of a sudden. The screen blanks out, and next thing you know it you’re looking at the bios mem-check screen. No sign, no warning, no nothing.

Are you suggesting setting up a separate spare disc as a single root OS drive, just for testing? What about using a bootable live disc as a test?

I agree, with the uncommanded restart and the suddenness of the crash, with no warnings, hangs, etc… seems like a driver issue of some sorts (like mdadm not handling the RAID correctly). As for my experience, I’m not a newb at linux, I have a degree in Computer Engineering and use it at work… however, I do know that this is my 10th post ever on this forum, but have camped out here for several answers to help me out from time to time. This forum is great for learning things. :slight_smile: I am by no means a “Flux Capacitor Penguin”, but have an understanding how PC’s and Linux should work.

Is there a way to test the write to swap ability? Maybe from a bootable USB Live key? (I know, again with the bootable USB Live key)…

Will do. I’ll type out a short description of the problem, and then paste in the /var/log/messages from the start of the scan to the time is dies (at the Marking No Save Pages).

I would really like to run the OS drives and DATA drives in RAID. I had started another discussion thread on dmraid (fake raid) versus mdadm (software raid). As you can see… I was already seeing a number of funny enomolies and questions on Linux RAID and such, as typically (in a work setting) I deal with expensive hardware RAIDs, and don’t know much about the software vs. fake raid setups.

I looked further into my current mdadm setup, and noticed that when doing a proper shutdown, sometimes there seems to be an “unknown” mdadm device. My /proc/mdstat should look like the following:


Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : active raid5 sdc1[0] sdf1[4] sde1[2] sdd1[1]
     5860539648 blocks super 1.0 level 5, 128k chunk, algorithm 2 [4/4] [UUUU]
     bitmap: 5/15 pages [20KB], 65536KB chunk

md2 : active raid1 sda4[0] sdb4[1]
     951489400 blocks super 1.0 [2/2] [UU]
     bitmap: 5/8 pages [20KB], 65536KB chunk

md0 : active raid1 sda1[0] sdb1[1]
     103412 blocks super 1.0 [2/2] [UU]
     bitmap: 1/1 pages [4KB], 65536KB chunk

md1 : active raid1 sda3[0] sdb3[1]
     20972472 blocks super 1.0 [2/2] [UU]
     bitmap: 1/1 pages [4KB], 65536KB chunk

Where md0 is /boot, md1 is /, md2 is /home, and md3 is /data. This should translate to /dev/md/0, /dev/md/1, /dev/md/2, and /dev/md/3 respectively (correct me if I’m wrong, as I’m still pretty new to this mdadm stuff). However, during shutdown I sometimes notice a /dev/md1 that is not found (no final ‘/’ before the 1). When I run mdadm --monitor with the --test flag, I do get situation reports for all the correct md’s, as well as this phantom /dev/md1… not sure what that means, and if it is part of the problem.

Do you know of any log or config I can check to see what the overall /dev config should be for mdadm devices?

I called it “Kernel Failure” because I can find no other indications except those last lines in the /var/logs/messages files that the kernel seem to exhibit some sort of problem, and then everything stops and reboots as normal (according to the log). The reboot isn’t initiated by openSUSE… it just happens. The lines I posted in my original post on this thread are the last ones in the log before lines pertaining to normal boot.

Sandy Bridge seems to be having a couple issues… but so far I think that’s due to a new architecture and support/drivers are in short supply so far. The kernel version I’m running is 2.6.37.6, which is the latest offering by opensuse and Packman for x64 arch. I have been thinking about making a thread that lists troubles/issues I have encountered with 11.4 and Sandy Bridge architecture. So far in other installs (all on older machines or in VM’s), 11.4 was problem free.

Any other logs and such I should be checking besides /var/logs/messages?

As always, thanks a ton for the feedback/help!

As promised, Paste created: SUSE Paste - http://paste.opensuse.org/86652867

Also, I ran the MEMTEST86+… it ran completely 10 times, and did not find any bad RAM.

I had a problem like that, spontaneous reboot, turned out to be a power supply going bad. Replaced it and it is solid now.

On 05/04/2011 04:36 AM, PsychoGTI wrote:
>
> I am by no means a “Flux Capacitor Penguin”

i guess you know that that means absolutely nothing, but one thing: has
posted a lot, needs to “get a life” etc…

says nothing about skill or knowledge…i wish they would get rid of
it…i tried my best to avoid it for over a year by posting using a
series of unregistered nom de guerre, but strangely some of the mods and
administrators here thought i was up to no good…and, after a direct
threat of banning i gave in and let the counting begin…

> Is there a way to test the write to swap ability?

normally the linux kernel will stuff some things into swap even before
RAM is full…i have zero idea how it goes about deciding what to put
where and when…but, top just now tells me i have been up 65 minutes
and have 3168k in swap with ram near full and about 43% of that cached…

so, it is because you wrote “This whole time the swap remains
untouched.” that i suspect that might be what cause the kernel to
complain “Marking nosave pages”

hmmmm…i googled “Marking nosave pages” and found 21k hits
with problems associated like acpi, hibernate, display and after
scanning a few pages of hits, i changed the search string, and leave it
to you to dig into these ~65 deeply: http://tinyurl.com/6f6mh5j

>> and copy paste that to paste.opensuse.org
>
>
> Will do. I’ll type out a short description of the problem, and then
> paste in the /var/log/messages from the start of the scan to the time is
> dies (at the Marking No Save Pages).
>

sorry, i see i failed to mention to paste the URL to the paste page back
to this thread (which is the only way we have access to it)

> I would really like to run the OS drives and DATA drives in RAID.

i’m not suggesting you change your production machine, just looking for
a way to prove or disprove it is a drive problem (most likely a RAID
problem) by eliminating those from the mix during a test

> [big snip of RAID output]… not sure what that
> means, and if it is part of the problem.
>
> Do you know of any log or config I can check to see what the overall
> /dev config should be for mdadm devices?

as mentioned, when it comes to RAID i’m an idiot…except to know that:

-software raid is junk (invented by folks who $ELL $oftware)
-no RAID is a good substitute for a proper backup routine
-only good if a 24x7x365 hot replacement is required
-almost always is more difficult to setup/administer than none

so, as a simple man i’ve not bothered to learn much about it…

if you have a 24x7x365 commercial commitment then you are [imnsho] in
the wrong place to begin with…openSUSE is a short life, near cutting
edge, consumer level distro where lots of enthusiast discover LOTS of
bugs…which get worked out and eventually the code becomes clean
enough for Novell to release the commercial applications ready SUSE
Linux Enterprise Server (SLES) or SUSE Linux Enterprise Desktop (SLED)…

i have zero idea, but it might be that their latest is already set to
work and play nicely with your setup…

anyway, once you have nailed down where the bug is, i will ask you to
log a bug so that it can be fixed…

>
> I called it “Kernel Failure” because I can find no other indications

not sure, but i think Dave’s point is that an uncommanded reboot might
just be a momentary loss of ground in a hardware switch…while most
“kernel failure” are total system hard freeze, kernel panic, etc…

in fact, i think it really good to listen to him (he is a real guru)

and his post seems to really be saying: it is not a kernel problem, a
declaration i can neither prove or disprove…however, i think the
single hard drive test will prove your system stable without RAID.

>
> Any other logs and such I should be checking besides
> /var/logs/messages?

well, you could look for the strings “error” and/or “warning” in *:

/var/log/boot.msg
/var/log/warn
/var/log/Xorg.0.log

hmmmm, that last one caused me to think…if you wish, before you do
surgery (removing drives and replacing with one, for a test) you might
want to remove X from the equation…you have not declared which AVG
you are running but i’m pretty confident whatever it is is doesn’t
require a GUI, so with the system and hardware you have boot to runlevel
three and see if it will complete the scan with rebooting…if it does
maybe it is an X problem, and neither a kernel nor RAID problem!


CAVEAT: http://is.gd/bpoMD
[openSUSE 11.3 + KDE4.5.5 + Thunderbird3.1.8 via NNTP]
HACK Everything → http://www.youtube.com/watch?v=j5b4CCe9pS8&NR=1
*

On 05/04/2011 04:36 AM, PsychoGTI wrote:
>
> As promised, Paste created: ‘SUSE Paste -
> http://paste.opensuse.org/86652867
> (http://paste.opensuse.org/86652867)

turn off/kill/drown the watchdog timer and see if your problems fade away…

> Also, I ran the MEMTEST86+… it ran completely 10 times, and did not
> find any bad RAM.

well, 10 passes is nothing…with the amount of RAM you have it needs
to run 10 hours just to get started good…24 would be better…

“overnight” is the recommended minimum run for a normal consumer level
setup with less than 2GB…


CAVEAT: http://is.gd/bpoMD
[openSUSE 11.3 + KDE4.5.5 + Thunderbird3.1.8 via NNTP]
HACK Everything → http://www.youtube.com/watch?v=j5b4CCe9pS8&NR=1

On 05/04/2011 07:36 AM, gogalthorp wrote:
>
> I had a problem like that, spontaneous reboot, turned out to be a power
> supply going bad. Replaced it and it is solid now.

yes, an uncommanded reboot could easily be a non-software problem…


CAVEAT: http://is.gd/bpoMD
[openSUSE 11.3 + KDE4.5.5 + Thunderbird3.1.8 via NNTP]
HACK Everything → http://www.youtube.com/watch?v=j5b4CCe9pS8&NR=1

PLEASE DON’T COMBINE REPLIES!

I completely missed your reply to my posting and only spotted a mention
in DenverD’s reply. Please post individual replies. Also please don’t
start a separate discussion about the merits of various types of RAID in
a thread about alleged kernel crashes!

Keep things clear and people stand more chance of following the issue
and being able to help you.

PsychoGTI wrote:
> djh-novell;2334387 Wrote:
>> Hmm, I’m not sure why this thread claims the kernel failed? Is there
>> any
>> evidence of that? All I see so far is evidence of a hardware reboot.
>>
>> Anyway, given the newness of the h/w, what version of the kernel are
>> you
>> running? I’d try installing the latest stable kernel in case there
>> have
>> been recent bugfixes. I think there have been significant problems
>> with
>> Sandy Bridge but I don’t follow hardware much.
>>
>
> I called it “Kernel Failure” because I can find no other indications
> except those last lines in the /var/logs/messages files that the kernel
> seem to exhibit some sort of problem, and then everything stops and
> reboots as normal (according to the log). The reboot isn’t initiated by
> openSUSE… it just happens. The lines I posted in my original post on
> this thread are the last ones in the log before lines pertaining to
> normal boot.

Right, but the kernel would experience problems if the hardware is
faulty as well, and the kernel apparently hasn’t oopsed, so the jury is
still out on what the cause is.

> Sandy Bridge seems to be having a couple issues… but so far I think
> that’s due to a new architecture and support/drivers are in short supply
> so far. The kernel version I’m running is 2.6.37.6, which is the latest
> offering by opensuse and Packman for x64 arch.

Not quite. The latest version is 2.6.39-rc5 from
<https://build.opensuse.org/project/show?project=Kernel%3AHEAD>
It might be worth trying as a test.

> I have been thinking
> about making a thread that lists troubles/issues I have encountered with
> 11.4 and Sandy Bridge architecture. So far in other installs (all on
> older machines or in VM’s), 11.4 was problem free.

Sounds like a good idea, if there isn’t such a thing already.

> Any other logs and such I should be checking besides
> /var/logs/messages?

If you really think it’s a kernel problem, you can turn up the level of
kernel logging.

Another thing worth trying might be to connect a serial terminal and
route kernel messages there as well. Mesages sometimes appear there that
otherwise get lost in a crash.

On 2011-05-04 08:44, DenverD wrote:
> On 05/04/2011 04:36 AM, PsychoGTI wrote:

> turn off/kill/drown the watchdog timer and see if your problems fade away…

The job of a watchdog is to reboot the machine if it times out. Could be
that…

> well, 10 passes is nothing…with the amount of RAM you have it needs to
> run 10 hours just to get started good…24 would be better…

Considering that the machine crashes so easily, I don’t think longer will
find anything. Just a guess.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

On 2011-05-04 08:32, DenverD wrote:
> On 05/04/2011 04:36 AM, PsychoGTI wrote:

>> Is there a way to test the write to swap ability?

Hibernating.

Writing a C program that reserves large chunks of memory.

dd can be used for that, sort of.

dd if=/dev/null of=/dev/null bs=1G count=1

will use one giga. You can try more, or several processes in parallel.

> as mentioned, when it comes to RAID i’m an idiot…except to know that:
>
> -software raid is junk (invented by folks who $ELL $oftware)

No :slight_smile:

Fake raid is junk, although it serves a purpose.

> -no RAID is a good substitute for a proper backup routine
> -only good if a 24x7x365 hot replacement is required
> -almost always is more difficult to setup/administer than none

yes.

>
> so, as a simple man i’ve not bothered to learn much about it…

I have one raid partition, to test things. Other than that, I don’t use it.

>> I called it “Kernel Failure” because I can find no other indications
>
> not sure, but i think Dave’s point is that an uncommanded reboot might just
> be a momentary loss of ground in a hardware switch…while most “kernel
> failure” are total system hard freeze, kernel panic, etc…
>

Yes, the kernel tries hard to report what is happening.

Another target culprit is the video driver.

> hmmmm, that last one caused me to think…if you wish, before you do
> surgery (removing drives and replacing with one, for a test) you might want
> to remove X from the equation…you have not declared which AVG you are
> running but i’m pretty confident whatever it is is doesn’t require a GUI,
> so with the system and hardware you have boot to runlevel three and see if
> it will complete the scan with rebooting…if it does maybe it is an X
> problem, and neither a kernel nor RAID problem!

That’s what I was thinking - I promise I had not read this paragraph when I
wrote mine :slight_smile:


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)