Problems with nvidia modules with 4GB RAM

My PC is running with 2GB of DDR2 memory installed (one DIMM). But when put 4GBs in (2x2GB, dual channel), strange things start to happen!

I have a GA-MA69G-S3H motherboard, AMD Athlon™ 64 X2 Dual Core Processor 4800+ CPU, and openSUSE 11.3 installed (2.6.34.7-0.5-desktop, x86_64).

Symptoms are that X won’t start. In /var/log/warn, I see entries like this:

Jan 28 18:55:19 media kernel: 8.690297] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 260.19.21 Thu Nov 4 21:16:27 PDT 2010
Jan 28 18:55:20 media mcelog: failed to prefill DIMM database from DMI data
Jan 28 18:55:20 media kernel: 9.463711] NVRM: Xid (0001:00): 53, CMDre 00000000 00000000 00000000 00000001 00000001
Jan 28 18:55:23 media kernel: 12.543351] NVRM: Xid (0001:00): 53, CMDre 00000000 00000000 00000000 00000001 00000001
Jan 28 18:55:25 media kernel: 13.276131] NVRM: Xid (0001:00): 6, PE007e
Jan 28 18:55:25 media kernel: 13.279962] NVRM: Xid (0001:00): 6, PE007e
Jan 28 18:55:25 media kernel: 13.283674] NVRM: Xid (0001:00): 6, PE007e
Jan 28 18:55:25 media kernel: 13.287397] NVRM: Xid (0001:00): 6, PE007e
Jan 28 18:55:25 media kernel: 13.291105] NVRM: Xid (0001:00): 6, PE007e
Jan 28 18:55:25 media kernel: 13.294818] NVRM: Xid (0001:00): 6, PE007e
Jan 28 18:55:25 media kernel: 13.298533] NVRM: Xid (0001:00): 6, PE007e
Jan 28 18:56:18 media kdm[1422]: X server startup timeout, terminating
Jan 28 18:57:28 media kdm[1422]: X server for display :0 cannot be started, session disabled

And also things like this:

Jan 28 18:59:09 media kernel: 238.370726] BUG: soft lockup - CPU#0 stuck for 61s! [xfslogd/0:343]
Jan 28 18:59:09 media kernel: 238.370729] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nvidia(P) cpufreq_conservative cpufreq_userspace cpufreq_powersave powernow_k8 mperf ext4 jbd2 crc16 loop dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm sr_mod snd_timer usb_storage r8169 k8temp snd soundcore snd_page_alloc i2c_piix4 cdrom edac_core edac_mce_amd pcspkr serio_raw sg button xfs exportfs sd_mod fan processor thermal thermal_sys ata_generic ahci pata_atiixp libata scsi_mod [last unloaded: preloadtrace]
Jan 28 18:59:09 media kernel: 238.370758] CPU 0
Jan 28 18:59:09 media kernel: 238.370759] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nvidia(P) cpufreq_conservative cpufreq_userspace cpufreq_powersave powernow_k8 mperf ext4 jbd2 crc16 loop dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm sr_mod snd_timer usb_storage r8169 k8temp snd soundcore snd_page_alloc i2c_piix4 cdrom edac_core edac_mce_amd pcspkr serio_raw sg button xfs exportfs sd_mod fan processor thermal thermal_sys ata_generic ahci pata_atiixp libata scsi_mod [last unloaded: preloadtrace]
Jan 28 18:59:09 media kernel: 238.370782]
Jan 28 18:59:09 media kernel: 238.370785] Pid: 343, comm: xfslogd/0 Tainted: P 2.6.34.7-0.5-desktop #1 GA-MA69G-S3H/GA-MA69G-S3H
Jan 28 18:59:09 media kernel: 238.370788] RIP: 0010:<ffffffff81249d21>] <ffffffff81249d21>] delay_tsc+0x61/0xe0
Jan 28 18:59:09 media kernel: 238.370794] RSP: 0018:ffff880001e03bb0 EFLAGS: 00000246
Jan 28 18:59:09 media kernel: 238.370796] RAX: 00000000b69a8501 RBX: ffff88012b43ffd8 RCX: 0000000000000000
Jan 28 18:59:09 media kernel: 238.370798] RDX: 000000000024dcc4 RSI: 0000000000263c09 RDI: 0000000000263c1d
Jan 28 18:59:09 media kernel: 238.370800] RBP: ffffffff810039b3 R08: ffff88012b460000 R09: ffff88012e4d3150
Jan 28 18:59:09 media kernel: 238.370802] R10: 0000000000000200 R11: 0000000000000102 R12: ffff880001e03b30
Jan 28 18:59:09 media kernel: 238.370804] R13: ffff88012b43ffd8 R14: ffff88012e420000 R15: ffffffff8101d9c5
Jan 28 18:59:09 media kernel: 238.370806] FS: 00007fe6cb112700(0000) GS:ffff880001e00000(0000) knlGS:0000000000000000
Jan 28 18:59:09 media kernel: 238.370808] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 28 18:59:09 media kernel: 238.370810] CR2: 00007f09066144f8 CR3: 0000000001a04000 CR4: 00000000000006f0
Jan 28 18:59:09 media kernel: 238.370812] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 28 18:59:09 media kernel: 238.370814] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 28 18:59:09 media kernel: 238.370816] Process xfslogd/0 (pid: 343, threadinfo ffff88012b43e000, task ffff88012c8ee300)
Jan 28 18:59:09 media kernel: 238.370818] Stack:
Jan 28 18:59:09 media kernel: 238.370819] 0000000000263c1d 00000000b69a8501 ffff88012e420000 0000000000000001
Jan 28 18:59:09 media kernel: 238.370822] <0> 0000000000000001 ffff88012e420000 ffff88012bd54000 ffffffffa092f1ca
Jan 28 18:59:09 media kernel: 238.370826] <0> 0000000000000000 ffffffffa0743d80 000000004d4303ed 0000000000011406
Jan 28 18:59:09 media kernel: 238.370830] Call Trace:
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa092f1ca>] os_delay+0x6a/0x230 [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa090400f>] _nv021806rm+0x9/0xe [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] DWARF2 unwinder stuck at _nv021806rm+0x9/0xe [nvidia]
Jan 28 18:59:09 media kernel: 238.371004]
Jan 28 18:59:09 media kernel: 238.371004] Leftover inexact backtrace:
Jan 28 18:59:09 media kernel: 238.371004]
Jan 28 18:59:09 media kernel: 238.371004] <IRQ>
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa0880421>] ? _nv024219rm+0x36/0x40 [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa04d36e8>] ? _nv006468rm+0x30a/0x3c2 [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa07f12ca>] ? _nv018960rm+0x319/0x4ab [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa071782d>] ? _nv012515rm+0x1cc/0x33f [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa086af1c>] ? _nv023376rm+0x1fa/0x9a8 [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa086ab2b>] ? _nv023377rm+0x8c7/0xabe [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa071a884>] ? _nv013015rm+0x237/0x7ef [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa070fc6b>] ? _nv013010rm+0x60/0x6a [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa071afe0>] ? _nv013021rm+0x1a4/0x9ce [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa07dec4f>] ? _nv019002rm+0x4c4/0x746 [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa07d5062>] ? _nv018982rm+0xc2/0x107 [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa090975d>] ? _nv002165rm+0x7d/0xae [nvidia]
Jan 28 18:59:09 media kernel: 238.371004] <ffffffffa090f235>] ? rm_isr_bh+0x5a/0x8d [nvidia]

(I’ll see this for various processes, not just xfslogd. But always with the nvidia, uh… is that an interrupt?)

Can someone help me out here? This is really bugging me, I’ll gather whatever info is needed to get to the bottom of this!

Thanks,

Matt

Hi
Have you checked the memory in your system? Run memtest86 from a boot cd/dvd. You could also try changing to the default kernel as well. But I would check your memory first.

Are the RAM modules exactly the same (as in a matched pair?) if not, set the BIOS to single channel mode. You could also check the voltages required for your memory modules, if your BIOS supports it it may need a tweak and also the timings.

I ran memtest for about an hour, no errors - I’ll leave it overnight and see if anything turns up.

Yes, the DIMMs are a single kit - a matched pair.

Does this sound like memory problems to you? I’ve had memory problems in the past, and wasn’t sure if the DIMMs or motherboard were to blame, which is why I’ve bought the new DIMMs.

Hi
I would check the manufacturers spec on the memory and verify that the
BIOS settings are correct. How’s your power supply?


Cheers Malcolm °¿° (Linux Counter #276890)
SUSE Linux Enterprise Desktop 11 (x86_64) Kernel 2.6.32.27-0.2-default
up 10 days 6:45, 4 users, load average: 0.30, 0.14, 0.04
GPU GeForce 8600 GTS Silent - Driver Version: 260.19.36

The BIOS doesn’t allow me to change much wrt the memory - just the voltage, which is set to auto. Otherwise, I’ve loaded the “failsafe” defaults in BIOS.
The RAM is Corsair XMS2, CM2X2048-6400C5C, the power-supply a reasonably new Jersey 600W job.

Here’s a screenshot of memtest, just before completing 8 successful passes:

http://goo.gl/EuzIo

The RAM timings of 5-5-5-18 are correct, but I’m not sure about 417MHz - I thought that 6400 should run at 400?

The other thing that I’m not sure about is the range of memory tested. Although the “Testing:” line always has “4095M” at the end of it, the first two numbers there (the actual range being tested?) never exceed 2048M, as in the above screenshot. And if I ask memtest to probe the memory, it comes back with “3582M”…

At the bios main menu press [Ctrl] + [F1]

This will activate a hidden menu with more memory settings.

In this menu I see something in the automatically configured values that is not the same for both DIMMs.

Trfc0 for DIMM1 is 127.5ns
Trfc2 for DIMM2 is 75ns
Trfc1 for DIMM3 is 75ns
Trfc3 for DIMM4 is 75ns

(only DIMMs 1 and 2 are present.)

Hi
So in the BIOS, can you set the frequency to 400Mhz and see how that
goes.

Can you check the motherboard manual and see if it will work with the
4GB of memory

Can you also post the output from dmesg;


dmesg |grep e820

or check /var/log/boot.msg for the BIOS-provided physical RAM map info.


Cheers Malcolm °¿° (Linux Counter #276890)
SUSE Linux Enterprise Desktop 11 (x86_64) Kernel 2.6.32.27-0.2-default
up 11 days 4:06, 2 users, load average: 0.08, 0.06, 0.01
GPU GeForce 8600 GTS Silent - Driver Version: 260.19.36

Hi
If you pull the memory and swap them, does the timings follow? Does it
indicate dual channel memory in the BIOS boot up messages?


Cheers Malcolm °¿° (Linux Counter #276890)
SUSE Linux Enterprise Desktop 11 (x86_64) Kernel 2.6.32.27-0.2-default
up 11 days 4:30, 2 users, load average: 0.07, 0.05, 0.01
GPU GeForce 8600 GTS Silent - Driver Version: 260.19.36

When I manually set it to 400MHz, memtest still displays 417…

The manual indicates that it support up to 16GB across 4 slots (1.8V, DDR2, dual channel, 800/667/533 MHz modules), so yes. I"ve also contacted Gigabyte, asking if these particular DIMMs are supported. I didn’t get a clear yes or no answer, but they asked questions about which slots the DIMMs were in, suggesting that the DIMMs are okay.

dmesg output:
EDITTED OUT

Yes, the BIOS messages indicate that the menory is dual-channel. I’ll try reversing the DIMMs now…

When you swap out the sticks, check the slots are clean.

Increase your memory voltage from 1.8v to 1.85v (+0.05)

I realised that the dmesg output that I’d posted was with only 2G installed, so I removed that.

Now I’ve tried with both DIMMs in, and I’ve tried reversing the order of DIMMs - that makes no difference to the Trfc numbers in the BIOS settings. I also tried manually setting the Trfc for DIMM2 to 127.5, that also didn’t help.

Here is the dmesg output from when I’ve got 4GB installed:

    0.000000]  BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
    0.000000]  BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
    0.000000]  BIOS-e820: 0000000000100000 - 00000000cfee0000 (usable)
    0.000000]  BIOS-e820: 00000000cfee0000 - 00000000cfee3000 (ACPI NVS)
    0.000000]  BIOS-e820: 00000000cfee3000 - 00000000cfef0000 (ACPI data)
    0.000000]  BIOS-e820: 00000000cfef0000 - 00000000cff00000 (reserved)
    0.000000]  BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
    0.000000]  BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
    0.000000]  BIOS-e820: 0000000100000000 - 0000000130000000 (usable)
    0.000000] e820 update range: 0000000000000000 - 0000000000001000 (usable) ==> (reserved)
    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
    0.000000] e820 update range: 00000000cff00000 - 0000000100000000 (usable) ==> (reserved)
    0.000000] e820 update range: 0000000000001000 - 0000000000010000 (usable) ==> (reserved)
    0.000000] Aperture pointing to e820 RAM. Ignoring.

I also tried reducing the speed of the memory down to 667MHz. In this mode I was able to boot correctly and get X to start! Looking at the dmesg output, it was identical to above, but with one line missing: “Aperture pointing to e820 RAM. Ignoring.”

After I’d tested different combinations, I tried to go back to these settings (667MHz), but something is different - even at slower speeds I now also get the “Aperture” message, and X locks up again. :frowning: But I’m convinced that this aperture is tied to the problem I’m seeing.

The RAM sticks in my server gave the same error, at the time I built it. I found a suggestion to raise voltage (which my mobo’s BIOS allows) to 1.90 V. After doing so, the machine booted, no issues at all, so I left it that way. Dual channel enabled.

These sounded like good tips, so I increased my mem voltage too. First to 1.85V, then to 1.95V, then in combination with reducing the speed to 667MHz.

No good, the problem remains the same. :’(

Hi
Well the aperture is AFAIK related to onboard video, do you have this and is it disabled in the BIOS or is the Nvidia device an onboard device?

What happens if you move one RAM module so it’s running in single channel mode?

Try using your other memory slots.

eg, if you are using dimm 1 + dimm 2, use dimm 3 + dimm 4

or

if dimm 1 + dimm 3, use dimm 2 + dimm 4

Is there a “Memory Remap Feature” is the North bridge configuration of your BIOS setup that you can enable or disable?

I’ve tried putting the 2nd DIMM in slot 3 - the machine comes up in single-channel mode with 4GB, but still hangs with the same error when I try to start X.

The nvidia card is a PCI-E card. I’ve tried removing that and using the on-board graphics. The aperture message in dmesg disappears, and X starts. So is the problem with my graphics card?

I’ve also left the internal graphics enabled and booted with the nvidia card in. The aperture message is not seen, but X still won’t start. The card is an ASUS nvidia 8400 - this one: ASUSTeK Computer Inc. - ASUS - ASUS EN8400GS SILENT/P/512M

Hi
So are you running the proprietary driver installed via the methods here;
openSUSE Graphic Card Practical Theory Guide for Users

What is the aperture setting in the BIOS?

I would drop back to the 2GB of RAM, diable the onboard video, ensure your using the latest kernel and install via the latest nvidia driver. Then pop your ram back in and see how that goes.

I don’t think it’s a harware problem - I run a couple of albeit older gigabyte boards with 4 gig of ram and have had no problems with either 32- or 64-bit kernels.
I would first try the existing setup, but remove your screen settings from your xorg.conf or xorg.conf.d and try a plain vanilla xorg setup.
If this doesn’t work, like Malcolms says, try reinstalling the nvidia drivers.