memtest+ questions

shawnsterp · June 29, 2019, 5:57pm

I installed memtest+ to the bootloader. When I run the test, it stops after about 5 seconds and restarts the computer. **I assume this is not good and that it means it is failing big time? I would love a confirmation here.
**I want to rule out the possibility that it is simply just the program not working correctly.

History:
I went down this path because I have been trying to debug why my computer is freezing when playing games. I have a ryzen 1700 and a vega 56. I believe that I have ruled out the ryzen bug (relates to c-states and idle times) because it does not ever freeze when idle, only when playing games. Although I have not found a way to monitor the gpu temp, I can see the “tachometer” on it and it never gets above 3 out of 10. And the computer seems to run pretty cool in general, so I think it is not that.

I then saw people talking about how ram could cause this, so hence the memtest. But, I also found out that there is apparently specific lists of ram that has been tested to work on motherboards (I did not know this was a thing, sadface) and mine is not there. There does appear to be a way in the bios to adjust the voltage going to the ram, but I am a little leery to muck around with that since I don’t know what the heck I am doing (people have suggested bumping up to 1.4. It is currently set to 1.2).

Any help is appreciated.

BTW: I realize this has nothing to do with opensuse (I even have a new install, and the same behavior was on manjaro). But, I would appreciate any help anyway!

malcolmlewis · June 29, 2019, 6:05pm

Hi
Have you tried pulling the RAM sticks and re-inserting? Are they all matched and in the correct slots?

After re-seating, run memtest again, all ok?

If you boot into a linux operating system and run dmidecode to see the exact part number(s) on the ram modules.


dmidecode -t memory

From the part numbers you can then check on the manufacturing specs to see what they need to be set at, both timing and voltage, then can look at the BIOS settings and adjust as required.

Normally the system should detect, but maybe if the RAM is not of the specified ones, then the above tweaking may be required.

Again run the memtest checks after tweaking, if all ok, then install prime95 and run the torture test to be sure…

shawnsterp · June 29, 2019, 6:38pm

malcolmlewis:

Hi
Have you tried pulling the RAM sticks and re-inserting? Are they all matched and in the correct slots?

After re-seating, run memtest again, all ok?

If you boot into a linux operating system and run dmidecode to see the exact part number(s) on the ram modules.
dmidecode -t memory
From the part numbers you can then check on the manufacturing specs to see what they need to be set at, both timing and voltage, then can look at the BIOS settings and adjust as required.

Normally the system should detect, but maybe if the RAM is not of the specified ones, then the above tweaking may be required.

Again run the memtest checks after tweaking, if all ok, then install prime95 and run the torture test to be sure…

Thanks for the help and suggestions. I have not tried re-seating the ram. That is on the todo list and will do that next. I just in the last hour figured out how to get journalctl to look at past boots rather than just the current, so that is another thing that I will check if/when it crashes again. But, you are right. I should reseat immediately. I will do that now. Who knows, maybe I’ll get lucky. In the meantime, for all for memory sticks, “dmidecode -t memory” gave an output such as this:

Handle 0x0037, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0027
        Error Information Handle: 0x0036
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMM_B2
        Bank Locator: BANK 3
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 2133 MT/s
        Manufacturer: G-Skill
        Serial Number: 00000000
        Asset Tag: Not Specified
        Part Number: F4-3200C16-8GTZSW
        Rank: 1
        Configured Memory Speed: 1067 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

The Error information handle caught my eye. **Is the voltage listed here what it SHOULD be running at? or what it IS running at? **Because that is the voltage listed on the spec sheet from g-skill. Also, forgive my ignorance, but what do you mean by timing for the ram? What should I be looking for here? (or is that the speed 2133 mt/s)?

EDIT: actually the spec sheet has two listings for voltage. SPD voltage is 1.2. Tested voltage is 1.35.

malcolmlewis · June 29, 2019, 7:02pm

Hi
See the “Tested Latency” so timings in BIOS should be set to 16-16-16-36?

I would check the seating of the RAM first and retest, then look at moving the voltage a little, say to 1.25v and test again.

What are the spec for the CPU with regard to memory speed, on my system (intel) I see;


Memory Device
    Array Handle: 0x003F
    Error Information Handle: Not Provided
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 4096 MB
    Form Factor: DIMM
    Set: None
    Locator: DIMM2
    Bank Locator: CHANNEL B SLOT1
    Type: DDR3
    Type Detail: Synchronous
    Speed: 1600 MT/s <==
    Manufacturer: Samsung
    Serial Number: 13B2DCED
    Asset Tag: 9876543210
    Part Number: M378B5173QH0-CK0  
    Rank: 1
    Configured Memory Speed: 1600 MT/s <==

You also might want to check over-clocking forums as may pick up some snippets about your RAM and setup…

It’s all about little tweaks and lots of testing…

mrmazda · June 29, 2019, 11:12pm

I suggest you run MemTest86 free version instead of memtest86+. The two are not the same thing. I haven’t observed reliable operation from memtest86+ in many moons, except on DDR2 and older hardware.

shawnsterp · June 30, 2019, 2:14pm

Okay, thanks. I took your advice and ran memtest86 over night. 48 tests, no errors. So, now I am not sure where to go, except maybe to keep fiddling with minor voltage increments. One of the games that is the most reliable to trigger the freeze is Hearts of Iron 4. It is not a graphically taxing game, but it does eat up a lot of memory, which still makes me think that it might have something to do with the ram – but of course I don’t really know. There is that bug I mentioned before about c-states at idle. Again, I do not THINK that it is that, but I still took precautions and disabled c-state management in the bios.

Since this ram is not listed on the approved list for the motherboard, would I have better luck buying ram that IS on it, or would that be a complete waste of money? Again, any help or suggestions are greatly appreciated.

shawnsterp · June 30, 2019, 2:19pm

So, this is the last few lines on the previous boot where I was playing Hearts of Iron 4 to trigger the crash:

Jun 30 08:04:47 linux-i7cx org_kde_powerdevil[8622]: powerdevil: Scheduling inhibition from ":1.82" "My SDL application" with cookie 29 and reason "Playing a game"
Jun 30 08:04:47 linux-i7cx org_kde_powerdevil[8622]: powerdevil: Releasing inhibition with cookie  29
Jun 30 08:04:47 linux-i7cx org_kde_powerdevil[8622]: powerdevil: It was only scheduled for inhibition but not enforced yet, just discarding it
Jun 30 08:04:52 linux-i7cx org_kde_powerdevil[8622]: powerdevil: Enforcing inhibition from ":1.82" "My SDL application" with cookie 29 and reason "Playing a game"
Jun 30 08:04:52 linux-i7cx org_kde_powerdevil[8622]: powerdevil: By the time we wanted to enforce the inhibition it was already gone; discarding it
Jun 30 08:05:07 linux-i7cx org_kde_powerdevil[8622]: powerdevil: Scheduling inhibition from ":1.82" "My SDL application" with cookie 30 and reason "Playing a game"
Jun 30 08:05:07 linux-i7cx org_kde_powerdevil[8622]: powerdevil: Releasing inhibition with cookie  30
Jun 30 08:05:07 linux-i7cx org_kde_powerdevil[8622]: powerdevil: It was only scheduled for inhibition but not enforced yet, just discarding it
Jun 30 08:05:12 linux-i7cx org_kde_powerdevil[8622]: powerdevil: Enforcing inhibition from ":1.82" "My SDL application" with cookie 30 and reason "Playing a game"
Jun 30 08:05:12 linux-i7cx org_kde_powerdevil[8622]: powerdevil: By the time we wanted to enforce the inhibition it was already gone; discarding it

I used the “journalctl -b -1 -n” command for this. Let me know if there is some other command to get more info or whatever. I don’t really know what the heck I am doing lol. Anyway, plz let me know if this is relevant.

EDIT: I looked back at the entire journalctl and this is the stuff that happened right before the “playing a game” messages:

Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14566, resource id: 85983252, major code: 19 (DeleteProperty), minor code: 0
Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14570, resource id: 85983252, major code: 18 (ChangeProperty), minor code: 0
Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14576, resource id: 85983252, major code: 19 (DeleteProperty), minor code: 0
Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14577, resource id: 85983252, major code: 18 (ChangeProperty), minor code: 0
Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14578, resource id: 85983252, major code: 19 (DeleteProperty), minor code: 0
Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14579, resource id: 85983252, major code: 19 (DeleteProperty), minor code: 0
Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14580, resource id: 85983252, major code: 19 (DeleteProperty), minor code: 0
Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14581, resource id: 85983252, major code: 7 (ReparentWindow), minor code: 0
Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14582, resource id: 85983252, major code: 6 (ChangeSaveSet), minor code: 0
Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14583, resource id: 85983252, major code: 2 (ChangeWindowAttributes), minor code: 0
Jun 30 07:56:17 linux-i7cx kwin_x11[8574]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 14584, resource id: 85983252, major code: 10 (UnmapWindow), minor code: 0

nrickert · June 30, 2019, 2:42pm

Oh, something like:

2019-06-30T07:38:50.279355-05:00 nwr2 kwin_x11[2735]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 10587, resource id: 117440516, major code: 18 (ChangeProperty), minor code: 0

My logs have many messages like that. I am ignoring them, because they do not seem to indicate any problem (other than flooding logs with unimportant messages).

tannington · June 30, 2019, 4:52pm

That’s “sort of normal” …

That error message is rather a red herring, it’s issued when a window disappears unexpectedly (generally) as a result of something else going belly up or otherwise nuking itself into oblivion.

shawnsterp · June 30, 2019, 5:05pm

I failed to realize till just now that “journalctl -b -1 -n” is only showing logs for BOOT. Since the system is booting just fine, I assume this is not helpful. Using yast, I looked at the systemd journal. Here is an entry that I believe may be relevant, as it surfaced the last time my system crashed:

BUG: unable to handle kernel paging request at fffff7ffb4e3dd80

.

However, over the last few days there have been a few other messages that are similar yet different:

BUG: Bad rss-counter state mm:00000000fc5397a5 idx:2 val:-4
BUG: Bad page map in process X  pte:80000007d1b82876 pmd:7b171c067

No idea what this means.

BTW, I have reseated the ram at this point. I never got around to mentioning that.

tannington · June 30, 2019, 5:08pm

I doubt you have a RAM issue… Try running Prime95 as suggested by @malcolmlewis, that’s a real stress test.

Since this ram is not listed on the approved list for the motherboard, would I have better luck buying ram that IS on it, or would that be a complete waste of money? Again, any help or suggestions are greatly appreciated.

“Approved” normally means it’s RAM the MB manufacturer has tested in their board.

Provided the RAM you use is of “reputable” make and of equal specification to the “approved” there’s no problem (in my experience) in using it.

tannington · June 30, 2019, 5:26pm

shawnsterp:

BUG: unable to handle kernel paging request at fffff7ffb4e3dd80
However, over the last few days there have been a few other messages that are similar yet different:
BUG: Bad rss-counter state mm:00000000fc5397a5 idx:2 val:-4
BUG: Bad page map in process X  pte:80000007d1b82876 pmd:7b171c067

What is at the start of those lines, need to see where the error originates

For example: “Jun 30 15:16:20 Orion-15 kernel:” - Date/Time/Host and issued by the kernel

My “feeling” is a kernel error/bug – is this a new issue following a kernel update by any chance?

shawnsterp · June 30, 2019, 6:08pm

Its not letting me copy/paste the whole line, but every single one of those lists kernel as the source.

I believe you are probably correct. I changed the systemd seach parameters from crit to err in yast, and I am getting a whole lot of things related to amdpgu. Several things like the following over the last couple of days:

Jun 29 22:25:35  kernel  amdgpu: [powerplay] No response from smu
Jun 29 22:25:36  kernel  amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
Jun 30 11:37:10  kernel  [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=154467, emitted seq=154469
Jun 30 11:37:10  kernel  [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process hoi4 pid 4376 thread hoi4:cs0 pid 4381

I can confirm that the last two happened on my last crash. As for is it recent? Not really, although I do believe it has gotten worse over the course of 4-5 months. It went from happening every so often to happening pretty regularly, and even consistently on some games. This is a week old install, but it was happening on my manjaro install as well. On manjaro, I was able to test out several different kernels (4.xx - 5.xx) and it happened on all of them. Of course, afaik, they were all getting patches so I guess there could be a bug affecting them all… =(

EDIT: No idea what prime95 is. I found this: GIMPS - Free Prime95 software downloads - PrimeNet. Is this what he was talking about?

malcolmlewis · June 30, 2019, 6:44pm

shawnsterp:

Its not letting me copy/paste the whole line, but every single one of those lists kernel as the source.

I believe you are probably correct. I changed the systemd seach parameters from crit to err in yast, and I am getting a whole lot of things related to amdpgu. Several things like the following over the last couple of days:
Jun 29 22:25:35  kernel  amdgpu: [powerplay] No response from smu
Jun 29 22:25:36  kernel  amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
Jun 30 11:37:10  kernel  [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=154467, emitted seq=154469
Jun 30 11:37:10  kernel  [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process hoi4 pid 4376 thread hoi4:cs0 pid 4381
I can confirm that the last two happened on my last crash. As for is it recent? Not really, although I do believe it has gotten worse over the course of 4-5 months. It went from happening every so often to happening pretty regularly, and even consistently on some games. This is a week old install, but it was happening on my manjaro install as well. On manjaro, I was able to test out several different kernels (4.xx - 5.xx) and it happened on all of them. Of course, afaik, they were all getting patches so I guess there could be a bug affecting them all… =(

EDIT: No idea what prime95 is. I found this: GIMPS - Free Prime95 software downloads - PrimeNet. Is this what he was talking about?

Hi
Yes that’s the correct site for prime95

Can you as root user show the output from;


/sbin/lspci -nnk | egrep -A3 "VGA|Display|3D"
systool -vm amdgpu

From the systool command just past the items is the Parameters: section.

tannington · June 30, 2019, 6:46pm

In view of those comments I’m less inclined to think it’s a kernel bug per se … but it could be I suppose :\ …

I guess this could be a hard one to nail unless someone comes along with a "Yep, seen exactly that - It’s caused by “whatever”.

EDIT: No idea what prime95 is. I found this: GIMPS - Free Prime95 software downloads - PrimeNet. Is this what he was talking about?

Yes, that’s it. - Doesn’t qualify as free and open-source software, so can’t be included in the openSUSE repositories.

It is probably well worth giving it a run, the longer the better, doesn’t stress the graphics adaptor obviously, but does a pretty good job on the rest.

I’m not a “gamer” at all, but would it be worth asking on any games forums associated with the one(s) you have problems with, maybe others have seen the issue.

shawnsterp · June 30, 2019, 7:30pm

malcolmlewis:

Can you as root user show the output from;
/sbin/lspci -nnk | egrep -A3 "VGA|Display|3D"
systool -vm amdgpu
From the systool command just past the items is the Parameters: section.

Okay, I will check out the prime95 thing in the meantime. might be a while as the wife is getting antsy lol (thanks again everyone for the help):

# /sbin/lspci -nnk | egrep -A3 "VGA|Display|3D"
0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c3)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3681]
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

and

Parameters:
    aspm                = "-1"
    audio               = "-1"
    bapm                = "-1"
    benchmark           = "0"
    cg_mask             = "4294967295"
    cik_support         = "0"
    cntl_sb_buf_per_se  = "0"
    compute_multipipe   = "-1"
    cwsr_enable         = "1"
    dc                  = "-1"
    dcfeaturemask       = "0"
    debug_largebar      = "0"
    deep_color          = "0"
    disable_cu          = "(null)"
    disp_priority       = "0"
    dpm                 = "-1"
    emu_mode            = "0"
    exp_hw_support      = "0"
    fw_load_type        = "-1"
    gartsize            = "4294967295"
    gpu_recovery        = "-1"
    gttsize             = "-1"
    halt_if_hws_hang    = "0"
    hw_i2c              = "0"
    hws_max_conc_proc   = "8"
    ignore_crat         = "0"
    ip_block_mask       = "4294967295"
    job_hang_limit      = "0"
    lbpw                = "-1"
    lockup_timeout      = "10000"
    max_num_of_queues_per_device= "4096"
    moverate            = "-1"
    msi                 = "-1"
    ngg                 = "0"
    noretry             = "0"
    param_buf_per_se    = "0"
    pcie_gen2           = "-1"
    pcie_gen_cap        = "0"
    pcie_lane_cap       = "0"
    pg_mask             = "4294967295"
    pos_buf_per_se      = "0"
    ppfeaturemask       = "4294787071"
    prim_buf_per_se     = "0"
    runpm               = "-1"
    sched_hw_submission = "2"
    sched_jobs          = "32"
    sched_policy        = "0"
    sdma_phase_quantum  = "32"
    send_sigterm        = "0"
    si_support          = "0"
    smu_memory_pool_size= "0"
    test                = "0"
    virtual_display     = "(null)"
    vis_vramlimit       = "0"
    vm_block_size       = "-1"
    vm_debug            = "0"
    vm_fault_stop       = "0"
    vm_fragment_size    = "-1"
    vm_size             = "-1"
    vm_update_mode      = "-1"
    vram_page_split     = "512"
    vramlimit           = "0"

malcolmlewis · June 30, 2019, 7:44pm

Hi
So when you get a chance, add the following kernel boot option to grub via YaST -> Bootloader


amdgpu.ngg=1

It may help…it may not as not sure if vega is supporting next generation graphics (ngg).

shawnsterp · June 30, 2019, 7:54pm

Oooookay,

I ran the torture test from prime95 and that triggered the system crash! Very quickly, within seconds. So, if I am reading the site right, then it was testing the CPU, not the GPU? Where do we go from here.

@malcolmlewis - I will add that now.
edit: just for completeness, this is now what the boot parameter line looks like in full:splash=silent resume=/dev/system/swap quiet idle=nomwait amdgpu.ngg=1

I added the idle=nomwait a few days back as that is a bug on these cpus (ryzen), but I doubt its doing anything as I never crashed on idle.

gogalthorp · June 30, 2019, 9:33pm

Failing hardware maybe. The increased frequency and different OS’s makes me think that the CPU is starting to fail. Generally large scale integration stuff either fails early or almost never. But the key word is almost. On the MB failing capacitors can cause problems.

shawnsterp · June 30, 2019, 9:35pm

I had attempted to overclock my cpu (only in an effort to fix my crashing issue, otherwise I don’t care). I reset the bios on the cpu, and raised the levels for the cpu fan. I am currently running the prime95 test now, although it has been stuck on the same line for a little while now so I cannot tell if it is actually doing something or just stuck. Either way, obviously my overclocking attempt left things unstable. However, this may be a different issue as I only tried to overclock to stop the freezing. I will test the games again after the test.