Lenovo ThinkPad P70, data corruption issues?

rlk · August 29, 2017, 5:48pm

I recently bought a refurb ThinkPad P70 (Xeon E3-1505Mv5, nVidia M4000M, UHD screen, 2x16GB RAM), and am having an odd data corruption issue with openSUSE Leap 42.3 (either the stock 4.4.79+ or a more recent 4.12). Specifically, anything that uses sockets – network or UNIX domain – gets sporadic data corruption. Sometimes it’s one or two flipped bits, sometimes it’s more than just a few bits in a byte. There appears to be some rough clustering, but there are no sequences of bytes that are bad.

One pattern I did pick out is that the bottom 5 bits of all of the affected file offsets (using scp to copy the files) are all ones.

This happens with both inbound, outbound, and local rsync (since rsync has some additional checks), so if I rsync from a remote machine, rsync between two directories on the local machine (which uses a UNIX domain socket), or rsync to a remote machine, I get the same kinds of comparisons or protocol failures. I’ve seen it with inbound ftp (downloading RPMs, where of course there are checksums), too.

I have not seen any errors with simple file copy using cp -r or tar through a pipe.

This does not happen if I boot Knoppix 7.7.1 (based on kernel 4.7.9).

I have run a full pass of memtest86 with no errors. I have tried using just one DIMM at a time and changing which slot I use, and using the rear panel slots vs. the under-keyboard slots; no change in the symptoms. The BIOS is up to date, presumably with the microcode fix.

I am presently running diagnostics; that will take a while longer. I am also going to try a vanilla kernel (provided by openSUSE RPMs) to see if that makes a difference.

What I have to decide is whether I return it (and most likely eat a 15% restocking fee; I didn’t see any indication of this under Windows, which I don’t plan to use but I kept the SSD with it installed), get the mobo replaced under warranty (have to ship to Lenovo, presumably at my expense), or find a solution on my own to this. I have not been able to find anything on the net about a problem like this, either. It’s an odd one; the symptoms look generally memory-ish, but it happened with two different DIMMs in different slots, and it’s only happening with sockets. It’s also apparently happening above the transport layer, since TCP checksums aren’t catching it.

Anyone have any thoughts here?

malcolmlewis · August 29, 2017, 6:01pm

Hi
A Samsung SSD? If so, another user with the same issue… https://forums.opensuse.org/showthread.php/526664-Installing-on-SSD-moving-tmp-and-var-into-tmpfs-in-RAM-memory

rlk · August 29, 2017, 7:30pm

Doesn’t look like the same thing at all, and in any event, this happened with two different SSD’s (one may have been a Samsung; the other, that I’m currently using is a Crucial MX300).

malcolmlewis · August 29, 2017, 8:47pm

Hi
I have a Crucial running on one of my 42.3 test systems;


Model Family:     Crucial/Micron RealSSD C300/M500
Device Model:     Crucial_CT120M500SSD1

You have checked out the SSD with smartctl and also firmware up to date?

If you really want to test, use prime95, best tool for stress testing

rlk · August 29, 2017, 9:59pm

malcolmlewis:

Hi
I have a Crucial running on one of my 42.3 test systems;
Model Family:     Crucial/Micron RealSSD C300/M500
Device Model:     Crucial_CT120M500SSD1
You have checked out the SSD with smartctl and also firmware up to date?

If you really want to test, use prime95, best tool for stress testing

Again, I’ve seen the same problem with an installation on two different SSD’s (different brands and models – I’m at home now, and the other one’s a SanDisk M400), and two separate DIMMs (singly and in combination). And it only strikes uses of sockets; I have not seen it in any other context. I have no reason to think it’s related to disk at all.

malcolmlewis · August 29, 2017, 10:07pm

Hi
Like I said, prime95 stress test will test your ram and confirm if it’s that…

rlk · August 30, 2017, 4:27pm

So, some more information overnight.

It appears that if I remove the xf86-video-nouveau package and use the vanilla kernel (4.12.9-1.gf2ab6ba-vanilla), this problem goes away. This seems distinctly odd, and this holds even if I boot to runlevel 3 and never start the X server to begin with (and blacklist the nouveau kernel driver). However, with either nouveau installed or using the default kernel of the same vintage I have the data corruption issue I described.

I ran a full pass of the Lenovo diagnostics in addition to memtest86, and found nothing. But it has me at a loss for explanation. The failure is robust against SSD and memory configuration, and appears confined to something both very specific and very general (use of sockets). It also happens regardless of whether I have hyperthreading enabled in the BIOS (and the BIOS is up to date in any event). But with two software changes, one of which should be completely unrelated, the problem appears to reliably go away.

This is making me nervous; if I can’t find an explanation and fix, I’ll certainly have to return the machine even if I have to eat the restocking fee.

malcolmlewis · August 30, 2017, 4:52pm

rlk:

So, some more information overnight.

It appears that if I remove the xf86-video-nouveau package and use the vanilla kernel (4.12.9-1.gf2ab6ba-vanilla), this problem goes away. This seems distinctly odd, and this holds even if I boot to runlevel 3 and never start the X server to begin with (and blacklist the nouveau kernel driver). However, with either nouveau installed or using the default kernel of the same vintage I have the data corruption issue I described.

I ran a full pass of the Lenovo diagnostics in addition to memtest86, and found nothing. But it has me at a loss for explanation. The failure is robust against SSD and memory configuration, and appears confined to something both very specific and very general (use of sockets). It also happens regardless of whether I have hyperthreading enabled in the BIOS (and the BIOS is up to date in any event). But with two software changes, one of which should be completely unrelated, the problem appears to reliably go away.

This is making me nervous; if I can’t find an explanation and fix, I’ll certainly have to return the machine even if I have to eat the restocking fee.

Hi
There have been some threads about the nouveau driver, what about the standard 42.3 kernel, can you duplicate? If so, then I would create a bug report;
openSUSE:Submitting bug reports - openSUSE

rlk · August 30, 2017, 7:40pm

I can reproduce it with the standard 42.3 kernel (either the 4.4-based one or the 4.12-based standard kernel) with or without the nouveau driver being installed. With the vanilla 4.12 kernel, I can’t. The vanilla 4.4 kernel does not, I believe, have proper Skylake support so I don’t think I can test that.

I’m going to look for the threads in question, but do you have any links handy?

malcolmlewis · August 30, 2017, 7:55pm

On Wed 30 Aug 2017 05:46:01 PM CDT, rlk wrote:

malcolmlewis;2836194 Wrote:
> Hi
> There have been some threads about the nouveau driver, what about the
> standard 42.3 kernel, can you duplicate? If so, then I would create a
> bug report;
> ‘openSUSE:Submitting bug reports - openSUSE’
> (openSUSE:Submitting bug reports - openSUSE Wiki)

I can reproduce it with the standard 42.3 kernel (either the 4.4-based
one or the 4.12-based standard kernel) with or without the nouveau
driver being installed. With the vanilla 4.12 kernel, I can’t. The
vanilla 4.4 kernel does not, I believe, have proper Skylake support so I
don’t think I can test that.

I’m going to look for the threads in question, but do you have any links
handy?

Hi
I would create a bug then with all those details and see what happens.
Also post the bug number back here for others to reference.

–
Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
openSUSE Leap 42.2|GNOME 3.20.2|4.4.79-18.26-default
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below… Thanks!

rlk · August 30, 2017, 9:06pm

I’m first going to run prime95 for a while to see if it looks like a hardware problem. If that doesn’t show anything, I will file a bug.

rlk · August 31, 2017, 4:31am

No problems after 7+ hours of Prime95. I filed bug https://bugzilla.opensuse.org/show_bug.cgi?id=1056535

rlk · September 2, 2017, 5:32am

So I’m taking a different tack. I’m using socat to back-to-back a file copy, and compare the results. With UNIX domain sockets, no trouble after 50 passes. With TCP or OpenSSL, I don’t get errors every pass, but I get enough to likely account for what I’ve been seeing. And I do see this under KNOPPIX too.

I’ve had up to about a dozen consecutive clean passes with TCP sockets, which would be 60 GB, which would be sufficient to allow me to copy 30 GB or so with rsync, which was my earlier test.

rlk · September 2, 2017, 9:50pm

This is proving a very difficult puzzle to track down. I believe it’s likely to be a hardware problem, but I have not found a good test to prove this out.

I tried something else: find /test -type f -print | cpio -o > test.cpio and same into another file, comparing the files. This should not yield any differences, since the directory was static, but it showed the same kind of data corruption as I originally saw. However, it appears that I need to use a large directory to test this with – if I use a 500 MB directory, even 100 passes never shows a discrepancy. This supports the hypothesis that it’s an I/O problem, since the 35 GB directory does not fit in physical memory while the 500MB directory does.

Mind you, find /test -type f -print | sort | xargs cat > testfile typically shows even more discrepancies from run to run, but of the same (isolated) kind.

The seller is understandably not too eager to take it back without more solid proof that there’s a problem, which for all intents and purposes means a recognized test. If memtest86 showed a memory error I suspect he would; if Dell diagnostics showed an error, he definitely would. If I could demonstrate data corruption under Windows, I expect he would too.

Unfortunately, I haven’t found a good I/O test that I can use for this purpose.

rlk · September 4, 2017, 10:25pm

rlk:

This is proving a very difficult puzzle to track down. I believe it’s likely to be a hardware problem, but I have not found a good test to prove this out.

I tried something else: find /test -type f -print | cpio -o > test.cpio and same into another file, comparing the files. This should not yield any differences, since the directory was static, but it showed the same kind of data corruption as I originally saw. However, it appears that I need to use a large directory to test this with – if I use a 500 MB directory, even 100 passes never shows a discrepancy. This supports the hypothesis that it’s an I/O problem, since the 35 GB directory does not fit in physical memory while the 500MB directory does.

Mind you, find /test -type f -print | sort | xargs cat > testfile typically shows even more discrepancies from run to run, but of the same (isolated) kind.

The seller is understandably not too eager to take it back without more solid proof that there’s a problem, which for all intents and purposes means a recognized test. If memtest86 showed a memory error I suspect he would; if Dell diagnostics showed an error, he definitely would. If I could demonstrate data corruption under Windows, I expect he would too.

Unfortunately, I haven’t found a good I/O test that I can use for this purpose.

I finally managed to reproduce it under Windows, albeit with some difficulty. I noticed on Linux that the likelihood and amount of corruption varied with the data throughput and load (it didn’t happen with WiFi, which never yielded me more than 20 MB/sec or so, vs. >100 MB/sec over gigabit Ethernet). Cygwin scp is much slower than Linux, but by means of running a load generator in the background (prime95), I managed to get one try to fail with the same pattern of corruption in a run of five tries.