I recently bought a refurb ThinkPad P70 (Xeon E3-1505Mv5, nVidia M4000M, UHD screen, 2x16GB RAM), and am having an odd data corruption issue with openSUSE Leap 42.3 (either the stock 4.4.79+ or a more recent 4.12). Specifically, anything that uses sockets – network or UNIX domain – gets sporadic data corruption. Sometimes it’s one or two flipped bits, sometimes it’s more than just a few bits in a byte. There appears to be some rough clustering, but there are no sequences of bytes that are bad.
One pattern I did pick out is that the bottom 5 bits of all of the affected file offsets (using scp to copy the files) are all ones.
This happens with both inbound, outbound, and local rsync (since rsync has some additional checks), so if I rsync from a remote machine, rsync between two directories on the local machine (which uses a UNIX domain socket), or rsync to a remote machine, I get the same kinds of comparisons or protocol failures. I’ve seen it with inbound ftp (downloading RPMs, where of course there are checksums), too.
I have not seen any errors with simple file copy using cp -r or tar through a pipe.
This does not happen if I boot Knoppix 7.7.1 (based on kernel 4.7.9).
I have run a full pass of memtest86 with no errors. I have tried using just one DIMM at a time and changing which slot I use, and using the rear panel slots vs. the under-keyboard slots; no change in the symptoms. The BIOS is up to date, presumably with the microcode fix.
I am presently running diagnostics; that will take a while longer. I am also going to try a vanilla kernel (provided by openSUSE RPMs) to see if that makes a difference.
What I have to decide is whether I return it (and most likely eat a 15% restocking fee; I didn’t see any indication of this under Windows, which I don’t plan to use but I kept the SSD with it installed), get the mobo replaced under warranty (have to ship to Lenovo, presumably at my expense), or find a solution on my own to this. I have not been able to find anything on the net about a problem like this, either. It’s an odd one; the symptoms look generally memory-ish, but it happened with two different DIMMs in different slots, and it’s only happening with sockets. It’s also apparently happening above the transport layer, since TCP checksums aren’t catching it.
Anyone have any thoughts here?