How to troubleshoot eMMC storage

I have openSUSE 13.2/KDE running quite ok on a Toshiba Satellite CL10-B-100, which has a 32GB eMMC device for internal storage.

Something is wrong with that eMMC thing, and I can’t put my finger on it. From time to time it freezes, and it has buggered up two filesystem already (albeit possibly in combination with me doing other stupid things). It also throws funny errors in dmesg, even while everything seems to be working.

If this were a spinning hard-disc, I’d be checking the SMART status with smartmontools, running extended self-tests and so on, maybe checking for bad blocks. Things is: I have no idea how to troubleshoot an eMMC device.

Symptoms:

  • 99% of the time, everything is working absolutely fine.
  • From time to time, even when under no load, the laptop grinds almost to a halt on all processes requiring storage I/O (other processes work fine, i.e., anything that does not need to talk to the eMMC storage). Left to itself, the problem goes away after some minutes / hours.
  • Funny messages in dmesg.

Any ideas on how to troubleshoot/fix this?

# dmesg | grep mmc
    2.235696] mmc0: no vqmmc regulator found
    2.235702] mmc0: no vmmc regulator found
    2.241565] mmc0: SDHCI controller on ACPI [80860F14:01] using ADMA
    2.373161] mmc0: BKOPS_EN bit is not set
    2.377040] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
    2.388097] mmc0: switch to bus width 2 failed
    2.388402] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
    2.397629] mmc0: Got command interrupt 0x00010000 even though no command operation was in progress.
    2.399161] mmc0: new HS200 MMC card at address 0001
    2.416860] mmcblk0: mmc0:0001 032GE4 29.1 GiB 
    2.416976] mmcblk0boot0: mmc0:0001 032GE4 partition 1 4.00 MiB
    2.417044] mmcblk0boot1: mmc0:0001 032GE4 partition 2 4.00 MiB
    2.417107] mmcblk0rpmb: mmc0:0001 032GE4 partition 3 4.00 MiB
    2.420115]  mmcblk0: p1 p2 p3
    2.422313]  mmcblk0boot1: unknown partition table
    2.423278]  mmcblk0boot0: unknown partition table
   12.583891] EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: acl,user_xattr
   12.672471] FAT-fs (mmcblk0p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
  147.807022] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  147.817571] mmc0: switch to bus width 2 failed
  147.818838] mmc0: unexpected status 0x800800 after switch
  147.822876] mmc0: switch to bus width 1 failed
  147.822878] mmc0: error -22 during resume (card was removed?)
  159.994532] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  203.820333] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  285.121058] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  285.131595] mmc0: switch to bus width 2 failed
  285.132864] mmc0: unexpected status 0x800800 after switch
  285.136903] mmc0: switch to bus width 1 failed
  285.136905] mmc0: error -22 during resume (card was removed?)
  292.727496] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  332.184889] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  332.195430] mmc0: switch to bus width 2 failed
  332.196705] mmc0: unexpected status 0x800800 after switch
  332.200745] mmc0: switch to bus width 1 failed
  332.200748] mmc0: error -22 during resume (card was removed?)
  405.024371] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  721.373582] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  731.372962] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  736.655591] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  792.838216] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  886.781710] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.
  892.651466] mmc0: Got command interrupt 0x00000001 even though no command operation was in progress.

(Note: I fscked /dev/mmcblk0p1, which is /boot/efi, and this is fine now.)

Forgot something:

  • Occasionally (maybe once per ten boots) it has difficulty shutting down its storage.

Snippet from journalctl:

Jan 08 22:06:49 tosh15.he systemd[1]: session-1.scope stopping timed out. Killing.
Jan 08 22:06:49 tosh15.he systemd[1]: Job dev-disk-by\x2did-raid\x2dcr_mmc\x2d032GE4_0x60ac94bd\x2dpart3.device/stop timed out.
Jan 08 22:06:49 tosh15.he systemd[1]: Timed out stoppping /dev/disk/by-id/raid-cr_mmc-032GE4_0x60ac94bd-part3.
Jan 08 22:06:49 tosh15.he systemd[1]: Job dev-disk-by\x2did-dm\x2duuid\x2dCRYPT\x2dLUKS1\x2d3c478272afc348fa81a5b0f0ea0bb356\x2dcr_mmc\x2d032GE4_0x60
Jan 08 22:06:49 tosh15.he systemd[1]: Timed out stoppping /dev/disk/by-id/dm-uuid-CRYPT-LUKS1-3c478272afc348fa81a5b0f0ea0bb356-cr_mmc-032GE4_0x60ac94
Jan 08 22:06:49 tosh15.he systemd[1]: Job dev-disk-by\x2did-dm\x2dname\x2dcr_mmc\x2d032GE4_0x60ac94bd\x2dpart3.device/stop timed out.
Jan 08 22:06:49 tosh15.he systemd[1]: Timed out stoppping /dev/disk/by-id/dm-name-cr_mmc-032GE4_0x60ac94bd-part3.
Jan 08 22:06:49 tosh15.he systemd[1]: Job dev-dm\x2d0.device/stop timed out.
Jan 08 22:06:49 tosh15.he systemd[1]: Timed out stoppping /dev/dm-0.
Jan 08 22:06:49 tosh15.he systemd[1]: Job sys-devices-virtual-block-dm\x2d0.device/stop timed out.
Jan 08 22:06:49 tosh15.he systemd[1]: Timed out stoppping /sys/devices/virtual/block/dm-0.

For reference, here’s what’s on the device (notice LVM-on-LUKS):

# lsblk
NAME                               MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
mmcblk0rpmb                        179:24   0    4M  0 disk  
mmcblk0boot0                       179:8    0    4M  1 disk  
mmcblk0boot1                       179:16   0    4M  1 disk  
mmcblk0                            179:0    0 29.1G  0 disk  
├─mmcblk0p1                        179:1    0  195M  0 part  /boot/efi
├─mmcblk0p2                        179:2    0  502M  0 part  /boot
└─mmcblk0p3                        179:3    0 28.4G  0 part  
  └─cr_mmc-032GE4_0x60ac94bd-part3 254:0    0 28.4G  0 crypt 
    ├─vg1-lvroot                   254:1    0 15.4G  0 lvm   /
    ├─vg1-lvswap                   254:2    0    2G  0 lvm   [SWAP]
    └─vg1-lvhome                   254:3    0    8G  0 lvm   /home

Hi
Maybe some sort of indexing going on?

Keep a terminal session open with top or iostat or vmstat, or three terminals with all and when things appear to freeze, check there.

Have you tried running smartctl -a against the block device to see if any output?

I tried smartctl -a, but get “unable to detect device type”. I haven’t found out which processes cause the freezing yet (doesn’t happen often; maybe unrelated), but meanwhile I have again lost a file system on this device.

Something is wrong either with the eMMC at hardware level, or the software/drivers that speak to it. After some more digging online I found this 2014 post in the Linux Kernel Mailing List, which appears to describe exactly my problem (it refers specifically to Toshiba eMMC devices) and includes a kernel patch: https://lkml.org/lkml/2014/2/3/12.

I know nothing about kernel development. Is there any way I can find out whether that patch is in the current openSUSE kernels? Do you think I could file a bug report with the information I have, and get someone to look at this?

Meanwhile, it would be nice to know for sure whether this is a hardware or driver issue, but I don’t know how to do this. I suppose I could install Windows and see whether that works; if yes, then it’s a driver problem. Any other ideas?

Update, in case anyone stumbles across this thread: Leap 42.1 works absolutely fine on this hardware (confirming that it was a software issue, i.e., this eMMC chip was not properly supported by the older kernel in 13.2).