Docker daemon hangs on stop and becomes defunct process, machine requires hard reset

Hi everyone,

I’m using a fully updated version of openSuse tumbleweed (kernel 4.11.8-1), and I recently installed docker (version 17.04.0-ce) using the standard method “zypper in docker”. I can start the docker daemon fine using “sudo service docker start”, and from my small amount of testing, docker seems to be working fine. I have not actually run any containers using docker; I have only installed it and nothing else (never done a “docker pull” or created/downloaded a docker file). However, when I go to stop the docker daemon (either manually with “sudo service docker stop” or try to reboot the machine, the docker daemon hangs and the machine hangs as well. If I do “sudo service docker stop” and then hit “ctrl-c”, the dockerd process becomes defunct, and the machine is useable but very slow and hangs on many commands. If I try to reboot the machine (before or after issuing the “sudo service docker stop” command) then the machine hangs on shutdown, and never reboots.

The only way around this hang is to kill the docker process with “kill -9 $(dockerd-pid)” before trying to stop the docker process with “sudo service docker stop”. Then the machine will reboot normally. If I first try “sudo service docker stop”, then there is no way that I have found to successfully reboot the machine short of a hard power cycle.

Just to be clear, this is a plain vanilla setup without any modification. Freshly installed from the official repositories for openSuse Tumbleweed. I don’t have any proxies or anything.

Expected behavior

Docker daemon stops when sudo service docker stop used.

Actual behavior

Machine freezes up, must be hard power cycled.

Steps to reproduce the behavior

Install docker on openSuse Tumbleweed from official repositories. Start the docker daemon with “sudo service docker start” and then attempt to stop it with “sudo service docker stop”.

Output of docker version:


Client:
Version: 17.04.0-ce
API version: 1.28
Go version: go1.7.5
Git commit: 78d1802
Built: Tue Jul 4 16:31:44 2017
OS/Arch: linux/amd64

Server:
Version: 17.04.0-ce
API version: 1.28 (minimum version 1.12)
Go version: go1.7.5
Git commit: 78d1802
Built: Tue Jul 4 16:31:44 2017
OS/Arch: linux/amd64
Experimental: false







**Output of `docker info`:**




Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 17.04.0-ce
Storage Driver: btrfs
Build Version: Btrfs v4.10.2+20170406
Library Version: 102
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: oci runc
Default Runtime: runc
Init Binary:
containerd version: (expected: 422e31ce907fd9c3833a38d7b8fdd023e5a76e73)
runc version: N/A (expected: 9c2d8d184e5da67c95d601382adf14862e4f2228)
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.11.8-1-default
Operating System: openSUSE Tumbleweed
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.69GiB
Name: cyclon
ID: B2IN:EMBW:HQNI:3GK4:TFUV:L33J:ARBS:JNRR:DYVB:6DQJ:GRKC:Z3B3
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 14
Goroutines: 20
System Time: 2017-07-07T10:36:30.49992113-04:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support





I also have log files that I can provide, but I can't figure out a way to attach them to this post. 

Has anyone else experienced this? Is there something specific to my setup that I can try to use to diagnose this? I filed a bug report on github, but it doesn't seem that those are responded to very quickly. I'm happy to file a bug report with openSuse, but I wanted to check with the forums first.


Please let me know if there is anything else necessary to diagnose and fix the problem. Thanks!

I ended up opening opened a bug report on openSUSE bugzilla as well (https://bugzilla.suse.com/show_bug.cgi?id=1047793), and it turns out that this is similar to another bug previously reported there (https://bugzilla.suse.com/show_bug.cgi?id=1047152). The issue is with the btrfs storage driver using some element of the btrfs filesystem that hangs if the filesystem is first mounted as ro and then later mounted as rw. This appears to be what’s going on here, and it seems like it is a bug in btrfs, not docker. So, I’m going to consider this resolved, but if anyone stumbles on this, be warned that it exists up to kernel 4.11.8 at least.

As a side note, the openSUSE developers are fantastic! The issue was resolved within hours of posting the bug! Their is a patch that needs to make it’s way into the kernel, but I understood the issue within hours.

I saw the bug report and the discussion there, though frankly I did not understand the circumstances necessary to trigger the bug. I have been using docker for years now and I switched all my computers to btrfs even when that was labeled experimental. Never had problems like this. I would be interested in what exactly triggers the bug in your setup. What does “first mounted ro and later mounted rw” mean in terms of setup? Does that refer to initrd mounting root ro or some other ro/rw switch?

You should try to stop your service with

systemctl stop docker

Any time you wish, and especially after any failures you can invoke “status” which should display a number of items including a relevant snippet from the system log

systemctl status docker

Don’t run those other commands like “service foo stop” which may or may not work and even should it work may not do exactly what you expect, nowadays the systemd Unit file which is invoked by the “systemctl” command acts as a “master script command” that does everything that is needed.

If you’re brand new to Docker, you can follow the guides I wrote which are simple openSUSE adaptations of old, official Docker tutorials. You’ll get an introduction to the basic architecture, downloading, managing and creating your own custom images and some simple commands. Although you’ve already installed Docker, you may also want to skim the link to my instructions for Installing, they contain links to, and describing Docker networking and more.

https://en.opensuse.org/User:Tsu2#Docker

Regarding stopping Docker, of course you should make sure that all your containers have stopped first (How to list your containers is included in my tutorials), else like any other running process can take a very long time to issue a HALT or KILL command and will make your system appear to hang when you attempt to stop the higher level process.

As always, if you choose to run a multi-OS or multi-container system, I strongly recommend placing the highest emphasis on reliability which means that nowadays LEAP should be preferred over TW.

TSU

BTW -
This might be very risky advice,
But if you run into a known problem and the “current” version of TW still has not fixed a problem, you can try upgrading to a nightly snapshot, ISO images can be found in the TW ISO repo.

http://download.opensuse.org/tumbleweed/iso/

It should be obvious that upgrading to nightly snapshots is far riskier than regular TW images, so be prepared to know how to recover and rollback as necessary (maybe have a repair disk handy?)

TSU

I second that. But then, playing with new docker is so much more fun. Especially since the production machines I tend to are RHEL7 with docker 1.12, no docker-compose, devicemapper on loop storage :eek:

Back on topic: I was bitten several times by docker upgrades which invalidated stuff in /var/lib/docker. And I had the odd “no space left on device” problem once. No biggie, just delete everything and start from scratch. Docker files are in SCM, anyway. But yes, creating and testing is fine on TW but prod should run stable or even commercial version.