All systemctl operations time out

Prune · March 8, 2016, 11:18pm

This started happening a while ago on my x86-64 installation, and continues even with the latest updates as of yesterday. The system would boot normally, with all services running. However, some time later, the system ends up in a state where virtually all systemctl operations issued as root fail with Failed to execute operation: Connection timed out
Even listing units fails: Failed to list units: Connection timed out
This creates the additional problem of not being able to install packages whose RPMs have scriptlet commands that restart services – I can only do so after a fresh reboot, before the problem state develops.
Moreover, rebooting and shutting down is impossible. All the standard commands like reboot, shutdown, and systemctl reboot or systemctl shutdown, fail with a timeout error, forcing me to pull the power on the machine.
This is a production server and I need urgent help resolving this issue. I have no idea what’s causing this, or how to debug it.
I found this from another 13.2 user: http://serverfault.com/questions/712928/systemctl-commands-timeout-when-ran-as-root
He was able to fix it by doing a kill -9 1, which in his case caused systemd to restart without --switched-root and --deserialize; however, when I tried that, systemd ignores the SIGTERM (also, I don’t know what side effects I’ll get if I try running it without those flags; my setup is RAID+LVM+LUKS).
systemctl daemon-reexec does execute without timing out (I think the only systemctl command other than status that I’ve found not to time out), but it doesn’t resolve any of the rest of the problems.
Please help!

nrickert · March 8, 2016, 11:40pm

Boot from live media. Maybe do a memory test and a smart drive test.

tsu2 · March 9, 2016, 6:03pm

Check for available free space on your root and other partitions with the following command

df -h

If you have sufficient/ample free space, particularly on your root partition then display your available memory. I wrote the following wiki article for using the free tool (or top)
https://en.opensuse.org/User:Tsu2/free_tool

If you’re not able to analyze the above information, post the results for others to evaluate

TSU

nrickert · March 9, 2016, 6:47pm

I should perhaps explain my earlier comment.

I have seen programs hanging (doing nothing) because they were stuck somewhere in a kernel process waiting for the disk drive to respond. The symptoms given seem similar. I’m expecting a failing disk. But this kind of failure is a failure of the control circuitry rather than a bad sector.

That’s only a guess. But it’s why I suggest a memory check and a hard drive check. It could also be a failure in some other component.

Prune · March 10, 2016, 12:56am

tsu2:

Check for available free space on your root and other partitions with the following command
df -h
If you have sufficient/ample free space, particularly on your root partition then display your available memory. I wrote the following wiki article for using the free tool (or top)
User:Tsu2/free tool - openSUSE Wiki

If you’re not able to analyze the above information, post the results for others to evaluate

TSU

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/Crypto_RAID10-root 1.9T 7.6G 1.9T 1% /
devtmpfs 7.9G 0 7.9G 0% /dev
tmpfs 7.9G 0 7.9G 0% /dev/shm
tmpfs 7.9G 11M 7.9G 1% /run
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
/dev/sda1 524M 411M 76M 85% /boot

top - 15:48:20 up 1 day, 1:22, 1 user, load average: 0.03, 0.03, 0.05
Tasks: 250 total, 1 running, 249 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16406760 total, 737396 used, 15669364 free, 3000 buffers
KiB Swap: 8388604 total, 0 used, 8388604 free. 392200 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14752 root 20 0 0 0 0 S 0.332 0.000 0:01.11 kworker/2:2
1 root 20 0 50888 17160 3500 S 0.000 0.105 0:10.50 systemd
2 root 20 0 0 0 0 S 0.000 0.000 0:00.05 kthreadd
(all 0s below)

So no clue from those.

Well, entering systemctl by itself to list units shouldn’t be blocking on another program. It’s just systemctl itself. I don’t think this is likely to be a hardware issue. The problem is, I don’t know where to look for error messages. Where does systemd output its own failures? I don’t see anything in the journal (I’ll try deleting the journal though, just in case there’s a problem with the journal itself).

deano_ferrari · March 10, 2016, 2:33am

Well, entering systemctl by itself to list units shouldn’t be blocking on another program. It’s just systemctl itself. I don’t think this is likely to be a hardware issue. The problem is, I don’t know where to look for error messages. Where does systemd output its own failures? I don’t see anything in the journal (I’ll try deleting the journal though, just in case there’s a problem with the journal itself).

That’s a good idea. I encountered something similar (but much less severe) soon after installing Leap within a VM. The systemd journal files had grown to an excessive size. Since removing them and adjusting /etc/systemd/journald.conf I haven’t had any further problems

SystemMaxUse=100M
#SystemKeepFree=
SystemMaxFileSize=20M

gryzli · June 18, 2016, 9:03pm

Check out your session files in /run/system/systemd , maybe you are affected by the bug between systemd-logind and dbus.

I have written already an article about this problem here:
https://gryzli.info/2016/06/18/systemd-systemctl-list-unit-files-timeouts/

In my case setting up the cron I’ve mentioned, seemed to do the job.

Cheers.

tsu2 · June 18, 2016, 9:55pm

As of now,
I don’t know that anyone running openSUSE is affected by the referenced dbus/logind bug.

After reading the issue on Github and the article written,

I ran the following on a 13.2 and LEAP, 42.1 (fully updated).
Both are running systemd version 210

ls /run/systemd/system/session*

Then, as root (su) I ran the following which I understand could spawn a new systemd session file and directories

systemctl list-unit-files

Then, re-ran the first command to see if new session files are created

ls /run/systemd/system/session*

Result:
No new session files were created.

Of course, this only tests the specific things which were mentioned, so for example I used the same User (root) to run the above.
If I perhaps re-ran the tests running services using other User account logins then new sessionids would be created and then it would be interesting if a TTL or purge mechanism properly removes those sessionid files when the session is no longer active.

Am also speculating that others may be using “sudo” instead of running as root using “su” could possibly be a relevant issue, and sudo in other distros are also often set up differently than how it’s configured in openSUSE. I would expect that it’s easy to see that a “sudo” session is only limited in time and new sessions would be created to run each command or only a small number of commands at a time.

TSU

arvidjaar · June 19, 2016, 7:17am

In this case restarting systend should not have any visible effect. There was another bug: [227 regression] Boot occasionally fails with "Connection timed out" · Issue #1505 · systemd/systemd · GitHub ; which sounds pretty close.

@Prune what you see is just a symptom that may be caused by various bugs. Many bugs manifest itself only under specific conditions, which is why others are not seeing it. You really need to open bug report (on openSUSE bugzilla) and follow directions how to troubleshoot it and provide more information.

You could of course try to install current systemd version from Base:System, but even if problem goes away it does not really mean it is fixed.

gryzli · June 19, 2016, 8:44am

I’m not very sure that if this was the case, restarting systemd won’t fix the problem, cause exact the “server restart” was fixing the problem in my case.

Of course this was just a suggestion, cause I was having almost identical issues with the session files.