We have hit a scenario with openSUSE 12.1 where boot success or failure depends on the number of tapes in a SAS-connected tape library. I have found the cause: to cut a long story short the problem is with the default load behaviour of the scsi-changer kernel module ‘ch’, and I fixed it by changing the module’s options in /etc/modprobe.d/99-local.conf. It looks to me that the accelerated booting schedule of systemd has allowed this problem to surface.
I think that the default behaviour of this module is harmful, and should be changed in the module-init-tools package. Do people think that it would be useful to discuss this on this forum first, or should I submit a bug report straight away? It isn’t a conventional kind of software bug, because it needs a particular type of hardware setup to reproduce.
Thanks for the suggestion (this is the first system that we have set up using systemd and I’m not that familiar with it). It makes no difference though. The ‘ch’ module takes 110s to load, unless you pass it ‘init=0’. The long load time for ‘ch’ causes udev to timeout, and in the middle of all this systemd jobs that initialise filesystems on the internal disks fail. I admit that I don’t know how this cross-talk happens: it may be relevant that the tape library and the internal disks are both connected using SAS HBA’s from LSI, so both depend on LSI’s MPT Fusion SAS driver. We are using the one that is shipped with openSUSE 12.1.
While working on this, I did increase the sysconfig MDADM_DEVICE_TIMEOUT parameter to 240s from its default of 60s: this is used as the ‘udev settle’ timeout parameter by md.service (/etc/init.d/boot.md). Before I did this, md.service complained that the udev event queue wasn’t empty.
Some kernel output:
69.317466] udevd[578]: timeout: killing '/sbin/modprobe -bv scsi:t-0x01' [703]
69.329461] udevd[593]: timeout: killing '/sbin/modprobe -bv scsi:t-0x08' [704]
69.340787] udevd[610]: timeout: killing '/sbin/modprobe -bv scsi:t-0x05' [752]
70.316990] udevd[578]: timeout: killing '/sbin/modprobe -bv scsi:t-0x01' [703]
... lots more of this...
96.067580] systemd[1]: Job dev-disk-by\x2dlabel-amanda_regular.device/start timed out.
96.071342] systemd[1]: Job remote-fs-pre.target/start failed with result 'dependency'.
96.071348] systemd[1]: Job local-fs.target/start failed with result 'dependency'.
96.071351] systemd[1]: Triggering OnFailure= dependencies of local-fs.target.
96.075190] systemd[1]: Job mnt-data-irregular.mount/start failed with result 'dependency'.
96.079087] systemd[1]: Job fsck@dev-disk-by\x2dlabel-data_irregular.service/start failed with result 'dependency'.
96.083068] systemd[1]: Job mnt-data-regular.mount/start failed with result 'dependency'.
96.087186] systemd[1]: Job fsck@dev-disk-by\x2dlabel-data_regular.service/start failed with result 'dependency'.
96.087524] systemd[1]: Job mnt-amanda_holding_regular.mount/start failed with result 'dependency'.
96.087528] systemd[1]: Job dev-disk-by\x2dlabel-amanda_regular.device/start failed with result 'timeout'.
...
120.563117] udevd[593]: timeout: killing '/sbin/modprobe -bv scsi:t-0x08' [704]
120.566366] udevd[578]: timeout: killing '/sbin/modprobe -bv scsi:t-0x01' [703]
120.569573] udevd[610]: timeout: killing '/sbin/modprobe -bv scsi:t-0x05' [752]
121.097476] ch0: ... finished
121.097481] ch 5:0:0:1: Attached scsi changer ch0
121.097980] udevd[593]: '/sbin/modprobe -bv scsi:t-0x08' [704] terminated by signal 9 (Killed)
121.101644] udevd[578]: '/sbin/modprobe -bv scsi:t-0x01' [703] terminated by signal 9 (Killed)
121.105417] udevd[610]: '/sbin/modprobe -bv scsi:t-0x05' [752] terminated by signal 9 (Killed)
121.430112] systemd[1]: Startup finished in 5s 709ms 549us (kernel) + 1min 55s 698ms 689us (userspace) = 2min 1s 408ms 238us.
It is the systemd jobs in the snippet above that seem to be the problem. scsi:t-0x08/05/01 are the ch, sr_mod and st modules respectively.
At the end of the day, I can’t help feeling that having kernel modules that take minutes rather than seconds to load is asking for trouble. For starters, I don’t think that udev and module_init_tools are written with that kind of behaviour in mind (but please feel free to correct me if I am wrong or being naive).
On 04/27/2012 10:26 AM, pakeller wrote:
> At the end of the day, I can’t help feeling that having kernel modules
> that take minutes rather than seconds to load is asking for trouble. For
> starters, I don’t think that udev and module_init_tools are written with
> that kind of behaviour in mind (but please feel free to correct me if I
> am wrong or being naive).
Any module that takes longer than 60 seconds to load, including the time needed
to load firmware, is just plain wrong, and the newest udev versions are
enforcing that behavior. All the wireless drivers that were using synchronous
firmware loads needed to be rewritten to use the asynchronous versions. Is
systemd really exposing the problem, or was it a new version of udev?
So if I go to the service file /lib/systemd/system/udev-settle.service, edit as root and add the following line (in bold), restart and see what you get:
# This service is usually not enabled by default. If enabled, it
# acts as a barrier for basic.target -- so all later services will
# wait for udev completely finishing its coldplug run.
#
# If needed, to work around broken or non-hotplug-aware services,
# it might be enabled unconditionally, or pulled-in on-demand by
# the services that assume a fully populated /dev at startup. It
# should not be used or pulled-in ever on systems without such
# legacy services running.
[Unit]
Description=udev Wait for Complete Device Initialization
DefaultDependencies=no
Wants=udev.service
After=udev-trigger.service
Before=basic.target
[Service]
Type=oneshot
RemainAfterExit=yes
**TimeoutSec=0**
ExecStart=/sbin/udevadm settle
[Install]
WantedBy=basic.target
This disables the Time Out logic as I understand it. So it should not Time Out at all, its worth a try …
This is how I got systemd to wait until a new version of the nVIDIA driver and VBox drivers to compile for a new kernel update with dkms.
Thanks for the suggestion, but it didn’t help. It seems to me that udev-settle.service is running too late in the boot process to solve this particular problem. ‘dmesg’ output from booting still contains this:
94.289347] udevd[532]: timeout: killing '/sbin/modprobe -bv scsi:t-0x05' [776]
95.197712] udevd[566]: timeout: killing '/sbin/modprobe -bv scsi:t-0x08' [675]
95.200647] udevd[564]: timeout: killing '/sbin/modprobe -bv scsi:t-0x01' [676]
95.288785] udevd[532]: timeout: killing '/sbin/modprobe -bv scsi:t-0x05' [776]
95.917847] systemd[1]: Job dev-disk-by\x2dlabel-amanda_regular.device/start timed out.
95.921286] systemd[1]: Job remote-fs-pre.target/start failed with result 'dependency'.
95.921291] systemd[1]: Job local-fs.target/start failed with result 'dependency'.
95.921295] systemd[1]: Triggering OnFailure= dependencies of local-fs.target.
95.924915] systemd[1]: Job mnt-data-irregular.mount/start failed with result 'dependency'.
95.928548] systemd[1]: Job fsck@dev-disk-by\x2dlabel-data_irregular.service/start failed with result 'dependency'.
95.932272] systemd[1]: Job mnt-data-regular.mount/start failed with result 'dependency'.
95.936216] systemd[1]: Job fsck@dev-disk-by\x2dlabel-data_regular.service/start failed with result 'dependency'.
95.936574] systemd[1]: Job mnt-amanda_holding_regular.mount/start failed with result 'dependency'.
95.936579] systemd[1]: Job dev-disk-by\x2dlabel-amanda_regular.device/start failed with result 'timeout'.
i.e. the failure to find local and mount lvm filesystems is happening well before the timeout specified in /lib/systemd/system/udev-settle.service (180s before your suggested edit).
That’s what I had thought, but without your specific knowledge about this. You may well be right that it is a newer version of udev rather than systemd that has shown this up.
I’ll wait for a bit to see if anyone else makes any suggestions on this forum, and then post a report on opensuse’s bugzilla. As you imply, the correct fix would be to rewrite the ‘ch’ driver to make its default initialisation asynchronous, but my gut feeling is that it would be more likely to happen if it has been discussed on the bug-reporting board of a major distro like openSUSE first. My fix (putting ‘options ch init=0’ in /etc/modprobe.d/99-local.conf) is a distro-specific workaround of course, rather than a true solution.
For reference, the tape library is a Quantum Superloader3 with an LTO-5 drive. The source of the driver is at scsi-changer but doesn’t show much sign of activity since 2006. A recent version in the kernel source (version 3.4-rc1 is the same as in openSUSE 12.1 and 11.4, so no recent changes there.