I’m running Tumbleweed on a Ryzen/NVidia laptop with two internal SSDs. Essentially everything works fine (hoping that the
nvidia garbage screen issues after wakeup are consistently solved), but…
All relevant partitions are mounted in the fstab, among them the only ext4 data partition on the 2nd disk which resides as /local.
The system can sleep and wake up, but apparently shortly after wakeup the /local partition silently disappears. I can manually
mount /local with no problem. There is no fsck error and I couldn’t find any hint in dmesg or journalctl that anything happens.
I searched for strange cron jobs (zero) and also stopped the autofs service - with no effect. After a normal boot /local is
mounted, the issue is related to sleep/wakeup.
Another strange thing happened when checking the properties of the KDE mount tool (which works fine for USB drives and
card readers). Selecting the “Camera” tab, a popup asked for the root password to mount one of the internal EFI partitions.
Maybe this is a hint that a KDE component is responsible, but I can’t find a good reason.
Any suggestions what might be the root cause and a possible fix? Is there a general way to omit all builtin disks from automount activity?
A quick look at the auto.master etc. files and manpage didn’t give me a clue.
Likely unrelated, but mentioned for completeness: I created a sufficiently large swap to support hibernation. The image is written upon hibernate,
but there is no resume. The machine will always reboot. The resume= kernel parameter is given, I also explicitly added the " resume " module in
the dracut configuration with no luck. I’m also stuck how to debug this.
thanks for the tip and the links - they somehow solved 2 other problems
Audit was already installed and running, but the suggested rules didn’t log anything. That pointed me to the other problems:
On my old laptop I tried to use auditctl to discover some network issues which I could never solve because audit didn’t log anything useful after configuring the rules. Now I realized that kernel messages require a kernel boot parameter audit=1. Having set that in yast2/boot and then with grub-customizer and then with booting to my parallel ubuntu using grub-customizer to update that as well (which changed the GRUB_CMDLINE_LINUX_DEFAULT string) and still having no success (checking with cat /proc/cmdline and sysctl -a) I realized that also my resume= string was never actually used! After editing the particular Tumbleweed boot entry in Ubuntu (which is the one which gets executed) I can finally hibernate and resume in Tumbleweed The resume and audit settings appear in /proc/cmdline (but not in sysctl -a).
The biggest surprise is that I apparently got a feature back which I lost 10+ years ago: the ability to hibernate and switch the OS (to be used with care, I know). When turning the laptop on, I get the grub menu. Only after selection of Tumbleweed the system performes the resume operation. Perfect! I read in many places that this would be impossible nowadays
The /local partition is also lost after resume from hibernate as well. After having visited the camera setting of the KDE Disk&Drives thing, I get a password request to mount other EFI partitions after every reboot…the message comes from PolicyKit1. Maybe that could be a useful hint.
Back to the problem of lost /local mount: I followed the article to setup mount/umount logging. I do get some messages, but they are not very helpful for me, so far:
[FONT=monospace]cat /var/log/audit/audit.log |grep nvme
type=SERVICE_STOP msg=audit(1622965679.780:1473): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-fsck@dev-**nvme**1n1p1 comm="syst
emd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[/FONT]This is at least suspicious because it is the only partition mentioned, namely the one that gets lost. But what does it tell me?
Searching for "mount" does not give me a clue:
[FONT=monospace]cat /var/log/audit/audit.log |grep mount
type=CONFIG_CHANGE msg=audit(1623312557.163:4554): auid=4294967295 ses=4294967295 subj==unconfined op=add_rule key="**mount**_u**mount**" list=4 res=1
type=CONFIG_CHANGE msg=audit(1623313714.103:71): auid=4294967295 ses=4294967295 subj==unconfined op=add_rule key="**mount**_u**mount**" list=4 res=1
type=CONFIG_CHANGE msg=audit(1623314162.180:71): auid=4294967295 ses=4294967295 subj==unconfined op=add_rule key="**mount**_u**mount**" list=4 res=1
type=CONFIG_CHANGE msg=audit(1623310177.815:71): auid=4294967295 ses=4294967295 subj==unconfined op=add_rule key="**mount**_u**mount**" list=4 res=1
type=CONFIG_CHANGE msg=audit(1623312005.851:186): auid=4294967295 ses=4294967295 subj==unconfined op=add_rule key="**mount**_u**mount**" list=4 res=1
Some more context around the STOP message for the partition name:
[/FONT]
type=SERVICE_STOP msg=audit(1622965679.780:1471): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=klog comm="systemd" exe="/usr/lib/sys
temd/systemd" hostname=? addr=? terminal=? res=success'
type=CRYPTO_KEY_USER msg=audit(1622965679.780:1472): pid=1595 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='op=destroy kind=server fp=SHA256:<key> direction=? spid=1595 suid=0 exe="/usr/sbin/sshd" hostname=? ad
dr=? terminal=? res=success'
type=SERVICE_STOP msg=audit(1622965679.780:1473): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-fsck@dev-nvme1n1p1 comm="syst
emd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
type=SERVICE_STOP msg=audit(1622965679.788:1474): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=irqbalance comm="systemd" exe="/usr/l
ib/systemd/systemd" hostname=? addr=? terminal=? res=success'
type=SERVICE_STOP msg=audit(1622965679.792:1475): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=smartd comm="systemd" exe="/usr/lib/s
ystemd/systemd" hostname=? addr=? terminal=? res=success'
[FONT=monospace][FONT=monospace]
Only mentioning of /local seems to be related to restoring the working dir of an open shell (?):
[/FONT][/FONT]
[FONT=monospace][FONT=monospace][FONT=monospace]type=USER_CMD msg=audit(1622588514.280:907): pid=23105 uid=1000 auid=1000 ses=4 subj==unconfined msg='cwd="/**local**/polaris1/etc" cmd=6370202D7020736861646F772
02F6574632F terminal=pts/8 res=success'
[/FONT][/FONT]The "success" is strange because /local is currently empty and cwd can't be the specified place.
[/FONT]
There is no problem with the partition:
fsck /dev/nvme1n1p1
fsck from util-linux 2.36.2
e2fsck 1.46.2 (28-Feb-2021)
/dev/nvme1n1p1: clean, 8860982/109387776 files, 337456099/437549568 blocks
I can do
mount /local
but get no new message in the audit.log containing "mount" or "local".
Quite nice that you could, using auditd, solve two other problems!
No, I am not an auditd expert and can not get more information out of what you shared. My first question would be, does you see in auditd if you manually mount/unmount /local but you already tried.
As mountin/unmounting /local by hand does not give you any trace, that should be solved first.
What did you put into /etc/audit/rules.d/audit.rules? Would be good to post it in this topic.
If you followed the article likely that is not (completely) correct (anymore) and needs to be updated.
You can get more information using “man audit.rules” but looking at the list of syscalls and searching for umount I see already more syscall’s then mount,umount2 as mentioned in the article.
Well, it helps to read what is written in the /etc/audit/rules.d/audit.rules:
After removing one of the default lines I do get mount/umount messages. But frankly, I know that something is umounted, I just can’t figure out what starts the process.
Most of the other documented syscalls are silently ignored or are not getting triggered. I kept them as comments. The residual content (which returns messages) is this:
## delete all rules
-D
## This suppresses syscall auditing for all tasks started with this rule in effect. Remove it if you need syscall auditing.
#-a task,never
## monitor mounting and unmounting
-a always,exit -F arch=b64 -S mount -k mount_umount_0
-a always,exit -F arch=b64 -S umount2 -k mount_umount_1
#-a always,exit -F arch=b64 -S fsmount -k mount_umount_2
#-a always,exit -F arch=b64 -S move_mount -k mount_umount_3
#-a always,exit -F arch=b64 -S oldumount -k mount_umount_4
#-a always,exit -F arch=b64 -S umount -k mount_umount_5
-w /usr/bin/mount -p x -k exec_mount
-w /usr/bin/umount -p x -k exec_umount
From the mount command I can check the ppid with grep and the process is bash - the shell where I manually executed mount after wakeup.
From the umount command for /local, the ppid [FONT=monospace]19048 does not exist anymore. It also umounted all nfs drives in fstab (despite that they have not been mounted before). Using [/FONT]“ausearch -k umount_1 -i” the cryptic proctitle string is translated into “umount -l -f /local/”. That’s just the command line used, but contains no hint on the father process.
From one of the password popups (for mounting other internal EFI partitions) I stumbled across polkit1 and udisks2. I stopped that service with “systemctl stop udisks2.service” but there is no effect on the automatic umounting of /local.
Repeatedly I could verify that the umount happens not as part of the suspend operation but after waking up again. E.g. when I prepare a “df” command in a shell and execute it immediately after wakeup, /local is still there. If I repeat it 1s later, it is gone.
Since the umount comes within a chain of nfs umounts, I realized from the time stamps of the messages that my /local problem is a misinterpretation of one fstab line for mounting machine2:/local to /nfs/machine2/local. Apparently machine2 is taken as either localhost or machine1, whatever. But I don’t understand how this happens. The name machine2 is not associated to any IP on machine1, and also “echo $HOST” or “cat /etc/hostname” return machine1.
The good side…now I have a workaround
I can comment out the nfs mount to the sister laptop (machine2), but the root cause is still a mystery.
There is one interesting thing about your posts. You report some symptoms you find interesting, but you fail to provide hard evidence. Mounting is governed by systemd, which generates units, e.g. in directory /run/systemd/generator/.
**3400G:~ #** ll /run/systemd/generator/
total 56
-rw-r--r-- 1 root root 320 Jun 20 13:19 -.mount
-rw-r--r-- 1 root root 282 Jun 20 13:19 GARMIN.mount
-rw-r--r-- 1 root root 337 Jun 20 13:19 HDD.mount
-rw-r--r-- 1 root root 338 Jun 20 13:19 WD25.mount
-rw-r--r-- 1 root root 359 Jun 20 13:19 \x2esnapshots.mount
-rw-r--r-- 1 root root 264 Jun 20 13:19 boot-efi.mount
-rw-r--r-- 1 root root 375 Jun 20 13:19 boot-grub2-i386\x2dpc.mount
-rw-r--r-- 1 root root 381 Jun 20 13:19 boot-grub2-x86_64\x2defi.mount
-rw-r--r-- 1 root root 323 Jun 20 13:19 home.mount
drwxr-xr-x 2 root root 260 Jun 20 13:19 **local-fs.target.requires**
drwxr-xr-x 2 root root 60 Jun 20 13:19 **local-fs.target.wants**
-rw-r--r-- 1 root root 345 Jun 20 13:19 opt.mount
-rw-r--r-- 1 root root 347 Jun 20 13:19 root.mount
-rw-r--r-- 1 root root 345 Jun 20 13:19 srv.mount
-rw-r--r-- 1 root root 357 Jun 20 13:19 usr-local.mount
**-rw-r--r-- 1 root root 345 Jun 20 13:19 var.mount **
**3400G:~ #**
So I am clueless what you are doing in /etc/fstab and what systemd does on your machine, presumably in unit var.mount.
**3400G:~ #** journalctl -b -u var.mount
-- Logs begin at Sat 2021-06-19 04:28:11 CEST, end at Sun 2021-06-20 20:01:06 CEST. --
Jun 20 13:19:15 3400G systemd[1]: Mounted /var.
**3400G:~ #**
In the past unmounting typically occurred due to inadvertent reloading of systemd. You may check for messages:
**3400G:~ #** journalctl -b --grep reload
-- Logs begin at Sat 2021-06-19 04:28:11 CEST, end at Sun 2021-06-20 20:01:06 CEST. --
Jun 20 13:19:15 3400G systemd[1]: Starting Reload Configuration from the Real Root...
Jun 20 13:19:15 3400G systemd[1]: Reloading.
Jun 20 13:19:15 3400G systemd[1]: Finished Reload Configuration from the Real Root.
Jun 20 13:19:15 3400G apparmor.systemd[556]: Reloading AppArmor profiles
Jun 20 13:19:16 3400G postfix[848]: To disable backwards compatibility use "postconf compatibility_level=3.6" and "postfix reload"
Jun 20 13:19:16 3400G postfix[956]: To disable backwards compatibility use "postconf compatibility_level=3.6" and "postfix reload"
Jun 20 13:19:26 3400G systemd[1124]: Reloading.
Jun 20 13:19:27 3400G plasmashell[1323]: kf.plasma.quick: Applet preload policy set to 1
Jun 20 13:19:28 3400G kded5[1261]: **Reload**ing the khotkeys configuration
**3400G:~ #**
Interesting information. I never really looked at systemd in detail, so if you ask me what I’m doing with it … I hope nothing, at least I’m not aware of it. My only intention is
to mount relevant local partitions from /etc/fstab and also keep a list of other machines in the network for mounting their respective /local in a tree like /nfs/machineX/local.
I noticed that newer Linux versions have a capability to umount nfs mounts from lost connections (instead of leaving an unresponsive system), but never thought about how this is done. Maybe now I have to care? Regarding your comments:
I did find the partition and mount point in two “generators”, /run/systemd/generator/local.mount and [FONT=monospace]/run/systemd/generator/local-fs.target.requires/local.mount. Both files are identical, and I don’t see something suspicious in it. Other internal mounts (which don’t disappear) look the same:
[/FONT]
The IPs are different, the names are well resolved when I check with a ping.
I guess with “presumably in unit var.mount” should read local.mount? After reboot (/local present), suspend and resume (/local absent) the result is this, no indication
of the umount:
# [FONT=monospace]journalctl -b -u local.mount
-- Logs begin at Tue 2021-06-01 05:28:08 CEST, end at Mon 2021-06-21 01:48:43 CEST. --
Jun 20 14:26:03 machine1 systemd[1]: Mounting /local...
Jun 20 14:26:03 machine1 systemd[1]: Mounted /local.
Jun 21 01:20:13 machine1 systemd[1]: local.mount: Succeeded.
Jun 21 01:48:18 machine1 systemd[1]: local.mount: Succeeded.
[/FONT]
and the problem even started before installing laptop mode tools. No systemd reload.
I also tried the variant with direct IP number instead of the hostname … the problem persists. It is gone as soon as the line is commented out.
Any chance that there is some kind of cache that could look up the same wrong information from both hostname and IP number? In /etc the only places are
fstab and hosts, as expected.
However, mounting /nfs/machine2/local (regardless of using name or IP) actually results in the /local mount being “hijacked”! Before the problem of disappearing
/local started this didn’t happen, and I still can’t see how it happens. I checked arp -a, and both machines report different MAC addresses for each other.
And compared to other nfs mounts, no nfs-machine2-local.mount generator exists as should be expected. This is doubly weird…the other nfs generators have
a filesystem nfs written - umounting them could be explained. If the actual nfs mount is for whatever reason misinterpreted as a local mount, and if this mount
exists, why is it umounted?
I couldn’t find anything in /etc or /var that could explain such a mixup. fstab and hosts look as they should (IMHO), ping and arp and ifconfig or ip link look ok.
How else could the system mix up its identity with another machine?
For debugging the problem you may change the local path in /etc/fstab to something else, e.g. /nfs/machine2/LOCAL and try again. Are the units now created correctly?
**3400G:~ #** systemctl list-units '*var*'
UNIT LOAD ACTIVE SUB DESCRIPTION
test-var.mount loaded active mounted /test/var
var.mount loaded active mounted /var
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
**2 loaded units listed.** Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
**3400G:~ #**
LazyUnmount=
Takes a boolean argument. If true, detach the filesystem from the filesystem hierarchy at time of the unmount operation, and clean up all references to the filesystem as soon as they are not busy anymore. This corresponds with umount(8)'s -l switch. Defaults to off.
You have not specified that for the local mount and the systemd-fstab-generator unit does not show it, while you auditd revealed:
cryptic proctitle string is translated into “umount -l -f /local/”
You can still try to make it explicit but it looks like it is not systemd umounting /local…
Sure. However for each working mount systemd loads a unit:
**3400G:~ #** **systemctl list-units --type mount**
UNIT LOAD ACTIVE SUB DESCRIPTION
-.mount loaded active mounted Root Mount
\x2esnapshots.mount loaded active mounted /.snapshots
boot-efi.mount loaded active mounted /boot/efi
boot-grub2-i386\x2dpc.mount loaded active mounted /boot/grub2/i386-pc
dev-hugepages.mount loaded active mounted Huge Pages File System
dev-mqueue.mount loaded active mounted POSIX Message Queue File System
home.mount loaded active mounted /home
opt.mount loaded active mounted /opt
root.mount loaded active mounted /root
run-user-1000-gvfs.mount loaded active mounted /run/user/1000/gvfs
run-user-1000.mount loaded active mounted /run/user/1000
srv.mount loaded active mounted /srv
sys-fs-fuse-connections.mount loaded active mounted FUSE Control File System
sys-kernel-config.mount loaded active mounted Kernel Configuration File System
sys-kernel-debug.mount loaded active mounted Kernel Debug File System
sys-kernel-tracing.mount loaded active mounted Kernel Trace File System
**test-var.mount loaded active mounted /test/var**
tmp.mount loaded active mounted Temporary Directory (/tmp)
usr-local.mount loaded active mounted /usr/local
**var.mount loaded active mounted /var**
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
**20 loaded units listed.** Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
**3400G:~ #**
All mounts are managed by systemd, which creates appropriate units and deletes them upon unmounting. Users may want to check correct generation of these units and verify its existence before performing the next step. Systemd isn’t static. It is very dynamic. Note the fine generators in /usr/lib/systemd/system-generators/
:shame: (headbang)^3 … thanks for opening my eyes!!! Life would be boring without beautiful errors, isn’t it?
Let’s start with the solution to my immediate problem: As I mentioned, all machines’ /local partition should be visible in a /nfs/machineX/local tree in my system.
For completeness this includes /local of the current machine, which is not an NFS mount but a symlink for performance reasons. On this particular machine, it turned out that the symlink was in the wrong folder (for machine2 instead of machine1), thus the mount point directory for the NFS mount /nfs/machine2/local was a symlink, pointing to /local.
Fixing this directory tree and symlink solved both the umount of /local and the bogus NFS mount which hijacked /local.
What remains is, from my point of view, a double error regarding handling of the NFS mounts:
Auto-removal of the NFS mounts appears to be useful when the mounted drive is not accessible after wakeup (assuming that NFS should survive a sleep and recover cycle if the network doesn’t change in the meantime - this way it could be explained that the umount happens on resume, not before the suspend).
In my case, both machines are and remain on the same net, I can immediately re-mount the NFS share. So why does it get umounted at all? Both probing the
machine2:/local and accessing the symlink to /local should result in success. There is no reason to assume that the connection is down and the mount should be removed?
The other issue is that IMHO the NFS mount to machine2:/local should not have succeeded and umounting /local should not have succeeded as well - because the mount point should be a directory, not a symlink (I would assume). Even more, shouldn’t umount verify that the drive is nfs and not ext4? The FS type is advertised in the generator file.