internal SSD partition unmounted after resume from sleep (suspend to RAM)

ggrau · June 9, 2021, 5:20pm

Hi,

I’m running Tumbleweed on a Ryzen/NVidia laptop with two internal SSDs. Essentially everything works fine (hoping that the
nvidia garbage screen issues after wakeup are consistently solved), but…

All relevant partitions are mounted in the fstab, among them the only ext4 data partition on the 2nd disk which resides as /local.
The system can sleep and wake up, but apparently shortly after wakeup the /local partition silently disappears. I can manually
mount /local with no problem. There is no fsck error and I couldn’t find any hint in dmesg or journalctl that anything happens.
I searched for strange cron jobs (zero) and also stopped the autofs service - with no effect. After a normal boot /local is
mounted, the issue is related to sleep/wakeup.

Another strange thing happened when checking the properties of the KDE mount tool (which works fine for USB drives and
card readers). Selecting the “Camera” tab, a popup asked for the root password to mount one of the internal EFI partitions.
Maybe this is a hint that a KDE component is responsible, but I can’t find a good reason.

Any suggestions what might be the root cause and a possible fix? Is there a general way to omit all builtin disks from automount activity?
A quick look at the auto.master etc. files and manpage didn’t give me a clue.

Likely unrelated, but mentioned for completeness: I created a sufficiently large swap to support hibernation. The image is written upon hibernate,
but there is no resume. The machine will always reboot. The resume= kernel parameter is given, I also explicitly added the " resume " module in
the dracut configuration with no luck. I’m also stuck how to debug this.

thanks
GG

marel · June 9, 2021, 9:52pm

To monitor what is unmounting /local, try the Audit framework, I see there is an audit package for Tumbleweed.

For instructions on how to set up monitoroing see How to monitor the Mounting/Umounting of Mount Points Using Auditd

Please let us know what you find, it would be good to know for others.

ggrau · June 10, 2021, 12:47pm

Hi Marel,

thanks for the tip and the links - they somehow solved 2 other problems
Audit was already installed and running, but the suggested rules didn’t log anything. That pointed me to the other problems:

On my old laptop I tried to use auditctl to discover some network issues which I could never solve because audit didn’t log anything useful after configuring the rules. Now I realized that kernel messages require a kernel boot parameter audit=1. Having set that in yast2/boot and then with grub-customizer and then with booting to my parallel ubuntu using grub-customizer to update that as well (which changed the GRUB_CMDLINE_LINUX_DEFAULT string) and still having no success (checking with cat /proc/cmdline and sysctl -a) I realized that also my resume= string was never actually used! After editing the particular Tumbleweed boot entry in Ubuntu (which is the one which gets executed) I can finally hibernate and resume in Tumbleweed The resume and audit settings appear in /proc/cmdline (but not in sysctl -a).

The biggest surprise is that I apparently got a feature back which I lost 10+ years ago: the ability to hibernate and switch the OS (to be used with care, I know). When turning the laptop on, I get the grub menu. Only after selection of Tumbleweed the system performes the resume operation. Perfect! I read in many places that this would be impossible nowadays

The /local partition is also lost after resume from hibernate as well. After having visited the camera setting of the KDE Disk&Drives thing, I get a password request to mount other EFI partitions after every reboot…the message comes from PolicyKit1. Maybe that could be a useful hint.

Back to the problem of lost /local mount: I followed the article to setup mount/umount logging. I do get some messages, but they are not very helpful for me, so far:


[FONT=monospace]cat /var/log/audit/audit.log |grep nvme 
type=SERVICE_STOP msg=audit(1622965679.780:1473): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-fsck@dev-**nvme**1n1p1 comm="syst
emd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

[/FONT]This is at least suspicious because it is the only partition mentioned, namely the one that gets lost. But what does it tell me?

Searching for "mount" does not give me a clue:

[FONT=monospace]cat /var/log/audit/audit.log |grep mount 
type=CONFIG_CHANGE msg=audit(1623312557.163:4554): auid=4294967295 ses=4294967295 subj==unconfined op=add_rule key="**mount**_u**mount**" list=4 res=1 
type=CONFIG_CHANGE msg=audit(1623313714.103:71): auid=4294967295 ses=4294967295 subj==unconfined op=add_rule key="**mount**_u**mount**" list=4 res=1 
type=CONFIG_CHANGE msg=audit(1623314162.180:71): auid=4294967295 ses=4294967295 subj==unconfined op=add_rule key="**mount**_u**mount**" list=4 res=1 
type=CONFIG_CHANGE msg=audit(1623310177.815:71): auid=4294967295 ses=4294967295 subj==unconfined op=add_rule key="**mount**_u**mount**" list=4 res=1 
type=CONFIG_CHANGE msg=audit(1623312005.851:186): auid=4294967295 ses=4294967295 subj==unconfined op=add_rule key="**mount**_u**mount**" list=4 res=1

Some more context around the STOP message for the partition name:
[/FONT]
type=SERVICE_STOP msg=audit(1622965679.780:1471): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=klog comm="systemd" exe="/usr/lib/sys 
temd/systemd" hostname=? addr=? terminal=? res=success' 
type=CRYPTO_KEY_USER msg=audit(1622965679.780:1472): pid=1595 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='op=destroy kind=server fp=SHA256:<key> direction=? spid=1595 suid=0  exe="/usr/sbin/sshd" hostname=? ad 
dr=? terminal=? res=success' 
type=SERVICE_STOP msg=audit(1622965679.780:1473): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-fsck@dev-nvme1n1p1 comm="syst 
emd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' 
type=SERVICE_STOP msg=audit(1622965679.788:1474): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=irqbalance comm="systemd" exe="/usr/l 
ib/systemd/systemd" hostname=? addr=? terminal=? res=success' 
type=SERVICE_STOP msg=audit(1622965679.792:1475): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=smartd comm="systemd" exe="/usr/lib/s 
ystemd/systemd" hostname=? addr=? terminal=? res=success'
[FONT=monospace][FONT=monospace]
Only mentioning of /local seems to be related to restoring the working dir of an open shell (?):
[/FONT][/FONT]
[FONT=monospace][FONT=monospace][FONT=monospace]type=USER_CMD msg=audit(1622588514.280:907): pid=23105 uid=1000 auid=1000 ses=4 subj==unconfined msg='cwd="/**local**/polaris1/etc" cmd=6370202D7020736861646F772
02F6574632F terminal=pts/8 res=success'

[/FONT][/FONT]The "success" is strange because /local is currently empty and cwd can't be the specified place.
[/FONT]
There is no problem with the partition:
fsck /dev/nvme1n1p1  
fsck from util-linux 2.36.2 
e2fsck 1.46.2 (28-Feb-2021) 
/dev/nvme1n1p1: clean, 8860982/109387776 files, 337456099/437549568 blocks

I can do
mount /local

but get no new message in the audit.log containing "mount" or "local".

Can you extract something useful from this?

Thanks
GG

marel · June 10, 2021, 10:44pm

Quite nice that you could, using auditd, solve two other problems!

No, I am not an auditd expert and can not get more information out of what you shared. My first question would be, does you see in auditd if you manually mount/unmount /local but you already tried.

As mountin/unmounting /local by hand does not give you any trace, that should be solved first.

What did you put into /etc/audit/rules.d/audit.rules? Would be good to post it in this topic.
If you followed the article likely that is not (completely) correct (anymore) and needs to be updated.

You can get more information using “man audit.rules” but looking at the list of syscalls and searching for umount I see already more syscall’s then mount,umount2 as mentioned in the article.

ggrau · June 20, 2021, 1:30pm

Well, it helps to read what is written in the /etc/audit/rules.d/audit.rules:
After removing one of the default lines I do get mount/umount messages. But frankly, I know that something is umounted, I just can’t figure out what starts the process.
Most of the other documented syscalls are silently ignored or are not getting triggered. I kept them as comments. The residual content (which returns messages) is this:

## delete all rules 
-D 
## This suppresses syscall auditing for all tasks started with this rule in effect.  Remove it if you need syscall auditing. 
#-a task,never 
## monitor mounting and unmounting 
-a always,exit -F arch=b64 -S mount -k mount_umount_0 
-a always,exit -F arch=b64 -S umount2 -k mount_umount_1 
#-a always,exit -F arch=b64 -S fsmount -k mount_umount_2 
#-a always,exit -F arch=b64 -S move_mount -k mount_umount_3 
#-a always,exit -F arch=b64 -S oldumount -k mount_umount_4 
#-a always,exit -F arch=b64 -S umount -k mount_umount_5 
-w /usr/bin/mount -p x -k exec_mount 
-w /usr/bin/umount -p x -k exec_umount

Typical results are now:

---- 

[FONT=monospace]time->Sun Jun 20 07:10:22 2021 
type=PROCTITLE msg=audit(1624165822.058:3609): proctitle=756D6F756E74002D6C002D66002F6E66732F706F6C61726973312F6C6F63616C 
type=PATH msg=audit(1624165822.058:3609): item=0 name="/local" inode=2 dev=103:01 mode=040755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fe=? cap_fver=? cap_fp=? cap_fi=? 
type=CWD msg=audit(1624165822.058:3609): cwd="/" 
type=SYSCALL msg=audit(1624165822.058:3609): arch=c000003e syscall=166 success=yes exit=0 a0=56087b06ab20 a1=3 a2=0 a3=0 items=1 ppid=19048 pid=20586 auid=4294967295 uid=0 gid=0 euid=0 su
id=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="umount" exe="/usr/bin/umount" subj==unconfined key="mount_umount_1"
---- 
time->Sun Jun 20 07:33:57 2021 
type=PROCTITLE msg=audit(1624167237.830:4072): proctitle=6D6F756E74002F6C6F63616C2F 
type=PATH msg=audit(1624167237.830:4072): item=1 name="/dev/nvme1n1p1" inode=374 dev=00:05 mode=060660 ouid=0 ogid=487 rdev=103:01 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 ca
p_frootid=0 
type=PATH msg=audit(1624167237.830:4072): item=0 name="/local" inode=1835009 dev=103:0b mode=040755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_froo
tid=0 
type=CWD msg=audit(1624167237.830:4072): cwd="/root" 
type=SYSCALL msg=audit(1624167237.830:4072): arch=c000003e syscall=165 success=yes exit=0 a0=564a61b2e9b0 a1=564a61b2b7a0 a2=564a61b2a390 a3=0 items=2 ppid=24671 pid=425 auid=1255 uid=0 g
id=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts1 ses=2 comm="mount" exe="/usr/bin/mount" subj==unconfined key="mount_umount_0"
[/FONT]

From the mount command I can check the ppid with grep and the process is bash - the shell where I manually executed mount after wakeup.
From the umount command for /local, the ppid [FONT=monospace]19048 does not exist anymore. It also umounted all nfs drives in fstab (despite that they have not been mounted before). Using [/FONT]“ausearch -k umount_1 -i” the cryptic proctitle string is translated into “umount -l -f /local/”. That’s just the command line used, but contains no hint on the father process.

From one of the password popups (for mounting other internal EFI partitions) I stumbled across polkit1 and udisks2. I stopped that service with “systemctl stop udisks2.service” but there is no effect on the automatic umounting of /local.
Repeatedly I could verify that the umount happens not as part of the suspend operation but after waking up again. E.g. when I prepare a “df” command in a shell and execute it immediately after wakeup, /local is still there. If I repeat it 1s later, it is gone.

Since the umount comes within a chain of nfs umounts, I realized from the time stamps of the messages that my /local problem is a misinterpretation of one fstab line for mounting machine2:/local to /nfs/machine2/local. Apparently machine2 is taken as either localhost or machine1, whatever. But I don’t understand how this happens. The name machine2 is not associated to any IP on machine1, and also “echo $HOST” or “cat /etc/hostname” return machine1.

The good side…now I have a workaround
I can comment out the nfs mount to the sister laptop (machine2), but the root cause is still a mystery.

thanks
Guenter

marel · June 20, 2021, 4:11pm

Nice write-up, thanks for that!
Good you found a work-around.

On the problem with fstab, can you try replacing “machine2” by it’s IP-address?

I assume that “machine2” is not the real name, no need to know that but can it be it is some kind of “reserved word”?

type=SYSCALL msg=audit(1624165822.058:3609): arch=c000003e syscall=166 success=yes exit=0 a0=56087b06ab20 a1=3 a2=0 a3=0 items=1 ppid=19048 pid=20586 auid=4294967295 uid=0 gid=0 euid=0 su
id=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm=“umount” exe=“/usr/bin/umount” subj==unconfined key=“mount_umount_1”

I think you could get more information using “ausearch -sc 116 -a 56087b06ab20”

karlmistelberger · June 20, 2021, 8:19pm

ggrau:

Hi,

I’m running Tumbleweed on a Ryzen/NVidia laptop with two internal SSDs. Essentially everything works fine (hoping that the
nvidia garbage screen issues after wakeup are consistently solved), but…

All relevant partitions are mounted in the fstab, among them the only ext4 data partition on the 2nd disk which resides as /local.
The system can sleep and wake up, but apparently shortly after wakeup the /local partition silently disappears. I can manually
mount /local with no problem. There is no fsck error and I couldn’t find any hint in dmesg or journalctl that anything happens.
I searched for strange cron jobs (zero) and also stopped the autofs service - with no effect. After a normal boot /local is
mounted, the issue is related to sleep/wakeup.

Another strange thing happened when checking the properties of the KDE mount tool (which works fine for USB drives and
card readers). Selecting the “Camera” tab, a popup asked for the root password to mount one of the internal EFI partitions.
Maybe this is a hint that a KDE component is responsible, but I can’t find a good reason.

Any suggestions what might be the root cause and a possible fix? Is there a general way to omit all builtin disks from automount activity?
A quick look at the auto.master etc. files and manpage didn’t give me a clue.

Likely unrelated, but mentioned for completeness: I created a sufficiently large swap to support hibernation. The image is written upon hibernate,
but there is no resume. The machine will always reboot. The resume= kernel parameter is given, I also explicitly added the " resume " module in
the dracut configuration with no luck. I’m also stuck how to debug this.

thanks
GG

There is one interesting thing about your posts. You report some symptoms you find interesting, but you fail to provide hard evidence. Mounting is governed by systemd, which generates units, e.g. in directory /run/systemd/generator/.

**3400G:~ #** ll /run/systemd/generator/                
total 56 
-rw-r--r-- 1 root root 320 Jun 20 13:19 -.mount 
-rw-r--r-- 1 root root 282 Jun 20 13:19 GARMIN.mount 
-rw-r--r-- 1 root root 337 Jun 20 13:19 HDD.mount 
-rw-r--r-- 1 root root 338 Jun 20 13:19 WD25.mount 
-rw-r--r-- 1 root root 359 Jun 20 13:19 \x2esnapshots.mount 
-rw-r--r-- 1 root root 264 Jun 20 13:19 boot-efi.mount 
-rw-r--r-- 1 root root 375 Jun 20 13:19 boot-grub2-i386\x2dpc.mount 
-rw-r--r-- 1 root root 381 Jun 20 13:19 boot-grub2-x86_64\x2defi.mount 
-rw-r--r-- 1 root root 323 Jun 20 13:19 home.mount 
drwxr-xr-x 2 root root 260 Jun 20 13:19 **local-fs.target.requires**
drwxr-xr-x 2 root root  60 Jun 20 13:19 **local-fs.target.wants**
-rw-r--r-- 1 root root 345 Jun 20 13:19 opt.mount 
-rw-r--r-- 1 root root 347 Jun 20 13:19 root.mount 
-rw-r--r-- 1 root root 345 Jun 20 13:19 srv.mount 
-rw-r--r-- 1 root root 357 Jun 20 13:19 usr-local.mount 
**-rw-r--r-- 1 root root 345 Jun 20 13:19 var.mount **
**3400G:~ #**

So I am clueless what you are doing in /etc/fstab and what systemd does on your machine, presumably in unit var.mount.

**3400G:~ #** systemctl cat var.mount 
**# /run/systemd/generator/var.mount**
# Automatically generated by systemd-fstab-generator 

[Unit] 
Documentation=man:fstab(5) man:systemd-fstab-generator(8) 
SourcePath=/etc/fstab 
After=blockdev@dev-disk-by\x2duuid-2b54b9ff\x2d84c9\x2d4db2\x2d841b\x2daff657a64325.target 

[Mount] 
Where=/var 
What=/dev/disk/by-uuid/2b54b9ff-84c9-4db2-841b-aff657a64325 
Type=btrfs 
Options=subvol=/@/var 
**3400G:~ #**

Check the journal for messages:

**3400G:~ #** journalctl -b -u var.mount 
-- Logs begin at Sat 2021-06-19 04:28:11 CEST, end at Sun 2021-06-20 20:01:06 CEST. -- 
Jun 20 13:19:15 3400G systemd[1]: Mounted /var. 
**3400G:~ #**

In the past unmounting typically occurred due to inadvertent reloading of systemd. You may check for messages:

**3400G:~ #** journalctl -b --grep reload                                        
-- Logs begin at Sat 2021-06-19 04:28:11 CEST, end at Sun 2021-06-20 20:01:06 CEST. -- 
Jun 20 13:19:15 3400G systemd[1]: Starting Reload Configuration from the Real Root... 
Jun 20 13:19:15 3400G systemd[1]: Reloading. 
Jun 20 13:19:15 3400G systemd[1]: Finished Reload Configuration from the Real Root. 
Jun 20 13:19:15 3400G apparmor.systemd[556]: Reloading AppArmor profiles 
Jun 20 13:19:16 3400G postfix[848]: To disable backwards compatibility use "postconf compatibility_level=3.6" and "postfix reload" 
Jun 20 13:19:16 3400G postfix[956]: To disable backwards compatibility use "postconf compatibility_level=3.6" and "postfix reload" 
Jun 20 13:19:26 3400G systemd[1124]: Reloading. 
Jun 20 13:19:27 3400G plasmashell[1323]: kf.plasma.quick: Applet preload policy set to 1 
Jun 20 13:19:28 3400G kded5[1261]: **Reload**ing the khotkeys configuration
**3400G:~ #**

ggrau · June 21, 2021, 2:51am

Interesting information. I never really looked at systemd in detail, so if you ask me what I’m doing with it … I hope nothing, at least I’m not aware of it. My only intention is
to mount relevant local partitions from /etc/fstab and also keep a list of other machines in the network for mounting their respective /local in a tree like /nfs/machineX/local.
I noticed that newer Linux versions have a capability to umount nfs mounts from lost connections (instead of leaving an unresponsive system), but never thought about how this is done. Maybe now I have to care? Regarding your comments:
I did find the partition and mount point in two “generators”, /run/systemd/generator/local.mount and [FONT=monospace]/run/systemd/generator/local-fs.target.requires/local.mount. Both files are identical, and I don’t see something suspicious in it. Other internal mounts (which don’t disappear) look the same:
[/FONT]

[FONT=monospace]# Automatically generated by systemd-fstab-generator 

[Unit] 
Documentation=man:fstab(5) man:systemd-fstab-generator(8) 
SourcePath=/etc/fstab 
Requires=systemd-fsck@dev-nvme1n1p1.service 
After=systemd-fsck@dev-nvme1n1p1.service 
After=blockdev@dev-nvme1n1p1.target 

[Mount] 
Where=/local 
What=/dev/nvme1n1p1 
Type=ext4 
Options=data=ordered

[/FONT]

This corresponds to the fstab line (intentionally not using uuid because I want to have clonable setups) and the line that causes the problem are


/dev/nvme1n1p1                    /local      ext4  data=ordered         0  2
machine2:/local    /nfs/machine2/local        nfs     noauto,timeo=14,intr,exec       0 0

The IPs are different, the names are well resolved when I check with a ping.

I guess with “presumably in unit var.mount” should read local.mount? After reboot (/local present), suspend and resume (/local absent) the result is this, no indication
of the umount:


# [FONT=monospace]journalctl -b -u local.mount
-- Logs begin at Tue 2021-06-01 05:28:08 CEST, end at Mon 2021-06-21 01:48:43 CEST. -- 
Jun 20 14:26:03 machine1 systemd[1]: Mounting /local... 
Jun 20 14:26:03 machine1 systemd[1]: Mounted /local. 
Jun 21 01:20:13 machine1 systemd[1]: local.mount: Succeeded. 
Jun 21 01:48:18 machine1 systemd[1]: local.mount: Succeeded.
[/FONT]

The suspend/resume only leaves this

# journalctl -b --grep reload
Jun 21 01:48:02 machine1 systemd[1]: Reloading Laptop Mode Tools. 
Jun 21 01:48:02 machine1 systemd[1]: Reloaded Laptop Mode Tools.

and the problem even started before installing laptop mode tools. No systemd reload.

I also tried the variant with direct IP number instead of the hostname … the problem persists. It is gone as soon as the line is commented out.
Any chance that there is some kind of cache that could look up the same wrong information from both hostname and IP number? In /etc the only places are
fstab and hosts, as expected.

However, mounting /nfs/machine2/local (regardless of using name or IP) actually results in the /local mount being “hijacked”! Before the problem of disappearing
/local started this didn’t happen, and I still can’t see how it happens. I checked arp -a, and both machines report different MAC addresses for each other.

And compared to other nfs mounts, no nfs-machine2-local.mount generator exists as should be expected. This is doubly weird…the other nfs generators have
a filesystem nfs written - umounting them could be explained. If the actual nfs mount is for whatever reason misinterpreted as a local mount, and if this mount
exists, why is it umounted?

I couldn’t find anything in /etc or /var that could explain such a mixup. fstab and hosts look as they should (IMHO), ping and arp and ifconfig or ip link look ok.
How else could the system mix up its identity with another machine?

karlmistelberger:

There is one interesting thing about your posts. You report some symptoms you find interesting, but you fail to provide hard evidence. Mounting is governed by systemd, which generates units, e.g. in directory /run/systemd/generator/.

**3400G:~ #** ll /run/systemd/generator/                
total 56 
-rw-r--r-- 1 root root 320 Jun 20 13:19 -.mount 
-rw-r--r-- 1 root root 282 Jun 20 13:19 GARMIN.mount 
-rw-r--r-- 1 root root 337 Jun 20 13:19 HDD.mount 
-rw-r--r-- 1 root root 338 Jun 20 13:19 WD25.mount 
-rw-r--r-- 1 root root 359 Jun 20 13:19 \x2esnapshots.mount 
-rw-r--r-- 1 root root 264 Jun 20 13:19 boot-efi.mount 
-rw-r--r-- 1 root root 375 Jun 20 13:19 boot-grub2-i386\x2dpc.mount 
-rw-r--r-- 1 root root 381 Jun 20 13:19 boot-grub2-x86_64\x2defi.mount 
-rw-r--r-- 1 root root 323 Jun 20 13:19 home.mount 
drwxr-xr-x 2 root root 260 Jun 20 13:19 **local-fs.target.requires**
drwxr-xr-x 2 root root  60 Jun 20 13:19 **local-fs.target.wants**
-rw-r--r-- 1 root root 345 Jun 20 13:19 opt.mount 
-rw-r--r-- 1 root root 347 Jun 20 13:19 root.mount 
-rw-r--r-- 1 root root 345 Jun 20 13:19 srv.mount 
-rw-r--r-- 1 root root 357 Jun 20 13:19 usr-local.mount 
**-rw-r--r-- 1 root root 345 Jun 20 13:19 var.mount **
**3400G:~ #**

So I am clueless what you are doing in /etc/fstab and what systemd does on your machine, presumably in unit var.mount.

**3400G:~ #** systemctl cat var.mount 
**# /run/systemd/generator/var.mount**
# Automatically generated by systemd-fstab-generator 

[Unit] 
Documentation=man:fstab(5) man:systemd-fstab-generator(8) 
SourcePath=/etc/fstab 
After=blockdev@dev-disk-by\x2duuid-2b54b9ff\x2d84c9\x2d4db2\x2d841b\x2daff657a64325.target 

[Mount] 
Where=/var 
What=/dev/disk/by-uuid/2b54b9ff-84c9-4db2-841b-aff657a64325 
Type=btrfs 
Options=subvol=/@/var 
**3400G:~ #**

Check the journal for messages:

**3400G:~ #** journalctl -b -u var.mount 
-- Logs begin at Sat 2021-06-19 04:28:11 CEST, end at Sun 2021-06-20 20:01:06 CEST. -- 
Jun 20 13:19:15 3400G systemd[1]: Mounted /var. 
**3400G:~ #**

In the past unmounting typically occurred due to inadvertent reloading of systemd. You may check for messages:

**3400G:~ #** journalctl -b --grep reload                                        
-- Logs begin at Sat 2021-06-19 04:28:11 CEST, end at Sun 2021-06-20 20:01:06 CEST. -- 
Jun 20 13:19:15 3400G systemd[1]: Starting Reload Configuration from the Real Root... 
Jun 20 13:19:15 3400G systemd[1]: Reloading. 
Jun 20 13:19:15 3400G systemd[1]: Finished Reload Configuration from the Real Root. 
Jun 20 13:19:15 3400G apparmor.systemd[556]: Reloading AppArmor profiles 
Jun 20 13:19:16 3400G postfix[848]: To disable backwards compatibility use "postconf compatibility_level=3.6" and "postfix reload" 
Jun 20 13:19:16 3400G postfix[956]: To disable backwards compatibility use "postconf compatibility_level=3.6" and "postfix reload" 
Jun 20 13:19:26 3400G systemd[1124]: Reloading. 
Jun 20 13:19:27 3400G plasmashell[1323]: kf.plasma.quick: Applet preload policy set to 1 
Jun 20 13:19:28 3400G kded5[1261]: **Reload**ing the khotkeys configuration
**3400G:~ #**

karlmistelberger · June 21, 2021, 6:26am

ggrau:

This corresponds to the fstab line (intentionally not using uuid because I want to have clonable setups) and the line that causes the problem are
/dev/nvme1n1p1                    /local      ext4  data=ordered         0  2
machine2:/local    /nfs/machine2/local        nfs     noauto,timeo=14,intr,exec       0 0
However, mounting /nfs/machine2/local (regardless of using name or IP) actually results in the /local mount being “hijacked”! Before the problem of disappearing
/local started this didn’t happen, and I still can’t see how it happens. I checked arp -a, and both machines report different MAC addresses for each other.

And compared to other nfs mounts, no nfs-machine2-local.mount generator exists as should be expected. This is doubly weird…the other nfs generators have
a filesystem nfs written - umounting them could be explained. If the actual nfs mount is for whatever reason misinterpreted as a local mount, and if this mount
exists, why is it umounted?

I couldn’t find anything in /etc or /var that could explain such a mixup. fstab and hosts look as they should (IMHO), ping and arp and ifconfig or ip link look ok.
How else could the system mix up its identity with another machine?

For debugging the problem you may change the local path in /etc/fstab to something else, e.g. /nfs/machine2/LOCAL and try again. Are the units now created correctly?

**3400G:~ #** systemctl list-units '*var*' 
  UNIT           LOAD   ACTIVE SUB     DESCRIPTION
  test-var.mount loaded active mounted /test/var   
  var.mount      loaded active mounted /var        

LOAD   = Reflects whether the unit definition was properly loaded. 
ACTIVE = The high-level unit activation state, i.e. generalization of SUB. 
SUB    = The low-level unit activation state, values depend on unit type. 

**2 loaded units listed.** Pass --all to see loaded but inactive units, too. 
To show all installed unit files use 'systemctl list-unit-files'. 
**3400G:~ #**

marel · June 21, 2021, 8:57am

My suggestions:

Run “sudo journalctl -b -g mount”, that is a broader search, maybe that will list the umount

On the systemd man I read:

LazyUnmount=
Takes a boolean argument. If true, detach the filesystem from the filesystem hierarchy at time of the unmount operation, and clean up all references to the filesystem as soon as they are not busy anymore. This corresponds with umount(8)'s -l switch. Defaults to off.

You have not specified that for the local mount and the systemd-fstab-generator unit does not show it, while you auditd revealed:

cryptic proctitle string is translated into “umount -l -f /local/”

You can still try to make it explicit but it looks like it is not systemd umounting /local…

karlmistelberger · June 21, 2021, 9:56am

Sure. However for each working mount systemd loads a unit:


**3400G:~ #** **systemctl list-units --type mount**
  UNIT                          LOAD   ACTIVE SUB     DESCRIPTION                     
  -.mount                       loaded active mounted Root Mount                       
  \x2esnapshots.mount           loaded active mounted /.snapshots                      
  boot-efi.mount                loaded active mounted /boot/efi                        
  boot-grub2-i386\x2dpc.mount   loaded active mounted /boot/grub2/i386-pc              
  dev-hugepages.mount           loaded active mounted Huge Pages File System           
  dev-mqueue.mount              loaded active mounted POSIX Message Queue File System  
  home.mount                    loaded active mounted /home                            
  opt.mount                     loaded active mounted /opt                             
  root.mount                    loaded active mounted /root                            
  run-user-1000-gvfs.mount      loaded active mounted /run/user/1000/gvfs              
  run-user-1000.mount           loaded active mounted /run/user/1000                   
  srv.mount                     loaded active mounted /srv                             
  sys-fs-fuse-connections.mount loaded active mounted FUSE Control File System         
  sys-kernel-config.mount       loaded active mounted Kernel Configuration File System 
  sys-kernel-debug.mount        loaded active mounted Kernel Debug File System         
  sys-kernel-tracing.mount      loaded active mounted Kernel Trace File System         
  **test-var.mount                loaded active mounted /test/var**                        
  tmp.mount                     loaded active mounted Temporary Directory (/tmp)       
  usr-local.mount               loaded active mounted /usr/local                       
  **var.mount                     loaded active mounted /var**                             

LOAD   = Reflects whether the unit definition was properly loaded. 
ACTIVE = The high-level unit activation state, i.e. generalization of SUB. 
SUB    = The low-level unit activation state, values depend on unit type. 

**20 loaded units listed.** Pass --all to see loaded but inactive units, too. 
To show all installed unit files use 'systemctl list-unit-files'. 
**3400G:~ #**

All mounts are managed by systemd, which creates appropriate units and deletes them upon unmounting. Users may want to check correct generation of these units and verify its existence before performing the next step. Systemd isn’t static. It is very dynamic. Note the fine generators in /usr/lib/systemd/system-generators/

ggrau · June 21, 2021, 12:18pm

:shame: (headbang)^3 … thanks for opening my eyes!!! Life would be boring without beautiful errors, isn’t it?

Let’s start with the solution to my immediate problem: As I mentioned, all machines’ /local partition should be visible in a /nfs/machineX/local tree in my system.
For completeness this includes /local of the current machine, which is not an NFS mount but a symlink for performance reasons. On this particular machine, it turned out that the symlink was in the wrong folder (for machine2 instead of machine1), thus the mount point directory for the NFS mount /nfs/machine2/local was a symlink, pointing to /local.
Fixing this directory tree and symlink solved both the umount of /local and the bogus NFS mount which hijacked /local.

What remains is, from my point of view, a double error regarding handling of the NFS mounts:

Auto-removal of the NFS mounts appears to be useful when the mounted drive is not accessible after wakeup (assuming that NFS should survive a sleep and recover cycle if the network doesn’t change in the meantime - this way it could be explained that the umount happens on resume, not before the suspend).
In my case, both machines are and remain on the same net, I can immediately re-mount the NFS share. So why does it get umounted at all? Both probing the
machine2:/local and accessing the symlink to /local should result in success. There is no reason to assume that the connection is down and the mount should be removed?

The other issue is that IMHO the NFS mount to machine2:/local should not have succeeded and umounting /local should not have succeeded as well - because the mount point should be a directory, not a symlink (I would assume). Even more, shouldn’t umount verify that the drive is nfs and not ext4? The FS type is advertised in the generator file.

Thanks for your help, guys!
Guenter

karlmistelberger:

For debugging the problem you may change the local path in /etc/fstab to something else, e.g. /nfs/machine2/LOCAL and try again. Are the units now created correctly?

**3400G:~ #** systemctl list-units '*var*' 
  UNIT           LOAD   ACTIVE SUB     DESCRIPTION
  test-var.mount loaded active mounted /test/var   
  var.mount      loaded active mounted /var        

LOAD   = Reflects whether the unit definition was properly loaded. 
ACTIVE = The high-level unit activation state, i.e. generalization of SUB. 
SUB    = The low-level unit activation state, values depend on unit type. 

**2 loaded units listed.** Pass --all to see loaded but inactive units, too. 
To show all installed unit files use 'systemctl list-unit-files'. 
**3400G:~ #**