Repetetive corruption of /boot partition

rfactormo · September 20, 2022, 1:44pm

Hello,

I am a relatively experienced Ubuntu user, and I finally wanted to change to OS Tumbleweed. I am quite happy, yet I encountered A big problem with my boot configuration, now for the second time in just 2 weeks time.

In short, my boot setup, so the following makes sense:

SSD drive (less than 2 years old, moderate use)
Legacy/MBR Grub2
Grub installed in MBR of the drive /dev/sda
separate /boot Partition (/dev/sda2), ext4, unencrypted
/ volume of OS in an extended partition (/dev/sda3), therein btrfs partition with subvolumes (/dev/sda6), encrypted

Relatively simple, I would start my PC and randomly get greeted by some GRUB2 warning that it cannot find some of its essential files. I would then run fsck of /dev/sda2 from a USB booted linux, and it would find an enormous amount of errors. It fixes them, but both times there was too much damage for Grub to start properly.

I got back into my system twice with the help of Super Grub Disk and/or chroot from Live, and then did a full reinstall of the bootloader by formatting /boot, grub2-install, mkinitrd and reinstalling the kernel files. That worked and also showed that both times, the main drive / was unaffected and no corruption seemed to have occurred there.

The first time, the event was preceded by a “dirty shutdown” due to some other reason, which at the same time should have never written anything to the /boot partition. The second time, I performed a normal shutdown, and in the morning was surprised by the error.

What on earth is going on? I am pretty sure that this is not a hardware problem. I never have had any corruptions/data loss on this drive, it is quite new, and now it happened twice. I suspect some weird error in unmounting, or maybe during the registering of the boot options to install updates next reboot?

Best regards,
Moe

rfactormo · September 20, 2022, 1:58pm

Here is a journalctl output containing the whole shutdown sequence from yesterday:

https://login.owncube.cloud/index.php/s/AADncSD9atEwRS2

non_space · September 20, 2022, 4:28pm

@rfactormo:

I don’t know too much about “MBR” . . . but I do run several multi-boot machines, each using grub . . . and there can be grub issues, but often “pilot error” is involved, I have found over the years.




  - Grub installed in MBR of the drive /dev/sda
  - separate /boot Partition (/dev/sda2), ext4, unencrypted
  - / volume of OS in an extended partition (/dev/sda3), therein btrfs partition with subvolumes (/dev/sda6), encrypted

My thinking is that you have “grub installed in MBR” of potentially “/dev/sda1”??? AND you have a “separate /boot partition” in sda2??? It should be that grub is installed in one partition??? flagged as “/boot” so that the system knows where it is?? You can check in GParted to see what you have done and perhaps delete one of them??

rfactormo · September 20, 2022, 4:46pm

@non_space

Every partition contains a Master Boot Record at the very beginning which is a potential space for a bootloader’s essentials. Additionally, there is a MBR of the whole device (/dev/sda), not linked to any one partition.

In my case, Grub2 (the initial code) is written into the MBR of the drive (/dev/sda), it then loads its components from /dev/sda2. This is standard procedure, grub2 even warns you not to install fully into /dev/sda2 via blocklists.

I appreciate your reply, but I don’t think it guides into the right direction. The problem here is file system corruption that randomly happens, despite the system being setup properly and knowing where to boot from etc.

nrickert · September 20, 2022, 4:54pm

My first guess would be that it is a hardware problem.

I do use a separate “/boot” but I do not use “btrfs” for the main partition.

For “/boot”, I am using “ext2” rather than “ext4”. And that’s because “ext4” is journaled, but “grub” does not read the journal. However, that should not be your issue. If the problem were due to the journaling, then “fsck” would report only that it is using the journal. If it is finding actual errors, something else is going wrong.

Tumbleweed does often update the kernel and regenerate the “initrd”. These are written to “/boot”. And some Tumbleweed updates will reinstall grub2 (also goes to “/boot” as well as the MBR). However, a clean shutdown should still leave a clean “/boot”.

non_space · September 20, 2022, 4:59pm

@rfactormo:

No worries . . . I’ve been “wrong” many times in my life . . . . But, your subject line says “corruption of /boot partition” . . . but now you are saying “file system corruption” . . . ??? The fact that you used SuperGrub2 to boot the system would point to a “/boot” problem . . . and if it “booted” the system, then the “/” is “OK”???

I didn’t look at your journalctl file . . . but it likely is something in how you are running your install that is creating your problem . . . I just ran a fresh install of my TW system a day back . . . . I don’t have time to look into the details of “MBR” install . . . but still I believe there is some “conflict” that grub isn’t able to handle??

non_space · September 20, 2022, 5:03pm

et al:

That was what I wanted to mention, the formatting of the /boot partition . . . for EFI I use “FAT32” as the formatting . . . so this idea of ext2 might be a similar idea.

malcolmlewis · September 20, 2022, 6:20pm

Hi
FWIW when I use legacy boot I keep the gpt format rather than dos and create an 8MB type ef01 MBR partition scheme just to utilize the partition setup and not having logical partitions and mbr protection.

rfactormo · September 20, 2022, 11:05pm

It would explain it, yet it is really unlikely, given that I used an identical setup with Ubuntu for two years straight without a single problem. Also, the drive is quite new. Here are my S.M.A.R.T results (I will be doing a long test tomorrow):


Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       2
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       649
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2596
165 Total_Write/Erase_Count 0x0032   100   100   000    Old_age   Always       -       1741
166 Min_W/E_Cycle           0x0032   100   100   ---    Old_age   Always       -       15
167 Min_Bad_Block/Die       0x0032   100   100   ---    Old_age   Always       -       46
168 Maximum_Erase_Cycle     0x0032   100   100   ---    Old_age   Always       -       39
169 Total_Bad_Block         0x0032   100   100   ---    Old_age   Always       -       429
170 Unknown_Marvell_Attr    0x0032   100   100   ---    Old_age   Always       -       2
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Avg_Write/Erase_Count   0x0032   100   100   000    Old_age   Always       -       15
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       139
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   057   059   000    Old_age   Always       -       43 (Min/Max 10/59)
199 SATA_CRC_Error          0x0032   100   100   ---    Old_age   Always       -       0
230 Perc_Write/Erase_Count  0x0032   100   100   000    Old_age   Always       -       1121 768 1121
232 Perc_Avail_Resrvd_Space 0x0033   100   100   005    Pre-fail  Always       -       100
233 Total_NAND_Writes_GiB   0x0032   100   100   ---    Old_age   Always       -       7569
234 Perc_Write/Erase_Ct_BC  0x0032   100   100   000    Old_age   Always       -       21553
241 Total_Writes_GiB        0x0030   100   100   000    Old_age   Offline      -       6403
242 Total_Reads_GiB         0x0030   100   100   000    Old_age   Offline      -       6203
244 Thermal_Throttle        0x0032   000   100   ---    Old_age   Always       -       0

The official San Disk Tool on Windows claims that the drive is on 97% health (0% equals maximum read write cycles used).

Yes, and something must be going wrong there. I thought that the first time, it was a “perfect storm event”, where a system crash might have happened exactly when it was doing this write to /boot. But as you said, a clean shutdown should never leave it at that state. Something else must be at cause.

mrmazda · September 21, 2022, 8:06am

“Master” refers to the first DISK sector. Boot Records elsewhere on a disk are Extended Boot Records, or EBRs, at the start of partitions, not disks, or are partition boot records, PBRs. Extended boot record - Wikipedia

All my MBR partitioned disks that have installed GNU/Linux operating systems have a partition with Grub Legacy installed either on sda1 or sda3 or a logical partition near the front of the disk. This started sometime early in this century, when I got serious about Linux instead of sticking to DOS and OS/2. Previously, I was booting exclusively using IBM Boot Manager, since its advent with OS/2 v2.0 about a decade earlier. None of these disks have any version of Grub on an MBR. Instead they use the same DOS/OS2/Windows-compatible code I used prior to Grub discovery. It is this code that depends on a boot flag on a primary partition, unlike Grub, which ignores MBR/EBR boot flags. IOW, if the only bootloader(s) on disk is/are Grub, then boot flags on it anywhere have no relevance on that disk, thought it’s possible they could in a system with (an)other disk(s).

I don’t often mount a primary hosting Grub to /boot. I manage it manually. That way it’s never subject to damage except by me. In configurations when I do have a separate partition mounted to boot, it’s virtually always EXT2. IIRC, I did create one somewhat recently with EXT4, but without journaling, which I deem as a practical matter pointless, given the infrequency it gets written to. I’ve never had one corrupted that automatic fsck during init didn’t fix, if ever any corruption at all.

All that said, I consider my experiences with SSDs disheartening. They’re faster than HDDs for sure, but in the barely four years I’ve had them, I’ve had 5 out of 20 need RMA in less than a year’s actual use, 2 ADATAs of different model lines, a Biwin, a Mushkin and a PNY, so don’t think it can’t happen to yours.

rfactormo · September 26, 2022, 11:44pm

It just happened again, and of course it was only the boot partition that got corrupted. I am pretty sure that this is not a hardware problem, by the predictable pattern of this.

I am really grateful to any suggestions; anyway I will now monitor FS changes in /boot and try to analyze it myself.

mrmazda · September 27, 2022, 12:25am

ISTR reading somewhere on one of the mailing lists, or maybe here, that a separate /boot/ filesystem in conjunction with BTRFS on / can be a problem, likely having to do with kernels and /lib/modules being on separate filesystems, in conjunction with snapshotting. You might try searching opensuse.org for such a subject.

nrickert · September 27, 2022, 3:28am

Yes, that can be a problem. If you do a rollback, then it might rollback the modules but not the kernel itself. That can cause booting problems. But it should not cause file system corruption.

karlmistelberger · September 27, 2022, 5:00am

I use this Partitioning. No monitoring required. What is yours?

karlmistelberger · September 27, 2022, 5:06am

Latest Tumbleweed has been canonified:

**erlangen:~ #** ll /boot 
total 127672 
lrwxrwxrwx 1 root root       50 Jul 20 21:09 .vmlinuz-5.18.11-1-default.hmac -> ../usr/lib/modules/5.18.11-1-default/.vmlinuz.hmac 
lrwxrwxrwx 1 root root       50 Sep 25 20:31 .vmlinuz-5.19.10-1-default.hmac -> ../usr/lib/modules/5.19.10-1-default/.vmlinuz.hmac 
lrwxrwxrwx 1 root root       49 Sep 12 07:27 .vmlinuz-5.19.8-1-default.hmac -> ../usr/lib/modules/5.19.8-1-default/.vmlinuz.hmac 
lrwxrwxrwx 1 root root       47 Jul 20 21:09 System.map-5.18.11-1-default -> ../usr/lib/modules/5.18.11-1-default/System.map 
lrwxrwxrwx 1 root root       47 Sep 25 20:31 System.map-5.19.10-1-default -> ../usr/lib/modules/5.19.10-1-default/System.map 
lrwxrwxrwx 1 root root       46 Sep 12 07:27 System.map-5.19.8-1-default -> ../usr/lib/modules/5.19.8-1-default/System.map 
lrwxrwxrwx 1 root root       43 Jul 20 21:09 config-5.18.11-1-default -> ../usr/lib/modules/5.18.11-1-default/config 
lrwxrwxrwx 1 root root       43 Sep 25 20:31 config-5.19.10-1-default -> ../usr/lib/modules/5.19.10-1-default/config 
lrwxrwxrwx 1 root root       42 Sep 12 07:27 config-5.19.8-1-default -> ../usr/lib/modules/5.19.8-1-default/config 
drwxr-xr-x 3 root root     4096 Jan  1  1970 **efi**
drwxr-xr-x 1 root root       98 Sep 27 02:20 **grub2**
lrwxrwxrwx 1 root root       24 Sep 25 20:31 initrd -> initrd-5.19.10-1-default 
-rw------- 1 root root 43384726 Sep 25 20:31 initrd-5.18.11-1-default 
-rw------- 1 root root 43635777 Sep 25 20:31 initrd-5.19.10-1-default 
-rw------- 1 root root 43638533 Sep 25 20:32 initrd-5.19.8-1-default 
lrwxrwxrwx 1 root root       48 Jul 20 21:09 sysctl.conf-5.18.11-1-default -> ../usr/lib/modules/5.18.11-1-default/sysctl.conf 
lrwxrwxrwx 1 root root       48 Sep 25 20:31 sysctl.conf-5.19.10-1-default -> ../usr/lib/modules/5.19.10-1-default/sysctl.conf 
lrwxrwxrwx 1 root root       47 Sep 12 07:27 sysctl.conf-5.19.8-1-default -> ../usr/lib/modules/5.19.8-1-default/sysctl.conf 
lrwxrwxrwx 1 root root       25 Sep 25 20:31 vmlinuz -> vmlinuz-5.19.10-1-default 
lrwxrwxrwx 1 root root       44 Jul 20 21:09 vmlinuz-5.18.11-1-default -> ../usr/lib/modules/5.18.11-1-default/vmlinuz 
lrwxrwxrwx 1 root root       44 Sep 25 20:31 vmlinuz-5.19.10-1-default -> ../usr/lib/modules/5.19.10-1-default/vmlinuz 
lrwxrwxrwx 1 root root       43 Sep 12 07:27 vmlinuz-5.19.8-1-default -> ../usr/lib/modules/5.19.8-1-default/vmlinuz 
**erlangen:~ #**

Modules and kernels belong to the same package installed in subvolume of @/.snapshots.

karlmistelberger · September 27, 2022, 5:09am

Substantiate this claim by running a file system check:

**erlangen:~ #** btrfs check --force /dev/nvme0n1p2 
Opening filesystem to check... 
WARNING: filesystem mounted, continuing because of --force 
Checking filesystem on /dev/nvme0n1p2 
UUID: 0e58bbe5-eff7-4884-bb5d-a0aac3d8a344 
[1/7] checking root items 
[2/7] checking extents 
[3/7] checking free space tree 
[4/7] checking fs roots 
[5/7] checking only csums items (without verifying data) 
[6/7] checking root refs 
[7/7] checking quota groups skipped (not enabled on this FS) 
found 484195385344 bytes used, no error found 
total csum bytes: 465784716 
total tree bytes: 2198863872 
total fs tree bytes: 1505411072 
total extent tree bytes: 128335872 
btree space waste bytes: 460178833 
file data blocks allocated: 1207237996544 
 referenced 529045770240 
**erlangen:~ #**

nrickert · September 27, 2022, 2:52pm

Yes. But booting still uses then from “/boot”. And the “initrd” is still only on “/boot”.

If you know what you are doing, then after a rollback you could copy the needed kernels to “/boot” and remake the “initrd”. That would allow you to still boot.

This won’t help with the file system corruption that the OP is seeing.

non_space · September 27, 2022, 8:47pm

@rfactormo:

A couple of people, myself included, pointed to the issue of the formatting of the /boot/efi partition as a potential source of “corruption” . . . . Looks like karlmistelberger’s set up also uses “vfat”???

I mentioned “FAT32” . . . another poster mentioned “ext2” . . . those are “suggestions” . . . .

hcvv · September 27, 2022, 9:05pm

Aren’t you confusing /boot with /boot/efi? Please re-read all those posts carefully.

non_space · September 27, 2022, 10:29pm

@hcvv:

The OP is “experienced user” . . . but my recent post was essentially a re-post on the data, as a courtesy to the OP . . . . It’s possible that the formatting of the EFI partition has nothing to do with their issue . . . but it is/was the most obvious problem of possible error, and it was “suggested” . . . but OP didn’t seem to respond to those several suggestions on that detail.

Too busy to re-read the whole thread. I still think there was a formatting issue that grub isn’t handling too well . . . hence “the corruption” . . . O . . . the corruption that continues to corrupt the proceedings.