I am a relatively experienced Ubuntu user, and I finally wanted to change to OS Tumbleweed. I am quite happy, yet I encountered A big problem with my boot configuration, now for the second time in just 2 weeks time.
In short, my boot setup, so the following makes sense:
SSD drive (less than 2 years old, moderate use)
Legacy/MBR Grub2
Grub installed in MBR of the drive /dev/sda
separate /boot Partition (/dev/sda2), ext4, unencrypted
/ volume of OS in an extended partition (/dev/sda3), therein btrfs partition with subvolumes (/dev/sda6), encrypted
Relatively simple, I would start my PC and randomly get greeted by some GRUB2 warning that it cannot find some of its essential files. I would then run fsck of /dev/sda2 from a USB booted linux, and it would find an enormous amount of errors. It fixes them, but both times there was too much damage for Grub to start properly.
I got back into my system twice with the help of Super Grub Disk and/or chroot from Live, and then did a full reinstall of the bootloader by formatting /boot, grub2-install, mkinitrd and reinstalling the kernel files. That worked and also showed that both times, the main drive / was unaffected and no corruption seemed to have occurred there.
The first time, the event was preceded by a “dirty shutdown” due to some other reason, which at the same time should have never written anything to the /boot partition. The second time, I performed a normal shutdown, and in the morning was surprised by the error.
What on earth is going on? I am pretty sure that this is not a hardware problem. I never have had any corruptions/data loss on this drive, it is quite new, and now it happened twice. I suspect some weird error in unmounting, or maybe during the registering of the boot options to install updates next reboot?
I don’t know too much about “MBR” . . . but I do run several multi-boot machines, each using grub . . . and there can be grub issues, but often “pilot error” is involved, I have found over the years.
- Grub installed in MBR of the drive /dev/sda
- separate /boot Partition (/dev/sda2), ext4, unencrypted
- / volume of OS in an extended partition (/dev/sda3), therein btrfs partition with subvolumes (/dev/sda6), encrypted
My thinking is that you have “grub installed in MBR” of potentially “/dev/sda1”??? AND you have a “separate /boot partition” in sda2??? It should be that grub is installed in one partition??? flagged as “/boot” so that the system knows where it is?? You can check in GParted to see what you have done and perhaps delete one of them??
Every partition contains a Master Boot Record at the very beginning which is a potential space for a bootloader’s essentials. Additionally, there is a MBR of the whole device (/dev/sda), not linked to any one partition.
In my case, Grub2 (the initial code) is written into the MBR of the drive (/dev/sda), it then loads its components from /dev/sda2. This is standard procedure, grub2 even warns you not to install fully into /dev/sda2 via blocklists.
I appreciate your reply, but I don’t think it guides into the right direction. The problem here is file system corruption that randomly happens, despite the system being setup properly and knowing where to boot from etc.
My first guess would be that it is a hardware problem.
I do use a separate “/boot” but I do not use “btrfs” for the main partition.
For “/boot”, I am using “ext2” rather than “ext4”. And that’s because “ext4” is journaled, but “grub” does not read the journal. However, that should not be your issue. If the problem were due to the journaling, then “fsck” would report only that it is using the journal. If it is finding actual errors, something else is going wrong.
Tumbleweed does often update the kernel and regenerate the “initrd”. These are written to “/boot”. And some Tumbleweed updates will reinstall grub2 (also goes to “/boot” as well as the MBR). However, a clean shutdown should still leave a clean “/boot”.
No worries . . . I’ve been “wrong” many times in my life . . . . But, your subject line says “corruption of /boot partition” . . . but now you are saying “file system corruption” . . . ??? The fact that you used SuperGrub2 to boot the system would point to a “/boot” problem . . . and if it “booted” the system, then the “/” is “OK”???
I didn’t look at your journalctl file . . . but it likely is something in how you are running your install that is creating your problem . . . I just ran a fresh install of my TW system a day back . . . . I don’t have time to look into the details of “MBR” install . . . but still I believe there is some “conflict” that grub isn’t able to handle??
That was what I wanted to mention, the formatting of the /boot partition . . . for EFI I use “FAT32” as the formatting . . . so this idea of ext2 might be a similar idea.
Hi
FWIW when I use legacy boot I keep the gpt format rather than dos and create an 8MB type ef01 MBR partition scheme just to utilize the partition setup and not having logical partitions and mbr protection.
It would explain it, yet it is really unlikely, given that I used an identical setup with Ubuntu for two years straight without a single problem. Also, the drive is quite new. Here are my S.M.A.R.T results (I will be doing a long test tomorrow):
The official San Disk Tool on Windows claims that the drive is on 97% health (0% equals maximum read write cycles used).
Yes, and something must be going wrong there. I thought that the first time, it was a “perfect storm event”, where a system crash might have happened exactly when it was doing this write to /boot. But as you said, a clean shutdown should never leave it at that state. Something else must be at cause.
“Master” refers to the first DISK sector. Boot Records elsewhere on a disk are Extended Boot Records, or EBRs, at the start of partitions, not disks, or are partition boot records, PBRs. Extended boot record - Wikipedia
All my MBR partitioned disks that have installed GNU/Linux operating systems have a partition with Grub Legacy installed either on sda1 or sda3 or a logical partition near the front of the disk. This started sometime early in this century, when I got serious about Linux instead of sticking to DOS and OS/2. Previously, I was booting exclusively using IBM Boot Manager, since its advent with OS/2 v2.0 about a decade earlier. None of these disks have any version of Grub on an MBR. Instead they use the same DOS/OS2/Windows-compatible code I used prior to Grub discovery. It is this code that depends on a boot flag on a primary partition, unlike Grub, which ignores MBR/EBR boot flags. IOW, if the only bootloader(s) on disk is/are Grub, then boot flags on it anywhere have no relevance on that disk, thought it’s possible they could in a system with (an)other disk(s).
I don’t often mount a primary hosting Grub to /boot. I manage it manually. That way it’s never subject to damage except by me. In configurations when I do have a separate partition mounted to boot, it’s virtually always EXT2. IIRC, I did create one somewhat recently with EXT4, but without journaling, which I deem as a practical matter pointless, given the infrequency it gets written to. I’ve never had one corrupted that automatic fsck during init didn’t fix, if ever any corruption at all.
All that said, I consider my experiences with SSDs disheartening. They’re faster than HDDs for sure, but in the barely four years I’ve had them, I’ve had 5 out of 20 need RMA in less than a year’s actual use, 2 ADATAs of different model lines, a Biwin, a Mushkin and a PNY, so don’t think it can’t happen to yours.
It just happened again, and of course it was only the boot partition that got corrupted. I am pretty sure that this is not a hardware problem, by the predictable pattern of this.
I am really grateful to any suggestions; anyway I will now monitor FS changes in /boot and try to analyze it myself.
ISTR reading somewhere on one of the mailing lists, or maybe here, that a separate /boot/ filesystem in conjunction with BTRFS on / can be a problem, likely having to do with kernels and /lib/modules being on separate filesystems, in conjunction with snapshotting. You might try searching opensuse.org for such a subject.
Yes, that can be a problem. If you do a rollback, then it might rollback the modules but not the kernel itself. That can cause booting problems. But it should not cause file system corruption.
Substantiate this claim by running a file system check:
**erlangen:~ #** btrfs check --force /dev/nvme0n1p2
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/nvme0n1p2
UUID: 0e58bbe5-eff7-4884-bb5d-a0aac3d8a344
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 484195385344 bytes used, no error found
total csum bytes: 465784716
total tree bytes: 2198863872
total fs tree bytes: 1505411072
total extent tree bytes: 128335872
btree space waste bytes: 460178833
file data blocks allocated: 1207237996544
referenced 529045770240
**erlangen:~ #**
Yes. But booting still uses then from “/boot”. And the “initrd” is still only on “/boot”.
If you know what you are doing, then after a rollback you could copy the needed kernels to “/boot” and remake the “initrd”. That would allow you to still boot.
This won’t help with the file system corruption that the OP is seeing.
A couple of people, myself included, pointed to the issue of the formatting of the /boot/efi partition as a potential source of “corruption” . . . . Looks like karlmistelberger’s set up also uses “vfat”???
I mentioned “FAT32” . . . another poster mentioned “ext2” . . . those are “suggestions” . . . .
The OP is “experienced user” . . . but my recent post was essentially a re-post on the data, as a courtesy to the OP . . . . It’s possible that the formatting of the EFI partition has nothing to do with their issue . . . but it is/was the most obvious problem of possible error, and it was “suggested” . . . but OP didn’t seem to respond to those several suggestions on that detail.
Too busy to re-read the whole thread. I still think there was a formatting issue that grub isn’t handling too well . . . hence “the corruption” . . . O . . . the corruption that continues to corrupt the proceedings.