Page 1 of 3 123 LastLast
Results 1 to 10 of 21

Thread: Repetetive corruption of /boot partition

  1. #1

    Exclamation Repetetive corruption of /boot partition

    Hello,

    I am a relatively experienced Ubuntu user, and I finally wanted to change to OS Tumbleweed. I am quite happy, yet I encountered A big problem with my boot configuration, now for the second time in just 2 weeks time.

    In short, my boot setup, so the following makes sense:


    • SSD drive (less than 2 years old, moderate use)
    • Legacy/MBR Grub2
    • Grub installed in MBR of the drive /dev/sda
    • separate /boot Partition (/dev/sda2), ext4, unencrypted
    • / volume of OS in an extended partition (/dev/sda3), therein btrfs partition with subvolumes (/dev/sda6), encrypted


    Relatively simple, I would start my PC and randomly get greeted by some GRUB2 warning that it cannot find some of its essential files. I would then run fsck of /dev/sda2 from a USB booted linux, and it would find an enormous amount of errors. It fixes them, but both times there was too much damage for Grub to start properly.

    I got back into my system twice with the help of Super Grub Disk and/or chroot from Live, and then did a full reinstall of the bootloader by formatting /boot, grub2-install, mkinitrd and reinstalling the kernel files. That worked and also showed that both times, the main drive / was unaffected and no corruption seemed to have occurred there.

    The first time, the event was preceded by a "dirty shutdown" due to some other reason, which at the same time should have never written anything to the /boot partition. The second time, I performed a normal shutdown, and in the morning was surprised by the error.

    What on earth is going on? I am pretty sure that this is not a hardware problem. I never have had any corruptions/data loss on this drive, it is quite new, and now it happened twice. I suspect some weird error in unmounting, or maybe during the registering of the boot options to install updates next reboot?

    Best regards,
    Moe

  2. #2

    Default Re: Repetetive corruption of /boot partition

    Here is a journalctl output containing the whole shutdown sequence from yesterday:

    https://login.owncube.cloud/index.php/s/AADncSD9atEwRS2

  3. #3

    Default Re: Repetetive corruption of /boot partition

    @rfactormo:

    I don't know too much about "MBR" . . . but I do run several multi-boot machines, each using grub . . . and there can be grub issues, but often "pilot error" is involved, I have found over the years.

    Code:
    • Grub installed in MBR of the drive /dev/sda
    • separate /boot Partition (/dev/sda2), ext4, unencrypted
    • / volume of OS in an extended partition (/dev/sda3), therein btrfs partition with subvolumes (/dev/sda6), encrypted
    My thinking is that you have "grub installed in MBR" of potentially "/dev/sda1"??? AND you have a "separate /boot partition" in sda2??? It should be that grub is installed in one partition???? flagged as "/boot" so that the system knows where it is?? You can check in GParted to see what you have done and perhaps delete one of them??

  4. #4

    Default Re: Repetetive corruption of /boot partition

    @non_space


    Every partition contains a Master Boot Record at the very beginning which is a potential space for a bootloader's essentials. Additionally, there is a MBR of the whole device (/dev/sda), not linked to any one partition.

    In my case, Grub2 (the initial code) is written into the MBR of the drive (/dev/sda), it then loads its components from /dev/sda2. This is standard procedure, grub2 even warns you not to install fully into /dev/sda2 via blocklists.

    I appreciate your reply, but I don't think it guides into the right direction. The problem here is file system corruption that randomly happens, despite the system being setup properly and knowing where to boot from etc.

  5. #5
    Join Date
    Aug 2010
    Location
    Chicago suburbs
    Posts
    16,123
    Blog Entries
    3

    Default Re: Repetetive corruption of /boot partition

    Quote Originally Posted by rfactormo View Post
    What on earth is going on?
    My first guess would be that it is a hardware problem.

    I do use a separate "/boot" but I do not use "btrfs" for the main partition.

    For "/boot", I am using "ext2" rather than "ext4". And that's because "ext4" is journaled, but "grub" does not read the journal. However, that should not be your issue. If the problem were due to the journaling, then "fsck" would report only that it is using the journal. If it is finding actual errors, something else is going wrong.

    Tumbleweed does often update the kernel and regenerate the "initrd". These are written to "/boot". And some Tumbleweed updates will reinstall grub2 (also goes to "/boot" as well as the MBR). However, a clean shutdown should still leave a clean "/boot".
    openSUSE Leap 15.4; KDE Plasma 5.24.4;
    testing Tumbleweed.

  6. #6

    Default Re: Repetetive corruption of /boot partition

    Quote Originally Posted by rfactormo View Post
    @non_space


    Every partition contains a Master Boot Record at the very beginning which is a potential space for a bootloader's essentials. Additionally, there is a MBR of the whole device (/dev/sda), not linked to any one partition.

    In my case, Grub2 (the initial code) is written into the MBR of the drive (/dev/sda), it then loads its components from /dev/sda2. This is standard procedure, grub2 even warns you not to install fully into /dev/sda2 via blocklists.

    I appreciate your reply, but I don't think it guides into the right direction. The problem here is file system corruption that randomly happens, despite the system being setup properly and knowing where to boot from etc.
    @rfactormo:

    No worries . . . I've been "wrong" many times in my life . . . . But, your subject line says "corruption of /boot partition" . . . but now you are saying "file system corruption" . . . ??? The fact that you used SuperGrub2 to boot the system would point to a "/boot" problem . . . and if it "booted" the system, then the "/" is "OK"???

    I didn't look at your journalctl file . . . but it likely is something in how you are running your install that is creating your problem . . . I just ran a fresh install of my TW system a day back . . . . I don't have time to look into the details of "MBR" install . . . but still I believe there is some "conflict" that grub isn't able to handle??

  7. #7

    Default Re: Repetetive corruption of /boot partition

    Quote Originally Posted by nrickert View Post
    My first guess would be that it is a hardware problem.

    I do use a separate "/boot" but I do not use "btrfs" for the main partition.

    For "/boot", I am using "ext2" rather than "ext4"..
    et al:

    That was what I wanted to mention, the formatting of the /boot partition . . . for EFI I use "FAT32" as the formatting . . . so this idea of ext2 might be a similar idea.

  8. #8
    Join Date
    Jun 2008
    Location
    East of Podunk
    Posts
    33,681
    Blog Entries
    15

    Default Re: Repetetive corruption of /boot partition

    Hi
    FWIW when I use legacy boot I keep the gpt format rather than dos and create an 8MB type ef01 MBR partition scheme just to utilize the partition setup and not having logical partitions and mbr protection.
    Cheers Malcolm °¿° (Linux Counter #276890)
    SUSE SLE, openSUSE Leap/Tumbleweed (x86_64) | GNOME DE
    If you find this post helpful and are logged into the web interface,
    please show your appreciation and click on the star below... Thanks!

  9. #9

    Default Re: Repetetive corruption of /boot partition

    Quote Originally Posted by nrickert View Post
    My first guess would be that it is a hardware problem.
    It would explain it, yet it is really unlikely, given that I used an identical setup with Ubuntu for two years straight without a single problem. Also, the drive is quite new. Here are my S.M.A.R.T results (I will be doing a long test tomorrow):

    Code:
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       2
      9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       649
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2596
    165 Total_Write/Erase_Count 0x0032   100   100   000    Old_age   Always       -       1741
    166 Min_W/E_Cycle           0x0032   100   100   ---    Old_age   Always       -       15
    167 Min_Bad_Block/Die       0x0032   100   100   ---    Old_age   Always       -       46
    168 Maximum_Erase_Cycle     0x0032   100   100   ---    Old_age   Always       -       39
    169 Total_Bad_Block         0x0032   100   100   ---    Old_age   Always       -       429
    170 Unknown_Marvell_Attr    0x0032   100   100   ---    Old_age   Always       -       2
    171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
    172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
    173 Avg_Write/Erase_Count   0x0032   100   100   000    Old_age   Always       -       15
    174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       139
    184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
    188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
    194 Temperature_Celsius     0x0022   057   059   000    Old_age   Always       -       43 (Min/Max 10/59)
    199 SATA_CRC_Error          0x0032   100   100   ---    Old_age   Always       -       0
    230 Perc_Write/Erase_Count  0x0032   100   100   000    Old_age   Always       -       1121 768 1121
    232 Perc_Avail_Resrvd_Space 0x0033   100   100   005    Pre-fail  Always       -       100
    233 Total_NAND_Writes_GiB   0x0032   100   100   ---    Old_age   Always       -       7569
    234 Perc_Write/Erase_Ct_BC  0x0032   100   100   000    Old_age   Always       -       21553
    241 Total_Writes_GiB        0x0030   100   100   000    Old_age   Offline      -       6403
    242 Total_Reads_GiB         0x0030   100   100   000    Old_age   Offline      -       6203
    244 Thermal_Throttle        0x0032   000   100   ---    Old_age   Always       -       0
    The official San Disk Tool on Windows claims that the drive is on 97% health (0% equals maximum read write cycles used).

    Quote Originally Posted by nrickert View Post
    Tumbleweed does often update the kernel and regenerate the "initrd". These are written to "/boot". And some Tumbleweed updates will reinstall grub2 (also goes to "/boot" as well as the MBR). However, a clean shutdown should still leave a clean "/boot".
    Yes, and something must be going wrong there. I thought that the first time, it was a "perfect storm event", where a system crash might have happened exactly when it was doing this write to /boot. But as you said, a clean shutdown should never leave it at that state. Something else must be at cause.

  10. #10
    Join Date
    Dec 2008
    Location
    FL, USA
    Posts
    4,280
    Blog Entries
    1

    Default Re: Repetetive corruption of /boot partition

    Quote Originally Posted by rfactormo View Post
    Every partition contains a Master Boot Record at the very beginning which is a potential space for a bootloader's essentials. Additionally, there is a MBR of the whole device (/dev/sda), not linked to any one partition.
    "Master" refers to the first DISK sector. Boot Records elsewhere on a disk are Extended Boot Records, or EBRs, at the start of partitions, not disks, or are partition boot records, PBRs. https://en.wikipedia.org/wiki/Extended_boot_record

    All my MBR partitioned disks that have installed GNU/Linux operating systems have a partition with Grub Legacy installed either on sda1 or sda3 or a logical partition near the front of the disk. This started sometime early in this century, when I got serious about Linux instead of sticking to DOS and OS/2. Previously, I was booting exclusively using IBM Boot Manager, since its advent with OS/2 v2.0 about a decade earlier. None of these disks have any version of Grub on an MBR. Instead they use the same DOS/OS2/Windows-compatible code I used prior to Grub discovery. It is this code that depends on a boot flag on a primary partition, unlike Grub, which ignores MBR/EBR boot flags. IOW, if the only bootloader(s) on disk is/are Grub, then boot flags on it anywhere have no relevance on that disk, thought it's possible they could in a system with (an)other disk(s).

    I don't often mount a primary hosting Grub to /boot. I manage it manually. That way it's never subject to damage except by me. In configurations when I do have a separate partition mounted to boot, it's virtually always EXT2. IIRC, I did create one somewhat recently with EXT4, but without journaling, which I deem as a practical matter pointless, given the infrequency it gets written to. I've never had one corrupted that automatic fsck during init didn't fix, if ever any corruption at all.

    All that said, I consider my experiences with SSDs disheartening. They're faster than HDDs for sure, but in the barely four years I've had them, I've had 5 out of 20 need RMA in less than a year's actual use, 2 ADATAs of different model lines, a Biwin, a Mushkin and a PNY, so don't think it can't happen to yours.
    Reg. Linux User 211409 *** multibooting since 1992
    Primary: 15.4, TW, 15.1 & 13.1 on Haswell @earthlink.net
    Secondary: eComStation (OS/2) &15.4 on i965P/Radeon
    Tertiary: Debian, Fedora, Mageia, more on Rocket Lake & older Intel, AMD, NVidia....

Page 1 of 3 123 LastLast

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •