Bootloader inexplicably broke (while the computer was off?)

Two months ago I installed openSUSE 12.2 on my grandmother’s computer, which is an older 32bit machine (ext4 for root). I restarted it several times to make sure everything is working properly before heading back home. Still, a week later she called me and mentioned the computer won’t boot any more, but since she barely knows to use email and browse the web we couldn’t debug anything with me being away. Today I came back to my grandparent’s house, and looked into the issue. Although I managed to fix it by repairing / upgrading with the DVD, I’m muffled by what actually happened, and thought to report my experience here.

What seems to have broken was the MBR and / or grub2. Like I said, I checked that the computer boots and logs in right before leaving, and remember clearly that last time it was shut down without errors (I also haven’t entered root during the last session). During those weeks I suspected the hard drive broke and I might need to buy a new one. But after repairing it, I did a check with fsck and no errors were found (SMART should be enabled in BIOS too). The command I used was:

fsck.ext4 -c /dev/sda3

When I first opened the computer, the error I got at boot was “No MBR found”. I booted the openSUSE DVD in the Rescue console, and after trying many commands I managed to reinstall grub2 using the grub2-install command. After restarting and trying to boot then, it got further, but grub got stuck with a message about an ELF header which did not have the expected size. I had to use the Upgrade option in the YaST installer to repair the system… but even then I got an error window when it came to installing the bootloader. Thankfully grub2 did work this time, but I had to tweak it to get it working normally once I could login to the system (themed mode was not enabled, and a lot of default boot options were missing).

My conclusion was that the bootloader somehow broke while the computer was off, although the hard drive itself has no physical problems. That doesn’t make any sense, since there was virtually nothing that could break the bootloader this badly except a drive breakage. Can anyone explain how this was possible and what can be done in the future to avoid such phenomenons? Note that the machine was repaired so I can no longer do any debugging, and probably not much investigation outside of what I remember and mentioned.

On Mon, 04 Mar 2013 21:46:01 +0000, MirceaKitsune wrote:

> My conclusion was that the bootloader somehow broke while the computer
> was off

That’s not really possible, without physical damage to the hardware.

What’s most likely is that something happened prior to shutdown or at
shutdown the last time the system was powered down that caused a problem
with the bootloader. You wouldn’t see the results of something like that
until you tried to actually start the machine, since the MBR is only used
by the system when booting.

Jim

Jim Henderson
openSUSE Forums Administrator
Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C

Only things I’m pretty sure about are that before the last shutdown, I didn’t either log in as root or modify any software packages (let alone touch the partitioning or boot loader options). If that can happen during a normal shutdown, it probably means there might be a major flaw somewhere.

My first curiosity is why would the OS automatically attempt to touch boot loader at all when shut down? Pretty sure that’s only changed when you run a grub command to modify the settings or a kernel update is done. Second question is, does this mean a machine can risk breaking any time when you shut it down? And is there a way to prevent such problems and random changes to the bootloader?

On Mon, 04 Mar 2013 22:16:02 +0000, MirceaKitsune wrote:

> Only things I’m pretty sure about are that before the last shutdown, I
> didn’t either log in as root or modify any software packages (let alone
> touch the partitioning or boot loader options). If that can happen
> during a normal shutdown, it probably means there might be a major flaw
> somewhere.
>
> My first curiosity is why would the OS automatically attempt to touch
> boot loader at all when shut down? Pretty sure that’s only changed when
> you run a grub command to modify the settings or a kernel update is
> done. Second question is, does this mean a machine can risk breaking any
> time when you shut it down? And is there a way to prevent such problems
> and random changes to the bootloader?

If you’re installed on the MBR, there are some BIOSes that have features
to disable modifications to the BIOS.

But my point is that when the system is powered off, changes can’t be
made to the disk - making changes to the disk /requires/ power to the
drive (unless, as I said, there’s physical damage, and to affect only the
MBR would require a fair amount of expertise and/or an almost
unfathomable amount of luck to damage just that and no other part of the
disk).

So the conclusion to reach is that something changed while the system was
running. The question is what - and no, unless you updated GRUB there’d
be no need to touch that part of the disk.

Jim


Jim Henderson
openSUSE Forums Administrator
Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C

On 2013-03-04 22:46, MirceaKitsune wrote:
> During those weeks I
> suspected the hard drive broke and I might need to buy a new one. But
> after repairing it, I did a check with fsck and no errors were found
> (SMART should be enabled in BIOS too). The command I used was:

No.

You have to run the long test of smarctl.


Cheers / Saludos,

Carlos E. R.
(from 11.4, with Evergreen, x86_64 “Celadon” (Minas Tirith))

Hi

how in the world should this happen?
When the computer is off, the hard disk isn’t spinning.
So how?

Why are you so sure about that?
It isn’t necessarily the hard disk drive itself, the contacts of the cable are a possible cause as well.

True.

But especially if the PC is a bit older, the contacts may not be that reliable anymore
(e.g. the contacts of the bus by which your hard disk is connected).

Or your motherboard may have some fracture, another possible cause of unexpected failure.

This could explain your problem in a reasonable way.

Instead of myths about a bootloader getting bust after power off.

To say it in a polite way: the latter just isn’t reasonable.

you meant smartctl.

On 2013-03-05 01:06, ratzi wrote:
>
> robin_listas;2532024 Wrote:
>> You have to run the long test of smarctl.
>
> you meant smartctl.

Yes, of course :slight_smile:

Or better, when available, the testing utility published by the disk
manufacturer. Seagate has a very good one, a bootable ISO image of which
I forget the name, seatools perhaps. It can also test HDs of other
brands, without the specific tests, of course.


Cheers / Saludos,

Carlos E. R.
(from 11.4, with Evergreen, x86_64 “Celadon” (Minas Tirith))

Hi Carlos !

… perhaps ? :slight_smile:

SeaTools | Seagate

But to run it under Linux only seems to be possible with severe restrictions

SeaTools Linux Enterprise-Edition | Seagate

:frowning:

Anyway,

I heard the same :slight_smile:

Best wishes
Mike

On 2013-03-05 01:46, ratzi wrote:

>> I forget the name, seatools perhaps.
>
> … perhaps ? :slight_smile:
>
> ‘SeaTools | Seagate’
> (http://www.seagate.com/de/de/support/downloads/seatools/)

Yes, that one.

> But to run it under Linux only seems to be possible with severe
> restrictions

Last time I tried it, it was a bootable CD downloaded from their page.
You burn it and boot it. The system it uses is thus irrelevant. Old
versions were MsDos based, now they are FreeDos based.

Looking at the download link I see no both a MsDos link and a Windows
link, so we definitely want the MsDos link, which is in fact FreeDos now
and GPL :slight_smile:

(see “more info” on the download link).

The Windows version runs under Windows, we do not want that version :wink:

> ‘SeaTools Linux Enterprise-Edition | Seagate’
> (http://tinyurl.com/crynuy9)

Did not know this one. Yes, it is limited to scsi, apparently.


Cheers/Saludos
Carlos E. R. (12.3 Dartmouth test at Minas-Anor)

Thanks for all the replies. Yes, it’s not (scientifically) possible for changes to be made to a drive while powered off… but I also don’t see how it was possible for the system to make automatic changes to the boot loader during shutdown like that. Unless there was a physical problem with the drive (even a broken contact or cable as suggested) we’re kinda stuck in a paradox :stuck_out_tongue:

And I’ll try using smartctl as well on that computer, to be even safer. Thanks for the suggestion.

> on my grandmother’s computer

i wonder how system updates are done?

is apper and/or package kit enabled?
can grandma authorize security patches be installed?

if so, how–does she have the root password?

if she does then you need to rethink the whole situation…

imnsho:

-she should not have any administrative permissions whatsoever…

-she should only have her own password and therefore only be able
to accidentally trash /home/[her]/ and nothing more

-you should be her System Administrator and do (for example) “zypper
patch” and all other admin duties via ssh or sitting at her machine,
say once a month…

i say all of this from experience as i set up a granny aged lady with
openSUSE five or six years back and she LOVED it but didn’t even know
there was such a thing as a root password–except i told her if she
was ever asked for it she didn’t need to do whatever it was she had
just tried, and STOP trying…

she would giggle while telling that at the tea the other day with her
old lady friends they were all complaining that their computers were
messed up again and the grandson had to come and clean out the
viruses again–even though he had installed an expensive antivirus
firewall/etc…

she got GREAT enjoyment out of telling them all that her Linux was
working just fine and she had no antivirus program at all, ever!

ymmv


dd
http://tinyurl.com/DD-Caveat

@dd: Only I have the root password and even know how to use it, she only accesses her normal user account. As for updates, I initially set Apper to “Automatically apply all updates”. After this incident however I disabled that option, and will probably update software packages manually when I come to visit (or through remote desktop). And yeah, she’s ok with it as long as web, email and messenger work… I installed openSUSE on her machine since I maintain it and it’s now easier for me, and also because it’s faster and more modern (Windows XP was the last Windows OS her computer was able to install).

Anyway, I did a sysctl on the drive. It has a longer output which I don’t understand entirely, here it is:

linux-q150:/home/bunica # smartctl -a /dev/sda
smartctl 6.0 2012-10-10 r3643 [i686-linux-3.4.28-2.20-default] (SUSE RPM)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate U9
Device Model:     ST380012ACE
Serial Number:    5JVSH1JY
Firmware Version: 9.01
User Capacity:    80,026,361,856 bytes [80.0 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-6 T13/1410D revision 2
Local Time is:    Tue Mar  5 13:06:52 2013 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (15556) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  58) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   075   063   006    Pre-fail  Always       -       237551366
  3 Spin_Up_Time            0x0003   098   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       4
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       64311992
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7087
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       815
194 Temperature_Celsius     0x0022   023   056   000    Old_age   Always       -       23
195 Hardware_ECC_Recovered  0x001a   075   063   000    Old_age   Always       -       237551366
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   186   000    Old_age   Always       -       767
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 770 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 770 occurred at disk power-on lifetime: 7080 hours (295 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 00 00 00 e0 00      00:04:25.766  READ DMA
  27 00 00 00 00 00 e0 00      00:04:25.665  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      00:04:24.698  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 02      00:04:24.698  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:04:24.690  READ NATIVE MAX ADDRESS EXT

Error 769 occurred at disk power-on lifetime: 7080 hours (295 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 00 00 00 e0 00      00:04:23.490  READ DMA
  27 00 00 00 00 00 e0 00      00:04:22.535  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      00:04:24.698  IDENTIFY DEVICE
  ef 03 44 00 00 00 a0 02      00:04:24.698  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:04:24.690  READ NATIVE MAX ADDRESS EXT

Error 768 occurred at disk power-on lifetime: 7080 hours (295 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 00 00 00 e0 00      00:04:23.490  READ DMA
  27 00 00 00 00 00 e0 00      00:04:22.535  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      00:04:22.534  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 02      00:04:22.526  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:04:22.517  READ NATIVE MAX ADDRESS EXT

Error 767 occurred at disk power-on lifetime: 7080 hours (295 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 07 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000007 = 7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 00 00 00 e0 00      00:03:44.652  READ DMA
  27 00 00 00 00 00 e0 00      00:04:22.535  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      00:04:22.534  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 02      00:04:22.526  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:04:22.517  READ NATIVE MAX ADDRESS EXT

Error 766 occurred at disk power-on lifetime: 7080 hours (295 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 00 00 00 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 01 00 00 00 e0 00      00:03:44.652  READ DMA EXT
  c4 ff 01 f4 0a ee e2 00      00:03:44.652  READ MULTIPLE
  c4 ff 01 00 00 00 e0 00      00:03:44.515  READ MULTIPLE
  c4 ff 01 00 00 00 e0 00      00:03:44.515  READ MULTIPLE
  f5 03 00 01 10 00 a0 00      00:03:44.515  SECURITY FREEZE LOCK

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

linux-q150:/home/bunica # 

On 2013-03-05 13:46, MirceaKitsune wrote:
> As for updates, I initially set Apper
> to “Automatically apply all updates”. After this incident however I
> disabled that option, and will probably update software packages

If apper automatically installed a kernel update, and that had a failure
related to grub, which has to be modified for the new kernel, that might
be your explanation.


Cheers / Saludos,

Carlos E. R.
(from 11.4, with Evergreen, x86_64 “Celadon” (Minas Tirith))

On 03/05/2013 01:46 PM, MirceaKitsune wrote:
>
> I initially set Apper
> to “Automatically apply all updates”. After this incident however I
> disabled that option, and will probably update software packages
> manually when I come to visit (or through remote desktop).

imo: right answer!

> Anyway, I did a sysctl on the drive. It has a longer output which I
> don’t understand entirely, here it is:

i have spent less than 15 minutes in my entire life reading and
learning about how to interrupt SMART reports–so i really have no
basis for what i’m about to write, so this is just a pure guess:

it seems to me you should go ahead and backup all data that granny
might want to keep, because that drive is about to catch the flu!


dd
http://tinyurl.com/DD-Caveat

On 2013-03-05 16:41, dd wrote:
>
> it seems to me you should go ahead and backup all data that granny might
> want to keep, because that drive is abou

No, the disk seems fine. Yes, there was some kind of error, but of a
type that one a firmware expert can interpret.

These parameters are very important:

Reallocated_Sector_Ct, Current_Pending_Sector, Offline_Uncorrectable.
When not zero it means that there are surface errors, probably growing.
If they grow, danger is critical.

And they are 0 in that report, so fine :slight_smile:

However, this:

>
>   SMART Self-test log structure revision number 1
>   Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
>   # 1  Short offline       Completed without error       00%         1

Means that the test has not been run at all! There is one short test
when the disk was one hour old, run probably by the manufacturer.
MirceaKitsune has not run any test at all!

OP, you have to run the short and the long tests ASAP. Then we can talk
about that disk health.

Cheers / Saludos,

Carlos E. R.
(from 11.4, with Evergreen, x86_64 “Celadon” (Minas Tirith))

On Tue, 05 Mar 2013 10:26:02 +0000, MirceaKitsune wrote:

> but I also don’t see how it was possible for the system to make
> automatic changes to the boot loader during shutdown like that

It wasn’t necessarily during shutdown, but that is a time when there’s a
fair amount of stuff running as root (as a one-shot to shut things down,
as opposed to running as a daemon).

Jim

Jim Henderson
openSUSE Forums Administrator
Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C

I’m very sure that especially during the last session, no automated Kernel update was done (and no software package updates at all most likely). But yeah… in order to be safe I disabled automatic updates from now on.

And what’s the command to run the long test? Never used the disk checking utilities in Linux before (also since I recently moved from Windows and never had to till now).

On 2013-03-05 22:16, MirceaKitsune wrote:

> And what’s the command to run the long test? Never used the disk
> checking utilities in Linux before (also since I recently moved from
> Windows and never had to till now).

As the disk is a Seagate, you can use seatols, a downloadable CD from
Seagate that boots running FreeDos and does the checking. I was talking
about this same tool yesterday in this same thread with ratzi, so
please, read your thread :wink:

As to the Linux tool, it is smartctl and has a nice man page with
examples :wink:

(the test in both methods is actually the same. Or to be exact, it is
the same as what seatools calls internal mode test, or firmware, or
something similar).


Cheers / Saludos,

Carlos E. R.
(from 11.4, with Evergreen, x86_64 “Celadon” (Minas Tirith))