Repair of superblock failing: What next?

spaceboy909 · March 10, 2011, 11:01pm

Right after installing, I began having system crashes while trying to do updates. After a few in a row, I’m now getting the "bad magic number, bad superblock" error.

I have spent quite a bit of time reading many guides and threads on this matter (including this one: Swap:superblock - openSUSE ) , but after trying the suggested solutions, it looks like I’m facing a reinstall, but I hope you guys can give me some other options. I just don’t understand why the super block restore isn’t working.

Here’s what I’ve tried thus far (sda6 is root, sda2 is boot):

fdisk -l : to confirm partition labeling
dumpe2fs /dev/sda6 : to verify backup superblock locations
e2fsck -b 32768 /dev/sda6 : restore superblock
e2fsck -c -f -v /dev/sda6 : repair file system

When going through this process, it first reports that I have multiple checksum errors (maybe more, can’t recall of the top of my head).

I then choose to fix the errors. It states that the errors have been repaired, and the final report shows zero bad blocks.

Then I reboot, but I still get the same ‘bad magic number in superblock’ error. I have tried backup block 163840 and 98304 with the same results.

Everywhere I’ve looked, people trying this repair are successful, but I’m not! So, what do you think has gone wrong that is preventing the superblock repair, and what can I do at this point?

It’s a brand new install, so it won’t kill me to reinstall it, but I am trying to learn Linux as I run into problems like this, so if I can fix it, then I’d like to try. Thanks for any help!

Oh I should also mention that the boot message also mentions attempting to restore from the swap partition…two messages about that, but I don’t think it showed any errors, and there wasn’t anything listed that mentioned a connection between the two. My guess is that those wouldn’t be related at all, but…

I also have multiple distros on disk, ftr. I’m experienced with partitioning so I’m 100% certain I didn’t overwrite anything, plus, I can access the data on the partition from my other distros, so it’s apparently just a boot problem.

ken_yap · March 10, 2011, 11:05pm

Have you run smartctl to ask the disk if it has any unrecoverable bad blocks? You may have a dying disk.

spaceboy909 · March 10, 2011, 11:07pm

I forgot to add that not only does the superblock repair fail, but when I run through the process again, it seems to report the same errors as before, as if nothing was changed! But the repair message indicates that a repair was completed. I don’t get it. It says it’s fixed, but then apparently nothing has changed, each time.

spaceboy909 · March 10, 2011, 11:08pm

No, but I’ll check it out now. Thanks.

jschellhaass · March 10, 2011, 11:48pm

What’s the file system? ext2,3, 4? What’s specified in fstab? Is the root device correct in grub?

jeff

spaceboy909 · March 10, 2011, 11:56pm

Fstab appears normal to me; here it is (ext4):

/dev/disk/by-id/ata-WDC_WD1500HLFS-01G6U0_WD-WXL408028698-part3 swap                 swap       defaults              0 0
/dev/disk/by-id/ata-WDC_WD1500HLFS-01G6U0_WD-WXL408028698-part6 /                    ext4       acl,user_xattr        1 1
/dev/disk/by-id/ata-WDC_WD1500HLFS-01G6U0_WD-WXL408028698-part2 /boot                ext4       acl,user_xattr        1 2
proc                 /proc                proc       defaults              0 0
sysfs                /sys                 sysfs      noauto                0 0
debugfs              /sys/kernel/debug    debugfs    noauto                0 0
usbfs                /proc/bus/usb        usbfs      noauto                0 0
devpts               /dev/pts             devpts     mode=0620,gid=5       0 0

And my grub menu looks ok:

# Linux bootable partition config begins
  title OpenSuse Linux (on /dev/sda2,6)
  root (hd0,1)
  kernel /vmlinuz root=/dev/sda6 ro vga=normal resume=/dev/disk/by-id/ata-WDC_WD1500HLFS-01G6U0_WD-WXL408028698-part3 splash=silent quiet showopts vga=0x345
  initrd /initrd

I’m off to work…I’ll tackle the smartctl test tonight.

I suspect that it is not a bad drive, as I’ve been running Ubuntu on the partition for 3 years or so, and other distros and data on the same drive, and haven’t had any significant trouble that I know of. Of course, I know they can go bad at any time.

If it is going bad, I’ve heard of utilities that can actually physically repair the drive by ‘exercising it’.

ken_yap · March 11, 2011, 12:34am

Heh, I think that was back in the days of ST506 interfaces with stiction and such problems.

These days when a block goes bad, the disk has enough intelligence to map in a spare block. After a while it runs out spare blocks and the errors can no longer be concealed.

DenverD · March 11, 2011, 1:58pm

On 03/10/2011 11:36 PM, spaceboy909 wrote:
>
> It says it’s fixed, but then apparently nothing has changed, each time.

i had a superblock problem like that about four or five years
ago…when all was said and done, my solution was to buy a new disk!

my advice: for now, STOP trying to fix it and BACKUP your data while can…

and THEN try to fix it…

–
DenverD
CAVEAT: http://is.gd/bpoMD
[NNTP posted w/openSUSE 11.3, KDE4.5.5, Thunderbird3.0.11, nVidia
173.14.28 3D, Athlon 64 3000+]
“It is far easier to read, understand and follow the instructions than
to undo the problems caused by not.” DD 23 Jan 11

spaceboy909 · March 11, 2011, 11:14pm

Advice acknowledged. I’ve had a couple of drives go bad on me in the past. I think anything that I have of consequence is already backed up…don’t really have anything I’d call critical.

Well, smartctl is acting strange. Is this because my drive is croaking, or because something is wrong with smartctl?

First I installed Gsmartctl (in Debian). The basic health check reports: PASSED. I have attempted the short self test a few times. Each time, it reaches 90% complete in about 2 minutes, and then just sits at 90% for another 30 minutes, with a continual pop up that says, “Running ‘smartctl’…”, over and over about every 10-15 seconds or so.

That can’t be the ‘short’ test, since it says appx. 2 minutes to complete it. When I abort the test and check the log, it shows basically the same error every time, I’m guessing due to my abort of the test:

Error 133 occurred at disk power-on lifetime: 20053 hours (835 days + 13 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

After command completion occurred, registers were:
ER ST SC SN CL CH DH

40 51 00 90 d5 82 40

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

60 08 18 70 4d 83 08 08 18:34:38.550 READ FPDMA QUEUED
b0 d0 01 00 4f c2 00 08 18:34:38.261 SMART READ DATA
60 40 18 50 d5 82 08 08 18:34:38.185 READ FPDMA QUEUED
61 08 00 78 6f e6 07 08 18:34:38.184 WRITE FPDMA QUEUED
60 20 10 00 59 27 08 08 18:34:38.184 READ FPDMA QUEUED

This last part of the log, resisters and commands varies with each error, but I’m betting this is all because of my abort…I hope.

I then tried the command from the CLI. smartctl -t short /dev/sda

This command appears to ‘complete’ instantly at the CLI, as it prints a message and then immediately gives me a prompt. I checked the long with “-l” and apparently it continues to run the test with no indicator that it is running (god I love that…). (ETA) It does give a future completion time, but I much prefer an active indicator of some kind.

So, the log reports: # 1 Short offline Completed without error 00% 29910 -

I’ll try the long test later tonight. Maybe I’ve got a bad disk, but…I don’t know, I have a feeling my disk is fine, and this just some other zany problem giving me trouble.

If it does pass the long test, should I just reinstall, or is there actually some other repair that can be attempted?

DenverD · March 12, 2011, 9:20am

On 03/11/2011 11:36 PM, spaceboy909 wrote:

> If it does pass the long test, should I just reinstall, or is there
> actually some other repair that can be attempted?

i’m not a S.M.A.R.T. <http://en.wikipedia.org/wiki/S.M.A.R.T.>
guru…not by any stretch of anyone’s imagination…

so, i can’t read your output of Gsmartctl and make any worthwhile
suggestions based on what i see, or on the trouble you had…

i could suggest that you read a bunch on Gsmartctl/smartctl, and ask
some questions of its makers and maintainers (like your: Well,
smartctl is acting strange. Is this because my drive is croaking, or
because something is wrong with smartctl?)

i mean, there must be S.M.A.R.T. docs somewhere that can answer all
of your questions on what the output means, and whether or not a
croaking drive can cause smartctl to act strange…but, i don’t know
where those docs are, and even if i did i wouldn’t go and read them
for you and give you a short, easy, no-reading required answer…

i can give you (again) the benefit of my experience: “i had a
superblock problem like that about four or five years
ago…when all was said and done, my solution was to buy a new disk!”

and, i guess i should add: i bought the new disk after the disk with
non-repairable superblocks stopped working completely…as i recall
now it just wouldn’t turn anymore…

you might ask yourself these questions:

how old is this drive?
how long is the manufacturer’s warranty period for this drive?
is the drive still in warranty?
a. if yes, contact the manufacturer and tell them what you see, and
ask their advice/solution recommendation
b. if not in warranty, continue trying to fix it, or replace it–it
is, after all, your time you are using…

–
DenverD
CAVEAT: http://is.gd/bpoMD
[NNTP posted w/openSUSE 11.3, KDE4.5.5, Thunderbird3.1.8, nVidia
173.14.28 3D, Athlon 64 3000+]
“It is far easier to read, understand and follow the instructions than
to undo the problems caused by not.” DD 23 Jan 11

spaceboy909 · March 12, 2011, 11:54am

Well, so far so good…it passed the long test. I think I’ll reinstall. If I don’t post back, then assume that it’s going ok!

ken_yap · March 12, 2011, 12:04pm

Or you got kidnapped by aliens. lol!

DenverD · March 12, 2011, 2:59pm

On 03/12/2011 12:06 PM, ken yap wrote:
>
> spaceboy909;2302525 Wrote:
>> If I don’t post back, then assume that it’s going ok!
>
> Or you got kidnapped by aliens. lol!

watch out for the shiny, non-miniature probes!

–
DenverD
CAVEAT: http://is.gd/bpoMD
[NNTP posted w/openSUSE 11.3, KDE4.5.5, Thunderbird3.1.8, nVidia
173.14.28 3D, Athlon 64 3000+]
“It is far easier to read, understand and follow the instructions than
to undo the problems caused by not.” DD 23 Jan 11