Failed hard drive superblock

Hi all,

I built my computer back in April while on internship in Utah. It was running fine until shortly after moving back home to Iowa. During the last two weeks I’ve noticed random beep codes from my motherboard (I haven’t seen these patterns in my mobo’s user’s guide & they didn’t start until after the move). And I’ve also had two hard drives become unusable.

The hard drive problem begins to manifest itself when the drive suddenly becomes read-only. This is after the computer has been running for a while and after a clean boot. Initially, rebooting the computer will fix the situation. Eventually, though, fsck fails during boot and I can only log in as root in a recovery mode. Commenting out the hard drive from fstab will allow a “normal” boot and then I can log in graphically as root.

I ran badblocks and was told that no bad blocks were found, although attempting to run fsck while in recovery mode told me that my superblock was bad. Attempting to run fsck with the suggested alternate superblock returned the exact same error message.

Any thoughts/suggestions/solutions?

Thanks!!!

System Info:
Motherboard: MSI P35 Neo2
Processor: Intel Core 2 Quad (2.4 GHz)
RAM: 4 x 1GB Corsair DDR2
Hard Drives: 4 x 250GB 3.0 GB/s SATA Western Digital Caviar
OS: openSUSE 10.3

Seems unlikely that you would get problems with 2 drives (Good ones at that) at the same time.

I’m not experience enough to offer serious advice. Though my first stop would be to use a cd boot utility to run disc checks. Even so far as to do it on a different machine -so as to separate yourself from the mainboard.

Yes, backup your data pronto before you investigate further whether it’s the drive or mobo.

YES - Should have said it myself. BACKUP Agreed!

I copied everything from my /home/rgrandin directory to my /root directory yesterday, so everything is backed-up or easily re-creatable (/dev/sdb, the problem drive, is mounted as /home…/ is on /dev/sda).

Any suggestions for how to test the mobo?

Hi
Yup backup… Have you run a smart test on it eg?


sudo smartctl -a /dev/sda

Note replace sda with your drive id.

Have you checked the power supply voltages, maybe the 12V rail is
failing?


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.11-0.1-default
up 1 day 7:23, 3 users, load average: 0.41, 0.39, 0.43
GPU GeForce 6600 TE/6200 TE - Driver Version: 173.14.12

some reading may be found here
MSI P35 Platinum Makes a Comeback : Re-Introduction

not sure if it’s suitable to help you

> I built my computer back in April while on internship in Utah. It was
> running fine until shortly after moving back home to Iowa.

don’t do ANYTHING until you have:

open the box…look to see what got shaken out of place during the
MOVE from Utah to Iowa!!

if something has been banging around in there you probably have BIG
problems that won’t be fixed by using the keyboard, only…

look for loose cables (data AND power) to all drives…

especially important is the cable from the hard drive to the mother
board…REMOVE both ends and reconnect (CD/DVD? then remove all ends
and reconnect)…

and, just for fun, wiggle gently the connection points of all wires
to the mother board…and, the ram (GENTLY)…

and, i wonder if the power supply you used to build the machine: are
you SURE is powerful enough for two hard drives and whatever other
hardware you have there…

AND, your line “This is after the computer has been running for a
while” kinda sounds like it might be a HEAT problem…hard drives
make heat…do you have a case fan? is it working? maybe all you need
to do is open the cabinet (and remove all the cat hair)…and then
LEAVE the case open, or install a case fan…

good luck


DenverD (Linux Counter 282315) via NNTP, Thunderbird 2.0.0.14, KDE
3.5.7, SUSE Linux 10.3, 2.6.22.18-0.2-default #1 SMP i686 athlon

Thanks for the suggestions. I’ll try messing with the various connections once I get a little time (I have Iowa State marching band camp all this week, but I’ll get my life back this weekend…).

I have a 600-watt power supply in the machine and each of my hard drives draws less than 8-watts (according to WD’s website). The only PCI cards installed are my graphics card, wireless network card (currently unused; not installed in Utah - any idea how this could mess with a hard drive?), and a TV card (currently unused, but was in-use in Utah).

I doubt that heat is causing my problem. My Thermaltake case has a 140mm fan blowing across all 4 drives, and smartctl tells me that they’re running between 37 - 40 C after the computer has been on for a while. The temperatures are all within a couple degrees C of each other, so /dev/sdb isn’t heating up exceptionally high compared to the others.

When I get time to start pulling wires, how do I tell that the problem is solved? Currently my computer won’t even boot to a graphical interface unless /dev/sdb is commented-out of /etc/fstab (a failed fsck halts the boot and forces the recovery mode). So do I disconnect & reconnect all of my wires/cards, reformat my hard drive, restore my data, and hope it doesn’t happen again? Or is there a way to immediately tell when I’ve fixed the root of the problem?

rgrandin wrote:
> So do I disconnect & reconnect all of my wires/cards, reformat my
> hard drive, restore my data, and hope it doesn’t happen again? Or
> is there a way to immediately tell when I’ve fixed the root of the
> problem?

i think your solution is based on the idea that there IS a bad
superblock…

mine TRIAL solution idea is based on the idea that there is a bad
connection…caused by the trip…

and, that that is the ONLY problem…

which MIGHT be ‘fixed’ simply by the process of unplugging and
replugging the drive cables (mechanically swiping EACH contact point,
removing any/most/all unwanted substances and restoring connectivity)…

that is to say: do NOT reformat until after you have proven my trial
solution idea incorrect…this is especially the route i would take
(do the easy things first to rule them out) since you have removed my
other two concerns (insufficient power and overly sufficient heat :slight_smile:

here: a gray haired old man (with lots of IBM Big Iron experience)
said to me once: “ALWAYS suspect the cables/connections FIRST!” I
have found that to be good advice. (oh, his advice actually went more
like this: “ALWAYS suspect the cables/connections FIRST! While in
there clean out the dust bunnies, wiggle stuff, check that fans are
turning and air paths are clear…”

IF you do those things and your exact problem persist: then i’m
wrong…and you continue to be where you were when i first
answered–EXCEPT, you will have eliminated a potential
physical/hardware problem prior to proceeding to the next logical
step: find the failed drive…OR files system problem…OR or or or

YMMV :peace:

DenverD (Linux Counter 282315) via NNTP, Thunderbird 2.0.0.14, KDE
3.5.7, SUSE Linux 10.3, 2.6.22.18-0.2-default #1 SMP i686 athlon

Are the beep codes coming at boot? If it’s beeping during use, the first thing I would suspect is a hot CPU. An overheated CPU can cause absolutely anything to fail in the system. Check the motherboard for blown capacitors as well - the tops will be pushed up… This will cause your board to beep.

One other thing to check is the PSU. Open it up and see if any capacitors are blown - the tops will be pushed up. If it’s “crowbaring” it will take out a hard drive in a hurry.

Until you determine the cause of the beeps, there is really no need to go any further.

Alright, so I unplugged all of my cables and plugged everything back in. And for the 3.5 days after my next boot everything was running just fine and all of my data was still on the suspect hard drive. I thought the problem was solved.

Just now I had the read-only symptoms appear again (Skype signs itself out & “touch” doesn’t work because of the read-only filesystem). The computer did reboot successfully…after several attempts required to get past POST. The failed attempts caused the BIOS to think that there was an overclocking problem (this computer isn’t overclocked), but I didn’t change any BIOS settings when given the opportunity. It was after this overclocking error that the computer finally rebooted.

Before I go through the time and hassle of removing the motherboard and RMAing it to MSI (a 1-2 week process, according to their tech support), what else can I attempt to do to solve the problem?

rgrandin wrote:
> Alright, so I unplugged all of my cables and plugged everything back in.
> And for the 3.5 days after my next boot everything was running just
> fine and all of my data was still on the suspect hard drive. I thought
> the problem was solved.
>
> Just now I had the read-only symptoms appear again (Skype signs itself
> out & “touch” doesn’t work because of the read-only filesystem). The
> computer did reboot successfully…after several attempts required to
> get past POST. The failed attempts caused the BIOS to think that there
> was an overclocking problem (this computer isn’t overclocked), but I
> didn’t change any BIOS settings when given the opportunity. It was
> after this overclocking error that the computer finally rebooted.
>
> Before I go through the time and hassle of removing the motherboard and
> RMAing it to MSI (a 1-2 week process, according to their tech support),
> what else can I attempt to do to solve the problem?
>
>
well, as i read (when you said it ran days with no problems, and
then…) i was gonna say REPLACE the cable between the drives and
MB…but, when you said “several attempts required to get past POST”
i thought, oh-no sounds like CPU or MB…

my guess is it may be the MB, but if you wanna TRY…go ahead and
remove/replace the CPU…it MIGHT help…but, be SURE you have (and
know how to correctly apply) replacement thermal paste…otherwise
you might fry your processor too…

on the other hand, if the MB’s maker’s symptom-problem-fix maze made
them think it IS a MB problem, it probably is…i mean, they know
their product a lot better than you or me…


see caveat: http://tinyurl.com/6aagco
DenverD (Linux Counter 282315) via NNTP, Thunderbird 2.0.0.14, KDE
3.5.7, SUSE Linux 10.3, 2.6.22.18-0.2-default #1 SMP i686 athlon

For the sake of completeness, replacing the SATA cable fixed the problem.

(Sorry for the year-later post…just realized that I never stated my solution last year.)