Corrupt root partition

I’m running openSUSE 11.2 64-bit for the moment on my desktop. I’ve been having trouble getting 11.4 to download with good checksums. In an attempt to get that file, I downloaded the 64-bit DVD via aria2c overnight. I did it via sudo as the only partition large enough to hold it was /. This morning all looked good. I checked the SHA1 checksum and it looked good. I wanted to check the MD5 just to be safe. I downloaded the md5 checksum file from here and saved it in /home/<user>/Downloads/. The first couple of times I did this, I clicked on the link in Firefox and did “File | Save As…” Nothing happened. I didn’t get the dialog asking where to save it (I have it set to always ask where to save it). I then right clicked on the link and selected “Save link as…” That allowed me to save it. I switched to root in the terminal and mv’d the file to / so it would be in the same file as the iso. I then cd’d to root to run the md5 check and did a quick “ls -lia”. It gave me an error saying something like “/usr/bin ls not a valid file or directory.” (or something very similar to that) I tried opening the KDE file manager (forget what it’s called) but it didn’t open. As things were acting weird at that point, I figured some process was hung or something, so I’d reboot. I exited the root mode and the terminal, shutdown all apps and tried to shut down via the KDE menu. It started to shut down, but then went to console and came up with an error. I don’t remember what it was. It was long or I would have written it down. I tried accessing the console various ways, but only got that error. I had to do a hard reset. When it booted, I got the following:

doing fast boot
Creating device nodes with udev
Trying manual resume from /dev/disk/by-id/ata-ST3160023AS_blah_blah-part9
Invoking userspace resume from /dev/disk/by-id/ata-<same as above>-part9
resume: libgcrypt version 1.4.4
Trying manual resume from /dev/disk/by-id/<same as above>-part9
Invoking in-kernel resume from /dev/disk/by-id/<same as above>part9
Waiting for device /dev/disk/by-id/<same as above>-part7 to appear:  ok
fsck from util-linux-ng 2.16
[sbin/fsck.ext4 (1) -- /] fsck.ext4 -a /dev/sda7
/dev/sda7: clean, 8118/3474800 files, 1486910/13894209 blocks
fsck succeeded.  Mounting root device read-write.
Mounting root /dev/disk/by-id/<same as above>-part7
mount -o rw,acl,user_xattr -t ext4 /dev/disk-by-id<same as above>-part7 /root
No init found.  Try passing init= option to the kernel.
umount:  /dev: device is busy.
              (In some cases useful info about processes that use
                the device is found by isof(8) or fuser(1))
        3.269491] Kernel panic - not syncing: Attempted to kill init!

I rebooted again with the install disk and ran the repair tools. The first time, I got messages saying that sda2, sda4, and sda7 were corrupt with the option to repair. When I clicked on “repair” it didn’t appear to do anything, but the message popped right back up. I hit the repair button several times on each drive (10 or so times), then hit “skip”. At the end, I got two errors saying no root partition was found.

I rebooted and ran it again. This time, it only gave me the corruption error on sda7 (root, of course). I again his “repair” several times, and again got the two “root not found” errors and rebooted to the recovery system. I then manually ran fsck on all partitions. They all seemed fine. sda7 took a while and it did say it was repairing. It succeeded. Reboot. Same error as above (in code block). If I try it in “failsafe” mode, same thing (not really a surprise).

I also tried running the partition manager from the repair utilities. It sees each of my partitions, but does not see the mount points. It will not let me specify the mount points unless I choose to format the drive. Obviously, I don’t want to do that.

So I’m pretty sure my root partition is corrupt. What are my options? Is there anything I can do to recover that partition? I was thinking about doing a re-install and just not formatting my /home and /usr partitions. However, since the install utility doesn’t recognize the partition mount points, I don’t think I can do that. Is there a way around that?

I don’t have a problem re-installing my entire system, but I don’t want to lose what’s in /home and /usr (each are separate partitions on the same disk). I’ve got data in /usr, and documents I don’t want to lose in /home.

Do you think it could be a hardware issue? I’m thinking about running Spinrite 6.0 on the disk, but that takes forever, so I can’t try anything else while that’s running.

When a disk is repaired sometimes the fsck program can not piece all things back together. So anything it does not know what to so with ends up in lost&found. If it is binaries it may be impossible to actually splice them back together. So if root has had a serious corruption about the only thing to do is reinstall. The installer does not know what the mount points are it only gives a suggestion if it can. You need to go to expert mode to set the mount points and if you want each partition formatted and if so how.

This may be an indication of a failing disk since you have Spinrite I’d run it, even if it does take all night. Also I’d boot from a CD OS and backup all important stuff before doing anything else.

How can I get to the lost+found file? I mounted it in the recovery system. I can see the directories on it, and I can cd into lost+found. It appears empty though.

Well maybe it is clean. But it does look like something was lost or you would be able to boot.

Since you can mount it maybe pike around a bit and see if all look well. I’d look at /boot/grub first and check the menu.lst file. From there I’d use the menu.lst references to look to see if the kernel is still there, etc.

Do you get anything message when you try to boot?

I only get the error in the code block above.

Kernel is still there.

I do see something odd though. When I boot into recovery system, I see the following directories when I ls in root:
bin
boot
dev
etc
home
lib
lib64
media
mnt
mounts
parts
proc
root
sbin
sys
tmp
usr
var.

However, when I switch to my recently mounted root partition, I only see the following:
boot
dev
home
lost+found
opt
proc
sys
usr
var
windows.

There’s no /etc. Even weirder. I did a find on /etc and it shows up in my windows directory. I did a quick cat on fstab located there and it appears to be the correct file. It’s almost like it got moved to another directory; as if fsck recovered it to the wrong place. In fact, if I compare what’s in the “windows” directory to what’s in my root directory all of the directories in the recovery system root are there except for “mounts” and “parts”.

The “windows” directory is just a directory which holds the directories where I mount my XP partitions (read-only, of course). So besides those WinXP directories, “windows” now holds:
bin
etc
lib
lib64
lost+found
media
mnt
root
sbin
selinux
srv
tmp.

BTW, that lost+found is also empty. All of them are. Is it possible that there is a permission thing that is keeping me from seeing them? It seems odd that linux loses all security as soon as I have an install disk.

Looks like fsck moved them to the wrong place. Things must have been seriously damaged. For me I’d reinstall. You might want to preserve etc contents for reference since it may contain files that might effect usr binaries. I’d also check that drive.

I moved all of those directories back to root, ran the repair utility and now it’s booting up normally. I will probably do a fresh install of 11.4 (if I can ever get a good download). But at least now I can backup my data. I looked through a bunch of the log files and I did see SMART - “prefailure” messages. However, they were on my WinXP drive. I’m in the process of replacing that drive, so that’s not catastrophic. :smiley:

I am also going to run Spinrite on all of my drives, just to be sure everything’s cool.

Thanks so much for your help! You got me on the right track.

Glad to help that was a tough problem.

On 2011-03-30 23:36, Yippee38 wrote:

(I know you solved the issue, just some comments for next time)

> I’m running openSUSE 11.2 64-bit for the moment on my desktop. I’ve
> been having trouble getting 11.4 to download with good checksums. In an
> attempt to get that file, I downloaded the 64-bit DVD via aria2c
> overnight. I did it via sudo as the only partition large enough to hold
> it was /.

That’s a mistake. You should have created a new directory under root, then
given yourself write permission on it, and run the download there. Why?
because, for example, root can fill completely a partition, while a user
process is killed leaving a small margin. I suspect your root partition
filled, plus something I don’t know.

Question: what filesystem type are you using?

> I rebooted again with the install disk and ran the repair tools.

I don’t trust that automated repair tool. I’ve never been able to do
anything good with it.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

On 2011-03-31 03:36, Yippee38 wrote:
>
> I looked through a bunch of the log files and I did see SMART
> - “prefailure” messages.

Not necessarily important. It may simply mean that a value changed. Look at
mine:

<3.6> 2011-03-31 11:05:10 Telcontar smartd 3240 - - Device: /dev/sdc
[SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 105
to 106
<3.6> 2011-03-31 11:05:10 Telcontar smartd 3240 - - Device: /dev/sdc
[SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 39 to 40

They are trivial, happening all the time. You have to interpret what your’s
says.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

Carlos,

The drive shouldn’t have filled up as it has a bit over 100Gb free space. Once I get my windows drive replaced, I’m going to back stuff up and do a re-install with better partition sizes, so I can put it someplace sensible like /home. In fact, when I was deciding to download it to /, I thought, “This is probably a bad idea to put this here.” hehe.

I don’t like the automatic repair tools. I run them manually so I have control over what it does. It always tries to change my fstab and my grub. I don’t let it.

Good info on the SMART. IIRC, those are the exact same messages I was seeing (though the values were different).

Thanks for the feedback. Hopefully, I won’t make these mistakes again, and maybe somebody else will learn from my problems.