Help Please - Upgrade Problem Now Can't Boot

Let me start by apologizing for a long post and also that I am limited in my linux experience. I had installed Opensuse 10.3 to begin to learn. The purpose of the machine was mainly a file server for my home network. I have two drives configured for raid through the OS. I mount /home and /shared from the raid /dev/md0

I will try to remember everything that I have done and the errors.Hopefully I have not done too much damage here :slight_smile:

I wanted to upgrade my machine to Opensuse 11 so I booted to the CD and did the installation option. I chose update. The first error that I received was when I had to choose the partition. It automatically select /dev/sdc2 which is my root partition. Other partitions listed are:
/dev/sdb1
/dev/sdb2
/dev/sda1
/dev/sda2
/dev/md0
/dev/md1
/dev/sdc1
/dev/sdc2

When I hit next to update /dev/sdc2 I got a message"

"The partition /dev/sda1 could not be mounted.
mount: /dev/sda1 already mounted or /mnt/boot busy

If you are sure that the partition is not necessary for the update (it is not a system partition), click continue. To check or fix the mount options, click specify mount options. To abort update, click cancel."

I should have probably canceled at this point but of course I did not because I am thinking that /dev/sda1 is part of the raid partition and wasn’t a system partition. So I hit continue.

The next thing that came up were the repositories. I am not sure if I should have left them alone or not but they were all for opensuse 10.3 and disabled so I changed most of them to the opensuse 11.0 repository and then enabled them. Not sure if I should have done that or not…

The installation completed and the PC rebooted. When it rebooted the first thing that I noticed was that GRUB still listed Opensuse 10.3. Then when it booted it failed to mount the first raid device. However the first error was:
“Could not load /lib/modules/2.6.22.18-0.2/modules.dep : no such file or directory.” Why would it be trying to load from that kernel when the latest one on there was 2.6.25.5-1.1? Then it went on to try to mount the raid partition and it could not. It would give a message to run fsck. I ran fsck on /dev/md0 and it could not run it because it said that “e2fsck device or resource may be busy while trying to open”. There was also a message that the superblock could not be read.

Then being the genius that I am decided that I would change the path that GRUB was using from the old kernel to the new kernel. This didn’t do anything (or I didn’t do it right) and it still failed to boot with the same messages.

Next (I know I should have asked for help by now) I decided to re-run the upgrade. Although, this seemed to be a little more promising, it would get further than it did before but then would get a message that the x server could not start. The error is:
“Fatal server error: no screens found” it also says that current OS is Linux suse103 2.6.22.18-0.2-default is the x server error window

This brings me back to a console screen which says:
“ Welcome to openSUSE 11.0 (i586) – Kernel 2.6.22.18-0.2-default (tty1)
When I try to log in as root I get:
“cannot make/remove an entry for the specified session” and I am kicked back to the login again.

My biggest concern is getting to the to raid partitions. I am fine if I need to install from scratch as long as I do not lose the data on those two partitions.

Am I fixable or did I completely hose it?

Also I forgot to add that I can boot to runlevel 1 and log in as root.

Anyone?

I am at the point now that I can boot to runlevel 1. However I cannot mount /dev/md0 or /dev/md1.

I get a message that it is the wrong fs, bad option or bad superblock.

If I could at least mount these partitions to be sure the data is ok then I would be ok with just installing the OS from scratch and not worry about trying to fix those problems.

i almost hate to write…but i will…

first, tell me you have a good, usable backup of all the data on all
those drives…

then, tell me if you have other OSs on the system, or not…

did you dual/triple boot 10.3 and what?

and, the disk you installed from, what was that? (cd/dvd? live cd with
gnome, kde3, kde4 what?)

and, tell me you read this before you started:

http://tinyurl.com/3vwrzl
http://tinyurl.com/6mu7rm
http://tinyurl.com/5eqy8k

–
DenverD (Linux Counter 282315)
A Texan in Denmark

Thank you for the response.

ok, it SEEMS to me that you only need to repair your grub setup…that
is the good news…the bad news is i can’t help you with that…

but, i remember reading some really good posts on repairing grub…maybe
the site’s search function could help you find them…

–
DenverD (Linux Counter 282315)
A Texan in Denmark

@streetglyden -

As @DenverD said . . .

i almost hate to write…but i will…

There is also the possibility that you have a broken array. You don’t say which type of array you have. You say you “mount /home and /shared from the raid /dev/md0” which is kind of confusing; I’m guessing that you have mirrored your home directory separately in a RAID 1 array? And that the root directory is not in an array? If the upgrade wrote files to individuals partitions that are in an array, rather than to the array itself, the array filesystem will lose synchronization and be broken (or worse, if RAID 0). I can’t say more than that, because if there is a problem with the array, it will depend upon what type it is, what is (was) contained in the array; and, that will govern what the recovery approach will be.

If you can get to a prompt, do >cat /proc/mdstat, which will give you the state of the array. The man page for mdadm, the interface program to the md RAID driver, lists the commands to query the condition of the array and how to rebuild it if necessary. Good luck.

Sorry, it is just a mirrored array. You are correct that / is not in the array. Just /home and /share. I also set it up after I had installed 10.3. I don’t remember exactly how I moved /home to the raid partition but it was done after the initial install of 10.3.

/dev/md0 is /share
/dev/md1 is /home

cat /proc/mdstat

Personalities :
md1 : inactive sdb2[0] sdc2[1]
169565924 blocks super 1.0

md0 : inactive sdb1[0] sdc1[1]
146801760 blocks super 1.0

Also when I do “mdadm --detail /dev/md0” the state shows active, Not started

What is the current kernel version? uname -r shows that I am running 2.6.22.18-0.2-default - should this have been updated?

Should I compile a newer version? I have 2.6.25.5-1.1 in /usr/src

Could this be the source of all of my problems?

So you have two arrays. /dev/md0 would not have been affected by the upgrade, but /dev/md1 would be. Fortunately, not to a great extent, as the KDE 3.5.9 upgrade would not change many files, and adding in KDE 4, while creating a number of new files, would basically be empty. Since you used RAID 1, if only one drive in the array was updated, it won’t be seriously out of sync with the other.

The /proc/mdstat’s “inactive” means the array is recognized but not mounted. mdadm’s “active, not started” means the array is configured with the md driver but it is not working. There are mdadm commands that will give you granular detail of the state of the array. If memory serves, you will need to break the array, rebuild the array, and then synchronize the array. This will take a good while. Given the above, you have a good chance of this working.

As far as the kernel, the most current vsn is 2.6.25.11-0.1.

Couple of questions…

How can I tell which drive has been updated in the array?

How exactly do I “break the array”? Do I just fail and remove the drive like:
mdadm -f /dev/sdb2
mdadm -r /dev/sdb2

Then would I recreate it just like I would be from the beginning?

As far as the kernel not getting updated during the install should I be concerned with that?

Also you mentioned KDE. I am running gnome.

Upgrading Gnome as part of 11.0 would probably be roughly equivalent to KDE 3.5.9. That is, /home does not change much at all.

Please bear in mind I’m working from memory here . . . fortunately I don’t have to get into my arrays often enough to have all this fresh in mind. But as I recall . . .

First, the mdadm --detail command should show the array to be “clean” and the individual drives to be “active sync”. You need to determine why the array is not started, because if it is not, only one drive is being written to or you will have failed mounts, etc. mdadm --examine looks at individual drives in the array.

If the array needs rework, first start with Assemble mode, with --update=resync; that may be all you need to do. If it has to be re-constituted, then yes, the commands you cite are what you’re looking for: you have to --fail the drive, --remove it, and then --re-add it. And probably resync as well; frankly, I don’t remember whether the re-constitution automatically invokes the synchronization, as happens when an array is first built. (When you built the array the first time, you either copied /home into the created array, or you had /home on a drive which you placed in an array, then added a second drive to that array, which invoked the synchronization.)

That’s about all I recall right now.

As far as the kernel version, while AFAIK that wouldn’t be a factor in your situation, yes, I think you want to get it updated as soon as you can after the RAID issue, if any, is resolved. In major upgrades, new features or capabilities are sometimes introduced which are tied into a new kernel. There have been some posts on this forum from users indicating problems after upgrading the kernel, but frankly, that always happens and when the symptoms are random it is impossible to judge (at least for me) whether the kernel update itself was the source of the problem (unless the problem is related to a known bug). The only kernel update problem I’ve seen repeated by users of late has not been related to the kernel, but to the X server that was updated at the same time. That’s about all I can offer on that point.

Good luck.

Here is some of the output:

mdadm --detail /dev/mdo

Version : 01.00.03
Creation Time : Tue Apr 1 01:46:04 20008
Raid Level : raid1
Used Dev Size : 734008800 (70.00 GiB 75.16 GB)
Raid Devices : 2
Total Deivces : 2
Preferred Mior : 0
Persistence : Superblock is persistent

Update Time : Sun Aug 3 21:30:59 2008
State : active, Not Started
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Name : 0
UUID : <UUID #>
Events: 20

Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1

I did mdadm --examine on each drive but not exactly sure what I am looking for there. They both show the state to be “clean”.

The I did:

mdadm --assemble --update=resync /dev/md0

and it returned:

mdadm: device /dev/md0 already active - cannot assemble it

Do I need to do mdadm -S /dev/md0 first?

Thanks again for your help.

Again, I need to caveat - been quite a while, so my memory is not the most reliable (can I get away with blaming that on my age?) . . .

What you’re looking for on the drives is the state (clean), the internal bitmap being correct (the checksum), and validation of the array UUID . . . what I noticed in your output was that there is no UUID . . .

Have you checked your /etc/mdadm.conf file? The array’s must be properly defined there to be started. Usually the array key used is the UUID; if you created the array’s with YaST, that would be the case. Have you looked at /var/log/boot.msg to verify the driver (it’s embedded in the kernel now) is starting properly, that it’s seeing the array, initializing the devices? The command “mdadm -A -scan” will simulate system startup of the array, scanning the conf file. The command “mdadm -A --update=UUID” will create a new random UUID for the array (I don’t know if it automatically updates mdadm.conf for you, that would need to be done).

Hope that helps . . .

Sorry, the UUID is actually in there I just didn’t want to type the whole thing :slight_smile: I am at runlevel 1 so not on the network and was just typing what was on the screen. The UUID is in /etc/mdadm.conf and is correct for each.

If it is already active and cannot assemble should I just stop it and then try to assemble it?

Sorry, the UUID is actually in there I just didn’t want to type the whole thing :slight_smile: I am at runlevel 1 so not on the network and was just typing what was on the screen.

Shame on you . . .

The UUID is in /etc/mdadm.conf and is correct for each.

Whew!

If it is already active and cannot assemble should I just stop it and then try to assemble it?

Possibly. But if everything in the config, the superblock, the UUID, the state, all is looking good - then my approach would be to first find out why the array is not being started. Again, get into the boot log and take a look at what’s happening when the driver loads, the arrays are detected, the mount attempts.

Ok here is some info from the boot.msg file.

Activating device mapper…
FATAL : Could not load /lib/modules/2.6.22.18-0.2-default/modules.dep: No such file or directory
failed
Starting MD Raid modprobe : FATAL : Could not load /lib/modules/2.6.22.18-0.2-default/modules.dep: No such file or directory

mdadm: failed to RUN_ARRAY /dev/md0: Invalid argument
Modprobe: FATAL : Could not load /lib/modules/2.6.22.18-0.2-default/modules.dep: No such file or directory

ls: cannot access /sys/module/apparmor: No such file or directory
ls: cannot access /sys/module/apparmor: No such file or directory
FATAL: Could not load /lib/modules/2.6.22.18-0.2-default/modules.dep: No such file or directory
Loading AppArmor module failed

System Boot Control: The system has been set up
failed features: boot.md boot.apparmor

Oh my! :frowning:

I re-read your first post. Frankly, I’m not at all sure what happened and consequently what the current state is. The boot log excerpt shows module loads failing on a missing modules.dep; that’s an index of all the modules created by the depmod command when a kernel is installed. So, something went awry with the kernel installation. Or, quite possibly much more; there could be corruption through the file set. We could look at the entire boot log for hints, but we can’t be sure.

In your first post you wrote

I am fine if I need to install from scratch as long as I do not lose the data on those two partitions
. So let me suggest this . . .

Boot into the Live-CD. Open a terminal, and su to root, then do:

modprobe raid1

That will load the RAID 1 kernel module. Now try to mount each array. You may have to do something with mdadm to activate the arrays; I’ve never had to do this exactly and so I just don’t know. But if you can mount the arrays, and verify the data, then the cleanest solution will be to re-install. Re-constructing from the bottom up is possible from the command line, but if it’s acceptable, a re-install would be far better.

YES! This worked. I was able to just load the raid1 module and then mount the array.

Ok, so since I am still learning the ins and outs of Linux, what do you think went wrong? Is it not recommended to do an upgrade? Is it something I did? I guess it probably all goes back to the kernel getting messed up during the installation. Or is the kernel possibly fine and GRUB is just messed up?

Anyway, now that I know the data is ok I will just install fresh. Thank you so much for your time and patience helping me out. I really appreciate it.

One more thing… it shouldn’t have made any difference that I have / on one partition and /home on the RAID array when doing an upgrade should it? I would think that would be normal pratice in the real world right for them to be on seperate partitions?

Again, thanks for your help.

Terrific :smiley: And, you are quite welcome. I was glad to be of help. (Gotta say though, I was concerned for awhile there, given those arrays hold your user data.)

I really can’t say what went awry. I’ve never seen anything like this, and I’ve done a ton of installs and upgrades. From your description, it almost looks like the installer saw your drives in a different sequence, maybe the SATA before the IDE? Actually, come to think of it, I encountered this recently, and I had to change the drive names in fstab. You might check this out.

I would check whether the /home array was actually mounted in the upgrade or not. Look at the file time stamps. If you asked for KDE 4 to be included in the upgrade, look for a .kde4 directory - if it’s there, the array was mounted and written to. And by the way, since you are going to re-install, delete that directory, too.

I would advise now doing a clean install rather than an upgrade; conditions being what they may be in the system, you may only have that choice. As far as /home, you can mount the array in the partitioning step and not have it formatted, and then the software installation will update what you have (do please have a backup first). If you were using this system as a file server and don’t have a lot of gui apps, there probably won’t be much changed. If you had KDE installed, be sure to only install KDE 3.5.9 at least for now. You could also let the installation do a clean install of /home within the root partition temporarily, migrate your user data there afterward, test everything, and if OK, copy that entire /home back to your array, delete the temporary /home, mount the array at /home - that’s extra safe, the arrays won’t be touched by the install. That’s what I would do.

Another thought: Use the DVD, not the Live-CD, for this. I trust the DVD more. Just my opinion. To your other questions . . .

I don’t think grub was a factor at all, unless again the drive order got switched; it would then be installed on the wrong drive. Re the kernel: It’s installation went wrong; that’s why you were missing your modules.dep index. But I strongly suspect a lot more is borked than just the kernel.

Finally, no, having the root and /home on separate partitions, even though one is RAID and the other not, should have absolutely no effect. I use 3 arrays, for root, home, and boot. They aren’t even all the same type. Let me put in another plug here for backups. Yes, RAID 1 means fault-tolerance, but that is only in the sense of 1 of the drives dying. If you get file system corruption, it will be on both drives. Other things can go wrong. You only get true fault-tolerance with a true hardware RAID controller. So I mirror my critical data with rsync running from a script scheduled with cron; works great.

I think that about covers it. Late here, gotta sign off. Again, great that things worked out OK!