I have a mysterious filesystem issue with openSUSE on an old machine.
Installation wasn’t very pleasant but worked finally. It boots fine. However after 5 or 10 minutes, all filesystems vanish. Most programs (but not all) cannot be found anymore. “ls” output is empty everywhere. “top” used to work (it was working on a ssh connection that I just lost). Now it complains about unset TERM variable. I was able to set it again to xterm. Now “top” is working, displays the running processes : mysqld, Xorg, icewm (my WM), nothing strange there, nobody is eating up the resources. I have other linux installed on that machine : Ubuntu freezes occasionally, mostly with Gnome, but it seems to be a totally different problem (maybe related to the radeon driver) ; Mandriva works fine so far. The two IBM 120 GB harddrives are quite old but the manufacturer fitness test didn’t find failures. I also checked for bad sectors before formating in openSUSE setup. It doesn’t look like a physical I/O problem. In such a case, it would crash or at least complain about not being able to access the HD. Well … I switched to console but I didn’t get any login prompt and now I cannot switch back to X. I can still ping the machine but of course not ssh to it. SysReq keys (although enabled) doesn’t work. I guess I will have to reboot brutally. Last time it took me about one houre of pressing ‘y’ tp repair the filesystem with Ubuntu. I’m sure I will be back soon. What do you guys think about that situation ? I suspect a kernel issue with that hardware. Have you heard about similar problems ? I’m afraid I will have to reinstall 11.1, which used to work fine, on that machine.
Some details about the hardware :
The (onboard) Promise controller and the SATA controller are not used.
The 2 HD are on the first IDE interface as master and slave.
Jan 17 08:59:24 miriam kernel: 1012.198135] attempt to access beyond end of device
Jan 17 08:59:24 miriam kernel: 1012.198210] sdb7: rw=32, want=34359730520, limit=32467302
Jan 17 08:59:24 miriam kernel: 1012.198229] EXT3-fs error (device sdb7): ext3_get_inode_loc: unable to read inode block - inode=148327, block=4294966314
Jan 17 08:59:24 miriam kernel: 1012.200417] attempt to access beyond end of device
Jan 17 08:59:24 miriam kernel: 1012.200448] sdb9: rw=32, want=27099922480, limit=6249222
Jan 17 08:59:24 miriam kernel: 1012.200467] EXT3-fs error (device sdb9): ext3_get_inode_loc: unable to read inode block - inode=105967, block=3387490309
Jan 17 08:59:24 miriam kernel: 1012.215485] EXT3-fs error (device sdb9) in ext3_reserve_inode_write: IO failure
Jan 17 08:59:24 miriam kernel: 1012.243797] attempt to access beyond end of device
Jan 17 08:59:24 miriam kernel: 1012.243824] sdb7: rw=32, want=34359738328, limit=32467302
Jan 17 08:59:24 miriam kernel: 1012.243842] EXT3-fs error (device sdb7): ext3_get_inode_loc: unable to read inode block - inode=35589, block=4294967290
Jan 17 08:59:24 miriam kernel: 1012.255703] attempt to access beyond end of device
Jan 17 08:59:24 miriam kernel: 1012.255747] sdb9: rw=32, want=27099922480, limit=6249222
Jan 17 08:59:24 miriam kernel: 1012.255766] EXT3-fs error (device sdb9): ext3_get_inode_loc: unable to read inode block - inode=105967, block=3387490309
Jan 17 08:59:24 miriam kernel: 1012.256105] EXT3-fs error (device sdb9) in ext3_reserve_inode_write: IO failure
Jan 17 08:59:24 miriam kernel: 1012.257125] attempt to access betail: error reading `/var/log/messages’: Input/output error
Now I cannot access /var/log/messages anymore or any other file (the system was running for about 10 minutes).
I also tried to change filesystems, reformat ext4 partitions in ext3 and reiserfs in ext4, also deleted and recreated partitions with different sizes between installation attempts. It didn’t make any difference.
I don’t have file system problems on Ubuntu, neither did I on 11.1. And I can mount these openSUSE filesystems in Ubuntu or Mandriva for more than 10 minutes. I also checked the harddrives several times and didn’d find any failure. However I don’t know how reliable the IBM drive fitness test is. I checked for bad sectors before formating the partitions in openSUSE. I don’t know what the openSUSE setup would do if it had found any. Fedora would refuse to install (happended to me once with a broken HD).
On Sun, 17 Jan 2010 17:26:01 +0000, please try again wrote:
> I also tried to change filesystems, reformat ext4 partitions in ext3 and
> reiserfs in ext4, also deleted and recreated partitions with different
> sizes between installation attempts. It didn’t make any difference.
Looks like an imminent hardware failure to me. I’ve seen this kind of
behaviour before - in the weeks leading up to a head crash that rendered
the drive inoperable.
Get a new drive and then get as much data off the drive as you can before
it dies.
Looks like you have a couple of hard drives. You did not say which drive holds which Linux OS. But it still feels like a hardware problem to me. I use a commercial program called Spinrite to do low level scans on drives, but you should be able to get a free one from the drive manufacturer’s site.
I use the manufacturer’s program for low level scans. That’s the first thing I do. openSUSE is on the second drive, but also mounts /home from the first drive. Ubuntu and Mandriva are on the first drive but also use /tmp and /srv from the second drive. All systems swap on both drives.
I didn’t have problems whith 11.1 on that harddrive. I know, they are also (at least) two problems, unrelated and occuring at the same time, like an harware failure during a system update.
BTW SysReq reboot is ok after changing the default value (176) to 1.
i) Make a backup
ii) Make a second backup, just in case
When you are having the problem, is the output of ‘mount’ different from the normal state?
From what I can tell/guess, it seems possible that your partition table doesn’t quite correspond with the physical layout of the drives (well, the numbers on SDB7/SDB9 look way out).
Check that the BIOS isn’t getting reset because the battery is dying…that could explain a lot if sometimes you were booting with the disk adressing mode set up incorrectly.
I don’t worry about that. Important stuff if any is on a Raid fileserver (running openSUSE 11.1).
When you are having the problem, is the output of ‘mount’ different from the normal state?
Nope. mount output is the same. However umount is missing, as most commands cannot be found anymore.
From what I can tell/guess, it seems possible that your partition table doesn’t quite correspond with the physical layout of the drives (well, the numbers on SDB7/SDB9 look way out).
I use the same kind of layout or even a more sophisticated one on a dozen machines running Linux and Unix. In an attempt to fix the problem, I recreated all the partitions from sdb7 to the end before installing openSUSE, since I wanted to resize sdb7 becoming to small for /usr with 11.2 . I created the partitions with gparted under Ubuntu. Normally I would use Partition Magic under DOS (for historical reasons!), but my old version can only create but not delete existing ext3 partitions anymore. I guess it cannot read them since the default inode size has changed (I had the same problem with Unix and I had to patch some kernels in order to mount ext3 partitions).
Check that the BIOS isn’t getting reset because the battery is dying…that could explain a lot if sometimes you were booting with the disk adressing mode set up incorrectly.
The BIOS looks OK. This machine must not be that old and seems to use only LBA addressing (doesn’t offer the choice of normal or large mode, as older mainboards did). I replaced the battery about two years ago.
And what about an fsck? When was that last done?
When this problem occures, I first reboot Ubuntu and do a manual fsck. It repares the filesystem when necessary (sometimes it is not).
missing files and directories are a sure sign of file system corruption. Since no one else is seeing this there are only a couple of reason it is occurring.
I booted openSUSE in runlevel 3, ran top over ssh as well as tail -f /var/log/messages on another terminal. Nothing happended for about 2 hours. Since it became boring I decided to start X, neither Gnome nor KDE but icewm (started with startx). As I expected, the filesystem on sdb7, sdb8 and sdb9 became unreadable after 10 minutes. However (that time) the / filesystem is still there, so all commands in /bin are still available and I could run “df -hl” which produces this fascinating output :
All filesystems excepted sda11 and sda12 (which still look OK) were mounted at boot time.
The interesting thing is that sda13 (/home) which is on the first harddisk - as its name indicates - and only contains user stuff and config files also looks empty and shows a 16T size (!)
Do you still think it’s an hardware or even a partitioning issue whith my second harddisk?
As I already mentionned I didn’t have such problems before with 11.1 on the same harddisk, nor do I have filesystems errors under Ubuntu and Mandriva, which mount sdb10 (/tmp) and sdb11 (/srv) at boot time. I have been running Mandriva in the past 3 or 4 days with Gnome (!) without any filesystem error.
mount output still looks fine.
“ls /boot” output looks empty, althoug “/boot” is not on a separate partition and “/” still looks reasonnable.
I guess the best thing to do would be to reinstall 11.1 on the same partitions and see if the problem persists. If it does not, I would say that this openSUSE kernel doesn’t like the IDE controller on that mainboard. Before doing that, I’ll try to plug the two harddisks on the Promise onboard controller and see if it makes a difference …
If the file systems were corrupted, they would remain corrupted (or get repared) while rebooting another Linux. But it is not necessary the case. I just rebooted Ubuntu, checked the filesystems and they were clean. They might or might not get corrupted when I have to brutally reset the machine (btw SysReq does NOT work finally) , but by the time as they appear empty under openSUSE, I guess they are not corrupted.
Did you try a repair? Before doing it though do a media check.
Your df check is just too odd. If the FS showing ok from another OS I guess that it is not really a FS problem but the kernel is losing track of stuff some how.
What does df say if you run it first thing after a boot. preferably to run level 3?
You got it! That’s exactly the situation happening there. That explains why the trouble began about 10 minutes after running X: xscreensaver and gslideshow started at this point.
Who wrote in this forum that there were no trouble with ATI cards on openSUSE ?! Ati cards are a nightmare under 11.2. I gave up trying to get something else but a black screen on an iMac with Radeon HD 2400 XT and finally reinstalled 11.1. Now on that old old computer with that old old All In Wonder Radeon 7500 (RV200), I end up with a garbage filesystem like I’ve never seen on any Linux before.
Thanks a lot for pointing to this explanation! That is an awesome bug report too.
This is slightly earlier and seemed progressive at the time, being on the Debian list might be worth raising on the Bugzilla.
As for finding google searching terms with df 16tb hit it, I just figured it had to be significant.
Any way glad to of enlightened you but without a bug report you may find you(& others) may be waiting till the upstream fix is mainstream(If it is there). Noticed there had been chats on #radeon implying they could see where the potential problem may lie.
I did a fresh install of opensuse 11.2 last night. I glided through setup & detected all my hardware correctly (except my printer Canon Pixma MS850). No isued (yet) with my ATI Radeon 7500 maybe because I disabled the screensaver. I don’t like screen savers & prefer my screen to power off after 30min of inactivity.
Wow…I read the bug report #550562 & certainly cannot be ignored. Although I do not use screensavers, what happens if other apps invoke openGL? I would be toast. Hmmm…perhaps best for me to go back to the drawing board…
unless I’m mistaken using the prop driver should fix it, those bug reports seemed based on the radeon and maybe radeonhd, so I would of thought fglrx.
This doesn’t seem that common and seems tricky to distinguish though I guess some other ATI problems are masquerading as something else. But I didn’t find a bug report upstream or either on the novell one.
This is specifically related to file system corruption and what I see as common is the df showing 16tb when it shouldn’t.(glxgears seems to trigger it as well going on the other link) The other bug report highlights this further but last time I looked neither had moved on and if they had it was on things like test 2.6.33 which is still very bleeding edge. So if people want it fixing in distro they’ll need to bug report.