File system corruption due to NFS not unmounting properly?

Not quite sure where this particular question should be posted… network, boot login problem… both.

My server seems to have a corrupt filesystem EVERY reboot once I’ve mounted it’s NFS share on my laptop. During the boot process I will be informed there are “Multiply claimed block” errors and system repair from the DVD wont fix it. I’ve to manually run fsck and let it fix somewhere between 10 and 100 errors.
Now as running fsck is a very lengthy process and result me being unable to reach my server for hours… it’s starting to annoy me quite a bit.

I think the corruption is caused due to my laptop never shutting down properly because for some reason it likes to shutdown network access before unmounting the disks? (I suspect that’s what is going on as it never successfully completed unmounting NFS and simple hangs on that part of the shutdown process)

I tried using # umount /home/Pascal/Shared but got: umount.nfs: /home/Pascal/Shared: device is busy
So I tried # fuser -c /home/Pascal/Shared which returns nothing.

So…
A) How do I get my laptop to properly unmount the disk so it can shutdown properly.
B) Is this the cause of the corruption and if not, what is?

For both Systems:
openSUSE 11.1 x86_64, KDE4.x, ext3 filesystem

Shutdown log, well don’t know where the log is stored so I typed it over…

Shutting down Name Service Cache Daemon
Shutting down hddtemp daemon for /dev/hda
Saving random seed
**Shutting down the NetworkManager**
Shutting down (remotefs) network interfaces:
  wlan0 device: Intel Corporation PRO/Wireless 4965 AG or AGN [Kedron] Network Connection (rev 61)
**Shutting down NFS client services**:Shutting down the HAL daemon.

Shutting down NetworkManager before shutting down NFS client services? Smells like a design error

Wonder what would happen if you changed the order yourself? Not easy to do, but it can be done. Go into the /etc/init.d/nfs script and try changing the “Required-stop” parameters. Here’s what my nfs script looks like:


### BEGIN INIT INFO
# Provides:       nfs
# Required-Start: $network $portmap
# Required-Stop:
# Default-Start:  3 5
# Default-Stop:
# Short-Description: NFS client services
# Description:    All necessary services for NFS clients
### END INIT INFO

In other words, change “required stop” to the same spec:


# Required-Stop: $network $portmap

From what I understand about init scripts, that should do it. Worth a try, anyway … and it’s easy to fix if I’m wrong! :slight_smile:

Unless I misunderstood it already seems to be there:

### BEGIN INIT INFO
# Provides:       nfs
# Required-Start: $network $portmap
# Required-Stop:  $network $portmap
# Default-Start:  3 5
# Default-Stop:   0 1 2 6
# Short-Description: NFS client services 
# Description:    All necessary services for NFS clients
### END INIT INFO

The NFS client ought not to be able to cause corruption in the filesystem through not unmounting. Otherwise you could never run a large NFS network reliably (especialy in old days when Win9X and WinNT boxes would crash 2-5 times a day on average). It was designed as a stateless protocol, to make it robust, at cost of write performance which were synchrous (though protocol enhancements reduce that).

This smells like a serious NFS bug, you ought to be reporting to me.

Perhaps you could manage to NFS mount ro (read only) for a while and rsync -essh or scp updates back, in meantime till it’s fixed and see if that helps.

Kinda need them read/write as I got my eclipse projects on it and I can’t get any work done if I can’t use it.

Can’t say with certainty it’s due to NFS not unmounting. It’s just that it is the only file related thing not going properly and every reboot I seem to be forced running fsck.
Fast reboots where I don’t bother mounting any of the shares via NFS do go well and so I reached that conclusion.

Come to think of it… all shares are ‘double’ shared SAMBA and NFS, but I doubt Samba is the issue as I’m pretty much the only one accessing them and I haven’t used samba (or windows) for weeks.

SMART is telling me the disk is fine btw and never hangs or anything so I think the hardware is not to blame at least?

Hi,
I’ve the same problem with cifs mounts. I’d to modify the scripts in order to mount the folders after start of the network and to unmount them before stopping the network. In order to do that I set :

Required-Start: $network $remote_fs

Should-Stop: network-remotefs

and it works fine. Anyway this is a bug in 11.1 where we are unable to desactivate the RUN_PARALLEL option in sysconfig.

Michel.

Well, it was worth a shot. I’m using 10.3 here at home, and it WASN’T in mine.

But if that’s the case, it SHOULD be dismounting the NFS before the network is shut down. The other suggestions here may be more fruitful. I don’t know anything else to suggest, because I don’t use NFS.

10-100 fsck problems seems excessive. I doubt if NFS can cause that. Are you sure you don’t have a failing disk? How about you don’t run NFS and do a normal shutdown and see if you get errors to fix?

Some chain rebooting goes fine, every time on for a while (which means by default I’ll be using NFS as I got nearly everything stored on it) I’ll be facing a nice error message again.
Disk seems to be fine, not making any strange sounds, not showing slowdowns and SMART is content with it as well.

I’ve also had errors on my software RAID5 array, haven’t had them recently though, which seems odd… just the WD failing. In fact my raid array was damaged to a point where it ended up as ext2 (recently converted it back to ext3 though).
Might also be that I didn’t notice the errors on the RAID array, as you might imagine after seeing the errors so many times I just press down the Y button and see a wall of text flying by.

Although I can’t say with certainty it’s not hardware failure or that NFS is to blame, they are my main suspects.

It’s the error message “Multiply claimed block” that sounds like software allocation issues. You’ld expect with hardware failures to see massive slowdown and plenty of errors in /var/log/messages on the afflicted machine.

Requiring a client to unmount a network filesystem properly is a serious flaw in a server system. Laptops get hibernated, machines die, networks suffer interruptions. Your server should plough on unaffected.

I really hope you have reported this one into Bugzilla.

Well since I can’t say with certainty NFS is too blame, I’d first have to have NFS shutting down properly… then repair the filesystem… and then see if it reoccurs.

Mmmh unless I messed up (don’t think I did… would be getting an error of some kind otherwise?) it doesn’t seem to work.

This is what the section looks like now:

### BEGIN INIT INFO
# Provides:       nfs
# Required-Start: $network $remote_fs
# Required-Stop:  $network $portmap
# Should-Stop:    network-remotefs
# Default-Start:  3 5
# Default-Stop:   0 1 2 6
# Short-Description: NFS client services 
# Description:    All necessary services for NFS clients
### END INIT INFO

Does it change anything at all? KWrite displays it in the same color as ‘regular’ comments.
I’d expect to have to use use a dollar sign infront of it… seems like variables are called for like that (similar to php).

Need some guidance I guess as I’ve zero know-how about this type of file.

They are regular comments, but Suse looks for that section and uses it to determine how (and when) to load and stop the service.

Just like the “#!/bin/bash” at the start of a shell script. Technically, it’s a comment, but it also tells which interpreter to use.

I’d expect to have to use use a dollar sign infront of it… seems like variables are called for like that (similar to php).

Seems like that to me as well, simply because it has the $ in the other cases. But like you, I’m anything but an expert on these scripts. I just use them. :slight_smile:

Bumping old topics is always fun uh…

Anyways, I think I found the culprit… or at least I hope I did.
Changed ktorrent settings to “Reserve disk space before starting a torrent” along with “Fully reserve disk space (avoidds fragmentation)” “Disk space reservation methode: Filesystem Specific”.

So far it seems to have solved my problem, although I need test it after it’s been running for like a week to be sure.

So much for that, yet another file system corruption.

Really puzzled as to what is causing it, not only has the harddisk been replaced but so was the motherboard, RAM, CPU… it’s like a completely new system.

Has to be the software that’s acting dodgy, but I’m clueless as to which it is.
Really annoying as every reboot means I lost access to pretty much all of my files for about a hour.

Reinstalled, without using the same home directory… STILL no luck.

Upon a reboot I got:
udevd-event[1029] ‘sbin modprobe’ abnormal exit

Adding 2096472k swap of /dev/sde2. Priority:-1 extents:1 across 2096472k FAILED