SuSE 12.1 NFS Server needs frequent restarting

Just finished upgrading my home server to openSUSE 12.1. My workstation(s)
are still on 11.4 but the NFS shares from the server keep hanging the server
at unpredictable intervals.
I’m not seeing any specific messages except that the workstations report
that the server is unaccessible and hang until I restart the nfsserver.
It does seem that the nfs service dies after a certain amout of writing to
the share but it’s not easy to predict.

On the server (zanshin) - I’ve got the system set up with nfsv4 enabled and

<code>
zanshin:~ # cat /etc/exports
/data 192.168.9.0/24(fsid=0,crossmnt,rw,root_squash,sync,no_subtree_check)
/secure 192.168.9.0/24(rw,root_squash,sync,no_subtree_check)
</code>

On the workstations I’ve tried with and without nfsv4 enabled and the system
hangs either way.
From /etc/fstab …
<code>
192.168.9.254:/secure /secure nfs defaults 0 0
192.168.9.254:/data /data nfs defaults 0 0
</code>

It must be something going down on the server since, if one workstation
hangs with a lost connection, other workstations also hang until I issue
rcnfsserver restart on zanshin.

On suggestion I’ve seen on the web is to add bg,soft,intr to the workstation
fstab entries which I’m currently trying on one workstation and, so far,
this seems to be working, but I gather that soft is deprecated.

Can anyone throw any light on what could be causing this and have they got
any suggestions for a cure.


Alan

Changing to NFSv4 in Yast is not enough. On the clients, change “#DefaultVers=4” to “DefaultVers=3” in /etc/nfsmount.conf

Next time, please post any output between CODE tags.

The client machine fstab entries have not been changed only the server has been rebuilt to add new hardware and bring it up to openSUSE 12.1 (it was on 11.3 before) so it has to be something to do with the server surely.

On the subject of the CODE tags - I’m usually working from a news reader (knode) so I don’t have the prompts for the code tag syntax. I’m working from the web interface ATM so now I know the syntax.

Ok - the details …

I’ve got a home network set up with one box as the main file|print|LDAP|Apache|nfs|samba|DNS|DHCP server plus a couple of workstations runing openSUSE 11.4 and SWMBO’s Windoze XP box.

The main server (ZANSHIN) has been running openSUSE 11.3 and I’ve rebuilt it to install a 120Gb SSD to replace an old SCSI HDD and ATA HDD plus adding a couple of trayless SATA caddies for backups. It also has a pair of 2Tb SATA HDDs running as RAID 1 and managed with LVM.
It’s got a pair of ethernet ports bonded as bond0
As part of the rebuild I’ve installed openSUSE 12.1 and spent the last couple of days making sure that all the services have been rebuilt/restored correctly.
At first nfs service seemed to perfectly Ok - just use the defaults from YaST.
I’ve been using nfs since the days of DEC Ultrix and never really had any problems but now the service keeps hanging.

There are 2 nfs shares
/data - a large LV for movies, music, photos etc.
/secure - an encrypted LV holding user homes and groups used to synchronize with workstation homes using unison.

At fairly unpredictable intervals a workstation will hang, locking up the nfs service on the server and any other workstation as well until the nfs service is restarted.
One definite killer is using unison to sync a home on a workstation with the nfs-mounted /secure share, so it seems that a lot of file access throws the glitch.

Running rcnfsserver status when the lock-up occurs shows the service running and no error messages


zanshin:~ # rcnfsserver status
nfsserver.service - LSB: Start the kernel based NFS daemon
          Loaded: loaded (/etc/init.d/nfsserver)
          Active: active (running) since Tue, 26 Jun 2012 08:37:33 +0100; 23min ago
         Process: 26528 ExecStop=/etc/init.d/nfsserver stop (code=exited, status=0/SUCCESS)
         Process: 26548 ExecStart=/etc/init.d/nfsserver start (code=exited, status=0/SUCCESS)
          CGroup: name=systemd:/system/nfsserver.service
                  ├ 26569 /usr/sbin/rpc.idmapd
                  ├ 26575 /usr/sbin/rpc.statd --no-notify
                  └ 26577 /usr/sbin/rpc.mountd

and there is nothing in the syslog to indicate a problem on the server and the syslog on the workstations just note that the server is inaccessible.

At some points I’m having to restart the nfs service every couple of minutes and I haven’t any good ideas where to go from here.

From my original post - the server is running with NFSv4 enabled and /etc/exports entries of


zanshin:~ # cat /etc/exports
/data   192.168.9.0/24(fsid=0,crossmnt,rw,root_squash,sync,no_subtree_check)
/secure 192.168.9.0/24(rw,root_squash,sync,no_subtree_check)

and the clients have (well had as I’ve been trying various additional parameters to no avail) fstab entries


192.168.9.254:/secure   /secure nfs     defaults 0 0
192.168.9.254:/data     /data   nfs     defaults 0 0

Any help, hints appreciated - I’m getting desperate.

Alan

Knurpht wrote:
> Changing to NFSv4 in Yast is not enough. On the clients, change
> “#DefaultVers=4” to “DefaultVers=3” in /etc/nfsmount.conf

Sorry, I’m not thinking too clearly but if the goal is to change to
NFSv4, why uncomment and set a line to v3?

The idea was to make it work on v3 first.

Another thing: is the server using systemd? If so, try booting it with System V’s sysvinit. To do so: hit F5 at boot, select System V, boot and see if the problem persists. If not, you need some adjustments to the services handled by systemd.

Knurpht wrote:

>
> Changing to NFSv4 in Yast is not enough. On the clients, change
> “#DefaultVers=4” to “DefaultVers=3” in /etc/nfsmount.conf
>
> Next time, please post any output between CODE tags.
>
>

For what it’s worth, the previous incarnation of the server was running
NFSv4

In the backup copy of /etc/sysconfig/nfs I have
NFS4_SUPPORT=“yes”
as well as
NFS3_SERVER_SUPPORT=“yes”


Alan

Knurpht wrote:

>
> djh-novell;2471338 Wrote:
>> Knurpht wrote:
>> > Changing to NFSv4 in Yast is not enough. On the clients, change
>> > “#DefaultVers=4” to “DefaultVers=3” in /etc/nfsmount.conf
>>
>> Sorry, I’m not thinking too clearly but if the goal is to change to
>> NFSv4, why uncomment and set a line to v3?
>
> The idea was to make it work on v3 first.
>
> Another thing: is the server using systemd? If so, try booting it with
> System V’s sysvinit. To do so: hit F5 at boot, select System V, boot and
> see if the problem persists. If not, you need some adjustments to the
> services handled by systemd.
>
>

I don’t know what the default is - I’m guessing systemd as I’ve just tried
your suggestion to reboot and select System V and the boot sequence looks a
little different.
I’ve just tried to run unison on the client workstation and I’ve got
nfs: server 192.168.9.254 not responding, still trying
in syslog again with a system hang.


Alan

Fudokai wrote:

> Knurpht wrote:
>
>>
>> djh-novell;2471338 Wrote:
>>> Knurpht wrote:
>>> > Changing to NFSv4 in Yast is not enough. On the clients, change
>>> > “#DefaultVers=4” to “DefaultVers=3” in /etc/nfsmount.conf
>>>
>>> Sorry, I’m not thinking too clearly but if the goal is to change to
>>> NFSv4, why uncomment and set a line to v3?
>>
>> The idea was to make it work on v3 first.
>>
>> Another thing: is the server using systemd? If so, try booting it with
>> System V’s sysvinit. To do so: hit F5 at boot, select System V, boot and
>> see if the problem persists. If not, you need some adjustments to the
>> services handled by systemd.
>>
>>
>
> I don’t know what the default is - I’m guessing systemd as I’ve just tried
> your suggestion to reboot and select System V and the boot sequence looks
> a little different.
> I’ve just tried to run unison on the client workstation and I’ve got
> nfs: server 192.168.9.254 not responding, still trying
> in syslog again with a system hang.
>

Ok - just tried a reboot and specifically selected systemd and still have
the same problem.
How do I find out which is the default?

Alan

On 2012-06-26 09:56, fudokai wrote:

> On the subject of the CODE tags - I’m usually working from a news
> reader (knode) so I don’t have the prompts for the code tag syntax.

You write “CODE” inside square brackets, without quotes, in a line at the
start of the section. At the end of it you write a similar line, but with
“/CODE” instead.


sample code from the news interface.


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

On 2012-06-26 15:28, Fudokai wrote:
> I’ve just tried to run unison on the client workstation and I’ve got
> nfs: server 192.168.9.254 not responding, still trying
> in syslog again with a system hang.

unison uses ssh.


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

On 2012-06-25 16:39, Fudokai wrote:
> Just finished upgrading my home server to openSUSE 12.1.

Which method?

Online upgrade method
Offline upgrade method

?


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

Carlos E. R. wrote:
> On 2012-06-26 15:28, Fudokai wrote:
>> I’ve just tried to run unison on the client workstation and I’ve got
>> nfs: server 192.168.9.254 not responding, still trying
>> in syslog again with a system hang.
>
> unison uses ssh.

Perhaps the ‘local’ hierarchy that unison is trying to synchronise
actually contains NFS-mounted files

Carlos E. R. wrote:

> On 2012-06-25 16:39, Fudokai wrote:
>> Just finished upgrading my home server to openSUSE 12.1.
>
> Which method?
>
> [Online upgrade
> method URL=“http://en.opensuse.org/SDB:Offline_upgrade”] Offline
> [upgrade method
>
> ?
>[/color]
Hardware rebuild and fresh install (DVD) + restores from backups of previous
system


Alan

Dave Howorth wrote:

> Carlos E. R. wrote:
>> On 2012-06-26 15:28, Fudokai wrote:
>>> I’ve just tried to run unison on the client workstation and I’ve got
>>> nfs: server 192.168.9.254 not responding, still trying
>>> in syslog again with a system hang.
>>
>> unison uses ssh.
>
> Perhaps the ‘local’ hierarchy that unison is trying to synchronise
> actually contains NFS-mounted files

The client machine user/unison configuration is unchanged from before the
server upgrade.
I’m running unison to sync the local home with a copy of home on an nfs
mounted share (192.168.9.254:/secure) so ssh doesn’t come in to it

I’m mainly using unison to test the nfs mount because it’s one I know will
cause the hang.
If I don’t use unison I’ve got to play with the box for a while before it
hangs.

I did want to get the server working properly first but maybe I should just
upgrade one of the clients to openSUSE 12.1 (currently 11.4) to match and
see if the problem goes away.


Alan

On 2012-06-26 16:42, Fudokai wrote:

> Hardware rebuild and fresh install (DVD) + restores from backups of previous
> system

So, you did not do an upgrade in the openSUSE parlance.
Did those restore operations include configuration files?


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

Carlos E. R. wrote:

> On 2012-06-26 16:42, Fudokai wrote:
>
>> Hardware rebuild and fresh install (DVD) + restores from backups of
>> previous system
>
> So, you did not do an upgrade in the openSUSE parlance.
> Did those restore operations include configuration files?
>
Those that worked anyway :slight_smile:

Ok - so, strictly speaking, it was a fresh install. Do we have a term for a
fresh install that’s also an upgrade from an earlier system?

I mostly restored specific configuration files and dumps - /etc/exports,
smb.conf, *.sql, *.ldiff and the like. I had to do some fiddling to get
Samba working etc. and, although I did start off by simply restoring
etc/exports, after I started having problems I regenerated it from scratch
using YaST.

If you’re looking for the old /etc files to cross-check, I still have them
of course.


Alan

On 2012-06-27 09:22, Fudokai wrote:
> Carlos E. R. wrote:

>> So, you did not do an upgrade in the openSUSE parlance.
>> Did those restore operations include configuration files?
>>
> Those that worked anyway :slight_smile:
>
> Ok - so, strictly speaking, it was a fresh install. Do we have a term for a
> fresh install that’s also an upgrade from an earlier system?

Not that I know :slight_smile:

The thing is, an upgrade does some checking of those configuration files.
Sometimes you get the new file, and the old is saved aside, and some times
you get the old file, and the new is saved aside. In both cases, you can
compare the new with the old.

> I mostly restored specific configuration files and dumps - /etc/exports,
> smb.conf, *.sql, *.ldiff and the like. I had to do some fiddling to get
> Samba working etc. and, although I did start off by simply restoring
> etc/exports, after I started having problems I regenerated it from scratch
> using YaST.

And it still does not work after regeneration. Mmmm.

Did you replace pam files? Those are tricky.

Have you checked for errors in the network hardware? They would be listed
under “ifconfig”.


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

Carlos E. R. wrote:

> On 2012-06-27 09:22, Fudokai wrote:
>> Carlos E. R. wrote:
>
>
>>> So, you did not do an upgrade in the openSUSE parlance.
>>> Did those restore operations include configuration files?
>>>
>> Those that worked anyway :slight_smile:
>>
>> Ok - so, strictly speaking, it was a fresh install. Do we have a term for
>> a fresh install that’s also an upgrade from an earlier system?
>
> Not that I know :slight_smile:
>
> The thing is, an upgrade does some checking of those configuration files.
> Sometimes you get the new file, and the old is saved aside, and some times
> you get the old file, and the new is saved aside. In both cases, you can
> compare the new with the old.
>
>
>> I mostly restored specific configuration files and dumps - /etc/exports,
>> smb.conf, *.sql, *.ldiff and the like. I had to do some fiddling to get
>> Samba working etc. and, although I did start off by simply restoring
>> etc/exports, after I started having problems I regenerated it from
>> scratch using YaST.
>
> And it still does not work after regeneration. Mmmm.
>
> Did you replace pam files? Those are tricky.
>
>
> Have you checked for errors in the network hardware? They would be listed
> under “ifconfig”.
>

Ok - It looks like the problem is unison :frowning:

I’m pretty certain I’ve had the system lock up at other times (i.e. not
running unison) but I know unison locks it up. In fact it locked up when I
was on another machine which doesn’t have the unison-based syncing set up on
it - but maybe the unison-based box (which was running at the time) was
responsible. I swear the unison scripts were only running from .kde4/env and
…kde4/shutdown but …

One of the reasons I’d been using unison with nfs mounted shares is because
it gets picky about incompatible versions if run over the network and this
was supposed to prevent that problem by making the two systms ‘local’ (the
version on the new server is different from the one on the workstation)

Since it appeared to be that the lock up happened when there was fairly
intense file activity I’ve fired off:


find . -type f -exec stat -t >/dev/null {} \;

from the roots of the mounted shares and that’s not caused the lockup.

I’ve now set up a script to use rsync instead of unison and that’s running
without a problem.

So - I’ll log this one down to unison for the moment.

Shame really - the rsync solution, being one-way, is fine for a single
logon, but the unison solution would have allowed the same user to have
multiple logons. I shouldn’t try to be too clever :stuck_out_tongue:

Thanks for everybody’s help on this - I really appreciate it :slight_smile:


Alan

Fudokai wrote:
> Ok - It looks like the problem is unison :frowning:

Well, it looks like unison is exposing the problem, but not causing it.
It shouldn’t be possible for an application to lock up an NFS server.

> One of the reasons I’d been using unison with nfs mounted shares is because
> it gets picky about incompatible versions if run over the network and this
> was supposed to prevent that problem by making the two systms ‘local’ (the
> version on the new server is different from the one on the workstation)

I’ve used that configuration myself, for similar reasons, but it was so
long ago that there’s no value to be gained by straining my memory :frowning:

> I’ve now set up a script to use rsync instead of unison and that’s running
> without a problem.
>
> So - I’ll log this one down to unison for the moment.

From memory (unchecked), unison uses librsync, which isn’t quite the
same as rsync, so it might be worth testing some other application that
uses librsync, and/or googling and experimenting with other versions.

> Shame really - the rsync solution, being one-way, is fine for a single
> logon, but the unison solution would have allowed the same user to have
> multiple logons. I shouldn’t try to be too clever :stuck_out_tongue:

As well as solving your problem, it looks to me like you’d also be
helping to solve a problem in NFS, if you do carry on investigating.