opensuse 11.1 autofs issues

We have a 150 node cluster that was running opensuse 10.2 with no problems. To allow users to get to their autofs mounted homes (using nis) across all the nodes, we set /proc/sys/sunrpc/max_resvport to 5000 which worked great the whole time. Now after limited albeit successful testing, we’ve upgraded the cluster to opensuse 11.1, and now we get these errors with the max_resvport set to 5000:

kernel: lockd_up: makesock failed, error=-13

Users are randomly unable to get to their home directories, and the problem gradually increments. We can get rid of the errors by dropping the max_resvport back to 1023, but that defeats the purpose because then we run into the nfs mount limit. I’ve tried using either portmap or rpcbind (also with -i for insecure), but I get the same behavior regardless. I don’t think it’s a firewall issue, as the internal network is wide open, and the external (through a head node) allows all nfs and nis traffic through. Any ideas would be greatly appreciated.

On Tue, 2009-04-28 at 23:26 +0000, steelah1 wrote:
> We have a 150 node cluster that was running opensuse 10.2 with no
> problems. To allow users to get to their autofs mounted homes (using
> nis) across all the nodes, we set /proc/sys/sunrpc/max_resvport to 5000
> which worked great the whole time. Now after limited albeit successful
> testing, we’ve upgraded the cluster to opensuse 11.1, and now we get
> these errors with the max_resvport set to 5000:
>
> kernel: lockd_up: makesock failed, error=-13
>
> Users are randomly unable to get to their home directories, and the
> problem gradually increments. We can get rid of the errors by dropping
> the max_resvport back to 1023, but that defeats the purpose because then
> we run into the nfs mount limit. I’ve tried using either portmap or
> rpcbind (also with -i for insecure), but I get the same behavior
> regardless. I don’t think it’s a firewall issue, as the internal network
> is wide open, and the external (through a head node) allows all nfs and
> nis traffic through. Any ideas would be greatly appreciated.

(this isn’t necessarily a direct answer, but an architectural
consideration)

>From a scalability perspective, we realized that the idea of
mounting EVERY user was not good. So our mounts are broken out
into areas containing many home dirs (let’s say 200 home dirs per
area, could be thousands though). Thus instead of
mounting: /export/home/<username>, we mount /export/home1 which
contains the homedirs for a plethora of usernames. Does that make
sense?

Consider the following (actual implementation here):

$ ypcat -k auto.master
/qahome auto.qahome bg,intr
/qashr auto.qashr bg,intr
/shr auto.shr bg,intr

We push a map called auto.qahome:

$ ypcat -k auto.qahome
nas01 qanas01:/export/qahome
nas02 qanas02:/export/qahome

(qanas01 and qanas02 should be fully qualified hostnames)

$ echo $HOME
/qahome/nas01/ccox

$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda2 82716124 2316692 80399432 3% /
udev 257500 84 257416 1% /dev
qanas01:/export/qahome
209708800 462400 209246400 1% /qahome/nas01

Now, if somebody were logged in and their home dir was under
/qahome/nas02, you’d see a mount for that one as well.

Hope that helps,
Chris

Thanks for the response, but changing the architecture at this point isn’t really an option, and it was running great before the upgrade. I will definitely figure on implementing that idea in the future.
Update: I have found now that the cause of the ypcall & rpc errors is due to nscd dying. If I restart it, I no longer get ypcall errors/timeouts, and users can login and run their jobs. Obviously without the cache daemon, too many requests were going out the network. Running it with -d enabled, I can see it segmentation faulting when it dies, so something is crashing it. I’ll bump up the debug log level and keep digging. FYI, I’m also running sun grid engine 6.2u2, which wasn’t working until the nscd restart workaround. If anyone has any ideas as to what I can look for with nscd crashing, or anything else, please let me know. Thanks.