11.3 NFS client hangs on large file transfer

dcfleck2 · January 2, 2011, 11:02pm

When accessing an NFS mount for a large (200MB+) file transfer, the transfer starts rapidly, then becomes slower and slower until it hangs. On several occasions, it has frozen the client machine. Both client and server are set to default to nfs version 3. Slowdown and hang also occur when connecting to FreeBSD NFS mounts.

Presumably (I hope), there is some sort of configuration for the client that needs to be set. Does anybody have any idea what should be changed in the configuration? This worked out of the box in OpenSUSE 11.0.

jdmcdaniel3 · January 2, 2011, 11:47pm

I don’t use NSF myself, but I found an interesting read at the following site. It even includes info to get status and other important information that might help diagnose the problem.

Optimizing NFS Performance

Thank You,

dcfleck2 · January 3, 2011, 2:19am

Thanks, I’ll take a look at that. It’s just disturbing that something that used to work out of the box – did I say that already? – seems to be pretty severely broken in 11.3.

jdmcdaniel3 · January 3, 2011, 3:20am

Thanks, I’ll take a look at that. It’s just disturbing that something that used to work out of the box – did I say that already? – seems to be pretty severely broken in 11.3.
So, there were a few issues caused by the 2.6.34 kernel in openSUSE 11.3 and some choices to stop using sax2. I found that openSUSE 11.2 was a better choice for some and others found upgrading the kernel to 2.6.36 helped. For me, I was tying to use an external USB 3 hard drive and the default kernel in openSUSE 11.3, will not permit a USB 3 drive to be mounted in the fstab file. Upgrading the kernel to then new 2.6.35 was the fix. Since, I have upgraded up to 2.6.37-rc8. The kernel is where the device drivers live and the newer the kernel the better the hardware support. Obviously, the kernel is not the only place a problem might exist as the Desktop selected is almost half the game on most distros. None the less, upgrading your Kernel might help. I suggest doing more research on NSF right now, but I do have a script that can help upgrade your kernel, message #17 has the most recent version of SAKC.

S.A.K.C. - SUSE Automated Kernel Compiler - Version 2.00

Thank You,

dcfleck2 · January 3, 2011, 2:37pm

Experimenting with the read and write size buffers (Rsize and Wsize in /etc/nfsmount.conf) make a HUGE difference. The default value (8k) is pretty much unusable on my local network, probably because of the ancient equipment involved. Here’s the difference between a 2k and a 4k buffer size:

192.168.1.200:/usr/local/share/common on /mnt type nfs (rw,rsize=2048,wsize=2048)
grond:~ # time dd if=/dev/zero of=/mnt/testfile bs=16k count=16384 | tee -a /tmp/nfstests
16384+0 records in
16384+0 records out
268435456 bytes (268 MB) copied, 169.502 s, 1.6 MB/s

real 2m49.579s
user 0m0.012s
sys 0m1.672s

192.168.1.200:/usr/local/share/common on /mnt type nfs (rw,rsize=4096,wsize=4096)
grond:~ # time dd if=/dev/zero of=/mnt/testfile bs=16k count=16384
16384+0 records in
16384+0 records out
268435456 bytes (268 MB) copied, 15003.7 s, 17.9 kB/s

real 250m3.790s
user 0m0.120s
sys 0m3.120s

Yes, 3 minutes vs. over 4 hours for the same 250 MB.

Thanks again for pointing me towards that URL.

ken_yap · January 3, 2011, 2:43pm

Why are you using such small block sizes? This isn’t the 1990s anymore. Let it default to 32kB and use TCP instead of UDP. Perhaps you have old config values which are not appropriate nowadays.

please_try_again · January 3, 2011, 3:02pm

Your equipment must be either very ancient or you have a switch which is about to die. I use 32k read/write buffer in both ways (Linux to FreeBSD, FreeBSD to Linux) and copy big files at a speed between 50 and 95 MB/s with common netcards and switchs but cat6 network cable. If you had to reduce the buffer from the default (which is not too high), you probably have a bottleneck in your lan.

robin_listas · January 3, 2011, 7:20pm

On 2011-01-03 15:06, dcfleck2 wrote:
>
> Experimenting with the read and write size buffers (Rsize and Wsize in
> /etc/nfsmount.conf) make a HUGE difference. The default value (8k) is
> pretty much unusable on my local network, probably because of the
> ancient equipment involved. Here’s the difference between a 2k and a 4k
> buffer size:

…

> Yes, 3 minutes vs. over 4 hours for the same 250 MB.

Wild guess: you have a high error rate that makes the network retry each
package several times. With smaller packages some may be correct. With big
package each big one has to be resent. Plus, using UDP means that it is the
NFS layer the one that has to get the correct transmission. With TCP it
would be the network layer.

Run ifconfig as root, after trying a big file, and look at the errors figures.

If I’m right, another wild guess: your cables are rated for 10 Mbit/s, not
100 Mbit/s. Wild guess #3: wrong twisted pairs. Telephone cable, not
network. I have seen it happen.

–
Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

glistwan · January 3, 2011, 11:55pm

This might also be the case if you’re using fiber and connect your optical interfaces with multimode fiber (usually orange or green colour of the coating) where in fact they should be conected with singlemode fiber (usually yellow or blue colour of the coating) or the other way around. Broken fibre might also cause this (it’s quite easy to break it while closing the rack for example)I’ve seen this problem make accessing CIFS slow to say the least on a perfectly fine configured network.

Best regards,
Greg

dcfleck2 · January 4, 2011, 2:51am

Thanks for all the responses. In truth, there could be any number of failure points here – some of the equipment involved is, in fact, from the 1990’s - none is less than 7 or 8 years old. The bits have to travel via a wireless card using ndiswrapper to a router, then across 2 old lengths of cat5 joined by a hub to a very old ethernet card. So it doesn’t surprise me at all that the buffer size needs to be ridiculously small – I’m just glad I can get it configured so that it is usable.

The ability to keep computers functional with very old hardware is one of the main reasons I use Linux.

ken_yap · January 4, 2011, 3:20am

If you use TCP, then you don’t need to use UDP and small packets.

robin_listas · January 4, 2011, 3:20pm

On 2011-01-04 03:06, dcfleck2 wrote:
>
> Thanks for all the responses. In truth, there could be any number of
> failure points here – some of the equipment involved is, in fact, from
> the 1990’s - none is less than 7 or 8 years old. The bits have to
> travel via a wireless card using ndiswrapper to a router, then across 2
> old lengths of cat5 joined by a hub to a very old ethernet card. So it
> doesn’t surprise me at all that the buffer size needs to be ridiculously
> small – I’m just glad I can get it configured so that it is usable.

I would still check for errors on each sector of that transmission path,
and replace the cable or link if the error rate is high.

–
Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

ken_yap · January 4, 2011, 4:24pm

If you are thinking why shouldn’t it work at least as well as before, let me tell you about the time my NFS performance actually got worse when I changed to a faster server. Eventually it turned out to be this: with a faster server, packets were now arriving at the low grunt machine faster than the lousy NIC (a RTL8139 I think) could handle and it started losing UDP packets which led to timeouts and retransmission. This was fixed by putting a good NIC on the client. UDP NFS is sensitive to this sort of thing.

robin_listas · January 4, 2011, 5:20pm

On 2011-01-04 16:36, ken yap wrote:

> This was fixed by putting a good NIC on the client. UDP NFS is sensitive
> to this sort of thing.

Time to change to TCP

–
Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

ken_yap · January 4, 2011, 10:17pm

This was many years ago, before TCP was available as a NFS option.

tsu2 · January 6, 2011, 10:05pm

Hi,
Co-incidentally, I was just looking at this today only as a by-product of something else I’m doing, let me suggest some reading and a “modern” practical approach to TCP Windows…

First, you should be aware that today’s kernels support TCP Windows scaling (dynamic re-sizing), and there is an optimum size (not too small and not too large) for every file transfer based on the size of file, bandwidth and transmission medium(losses due to unreliable medium like wireless) which are affected by factors like congestion, TTL and packet loss. To that end, when you first open a TCP socket, the values are very conservative so as not to over-load the connection but will gradually search for the optimum windows size.

This is where it gets tricky… There are something like at least 11 different algorithms which can be selected to manage TCP Window scaling, and the default SuSE install may not have configured what is best for your situation.

Here is an easy to read, highly entertaining technical article 2 years old
TCP and Linux’ Pluggable Congestion Control Algorithms LG #135

Note that since this article was written, a kernel patch supporting Microsoft’s CTPC algorthm has been written which may be the most versatile (and best performing) algorthm available.

HTH,
Tony