On 02/08/2018 02:26 PM, bishoptc wrote:
>
> that gave me some things to think about and try.
> Seems there was a major oversite on my part.
>
> The NFS server providing the 75TB /home has bonded 10GigE cards
> everything else has bonded GigE cards.
> Our local network admin suggested maybe the 10G is flooding the
> workstations w/ packets and they can’t keep up.
> That he suggested would be a spike in CPU not packets or other…which
> is what I"m seeing w/ Ksysguard.
I would be interested in hearing more about what this theory, as it
sounds… strange to me. Sure, 10 Gb cards can go much faster than 1 Gb
equivalents, but what would the NFS box be sending that the 1 Gb boxes
were not requesting? Nothing, of course; NFS doesn’t just send packets to
boxes in some sort of DDoS attack by virtue of being on the same network.
If the clients do request something big, then the 10 Gb could send it much
faster, but TCP windows are designed to prevent any single connection from
overwhelming any other by letting the recipient how fast to send stuff,
otherwise every dial-up user would be overwhelmed by ever non-dial-up
server in the world when requesting any old download.
This is not meant to say your network admin is crazy; maybe even without
sending packets faster than a 1 Gb connection can handle it the offloading
of work to the NIC is not working, so the CPU itself may have to process
tons of packets, even at the ! Gb rate, and that may cause things to stall
a bit as the CPU is pegged, but I would hope that the other few-dozen CPUs
could take over, but hardware and disk I/O are pretty painful sometimes.
> Thought I had reconfigured the server to ONLY have GigE but it was still
> live and I was sloppy in the reconfig.
As a note that is probably obvious, your server is handling many clients
at a time, so having the extra possible throughput is probably worthwhile
so it can handle their concurrent load.
> Today I’ve done a complete shut down the 10Gig cards and the problem
> seems to be resolved
> (it is intermittent so I’ll let things run for a day or two to see if
> that really did it)
I would be interested to know what the MTU is on the various boxes, both
before you reconfigured and now that you have reconfigured. Having it too
small on some boxes, or inconsistent between boxes, may cause a lot of
overhead as packets are split up here or there, e.g. from 9k max size to
1.5k, meaning packets need to be split in sixths.
> *The question now is:
> *Should I expect problems mixing server with 10GigE NIC and
> workstations w/ GigE NICS in the same switch/LAN?
> (ok networking is not my expertise and the admins are doing something to
> the switch that I’m not fully clued in about)
By virtue of being there? I do not see any reason why, but others may
have different experiences, particularly depending on the settings applied
to the boxes. These are not old hub networks (because you cannot even
have 1 Gb connections in a hub per the Ethernet spec) where every packet
would really be analyzed by every other box on the network, since the
switch does that and only sends packets to the right ports (unless the
switch is misconfigured, so check that too). Sometimes switches can be
forced/tricked into hub mode via attacks, but it should not happen normally.
> Do I have to tune some networking parameters someplace to have the NFS
> client on GigE play well w/ NFS server on 10GigE?
I’d check MTU, but otherwise I do not think so. You could also put the 10
Gb stuff on another network entirely, but it seems like this should just
work. Looking at the offloading maybe that is a bit part of it, though I
would hope not with current hardware and defaults:
https://www.mail-archive.com/kde-bugs-dist@kde.org/msg172605.html
https://www.ibm.com/support/knowledgecenter/en/linuxonibm/com.ibm.linux.z.lgdd/lgdd_t_qeth_wrk_packing.html
> *OTHER INFORMATION:
> *
> intermittent: means it happens 3-5 times a day at seemingly random
> times
> temporary: means lasts less than 10secods (but it seems like forever)
Tracking over time may be useful. My own system, completely separate from
what you are seeing, sometimes feels like it locks up when I pull an ISO
from a fast-enough connection, not because either is faster, but just
because my own system’s I/O is busy doing things that apparently cause my
really old kernel to get congested. In my case that basically only
happens when I pull at Gb speeds from something local, but in your case
your systems all have /home on remote storage, and that system is (or was)
fast enough to keep up with them. Turning those systems down may have
just kept them from overwhelming themselves with I/O they can handle but
at the expense of other realtime things. More on this below.
> network lockup: truth be told it’s not necc. the network. the User
> Interface is what locks up.
> I still have pointer control s.t. can move the mouse but seems nothing
> else is responsive.
This is exactly what I see when I/O (disk in particularly) is really high
on my system, though in my case it is very predictable. There may be all
kinds of services on the box that are pushing I/O at seemingly-random
times, like indexing the /home directory’s contents for searching in the UI.
> UI: yes i’m using KDE Plasma 5.8.7/ Frameworks 5.32.0/ Qt 5.6.2
I think it would be interesting to know a bit more about how non-GUI
things respond during these times, and what kind of utilization is
apparently throttling the system. Is it user-mode stuff, system stuff, or
wait/IO, or other?
–
Good luck.
If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.
If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.