intermittent temporary network lockup

Dear O. Suse.

My office machines experience intermittent temporary lockups.
Lockup seems to correlate with a spike of upto 100% cpu utilization of one thread that I believe is related to network issues. This happens on each machine in my network but not simultaneously. The machines have same OS/software config. but differ significantly in terms of hardware.

(oddly while monitoring CPU usage w/ ksysguard cpu usage on CPU 31 of AMD opteron w/ 32 went upto 3,135,000.00% and is staying there… but that seems to be another (related?) problem as the machine continues to function OK. Restarting ksysguard reset the reporting on the aberrant CPU)

I’m running openSUSE 42.3 (x86_64)
Linux 4.4.104-39-default #1 SMP
in a network in which all machines run same version opensuse.
The network is a mix of AMD and Intel CPUS on supermicro or dell machines.

All are configured w/ dual GigE cards and all are setup w/ channel bonding and static IP addresses and
mode=balance-rr miimon=100

All machines mount a 75TB /home via an autofs mount on our fileserver.
The fileserver is configured same as rest but with server configuration installed rather than full graphics install.

This did not happen in earlier versions of opensuse using the same hardware/network/system-configurations.

Any assistance tracking this down would be greatly appreciated!
TOm

How did you come to notice this?

By “lockups” do you mean the UI locks up, and could you describe that a
bit (the symptoms, the duration, desktop environment (presumably KDE based
on ksysguard), etc.)?

Could you test a system disconnected from the network to see if the
problem persists, or maybe setup one system without the bond to see if
that matters? I cannot imagine why it would, but it could help narrow
things down. Alternatively, having a box running in multi-user (vs.
graphical) mode, but with the bond, may be another way to narrow things
down. Similarly, making /home permanently-mounted (vs. using autofs) or
not-mounted at all (use local disks) may rule things in or out.

I know this does not provide any definite answers for you, but your
description makes me think you are not new around these parts, so it may
be better to find the unlikely/unknown things than to wait too long for
somebody to have the exact answer at the forefront of their mind.

Thanks for the nice write-up.


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

Thanks Ab,
that gave me some things to think about and try.
Seems there was a major oversite on my part.

The NFS server providing the 75TB /home has bonded 10GigE cards everything else has bonded GigE cards.
Our local network admin suggested maybe the 10G is flooding the workstations w/ packets and they can’t keep up.
That he suggested would be a spike in CPU not packets or other…which is what I"m seeing w/ Ksysguard.

Thought I had reconfigured the server to ONLY have GigE but it was still live and I was sloppy in the reconfig.

Today I’ve done a complete shut down the 10Gig cards and the problem seems to be resolved
(it is intermittent so I’ll let things run for a day or two to see if that really did it)

**The question now is:
**Should I expect problems mixing server with 10GigE NIC and workstations w/ GigE NICS in the same switch/LAN?
(ok networking is not my expertise and the admins are doing something to the switch that I’m not fully clued in about)

Do I have to tune some networking parameters someplace to have the NFS client on GigE play well w/ NFS server on 10GigE?

**OTHER INFORMATION:
**
intermittent: means it happens 3-5 times a day at seemingly random times
temporary: means lasts less than 10secods (but it seems like forever)

network lockup: truth be told it’s not necc. the network. the User Interface is what locks up.
I still have pointer control s.t. can move the mouse but seems nothing else is responsive.

UI: yes i’m using KDE Plasma 5.8.7/ Frameworks 5.32.0/ Qt 5.6.2

As far as that CPU usage going above 3,000% this is a knbow issue in KDE
https://www.mail-archive.com/kde-bugs-dist@kde.org/msg172605.html

Thanks in advance for more suggestions.
Tom

Networks will pull down speed to the slowest NIC found on the segment. Provide the 10G with it’s own segment. Everything 10G should be on it;s own segment (Sub-Net) all at 1 go should be on a separate sub net

On 02/08/2018 02:26 PM, bishoptc wrote:
>
> that gave me some things to think about and try.
> Seems there was a major oversite on my part.
>
> The NFS server providing the 75TB /home has bonded 10GigE cards
> everything else has bonded GigE cards.
> Our local network admin suggested maybe the 10G is flooding the
> workstations w/ packets and they can’t keep up.
> That he suggested would be a spike in CPU not packets or other…which
> is what I"m seeing w/ Ksysguard.

I would be interested in hearing more about what this theory, as it
sounds… strange to me. Sure, 10 Gb cards can go much faster than 1 Gb
equivalents, but what would the NFS box be sending that the 1 Gb boxes
were not requesting? Nothing, of course; NFS doesn’t just send packets to
boxes in some sort of DDoS attack by virtue of being on the same network.
If the clients do request something big, then the 10 Gb could send it much
faster, but TCP windows are designed to prevent any single connection from
overwhelming any other by letting the recipient how fast to send stuff,
otherwise every dial-up user would be overwhelmed by ever non-dial-up
server in the world when requesting any old download.

This is not meant to say your network admin is crazy; maybe even without
sending packets faster than a 1 Gb connection can handle it the offloading
of work to the NIC is not working, so the CPU itself may have to process
tons of packets, even at the ! Gb rate, and that may cause things to stall
a bit as the CPU is pegged, but I would hope that the other few-dozen CPUs
could take over, but hardware and disk I/O are pretty painful sometimes.

> Thought I had reconfigured the server to ONLY have GigE but it was still
> live and I was sloppy in the reconfig.

As a note that is probably obvious, your server is handling many clients
at a time, so having the extra possible throughput is probably worthwhile
so it can handle their concurrent load.

> Today I’ve done a complete shut down the 10Gig cards and the problem
> seems to be resolved
> (it is intermittent so I’ll let things run for a day or two to see if
> that really did it)

I would be interested to know what the MTU is on the various boxes, both
before you reconfigured and now that you have reconfigured. Having it too
small on some boxes, or inconsistent between boxes, may cause a lot of
overhead as packets are split up here or there, e.g. from 9k max size to
1.5k, meaning packets need to be split in sixths.

> *The question now is:
> *Should I expect problems mixing server with 10GigE NIC and
> workstations w/ GigE NICS in the same switch/LAN?
> (ok networking is not my expertise and the admins are doing something to
> the switch that I’m not fully clued in about)

By virtue of being there? I do not see any reason why, but others may
have different experiences, particularly depending on the settings applied
to the boxes. These are not old hub networks (because you cannot even
have 1 Gb connections in a hub per the Ethernet spec) where every packet
would really be analyzed by every other box on the network, since the
switch does that and only sends packets to the right ports (unless the
switch is misconfigured, so check that too). Sometimes switches can be
forced/tricked into hub mode via attacks, but it should not happen normally.

> Do I have to tune some networking parameters someplace to have the NFS
> client on GigE play well w/ NFS server on 10GigE?

I’d check MTU, but otherwise I do not think so. You could also put the 10
Gb stuff on another network entirely, but it seems like this should just
work. Looking at the offloading maybe that is a bit part of it, though I
would hope not with current hardware and defaults:
https://www.mail-archive.com/kde-bugs-dist@kde.org/msg172605.html
https://www.ibm.com/support/knowledgecenter/en/linuxonibm/com.ibm.linux.z.lgdd/lgdd_t_qeth_wrk_packing.html

> *OTHER INFORMATION:
> *
> intermittent: means it happens 3-5 times a day at seemingly random
> times
> temporary: means lasts less than 10secods (but it seems like forever)

Tracking over time may be useful. My own system, completely separate from
what you are seeing, sometimes feels like it locks up when I pull an ISO
from a fast-enough connection, not because either is faster, but just
because my own system’s I/O is busy doing things that apparently cause my
really old kernel to get congested. In my case that basically only
happens when I pull at Gb speeds from something local, but in your case
your systems all have /home on remote storage, and that system is (or was)
fast enough to keep up with them. Turning those systems down may have
just kept them from overwhelming themselves with I/O they can handle but
at the expense of other realtime things. More on this below.

> network lockup: truth be told it’s not necc. the network. the User
> Interface is what locks up.
> I still have pointer control s.t. can move the mouse but seems nothing
> else is responsive.

This is exactly what I see when I/O (disk in particularly) is really high
on my system, though in my case it is very predictable. There may be all
kinds of services on the box that are pushing I/O at seemingly-random
times, like indexing the /home directory’s contents for searching in the UI.

> UI: yes i’m using KDE Plasma 5.8.7/ Frameworks 5.32.0/ Qt 5.6.2

I think it would be interesting to know a bit more about how non-GUI
things respond during these times, and what kind of utilization is
apparently throttling the system. Is it user-mode stuff, system stuff, or
wait/IO, or other?


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

IMO the “flooding” theory is hogwash.
TCP/IP is a <control> protocol which means that new packets won’t be sent until the recipient has sent an acknowledgement that all expected packets are received, so send the next. And, this would be the case even if there is a speed network device speed mis-match. The exception would be streaming protocols which don’t rely on acknowledgements because you want smooth audio/video, even if there might be a few dropped packets.

I also wouldn’t suspect MTU unless someone has been monkeying around with the setting… You’d have to be using or crossing really ancient equipment for MTU to be a factor(but see what I describe next which in some ways could be considered a similar limitation).

Recommend you do a packet capture to determine what the problem might be.

Something to consider,
If you have a network congestion problem which can be caused by many things, your problem would typically look like the following…
Fast throughput for a bit.
Gradually slowing over a few minutes.
Eventually slows to a full halt.
Full Halt for a couple minutes or so.
Then, suddenly fast throughput, followed by repeating the cycle.

If the above describes what you’re seeing, then you either need to find and fix the cause of your dropped packets or mitigate the problem by modifying your TCP/IP Congestion Control algorithm and possibly enlarging your TCP/IP buffers. To understand all of this and how to address, I wrote the following article a long time ago for a version of openSUSE long ago, but it’s still completely relevant today. And, remember that openSUSE by default is tuned for use as a personal computing device or a small server. If your machine is doing “big server” things (including running a torrent application), then you need to make changes to support heavier networking

https://sites.google.com/site/4techsecrets/optimize-and-fix-your-network-connection

TSU

@bishoptc:
A short check list:

  • What’s ‘top’ and/or ‘ksysguard’ indicating as the process which is using the most CPU on the desktop machines?
  • Is Firefox the thing that’s locking up?
  • Which file system(s) do you have for system and user partitions?
  • Is anything noticeable in the LAN traffic and/or the router to your ISP?
  • What’s ‘top’ and/or ‘ksysguard’ indicating on the NFS server(s)?
  • What are the NFS counters indicating?

[HR][/HR]The reason that I’m asking about Firefox running from KDE is that I’ve just this afternoon I’ve experienced Firefox (v52.6.0 ESR on Leap 42.3) eating CPU for no apparent reason. I cleaned out the ~/.mozilla/ and ~/.cache/mozilla/firefox/ directories completely; restarted Firefox and restored the settings and bookmarks, and now it’s quiet, or at least as quiet as Firefox can be when reading Newspapers.

Thanks for all the replies.
I have not ignored them but this is intermittent and seems to have disappeared after switching the fileserver from 10G channel bonded to 1Gig channel bonded.

Again the only indication other than UI freeze (mouse still movable but not clickable) was 1 CPU on client going to 100%. network and memory on client not affected. (using ksysguard for monitoring)
Don’t have info from server at the time of incidents

Possibly related aside is that when had 2x10G on server and 2x1G on client (worksations) that I could get 1.8Gbps iperf3 transfers from client to server but only .9Gbps in other direction. workstation to workstation was Gbps in either direction. Machines seemed to perform fine during these tests even if used udp

When I get a chance I"ll switch back to 2x10Gbps on server and see if problem returns.

I do the sys-admin in my spare time so specific tests/recommendations should come w/ a command line suggestion to help things along :slight_smile:

FYI: filesystem is XFS

An oldie but goodie:

It’s been more than 1month and I used the 2x 1GigE lines with channel bonding and not a glitch.
Yesterday I switched back to the 2 x 10GigE lines w/ channel bonding and boom the intermittent temporary network lockup (well it’s actually the desktop environment that locks…mostly… mouse pointer still works but rest locks)

I get this lock up on a machine that’s two switches away from the server as well as on a machine that is on the same switch. The two machines are vastly diff. in HW but have same OS/software/DE config.

All machines have been thru several updates and are now at kernel
4.4.120-45-default so whatever is causing this is NOT going away.

Guess I could just use single 10GigE to try and eliminate any channel bonding issues.

But any other ideas greatly appreciated.

Thanks in advance,
TOm

I really can’t address your other problems,
but is your mouse problem possibly a result
of your mouse button(s) not working correctly?

For example, my mouse which should be considered working fine has
at times problems when pressing the centre button/wheel. Yet,
the mouse otherwise works excellently (it better had, I’ve only had the
computer for two years and the mouse came with it).

These are just my thoughts on the subject.