Results 1 to 10 of 10

Thread: intermittent temporary network lockup

  1. #1
    Join Date
    Mar 2014
    Location
    Louisiana, USA
    Posts
    9

    Default intermittent temporary network lockup

    Dear O. Suse.

    My office machines experience intermittent temporary lockups.
    Lockup seems to correlate with a spike of upto 100% cpu utilization of one thread that I believe is related to network issues. This happens on each machine in my network but not simultaneously. The machines have same OS/software config. but differ significantly in terms of hardware.

    (oddly while monitoring CPU usage w/ ksysguard cpu usage on CPU 31 of AMD opteron w/ 32 went upto 3,135,000.00% and is staying there... but that seems to be another (related?) problem as the machine continues to function OK. Restarting ksysguard reset the reporting on the aberrant CPU)

    I'm running openSUSE 42.3 (x86_64)
    Linux 4.4.104-39-default #1 SMP
    in a network in which all machines run same version opensuse.
    The network is a mix of AMD and Intel CPUS on supermicro or dell machines.

    All are configured w/ dual GigE cards and all are setup w/ channel bonding and static IP addresses and
    mode=balance-rr miimon=100

    All machines mount a 75TB /home via an autofs mount on our fileserver.
    The fileserver is configured same as rest but with server configuration installed rather than full graphics install.

    This did not happen in earlier versions of opensuse using the same hardware/network/system-configurations.

    Any assistance tracking this down would be greatly appreciated!
    TOm

  2. #2

    Default Re: intermittent temporary network lockup

    How did you come to notice this?

    By "lockups" do you mean the UI locks up, and could you describe that a
    bit (the symptoms, the duration, desktop environment (presumably KDE based
    on ksysguard), etc.)?

    Could you test a system disconnected from the network to see if the
    problem persists, or maybe setup one system without the bond to see if
    that matters? I cannot imagine why it would, but it could help narrow
    things down. Alternatively, having a box running in multi-user (vs.
    graphical) mode, but with the bond, may be another way to narrow things
    down. Similarly, making /home permanently-mounted (vs. using autofs) or
    not-mounted at all (use local disks) may rule things in or out.

    I know this does not provide any definite answers for you, but your
    description makes me think you are not new around these parts, so it may
    be better to find the unlikely/unknown things than to wait too long for
    somebody to have the exact answer at the forefront of their mind.

    Thanks for the nice write-up.

    --
    Good luck.

    If you find this post helpful and are logged into the web interface,
    show your appreciation and click on the star below.

    If you want to send me a private message, please let me know in the
    forum as I do not use the web interface often.

  3. #3
    Join Date
    Mar 2014
    Location
    Louisiana, USA
    Posts
    9

    Default Re: intermittent temporary network lockup

    Thanks Ab,
    that gave me some things to think about and try.
    Seems there was a major oversite on my part.

    The NFS server providing the 75TB /home has bonded 10GigE cards everything else has bonded GigE cards.
    Our local network admin suggested maybe the 10G is flooding the workstations w/ packets and they can't keep up.
    That he suggested would be a spike in CPU not packets or other...which is what I"m seeing w/ Ksysguard.

    Thought I had reconfigured the server to ONLY have GigE but it was still live and I was sloppy in the reconfig.

    Today I've done a complete shut down the 10Gig cards and the problem seems to be resolved
    (it is intermittent so I'll let things run for a day or two to see if that really did it)

    The question now is:
    Should I expect problems mixing server with 10GigE NIC and workstations w/ GigE NICS in the same switch/LAN?
    (ok networking is not my expertise and the admins are doing something to the switch that I'm not fully clued in about)

    Do I have to tune some networking parameters someplace to have the NFS client on GigE play well w/ NFS server on 10GigE?



    OTHER INFORMATION:

    intermittent: means it happens 3-5 times a day at _seemingly_ random times
    temporary: means lasts less than 10secods (but it seems like forever)

    network lockup: truth be told it's not necc. the network. the User Interface is what locks up.
    I still have pointer control s.t. can move the mouse but seems nothing else is responsive.

    UI: yes i'm using KDE Plasma 5.8.7/ Frameworks 5.32.0/ Qt 5.6.2

    As far as that CPU usage going above 3,000% this is a knbow issue in KDE
    https://www.mail-archive.com/kde-bugs-dist@kde.org/msg172605.html

    Thanks in advance for more suggestions.
    Tom

  4. #4
    Join Date
    Nov 2009
    Location
    West Virginia Sector 13
    Posts
    16,287

    Default Re: intermittent temporary network lockup

    Networks will pull down speed to the slowest NIC found on the segment. Provide the 10G with it's own segment. Everything 10G should be on it;s own segment (Sub-Net) all at 1 go should be on a separate sub net

  5. #5

    Default Re: intermittent temporary network lockup

    On 02/08/2018 02:26 PM, bishoptc wrote:
    >
    > that gave me some things to think about and try.
    > Seems there was a major oversite on my part.
    >
    > The NFS server providing the 75TB /home has bonded 10GigE cards
    > everything else has bonded GigE cards.
    > Our local network admin suggested maybe the 10G is flooding the
    > workstations w/ packets and they can't keep up.
    > That he suggested would be a spike in CPU not packets or other...which
    > is what I"m seeing w/ Ksysguard.


    I would be interested in hearing more about what this theory, as it
    sounds.... strange to me. Sure, 10 Gb cards can go much faster than 1 Gb
    equivalents, but what would the NFS box be sending that the 1 Gb boxes
    were not requesting? Nothing, of course; NFS doesn't just send packets to
    boxes in some sort of DDoS attack by virtue of being on the same network.
    If the clients do request something big, then the 10 Gb could send it much
    faster, but TCP windows are designed to prevent any single connection from
    overwhelming any other by letting the recipient how fast to send stuff,
    otherwise every dial-up user would be overwhelmed by ever non-dial-up
    server in the world when requesting any old download.

    This is not meant to say your network admin is crazy; maybe even without
    sending packets faster than a 1 Gb connection can handle it the offloading
    of work to the NIC is not working, so the CPU itself may have to process
    tons of packets, even at the ! Gb rate, and that may cause things to stall
    a bit as the CPU is pegged, but I would hope that the other few-dozen CPUs
    could take over, but hardware and disk I/O are pretty painful sometimes.

    > Thought I had reconfigured the server to ONLY have GigE but it was still
    > live and I was sloppy in the reconfig.


    As a note that is probably obvious, your server is handling many clients
    at a time, so having the extra possible throughput is probably worthwhile
    so it can handle their concurrent load.

    > Today I've done a complete shut down the 10Gig cards and the problem
    > seems to be resolved
    > (it is intermittent so I'll let things run for a day or two to see if
    > that really did it)


    I would be interested to know what the MTU is on the various boxes, both
    before you reconfigured and now that you have reconfigured. Having it too
    small on some boxes, or inconsistent between boxes, may cause a lot of
    overhead as packets are split up here or there, e.g. from 9k max size to
    1.5k, meaning packets need to be split in sixths.

    > *The question now is:
    > *Should I expect problems mixing server with 10GigE NIC and
    > workstations w/ GigE NICS in the same switch/LAN?
    > (ok networking is not my expertise and the admins are doing something to
    > the switch that I'm not fully clued in about)


    By virtue of being there? I do not see any reason why, but others may
    have different experiences, particularly depending on the settings applied
    to the boxes. These are not old hub networks (because you cannot even
    have 1 Gb connections in a hub per the Ethernet spec) where every packet
    would really be analyzed by every other box on the network, since the
    switch does that and only sends packets to the right ports (unless the
    switch is misconfigured, so check that too). Sometimes switches can be
    forced/tricked into hub mode via attacks, but it should not happen normally.

    > Do I have to tune some networking parameters someplace to have the NFS
    > client on GigE play well w/ NFS server on 10GigE?


    I'd check MTU, but otherwise I do not think so. You could also put the 10
    Gb stuff on another network entirely, but it seems like this should just
    work. Looking at the offloading maybe that is a bit part of it, though I
    would hope not with current hardware and defaults:
    https://www.mail-archive.com/kde-bug...msg172605.html
    https://www.ibm.com/support/knowledg...k_packing.html

    > *OTHER INFORMATION:
    > *
    > *intermittent*: means it happens 3-5 times a day at _seemingly_ random
    > times
    > *temporary*: means lasts less than 10secods (but it seems like forever)


    Tracking over time may be useful. My own system, completely separate from
    what you are seeing, sometimes feels like it locks up when I pull an ISO
    from a fast-enough connection, not because either is faster, but just
    because my own system's I/O is busy doing things that apparently cause my
    really old kernel to get congested. In my case that basically only
    happens when I pull at Gb speeds from something local, but in your case
    your systems all have /home on remote storage, and that system is (or was)
    fast enough to keep up with them. Turning those systems down may have
    just kept them from overwhelming themselves with I/O they can handle but
    at the expense of other realtime things. More on this below.

    > *network lockup*: truth be told it's not necc. the network. the User
    > Interface is what locks up.
    > I still have pointer control s.t. can move the mouse but seems nothing
    > else is responsive.


    This is exactly what I see when I/O (disk in particularly) is really high
    on my system, though in my case it is very predictable. There may be all
    kinds of services on the box that are pushing I/O at seemingly-random
    times, like indexing the /home directory's contents for searching in the UI.

    > *UI*: yes i'm using KDE Plasma 5.8.7/ Frameworks 5.32.0/ Qt 5.6.2


    I think it would be interesting to know a bit more about how non-GUI
    things respond during these times, and what kind of utilization is
    apparently throttling the system. Is it user-mode stuff, system stuff, or
    wait/IO, or other?

    --
    Good luck.

    If you find this post helpful and are logged into the web interface,
    show your appreciation and click on the star below.

    If you want to send me a private message, please let me know in the
    forum as I do not use the web interface often.

  6. #6
    Join Date
    Jun 2008
    Location
    San Diego, Ca, USA
    Posts
    13,295
    Blog Entries
    2

    Default Re: intermittent temporary network lockup

    IMO the "flooding" theory is hogwash.
    TCP/IP is a <control> protocol which means that new packets won't be sent until the recipient has sent an acknowledgement that all expected packets are received, so send the next. And, this would be the case even if there is a speed network device speed mis-match. The exception would be streaming protocols which don't rely on acknowledgements because you want smooth audio/video, even if there might be a few dropped packets.

    I also wouldn't suspect MTU unless someone has been monkeying around with the setting... You'd have to be using or crossing really ancient equipment for MTU to be a factor(but see what I describe next which in some ways could be considered a similar limitation).

    Recommend you do a packet capture to determine what the problem might be.

    Something to consider,
    If you have a network congestion problem which can be caused by many things, your problem would typically look like the following...
    Fast throughput for a bit.
    Gradually slowing over a few minutes.
    Eventually slows to a full halt.
    Full Halt for a couple minutes or so.
    Then, suddenly fast throughput, followed by repeating the cycle.

    If the above describes what you're seeing, then you either need to find and fix the cause of your dropped packets or mitigate the problem by modifying your TCP/IP Congestion Control algorithm and possibly enlarging your TCP/IP buffers. To understand all of this and how to address, I wrote the following article a long time ago for a version of openSUSE long ago, but it's still completely relevant today. And, remember that openSUSE by default is tuned for use as a personal computing device or a small server. If your machine is doing "big server" things (including running a torrent application), then you need to make changes to support heavier networking

    https://sites.google.com/site/4techs...ork-connection

    TSU
    Beginner Wiki Quickstart - https://en.opensuse.org/User:Tsu2/Quickstart_Wiki
    Solved a problem recently? Create a wiki page for future personal reference!
    Learn something new?
    Attended a computing event?
    Post and Share!

  7. #7
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    4,576

    Question Re: intermittent temporary network lockup

    @bishoptc:
    A short check list:
    • What's 'top' and/or 'ksysguard' indicating as the process which is using the most CPU on the desktop machines?
    • Is Firefox the thing that's locking up?
    • Which file system(s) do you have for system and user partitions?
    • Is anything noticeable in the LAN traffic and/or the router to your ISP?
    • What's 'top' and/or 'ksysguard' indicating on the NFS server(s)?
    • What are the NFS counters indicating?


    The reason that I'm asking about Firefox running from KDE is that I've just this afternoon I've experienced Firefox (v52.6.0 ESR on Leap 42.3) eating CPU for no apparent reason. I cleaned out the ~/.mozilla/ and ~/.cache/mozilla/firefox/ directories completely; restarted Firefox and restored the settings and bookmarks, and now it's quiet, or at least as quiet as Firefox can be when reading Newspapers.

  8. #8
    Join Date
    Mar 2014
    Location
    Louisiana, USA
    Posts
    9

    Default Re: intermittent temporary network lockup

    Thanks for all the replies.
    I have not ignored them but this is intermittent and seems to have disappeared after switching the fileserver from 10G channel bonded to 1Gig channel bonded.

    Again the only indication other than UI freeze (mouse still movable but not clickable) was 1 CPU on client going to 100%. network and memory on client not affected. (using ksysguard for monitoring)
    Don't have info from server at the time of incidents

    Possibly related aside is that when had 2x10G on server and 2x1G on client (worksations) that I could get 1.8Gbps iperf3 transfers from client to server but only .9Gbps in other direction. workstation to workstation was Gbps in either direction. Machines seemed to perform fine during these tests even if used udp

    When I get a chance I"ll switch back to 2x10Gbps on server and see if problem returns.

    I do the sys-admin in my spare time so specific tests/recommendations should come w/ a command line suggestion to help things along :-)

    FYI: filesystem is XFS

  9. #9
    Join Date
    Mar 2014
    Location
    Louisiana, USA
    Posts
    9

    Default Re: intermittent temporary network lockup

    An oldie but goodie:

    It's been more than 1month and I used the 2x 1GigE lines with channel bonding and not a glitch.
    Yesterday I switched back to the 2 x 10GigE lines w/ channel bonding and boom the intermittent temporary network lockup (well it's actually the desktop environment that locks...mostly.. mouse pointer still works but rest locks)

    I get this lock up on a machine that's two switches away from the server as well as on a machine that is on the same switch. The two machines are vastly diff. in HW but have same OS/software/DE config.

    All machines have been thru several updates and are now at kernel
    4.4.120-45-default so whatever is causing this is NOT going away.

    Guess I could just use single 10GigE to try and eliminate any channel bonding issues.

    But a
    ny other ideas greatly appreciated.

    Thanks in advance,
    TOm




  10. #10
    Join Date
    Jun 2008
    Location
    Belleville, Ontario, Canada
    Posts
    503

    Default Re: intermittent temporary network lockup

    I really can't address your other problems,
    but is your mouse problem possibly a result
    of your mouse button(s) not working correctly?

    For example, my mouse which should be considered working fine has
    at times problems when pressing the centre button/wheel. Yet,
    the mouse otherwise works excellently (it better had, I've only had the
    computer for two years and the mouse came with it).

    These are just my thoughts on the subject.
    "The time is always right to do what's right." Rev. Dr. Martin Luther King, Jr.
    openSUSE 15.3 5.3.18-59.40-default x86_64

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •