Intermittant network failures

StevenKarp · June 11, 2018, 6:37pm

Greetings.

I’ve got a machine here running 42.3–not quite ready to upgrade to 15. The network occasionally just seems to stop. Running “systemctl restart wicked” brings it back, and everything is fine again. Until the next failure. (To be fair, it’s running for multiple hours between failures. But since this is the main server for the house, having the network down on that box is unacceptable.)

For the time being, I’m running a cron job once a minute to ping its own IP and restart wicked if it fails. But that’s hardly elegant.

The network card is a “Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller” using the r8169 driver. I’ve seen (older) reports on the Internet and these forums that the driver can be a bit unstable, and suggestions to use the r8168 driver from PackMan (or Sauerland), but I’m reluctant to take that step based on advice to others in situations that aren’t quite the same and risk confusing matters further.

Suggestions or recommendations–especially for ways to pin down why the network is going down–gratefully appreciated.

gogalthorp · June 11, 2018, 7:17pm

Got the same chip set running the 68 version without problem. Get the black list package also

Sauerland · June 11, 2018, 7:23pm

If Packman Repo is enabled as root:

zypper in r8168-blacklist-r8169 r8168-kmp-default

To uninstall:

zypper rm r8168-blacklist-r8169 r8168-kmp-default

tsu2 · June 11, 2018, 7:51pm

You should try to gather as much info as possible.

To a certain degree you’re handicapped that your server doesn’t have a User actively logged in when a problem happens, but…

You first need to clearly define for yourself what you mean that the network seems to stop. There is a difference for example if any particular service running on the machine has stopped or if perhaps some external networking issue causes other machines not to be able to connect to your Server.

Some places to start…

If you suspect the network service running on your openSUSE, the following will return all system logs related to the network service during your previous boot…

journalctl -b -1 -u network.service

The following will return any type of error in your current boot

journalctl -p err -b

And, the following will return errors warnings in your current boot

journalctl -p warning -b

If you want to monitor your system log in real time, you can run the following, perhaps remotely using SSH (If monitoring remotely, you may see the last entries before networking stopped)

journalctl -f

Consider that restarting wicked may not actually mean that the problem is local. When you restart networking, your machine re-advertises itself on the network affecting how every other machine might interact with your machine.

From another (or the openSUSE itself), you may want to run tcpdump (if on another machine perhaps in promiscuous mode) to capture network packets for later analysis by a tool like Wireshark (A bit heavier, but Wireshark can do the packet capture, too).

If you’re serving files or providing Network services, you may want to analyze your network security as well (Are you deployed as a Workgroup which can regularly experience periods of downtime due to things like Master Browser elections).

You’ll notice the journalctl commands I recommended,
There are probably other ways to use journalctl to collect data, filtering differently, eg by time windows instead of by boot. You can find documentation and examples on the Web in blogs, in the help file MAN pages and plenty of other places. If you have difficulties running any command, just post details about your attempts and your question.

TSU

StevenKarp · June 12, 2018, 7:43am

That’s an interesting notion, that the problem might be external to the server. Certainly, there’s nothing I can see that’s relevant in the logs you suggested I check (and I looked on my usual client machine as well). There are a metric buttload of ssh login failures on the client (no surprise there), but they continue through the period where the server was off in loony land.

It’s possible the problem was in my network switch; during the course of my attempts to resurrect networking this morning, I moved the server to another port, and the problem has not returned. Or, if the problem is on the server, maybe my pings are keeping something awake that was previously passing out from boredom.

I’m going to leave everything as-is now, and see if the problem returns. If so, I’ll take another look with your suggestions. If not, I’ll assume the problem has gone away.

Thanks for the help, all!

StevenKarp · June 16, 2018, 8:05pm

An update:

The problem has recurred several times. I’ve started leaving a user account logged in on the console for troubleshooting.

When goes out, the computer can ping its own IP address, which I presume means there’s nothing wrong with the hardware. However, any attempt to access another machine (ping, ssh, http) results in a “no route to host”. It’s not a DNS problem, because I get the same result using IPs.

I’ve tried tsu2’s suggestions, and there’s nothing in the logs related to networking, other than a minor samba misconfiguration, now corrected.

I’m still reluctant to switch drivers without knowing what the underlying problem is. Unless I get a flash of insight, I’m going to stagger along with the “restart wicked” workaround until payday, and then pick up an Intel network card and disable the onboard networking.