Hello all, I have done plenty of searching and am unable to resolve this problem so I am hoping I may find help here. I am running OpenSuse 13.1 and using it as a host monitoring system on our network; in the most basic sense if it can’t ping a remote branch router it will then trigger an e-mail alert to our IT department so we are aware it is offline. The host also has two NICS, one of which runs in promiscuous mode monitoring internet traffic on a switch mirror port with Ntopng.
I have a very strange issue when a remote branch router goes offline due to a network outage and returns to an online state it still cannot be pinged from the OpenSuse server. I can ping any other host on the remote subnet except the router from the OpenSuse server and I can ping the router from any other host on the same subnet as the OpenSuse server. I am not sure if this is due to some form of IP redirect/routing issue but it only affects the OpenSuse server and no other machines. I did discover that stopping and starting the firewall on OpenSuse resolves the erroneous state but it does not prevent it from reoccurring. Does anybody have any insight on such an issue? Forgive me if I left out any other pertinent information.
My first reaction is that there is probably a “discovery” problem.
So, for instance…
It can make a diff whether you’re pinging the remote machine (the router) by name or IP address.
Is this a router a managed device? Is there an app that “knows” this device and where is the app located, is it part of overall network security?
Looking at how your openSUSE is setup, is it different than your other machines, eg Are your other machines part of a Network Security system like LDAP or AD while your openSUSE is not?
There may be other possible questions, the above all are similar in that they relate to how machines in your network are “found.”
If you deploy Wireshark and analyze packets, you may also get better insight into why and where your openSUSE is failing in finding the remote router.
I am not sure I understand your question about an “app” that “knows” the device.
To answer your question about AD and LDAP I can ping the same host after a network outage from another OpenSuse box on the same subnet as the problematic OpenSuse installation, neither of which are part of Active Directory or LDAP.
I will try to get a tcpdump capture when this occurs and analyze it with Wireshark but it is clearly difficult to anticipate an outage.
Is your other machine running the same openSUSE version, as the one giving you this problem? I ask because I recall this old thread, where it was mentioned that the ARP cache behaviour seems to have changed. I’m not completely sure about the implications of this, but perhaps that is what is affecting you.
The other machine is the same version. I will look at the ARP issue but I don’t believe that it is related because the host that I am pinging is not on the same subnet as the OpenSuse server so it would not be in the ARP cache anyway.
Is there a chance that this particular machine is assigned a duplicate IP address? (You haven’t told us if the addresses are statically or dynamically assigned).
Are you pinging by hostname or IP address?
Can you provide tracroute output?
No duplicate IP and using only static IP addresses on all hosts.
Pinging by IP address only.
I cannot provide traceroute until it is broken again. Here is my ifconfig output and routing table though, not sure if the auto-assigned 169.254 address on ens192:av is of any relevance or not.
Well, the 169.254.x.y subnet is reserved for Automatic Private IP Addressing (APIA). (I could be wrong, but I think openSUSE self-assigns when no IP address is assigned eg by a DHCP server). Anyway, it’s a non-routable link-local address.
That was my understanding as well but I figured at this point anything could be relevant. The good news is it broke again over the weekend so I can provide the previously requested traceroute. I also performed a packet capture but only see the echo request and nothing else useful. I turned up logging sensitivity on the firewall and it restarted in order to apply the settings automagically fixing the dead host issue in the process, so it definitely has something to do with the firewall just not sure what.
Still thinking about possibilities but don’t think that it is definitely the firewall,
Could be a dependency (like a network restart).
So, for instance just throwing mud at the wall…
If it is actually an ARP cache issue, then restarting network services could have re-invoked an arp from your machine. Unlike Windows machines which arp periodically and continuously (creating all sorts of traffic “noise”), Linux by default only arps once, don’t remember for sure if on boot (unlikely) or on network service start (likely). This behavior can of course be modified so that your openSUSE can be just as noisy on the network as any Windows box…
In the event anyone else experiences this issue disabling ICMP redirects fixed my problem. As to whether or not it will cause other problems is yet to be seen.