Odd intermittent network/routing problem

Hi

I have an weird intermittent network problem and I am running out of ideas!

Last week I upgraded my sever to OpenSuSE 12.3 and right from the install it was having difficulty verifying internet connectivity.

Server 192.168.2.1, SuSE 12.3
Gateway 192.168.2.138, IPCop 2.0.6
Gigabit switch, no wireless.

The configuration seems fine.


eth0      Link encap:Ethernet  HWaddr 00:24:8C:06:61:7D
          inet addr:192.168.2.1  Bcast:192.168.2.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:306484 errors:24 dropped:2094 overruns:23 frame:1
          TX packets:151440 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:319127546 (304.3 Mb)  TX bytes:14483317 (13.8 Mb)

lo        Link encap:Local Loopback

          inet addr:127.0.0.1  Mask:255.0.0.0

          UP LOOPBACK RUNNING  MTU:65536  Metric:1

          RX packets:3573 errors:0 dropped:0 overruns:0 frame:0

          TX packets:3573 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:259788 (253.6 Kb)  TX bytes:259788 (253.6 Kb)

Routing also seems OK.


Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.2.138   0.0.0.0         UG    0      0        0 eth0
127.0.0.0       0.0.0.0         255.0.0.0       U     0      0        0 lo
169.254.0.0     0.0.0.0         255.255.0.0     U     0      0        0 eth0
    192.168.2.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0

And ARP is also OK


Address                  HWtype  HWaddress           Flags Mask            Iface

192.168.2.25             ether   bc:ae:c5:6f:c6:f1   C                     eth0

     192.168.2.138            ether   00:19:66:8a:62:8f   C                     eth0

The weird thing is I can connect to the server with SSH with no problem (that’s from 192.168.2.25). The gateway router can ping the server all the time. But the server occasionally just stops being able to send packets to the router. At the same time, the server can ping the printer for instance, just not the router.
No error messages, nothing in log/messages, it just fails. Nothing in the logs on the router. The server firewall is off BTW.
Tcpdump on the server claims to be sending, tcpdump on the router claims nothing arrives from the server, but can see packets from other sources.

Then some time later it works again.

I tried SMEserver on the same hardware last week. It worked fine with the network but I found it a bit to restrictive so went back to Suse.

Any ideas?

Thanks
Liam

BTW in case anyone is wondering why the gateway is at 192.168.1.138, it used to be on a 10mbit segment at 192.168.1.1 and 192.168.2.1 was the gateway to that segment. I got rid of the 10mbit segment and put the router on the high speed net but never got around to giving it a more conventional address.

Well, either your NIC or your router is probably flaky. Can you swap out
either to test? If you could put a hub between the server and router from
which you could run tcpdump to capture all traffic on the wire that’d be a
quick way to test both without changing either.

Good luck.

Thanks for the input. That was basically my feeling too, but you never know, I might have missed something obvious.
Spare NICs are on the way. I will have to do some digging to see if I have a real hub somewhere, and it is likely just 10mbit so it might not help with different frame sizes etc.
I also have a small switch with port mirroring on the way too - the current switch is dumb.

Another simple test… maybe one of many ports on the switch is going out.
Tried swapping them with the box that always works?

Good luck.

I considered that, and it is worth a try (when I get time over the weekend), but doesn’t really explain why ping works well in one direction but intermittently in the other.

It’s at this point I consider running Wireshark to look at the packets, you might not even need to have much experience with this if something is color-coded and saying “FAILED.”

Do you have any other machines on the same network, and if you do are any others having similar problems connecting to router or your upgraded openSUSE? If you have few or rarely accessed machines on your network then you might have overlooked a larger issue you thought might have been isolated only between your two machines.

Also,
I’d likely want to ask whether you’re running any kind of Network Security like LDAP, AD or something similar… If your machine changed identity somehow during your upgrade, that could have thrown things awry. So, for instance clear the arp caches on the router and your openSUSE if you can, and anything else that retains identity (or if you’re satisfied both machines aren’t getting info from another machine, then boot both simultaneously).

Hardware is only one possible problem, there are still many software possibilities still to be considered.

HTH,
TSU

Well, it is currently working but I have no idea why.
I tried a brand new Longshine smart switch. This didn’t help. I tried using the port mirroring function to check on the ICMP packets with wireshark, but there was nothing visible, even though ping was working in one direction at least.
It seems like the Longshine doesn’t mirror ICMP for some reason but I could not confirm this from any documentation.

So then I finally dug out an old 10mbit hub. This worked fine. I could monitor the packets in both directions. Unfortunately (or fortunately), at 10mbit, everything worked fine and I could ping in both directions, the gateway was visible and everything worked as expected!!

So next I tried a little 5 port smart switch from Netgear. Everything is still working as it should. The Netgear switch also works with port mirroring just as I would expect, showing ICMP packets in both directions.

So now everything works, but I am still no wiser as to why it failed last week.

Liam