Hello All!
Up Front Info
I have a bare metal system with Leap 15.4 installed. Hardware has 4 NICs – of which I have eth0 and eth1 cabled. The eth0 network is the primary network. It gets its IP from DHCP. The eth1 interface is what we consider the camera/accessory network. It also receives its IP from DHCP.
The eth0 interface is a part of a different network segment than the eth1. We can call the eth0 segment “facility” and the eth1 segment “management.”
I have a couple of hundred of these machines in the field. Of these, maybe 10 or so have had this same issue. Each location has its own DHCP server.
My Issue and Troubleshooting
The issue that I am having with my system is that when I toggle wicked.service (or reboot the box), it is a crapshoot as to whether eth0 or eth1 will come up as the primary interface. When the primary interface is switched to eth1, we lose the ability to ssh to the machine, even though both eth0 and eth1 show a status of UP. We were experiencing this issue before changing any configuration files or default routes.
e.g.
While my machine was pinging, I bounced wicked.service (instead of rebooting – which is where we also see this issue manifest). After that first service restart, the machine only dropped off of the network for a few seconds and than began pinging. When I run the ‘ip route’ command, I see that eth0 is showing as the default.
*I restarted the service again (5 more times) and each time, the box stopped pinging indefinitely (I am consoled in) and the default route showed the 192 (eth1) ip address when running ‘ip route’ again. *
Once I restarted the service again, we were back to working condition with ‘ip route’ showing the expected eth0 interface and IP.
As I have been troubleshooting this, I have become more familiar with some of the wicked configuration files (but I am by no means an expert – so please show some grace). My ifcfg-eth* files are shown below. The additions I made to troubleshoot are followed by an asterisk.
eth0:
STARTMODE=auto
BOOTPROTO=dhcp
DEFROUTE=yes
ZONE=public
DHCLIENT_SET_HOSTNAME=yes
DEBUG=yes*
PERSISTENT=yes*
eth1:
STARTMODE=auto
BOOTPROTO=dhcp
DEFROUTE=no
DEBUG=yes*
My /etc/udev/rules.d/70-persistent-net.rules look like this (scrubbed for personalized info):
SUBSYSTEM==“net”, ACTION==“add”, DRIVERS==“?“, ATTR{address}==”<some_mac>:b7", ATTR{dev_id}==“0x0”, ATTR{type}==“1”, KERNEL=="eth”, NAME=“eth3”
SUBSYSTEM==“net”, ACTION==“add”, DRIVERS==“?“, ATTR{address}==”<some_mac>3:b5", ATTR{dev_id}==“0x0”, ATTR{type}==“1”, KERNEL=="eth”, NAME=“eth1”
SUBSYSTEM==“net”, ACTION==“add”, DRIVERS==“?“, ATTR{address}==”<some_mac>:b6", ATTR{dev_id}==“0x0”, ATTR{type}==“1”, KERNEL=="eth”, NAME=“eth2”
SUBSYSTEM==“net”, ACTION==“add”, DRIVERS==“?“, ATTR{address}==”<some_mac>:b4", ATTR{dev_id}==“0x0”, ATTR{type}==“1”, KERNEL=="eth”, NAME=“eth0”
There were no ‘route’ files in the path /etc/sysconfig/network until I manually created “routes.” Once I went into Yast and configured eth0 as the default route, the other files were created (and routes was zeroed).
$ ls -l route
-rw-r–r-- 1 root root 28 Jun 13 14:48 ifroute-eth0
-rw-r–r-- 1 root root 29 Jun 13 14:48 ifroute-eth0.YaST2save
-rw-r–r-- 1 root root 0 Jun 13 14:48 routes
-rw-r–r-- 1 root root 0 Jun 13 14:48 routes.YaST2save
I have since removed all route files except for ifroute-eth0, which contains the following line:
default <router_ip> - eth0
*Note: I have also tried the segment IP (ending in .1) and this does not work either.
*Another Note: We do not have route files configured in ANY of the ~200 machines in the field, so I am not sure how impactful this is.
From the console, I had one session redirecting dmesg output, while in another session I restarted wicked. On restarts that resulted in the default route being switched to eth1, nothing stood out to me in dmesg (or journalctl -u wicked.service) that would indicate why/how this primary route flip occurred. Admittedly, this could be because I dont know exactly what to look for.
When a box working properly, we see this line in the output from the ‘wicked show all’ command under the eth0 interface:
route: ipv4 default via <segment_ip> 1 proto dhcp
What I am struggling to find is the root cause of what is causing my the primary/default route to change once the machine is rebooted or wicked.service is restarted. I have run as many checks as I know of, but we are still in the same state on this machine, where a reboot could make the machine inaccessible via ssh.
I read through many similar issues on this forum, but nothing resolved my issue. Any help would be greatly appreciated. Thank you all in advance! -Alex