All -
I just wanted to close this out with a final report. Although no solution has been found, a workaround has been found, and unless someone wants to push on this more, I’m just going with what I have.
PROBLEM SUMMARY
To recap the problem:
- IPv6 external routing on Leap 15.6 is failing on one machine 1-4 hours after server boot.
- The problem can be reproduced reliably, even on a fresh load of Leap 15.6, server profile only.
- It only happens on IPV6, IPV4 is unaffected.
- It only happens when the network interface is in bridged mode (br0), regardless of whether the Xen hypervisor is active or not.
- It only happens at OVH locations using the new, undocumented, “V3” networking mode for IPV6, characterized by a default gateway of fe80::1. It is not happening at other OVH or non-OVH locations.
- Changes to /sys, /proc, and other sysctl structures did not solve the problem (many were tried, including forwarding, multicast_snooping, ignore_linkdown, etc., none of them solved the issue).
- No clues, anomalies, or changes are noted in the IP -6 addr, neighbor, or route tables. Clearing or resetting those things does not work.
- It happens whether IPV6 is configured in Yast2, via NetworkManager, via Wicked, or through the use of manual setup commands.
- When external routing fails, the host can still communicate over IPV6 with itself, with the default gateway at OVH, and with any guests on the host. IPV6 itself is not failing, only routing to and from the outside world.
- When external routing fails, any running guests are also cut off for IPV6.
- systemctl restart network, or similar attempts to reset the network, do not restore connectivity.
- Only a full reboot of the server (physical host) will bring back connectivity, until it fails again 1-4 hours later.
STEPS TO REPRODUCE
This process can be reproduced with the following steps:
- Lease a new server from OVH which uses their new V3 networking (expensive, given their setup fees)
- Do a fresh load of OpenSuse Leap 15.6, using just the basic server profile.
- Enable Xen, which will create a bridged network interface. It is not necessary to boot into Xen Hypervisor mode, just having the br0 interface in use is enough.
- Leave IPV4 as is (DHCP) or configure it manually, it makes no difference.
- Configure IPV6.
- Wait 1-4 hours.
IPV6 routing connectivity will fail reliably at this point, and only a reboot appeared to bring it back.
The simplest way to configure IPV6 in this case is using manual commands. I used:
ip -6 addr add 2607:5000:1:1::5/56 dev br0
ip -6 route add default via fe80::1 dev br0 onlink
to bring up IPV6 service on boot. As noted, the method doesn’t matter, the failure will happen regardless of configuration method. The 2607 address shown in this thread is an example only. The fe80::1 is the actual default gateway for all IPV6 servers on the new V3 OVH networks.
As noted, no solution has been found for this issue as of the time of this writing. However, by random chance through ongoing testing, I have found a workaround:
WORKAROUND:
After the server boots, and IPV6 has come up, execute the following command sequence:
#!/bin/bash
# /usr/local/bin/recycleipv6
#
echo 1 > /proc/sys/net/ipv6/conf/all/disable_ipv6
sleep 1
echo 0 > /proc/sys/net/ipv6/conf/all/disable_ipv6
sleep 1
/sbin/ip -6 addr add 2607:5000:1:1::5/56 dev br0
/sbin/ip -6 route add default via fe80::1 dev br0 onlink
exit 0
Substitute in your own IPV6 address of course.
Running these commands will cause IPV6 to come up and stay up permanently. It is not necessary to wait for IPV6 to fail before running these commands (although they will work to restore service reliably even after the failure.) You can just run them 5 seconds after the server boots. Making a new systemd service that runs “After=network.service” and then calls a script to run these commands is sufficient. For example:
# /usr/lib/systemd/system/recycleipv6.service
#
[Unit]
Description=Corrects OVH IPV6 routing issue at system startup.
After=network.service
After=mysql.service
[Service]
Type=oneshot
User=root
ExecStart=/usr/local/bin/recycleipv6
KillSignal=SIGHUP
[Install]
WantedBy=multi-user.target
Make sure you have “exit 0” in the script to keep systemd happy.
This is not a solution to the problem. It’s a workaround. I still don’t know why this is happening. I cannot of course get access to the OVH routers, or any useful help from OVH at all, they are not interested in this. I cannot say why the disable_ipv6 works, and corrects the problem permanently, when restarting the whole network does not. I surmise that there is something about the initial initialization of IPV6 in the Leap stack that is being “triggered” by OVH’s crazy network setup. Perhaps something is being done out of order, or in an unexpected order, that is only exposed in bridge mode, and only at OVH. But such things are beyond my ability (or resources) to debug further or understand.
Regardless, resetting just the IPV6 stack as illustrated here after each reboot does return stability to IPV6, for the host, and for any guests running on the host. Indeed, it is not necessary to run this on Xen guests. You need only run it on the physical host, and the problem is prevented for host and guests.
So, while this is just a hack, not a solution, I will nevertheless attempt to mark this as a solution so that in case anyone trips over this in the future they might find this thread. OVH has informed me that they’re rolling out this new network to all new servers in all data centers, so unless we find something on our end, other people wanting to run Xen on Leap at OVH will probably trip over this.
@arvidjaar - Thank you for your patience and help, I am very grateful.
@malcolmlewis - Thank you for overseeing this and educating me on this forum. I am very grateful to you also.
Thank you,
Glen