Centos 6.3 running in Xen Private Networking Randomly stops working.

I have a CentOS 6.3 (64bit) VM running on OpenSuSe 12.2 (64bit), I have two physical network connections on the host. One goes to our private network while the other is the public interface. The public seems to be working fine all the time, however the private network interface in the VM will work immediately after boot up, but once the VM runs for awhile the private net in the VM stops working at all. The public interface works fine in the VM, and the Private net on the Host will still respond and other CentOS VMs running on the same host can continue to use the private network with no problems. The issue ONLY appears to be happening in the VM. This same situation is occuring on two different VMs on two different hosts. We have a total of 7 OpenSuSe hosts running the Xen hypervisor, and on those we have more than 20 Virtual Servers all runing CentOS 6.3 and there are only two giving us problems.

What we have recently changed:
All of the CentOS VMs have been upgraded to the latest kernel.
The updates have been applied to the first OpenSuSe host, but none of the others. (one of the VMs with the networking problem resides on this server, the other is on a server that has not been upgraded yet)

I find it weird that only two of the servers are having this problem, and they exist on two different hosts.
Oh and the CentOS VM sees the interface as UP and it can ping itself, it just cant ping anything else and nothing can ping it.

Here is my Xen info:

xm info

host : HP-Alpha
release : 3.4.33-2.24-xen
version : #1 SMP Tue Feb 26 03:34:33 UTC 2013 (5f00a32)
machine : x86_64
nr_cpus : 8
nr_nodes : 1
cores_per_socket : 4
threads_per_core : 1
cpu_mhz : 3000
hw_caps : bfebfbff:20000800:00000000:00000940:000ce3bd:00000000:00000001:00000000
virt_caps : hvm
total_memory : 20477
free_memory : 2
free_cpus : 0
max_free_memory : 11132
max_para_memory : 11128
max_hvm_memory : 11093
xen_major : 4
xen_minor : 1
xen_extra : .4_02-5.21.1
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset : 23432
xen_commandline :
cc_compiler : gcc version 4.7.1 20120723 [gcc-4_7-branch revision 189773] (SU
cc_compile_by : abuild
cc_compile_domain :
cc_compile_date : Sat Mar 30 13:14:09 UTC 2013
xend_config_format : 4

The hardware running OpenSuSe is all identical:
HP DL360 G5
2x3.00Ghz Quad-Core Xeon
14Gb RAM
2x73gb HDD (OpenSuSe install) (Virtual Machines look to the NAS with NFS)
^^With the exception of the First server which has 146GB of Local Storage for one VM (this VM is one of the ones having problems)
2x1gb NIC eth0-private net eth1-public net

If you have any more questions or need more information please let me know.
The first server happened randomly, while the other stopped working during a network backup using that interface.
Thanks for any help.

Also just found that I am getting the following when attempting to ping using the interface:
“ping: sendmsg: No buffer space available”

If you think something may be happening related to TCP/IP buffers,
Recommend you take a look at my paper
https://sites.google.com/site/4techsecrets/optimize-and-fix-your-network-connection

I assume the commands described should work equally well for CentOS (your Guests) and openSUSE (your Host).

Currently, it looks like there is likely a systemd/kernel problem executing sysctl.conf, so those command may need to be replaced by writing directly to the proc/sys/net files, but you should likely be able to run the commands which read existing values and utilization.
(If you have specific Q about what I just said, post)

Note what I describe, there is a relationship between the TCP/IP Congestion Control and the buffer sizes and settings. You may have to play around a bit to find the best combination. And, also note that adjusting the settings may be required for any setup that’s other than a small, low volume, low utiliaation Server. If you have big, fat network connections, high traffic, many simultaneous Users, transfer very large files (larger than about 5MB), have non-optimal network conditions, then this is for you.

HTH,
TSU

Thank you for your prompt response, after reading online a little bit, it appears that the offloading on the NIC was causing the problem.
The following is what we used that seems to have resolved our issue at this time, we are going to do some further testing and will post the results.

If this is as it seems, this fix should be applied to all of the virtual network cards on the vservers(VMs), both the private and public. But the private card is the one which, as we can see, apparently needs it the most.
We’re turning off all “offloading”. To see the offloading status of any card, for example eth0, we do this:
ethtool --show-offload eth0
The default output in our vservers looks like this:
[root@Utility6 ~]# ethtool -k eth1
Features for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off
rx-vlan-offload: off
tx-vlan-offload: off
ntuple-filters: off
receive-hashing: off
as you can see, it has a number of offloading options turned on.
To make it all turn off in the case of Eth0, you enter these commands:
ethtool -K eth0 tx off
ethtool -K eth0 sg off
ethtool -K eth0 tso off
ethtool -K eth0 gso off
ethtool -K eth0 gro off
If you rerun the --show-offload command again, you should see that it’s turned off all of the offloading.
These changes don’t require a reboot of the vserver to take effect, but they are also removed upon rebooting the vserver. To counter that, you add these lines in the /etc/rc.local file (once again example here being eth0):
/sbin/ethtool -K eth0 tx off
/sbin/ethtool -K eth0 sg off
/sbin/ethtool -K eth0 tso off
/sbin/ethtool -K eth0 gso off
/sbin/ethtool -K eth0 gro off

If “offloading” is what I think its, which is that a number of more recent NICS are supposed to have some low level networking functionality built in, thereby offloading CPU cycles and possibly iops,
Yeah I’ve heard early implementations on some are faulty, typical when introduction of new technologies.

Maybe virtualization technologies introduced something unplanned by the NIC architects.

TSU