This is really hard to troubleshoot, but it seems worth passing on.
Maybe others have seen this?
I know the good folks from Suse seem to see posts on this board (thanks).
We have a VMware infrastructure (which could easily interact with all of this).
I’ve had three servers with the new kernel patches (kernel-default-4.4.155-68.1.x86_64) that “go south” with 100% CPU.
All three have some similar characteristics.
- No logs, no sar/sysstat, cannot ping, no nothing after the CPU spikes
- VMware console view is similarly locked up.
- All have “Spiky” use patterns, longish periods of inactivity followed by burst of high CPU
- VMware Vcenter monitoring shows 100% CPU Capped until the machine is reset
- VMware cannot see the openVM agents after the spike
- And (of course) all the servers with this issue have the newer kernel, but older servers with similar (and even identical) workloads are fine.
One server in particular has frozen like this many times. It’s serving varnish and hitch (which can easily spike CPU for brief intervals). This server is telling, as it’s currently the “warm spare” and the production version is completely solid (running the older kernel).
For fairly obvious reasons, I cannot offer a whole lot more details (no logs, no connectivity, and all the servers are production to some degree).
This makes me reluctant to file a formal bug report.
For the time being I’ve reverted to kernel-default-4.4.143-65.1.x86_64 on the problematic servers, which does seem to be working (albeit it will take more time to verify).
As I understand it, the kernel update was about SPECTRE like speculative execution fixes, which to my un-educated eye makes it seem likely that there’s some odd edge case triggered by load.