Serious defunct process issue building Linux source code on Tumbleweed

This has taken many hours of investigation. I suspect this needs to be looked at by someone in the know. 20180702.

Basically, I’m seeing during “make -j4” of kernel 4.17.2 source (config is make oldconfig) that the parent make is failing to wake up and reap defunct make/sh processes. This ties up the tokens in make’s jobserver shared pipe read/write file descriptors. This causes make to act like -j1 sometimes, but sometimes doesn’t happen. Build time (-j4) can vary from 20 to 60 minutes. Note that -j1 takes 65 minutes. Make -j12 or -j24 seems to hide the issue. CPU is not overclocked, 3200 ram is vastly underclocked at 2666, extra voltage all around (8086K, 2x16GB ram, 1TB nvme), all brand new top-shelf hardware. These are the key observations and state that persist for far too long:

pstree [pid of make -j4]

make - make - sh - gcc - cc1+as
… |- 2*[make] # stalled for many minutes
… - make - sh # stalled for many minutes

ps ux | grep defunct


make <defunct>
make <defunct>
sh <defunct>

top

… 90.0 idle …

So, I’m off to test another distro to see if it’s something openSUSE Tumbleweed has done with kernel/package compile flags or kernel options. Also, I have a Ryzen 2700X Tumbleweed box that I can try to reproduce the issue to rule out anything brand specific. But something is vastly wrong here.

And wasn’t Tumbleweed having an issue with their build farm performance recently? I just might have your problem symptoms outlined above …

I’m NOT seeing the same issue on Ubuntu 18.04 so far. Looks like Ubuntu is still on kernel 4.15 and gcc 7.3. Ubuntu compiles 2-4x faster than Tumbleweed because it’s actually running 4 things in parallel for the entire build for -j4.

Likely problem: pselect in gnu make is not receiving signals, hence fails to reap defunct sub-processes in a timely manner. Possible race condition has been introduced either by code or compiler or kernel change.

Wild guesses in no particular order:
(1) issue with kernel 4.17
(2) issue with glibc
(3) issue with gcc8 + kernel / make / glibc

This is a serious issue!!!

Reply from GNU make ml: “There has been a fix made to GNU make since 4.2.1 was released which could be related to the problem you’re seeing.” Okay, that’s encouraging, and could explain why TW is not showing symptoms outside of GNU make. I will try to get my hands on the latest make source code for make to see if it resolves the current problem.

http://lists.gnu.org/archive/html/help-make/2018-07/msg00004.html

Pulled GNU make from git, can confirm it resolves the issue. How can I check for an existing bug report? This needs to be made known to openSUSE devs. A little surprised this hasn’t been noticed more widely. Kinda obvious when make -j4 degrades to -j1 while building the Linux kernel!

Filed https://bugzilla.opensuse.org/show_bug.cgi?id=1100504

I think this means a fix has been queued up in Factory:

https://build.opensuse.org/request/show/623670

Thanks for sharing your experiences here. I don’t think many of us compile their own kernels, but the ones that do could certainly benefit

It would affect make -j of any software package, not just kernel source code. I just happened to be testing compile times between 4790K, Ryzen 2700X, and 8086K using Linux source code, when I observed the issue.

I compiled kernels three times this week, and have seen nothing of the sort. Using make -j4, all 4 of my cores are pinging to 100% for much of the process. Haven’t seen any “stalled for many minutes” messages, or any acting like a -j1 process.

make 4.2.1-5.5
gcc 8-2.1
kernel 4.17.7

Trust me, it’s there. What CPU do you have? Just because you don’t see it on your particular system doesn’t mean others aren’t affected. I don’t know what it takes to tickle the bug exactly. It’s very, very obvious on my 8086K with -j4, and there exists a patch from the make maintainer, and he knows about the issue, and it’s patched very specifically, and the patch works to resolve the issue. There are no messages, just that the cpu untilization is low, and pstree will show make sub-processes that just sit there until the end and then somehow get unblocked. I’m fairly certain that I observed it also on a Ryzen 1500X cpu. It doesn’t happen every time, but fairly regularly. I did test and check that the Factory rpm (“4.2.1-227.1”) is working properly.

Anyway, so I can see the patch changelog note in yast2 sw_single, but clearly the rpm has not been updated, the build date is still June 13. It’s weird to me that the Changelog shows the patch notes, but the rpm is not updated, that’s rather misleading. How is it that the Changelog is ahead of the rpm? (4.2.1-5.5)

I figured this one out. It’s pulling the Changelog from fixed factory 4.2.1-227.1 installed on my system. After re-installing the older broken 4.2.1-5.5 then the Changelog doesn’t show the patch info. Mystery solved. So looking at “Technical Data” tab and “Build Time” for something July 16th or after should have the fix.

20180731 has make-4.2.1-6.1 which includes the fix.

rotfl!