Only one CPU core is used (various programs)

alexsb · June 12, 2018, 1:31pm

Hello, I freshly installed Leap 15 with KDE (everything up to date) and noticed a strange behavior of some user processes. If I start multiple instances of the same program, they all run on the first core. They won’t be distributed across the cores.

First I noticed that behavior in Blender. It starts multiple instances for the renderer but they all run on the first core.
Another example is when I manually start computations in two separate instances of Gimp, they both max out only the first core and the other ones don’t show any significant load.
However if I start those instances and bind them separately to a different core via the GOMP_CPU_AFFINITY variable (e.g. GOMP_CPU_AFFINITY=“1” gimp -n) it works as expected. The specified cores are fully maxed out.
Not every program shows that behavior, “make” for example uses as many cores as I specify with --jobs

What is the problem here?

I have a new machine with a Xeon CPU, all cores are recognized by the system.

cat /proc/cpuinfo  
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 158
model name      : Intel(R) Xeon(R) CPU E3-1285 v6 @ 4.10GHz
stepping        : 9
microcode       : 0x84
cpu MHz         : 4100.000
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rd
tscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx 
est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault
 epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel
_pt xsaveopt xsavec xgetbv1 xsaves ibpb ibrs stibp dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 8208.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

...

alexsb · June 12, 2018, 2:42pm

As an additional info: taskset -pa <Blender-PID> is a workaround, it allows Blender to use all cores as it should be normally.

So there is something wrong with the default CPU affinity value for some programs, I guess.
My linux knowledge is too limited to further investigate this issue, I hope someone can help.

tsu2 · June 12, 2018, 4:05pm

It’s my understanding that the number of CPUs (cores) used is determined by how the application(in this case blender) is compiled.
When an app is compiled to run on only one core, I don’t know if there is anything that will load balance entire app instances, I can’t remember such a thing (but may exist, just not to my knowledge).
If an app is compiled to support multi-threading, then an instance can potentially allocate across multiple cores.

All really old apps used to be compiled to run on a single core, but generally speaking almost all modern apps are compiled to support SMP processing.
But, blender may be one of those apps that needs to run on a single core, likely because of extreme high use of CPU cache or similar on integrated CPU/GPU architectures.

The solution you found is commonly done on big servers, particularly running database apps.
Because load balancing is not automatic, it’s not always an optimal solution but when you know the immediate running conditions, you get results as expected.

TSU

flymail · June 12, 2018, 4:32pm

On 2018-06-12, tsu2 <tsu2@no-mx.forums.microfocus.com> wrote:
> It’s my understanding that the number of CPUs (cores) used is determined
> by how the application(in this case blender) is compiled.

My understanding (at least in C++11) is that thread availability and assignment is kernel-controlled but multithreading
support can be invoked from g++ using the -pthread option.

> When an app is compiled to run on only one core, I don’t know if there
> is anything that will load balance entire app instances, I can’t
> remember such a thing (but may exist, just not to my knowledge).

Again, I believe that is kernel-controlled. It’s possible to compile a binary that distributes computations over
multithreads either using C++ pthreads or OpenMP but the quality of the distribution is very much dependent on the
quality of the code to do that.

> If an app is compiled to support multi-threading, then an instance can
> potentially allocate across multiple cores.

I’m not sure what you mean by `allocate’. One of the really tricky aspects of writing multithreading code is thread
safety and potential loss of performance from L1/L2 cache misses. Intel chips however share L3 cache across all
threads.

> All really old apps used to be compiled to run on a single core, but
> generally speaking almost all modern apps are compiled to support SMP
> processing.

That is not my impression, but appreciate my experience will differ from yours.

> But, blender may be one of those apps that needs to run on a single
> core, likely because of extreme high use of CPU cache or similar on
> integrated CPU/GPU architectures.

I’m not familiar with Blender, but I would suspect its performance would be optimal under Nvidia/CUDA compared to CPU.
From Blender’s homepage, it appears it does does not support gcc versions 4.7 and up and so I would avoid it.

tsu2 · June 13, 2018, 5:00am

flymail:

On 2018-06-12, tsu2 <tsu2@no-mx.forums.microfocus.com> wrote:
> It’s my understanding that the number of CPUs (cores) used is determined
> by how the application(in this case blender) is compiled.

My understanding (at least in C++11) is that thread availability and assignment is kernel-controlled but multithreading
support can be invoked from g++ using the -pthread option.

> When an app is compiled to run on only one core, I don’t know if there
> is anything that will load balance entire app instances, I can’t
> remember such a thing (but may exist, just not to my knowledge).

Again, I believe that is kernel-controlled. It’s possible to compile a binary that distributes computations over
multithreads either using C++ pthreads or OpenMP but the quality of the distribution is very much dependent on the
quality of the code to do that.

> If an app is compiled to support multi-threading, then an instance can
> potentially allocate across multiple cores.

I’m not sure what you mean by `allocate’. One of the really tricky aspects of writing multithreading code is thread
safety and potential loss of performance from L1/L2 cache misses. Intel chips however share L3 cache across all
threads.

> All really old apps used to be compiled to run on a single core, but
> generally speaking almost all modern apps are compiled to support SMP
> processing.

That is not my impression, but appreciate my experience will differ from yours.

> But, blender may be one of those apps that needs to run on a single
> core, likely because of extreme high use of CPU cache or similar on
> integrated CPU/GPU architectures.

I’m not familiar with Blender, but I would suspect its performance would be optimal under Nvidia/CUDA compared to CPU.
From Blender’s homepage, it appears it does does not support gcc versions 4.7 and up and so I would avoid it.

The bottom lines which I may or may not have been clear…
It appears that blender might have been compiled to be single-threaded for some good reason.
I don’t know if there has ever been a way to automatically allocate single-threaded apps to any other core than the default which is why @alexsb found all blender instances running in the same core. The manual assignment he described is what I have seen but manual assignments cannot dynamically adjust to changing loads.

TSU

alexsb · June 13, 2018, 9:31am

OK guys, thanks for the answers, but it’s not a blender specific problem.

Only some programs show this behavior. I looked up the CPU affinity of some via taskset -p <PID>
blender gets started with affinity “1” so every thread runs on cpu “1”
same thing with gimp, gwenview and inkscape.
Other programs like chromium, okular, libreoffice run with “ff” (cpu affinity to use all 8 cores) right after startup.

I have access to another machine with a fresh leap 15 setup and a different cpu, there is the same thing.
I also tried a machine with 42.2. This one had no problems, everything was started with “f” affinity (this machine has only 4 cores)

I wrote a wrapper script that changes the CPU affinity after startup as a workaround. An interesting thing is, I had to add a little pause, because right after startup the CPU affinity is “ff” and then it changes to “1”.

#!/bin/bash
/usr/bin/blender "$@" & export pid=$!
taskset -p $pid
sleep 1
taskset -p $pid
taskset -pca 0-7 $pid

flymail · June 13, 2018, 10:28am

On 2018-06-13, alexsb <alexsb@no-mx.forums.microfocus.com> wrote:
> I have access to another machine with a fresh leap 15 setup and a
> different cpu, there is the same thing.
> I also tried a machine with 42.2. This one had no problems, everything
> was started with “f” affinity (this machine has only 4 cores)

Interesting. Perhaps the kernel change in Leap 15 comes withs a different CPU scheduler. Unfortunately I don’t have a
Leap 15 box on me to test. You may find however documentation from SUSE12_SP3 helpful, although I’m not sure it will
solve this problem:

https://www.suse.com/documentation/slerte-12/singlehtml/art_slert_quickstart/art_slert_quickstart.html

techwinder · June 23, 2018, 2:50pm

Same issue here with the new LEAP 15.
Is there a fix more simple than the one above which could be implemented by average users?
Thanks,

tsu2 · June 23, 2018, 5:07pm

You may be barking up the wrong tree.
As I described earlier, AFAIK the affinity setting is typically set when compiled (and as you’ve described might be over-ridden).

Recommend you either inspect source of the current version of your app (Open Source projects allow you to inspect) or ask in that project’s community.

Only after you’ve verified that the distributed version doesn’t set affinity to default should you then consider other possibilities.
I can’t imagine a rationale for why a CPU would alter affinity settings for specific apps, and although not impossible that the standard setting would be changed (The result could then cause what you’ve speculated… a whole class of apps if not all apps executing the wrong way).

IMO,
TSU

d3vnull · June 26, 2018, 9:57am

Since you mention a XEON processor will you please try:


dmesg|grep coretemp

If you are getting coretemp errors there is a bug report on it and an updated kernel. I have not installed the kernel, but I do notice this same characteristic even though I have 16 cores.

techwinder · June 28, 2018, 1:39pm

Thanks for the hint. It turns out that I was indeed barking at the wrong tree.
The app I try to run uses the OpenBlas library, and its FAQ mentions Faq · OpenMathLib/OpenBLAS Wiki · GitHub

If your application is already multi-threaded, it will conflict with OpenBLAS multi-threading. Thus, you must set OpenBLAS to use single thread

Fixed this by disabling OpenBlas multi-threading.

flymail · June 28, 2018, 2:42pm

On 2018-06-28, techwinder <techwinder@no-mx.forums.microfocus.com> wrote:
> Fixed this by disabling OpenBlas multi-threading.

… which does rather defeat the point of using OpenBLAS. It’s performance is superior to other linear algebra libraries
(e.g. eigen/ATLAS) and this is in large part due to its multithreading capabilities.

techwinder · June 28, 2018, 3:05pm

Agreed, but hey, what can you do?
In the present case multi-threading the app (home development, Qt-based) outweighs the benefits of having OpenBlas multithreaded.
Funny thing is that I’m pretty sure there was no such conflict with Leap 42.3.