Number of cores, "SUSE"

Hi!
I have a question first, this whit the number of cores/treads and the hysteria:
http://jodo.nu/pic5/DSC00765.JPG
Oh there is a lot of blogs, - test on new hardware. But I have also a question and would like to check with forum if I…

The command top sending (On my old server in the basement a AMD Athlon II X4 630(4 cores/treads) never exceed 4.0 in load except when I running an backup(I have 1GB lan):

top - 15:05:22 up 1 day,  1:46,  2 users,  load average: 4,45, 3,96, 3,20
Tasks: 192 total,   1 running, 191 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1,1 us,  4,8 sy,  0,6 ni, 23,8 id, 63,7 wa,  0,0 hi,  6,0 si,  0,0 st
KiB Mem:  12045016 total, 11772764 used,   272252 free,     3188 buffers
KiB Swap:  2103292 total,        0 used,  2103292 free.  8628420 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
 3337 root      20   0 4291420 2,668g 2,620g S 12,25 23,22 186:03.34 VBoxHeadl+ 
20133 root      20   0  259948   7216   5248 S 4,636 0,060   2:27.66 iftop      
 1862 root      20   0       0      0      0 D 3,311 0,000   2:32.24 nfsd       
 1864 root      20   0       0      0      0 D 3,311 0,000   3:32.81 nfsd       
 1861 root      20   0       0      0      0 D 2,980 0,000   2:30.24 nfsd       
 1863 root      20   0       0      0      0 D 2,318 0,000   2:48.05 nfsd       
20968 root      20   0   15360   2616   2072 R 0,662 0,022   0:00.37 top

On my main pc (AMD A8-3850 APU 4 cores/treads) receiving on a e-sata mechanical disk.

top - 15:02:39 up 1:59, 3 users, load average: 3,63, 3,70, 3,32
Tasks: 263 total, 3 running, 259 sleeping, 0 stopped, 1 zombie
%Cpu(s): 17,9 us, 9,7 sy, 0,0 ni, 38,5 id, 25,5 wa, 0,0 hi, 8,5 si, 0,0 st
KiB Mem: 11782244 total, 11630944 used, 151300 free, 39952 buffers
KiB Swap: 2104316 total, 0 used, 2104316 free. 10254940 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11752 xxxxx 20 0 39092 16096 1476 S 48,84 0,137 21:24.93 rsync
11731 xxxxx 20 0 39456 18016 3200 D 45,51 0,153 19:51.63 rsync
12861 root 0 -20 0 0 0 S 30,90 0,000 1:40.91 kworker/3+
43 root 20 0 0 0 0 S 3,987 0,000 1:37.93 kswapd0
12777 root 20 0 0 0 0 D 2,658 0,000 0:05.53 kworker/u+
9205 root 20 0 0 0 0 R 1,993 0,000 0:49.98 jbd2/sdf1+

The question is, - 4 cores and when the load is not step over 4.0 I’m still in the game? I would like to buy a Porsche but do I need to?

ROI, attach a picture from 2002 from middle east. The computer is in my private workshop home here running TW 32 bit.

regards

On 2017-09-05, jonte1 <jonte1@no-mx.forums.microfocus.com> wrote:
> <SNIP>
> Code:
> --------------------
> top - 15:05:22 up 1 day, 1:46, 2 users, load average: 4,45, 3,96, 3,20
> Tasks: 192 total, 1 running, 191 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 1,1 us, 4,8 sy, 0,6 ni, 23,8 id, 63,7 wa, 0,0 hi, 6,0 si, 0,0 st
> KiB Mem: 12045016 total, 11772764 used, 272252 free, 3188 buffers
> KiB Swap: 2103292 total, 0 used, 2103292 free. 8628420 cached Mem
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 3337 root 20 0 4291420 2,668g 2,620g S 12,25 23,22 186:03.34 VBoxHeadl+
> 20133 root 20 0 259948 7216 5248 S 4,636 0,060 2:27.66 iftop
> 1862 root 20 0 0 0 0 D 3,311 0,000 2:32.24 nfsd
> 1864 root 20 0 0 0 0 D 3,311 0,000 3:32.81 nfsd
> 1861 root 20 0 0 0 0 D 2,980 0,000 2:30.24 nfsd
> 1863 root 20 0 0 0 0 D 2,318 0,000 2:48.05 nfsd
> 20968 root 20 0 15360 2616 2072 R 0,662 0,022 0:00.37 top
> --------------------
>
> <SNIP>
>
> The question is, - 4 cores and when the load is not step over 4.0 I’m
> still in the game? I would like to buy a Porsche but do I need to?

I don’t understand your question. If you want to see the thread load per-core in `top’, you have to press 1. There’s no
point increasing your core count if your speed-critical program is not explicitly coded to be CPU multithreaded
otherwise it will still use just a single thread however many logical threads you have available. The correspondence
between logical threads and cores is not necessarily 1:1 (as it with AMD cores) since Intel chips distribute two threads
per core.

However execution threads need not correspond with the theoretical logical maximum. Some programmers even subscribe to
the belief to the number of executation threads should correspond to the number of theoretical maximum plus two. My
opinion is that is rubbish and there is little/no benefit exceeding the theoretical maximum and more often than not,
doing so is usually harmful to performance.

Bear in mind even for multithreaded code, a doubling of maximum threads can rarely reduce execution time to half (with
the notable exception of going from 1 to 2 threads). It depends on the operation and the total thread count since there
are diminishing returns on increasing numbers of threads, and within the range of CPUs used on desktop PCs you’re doing
well if the the execution time is reduced by ~30% per doubling of maximum thread counts.

Today’s Intel and AMD x86/x64 processors are SMP processors.

This generally means that no matter whether your app is written multi-threaded or not, execution will be across all processors (and cores) according to the processors’s own internal instructions.

But, if your application is specifically written to be multi-threaded, then more complex calculations can be performed faster by improved parallelism.

What this generally means is that you’re probably not going to get desired results unless your benchmarking app (or other software) over-rides what is done automatically and does things like assigning processor affinities.

Note that this is different than for example what I’ve observed on ARM processors which is one major reason why every app might run full screen by default. To support screen splitting (apps running side by side on the screen) and multi-tasking, each app is running in its own core.

And,
If you’re talking about GPU processing and rendering, then that’s another architecture where you have a layer of virtual cores on top of your real, physical cores vastly increasing parallelism potential compared to CPUs which still utilize SIMD(Intel) which are specialized instruction sets for certain high performance tasks.

So, if you’re going to benchmark, you’ll need to first understand what you’re testing(unless you’re only interested in default settings and bottom line numbers).

HTH,
TSU

On 2017-09-06, tsu2 <tsu2@no-mx.forums.microfocus.com> wrote:
>
> Today’s Intel and AMD x86/x64 processors are SMP processors.
>
> This generally means that no matter whether your app is written
> multi-threaded or not, execution will be across all processors (and
> cores) according to the processors’s own internal instructions.

Ahh… if only that were true, it would save me a lot of work!

In practice without explicit multithreading code (e.g. using std::thread or pthreads), at best a single executable is
distributed across two threads with anything beyond a non-trivial CPU-load. The problem is that the fastest caches (L1
and L2) cannot be across cores and as a result SMP factors rarely exceed 1.2X-1.4X increases in speed whatever the core
count. In my opinion therefore, the SMP environment is ideal for running different programs at the same time but a poor
substitute for multithreading.

A couple of things:

  • When writing posts, please be clear about what your issue is.
  • It’s openSUSE. SUSE is a different beast
  • On 64-bit systems, run 64-bit openSUSE. The 32-bit TW is still there, but less and less people are using it. Count on it disappearing

True, one of the factors affecting execution is are the L1 and L2 caches, and then if available to specific or all cores of a processor. IIRC the latest AMD processor is supposed to do something revolutionary to improve this.

TSU

Good general advice, but this thread is more of a hardware topic, not specific to software at any level except at lowest levels.

TSU

What I read from the two “top” outputs is, that the cpu’s are not fully loaded ("id"le values of 23.8% and 38.5%) and the system is waiting for I/O’s to complete ("wa"it values of 63.7% and 25.5%).
IMHO more cpu cores do not make a lot of sense with this workload. Everything else needs further investigation about the bottlenecks and more info about the workload(s).

Hendrik

It isn’t. It is General Chitchat. Everybody can say what (s)he wants (as long as it abides to the T&C). Nobody has to take anything here serious. lol!

On 2017-09-06, tsu2 <tsu2@no-mx.forums.microfocus.com> wrote:
> True, one of the factors affecting execution is are the L1 and L2
> caches, and then if available to specific or all cores of a processor.

L1 and L2 caches are core-confined whereas L3 is pooled across cores. My experience prefetching data to L3 caches is
very limited but the L3 latency is so poor that from little testing I’ve done, I’ve found prefetches to L3 largely
homeopathic for performance.

> IIRC the latest AMD processor is supposed to do something revolutionary
> to improve this.

That sounds exciting! Do you have a reference for this?

Since I don’t work with this kind of info regularly, I can’t remember the exact references I read.

But, you can start with the Wikipedia entry for the AMD Zen micro-architecture.
https://en.wikipedia.org/wiki/Zen_(microarchitecture)

And, the stuff I’m describing about shared L2 and L3 caches might be related to the third listed feature, the CCX. Now, I can’t remember for sure if the L1 cache is core-specific only (would make sense) but still… shared L1 might take a performance hit but could also be physically engineered to be minimized through a layered manufacturing process so cores could have a very short physical path to any of the 4 L1 cache in a 4 core cluster.

TSU

On 2017-09-07, tsu2 <tsu2@no-mx.forums.microfocus.com> wrote:
> But, you can start with the Wikipedia entry for the AMD Zen
> micro-architecture.
> https://en.wikipedia.org/wiki/Zen_(microarchitecture)

Thanks Tsu. Very interesting.

> Now, I can’t remember for
> sure if the L1 cache is core-specific only (would make sense) but
> still…

Looking at the article, it appears the L1 and L2 cache is core specific (32KiB data L1 and 512KiB L2 per core) but the
SMT design accommodates two threads per core just like Intel CPUs.

> shared L1 might take a performance hit but could also be
> physically engineered to be minimized through a layered manufacturing
> process so cores could have a very short physical path to any of the 4
> L1 cache in a 4 core cluster.

The CCX seems to share only the L3 cache (8MiB per quad-core). I can’t tell from the article whether the L1 cache
reads/writes between cores within a CCX cluster have reduced latencies compared to those across different CCX clusters.
If it is, it’s an interesting stategy and I wonder whether it improves real-life performance.

It’s good to see AMD introducing AVX2 instructions to Zen, it’s just a shame they take twice as long as their Intel
counterparts - although I suspect this is less critical for performance compared to caching.

I’m curious. I have an i7 with 8 cores, but runs at 3000 to 3200Ghz. It cost quite a bit compared to other chips. Without going into a discussion of L# caches, in practical terms, for normal openSUSE desktop things, is it faster to have more cores, or less cores and faster processing? would a 4 core processor that runs at 3500 to 3700 Ghz do the normal things, like starting Libreoffice, or Firefox be faster?

Bart

You can always test and see.

It depends on the application. If the application is single-threaded, then it only uses one core. If there is significant multi-threading, then it can use several cores at the same time.

I would guess that firefox and LibreOffice are multi-threaded. But even it that’s right, it doesn’t tell how well that works for the particular way you are using them. So maybe do some testing.

Hi
Depends on what your doing… for me I’m not worried about GHz, but cores when building packages… I want one of these AMD Ryzen Threadripper 1950X (16-core/32-thread)

To run a program it is better to have an Ssd M2.