Executable's speed jumping between low and high in 13.1

I have an executable which is doing math calculations and array operations, using about 1 GB of RAM. The executable is similar to the one described in https://forums.opensuse.org/showthread.php/484988-Executable-running-slower-on-opensuse-12-3

It appears that the execution speed can get two values, differing three times from each other! The speed changes perhaps depending on reboot, compilation, user. There are no intermediate speeds: only the slow and quick one. Same is observed with libm version 15, which I check by using LD_PRELOAD.

The CPU’s frequency is fixed. The computer is not otherwise loaded.

I cannot figure out when the executable starts is slow mode and when in quick mode; it looks random.

I wonder what is the reason and how to proceed with diagnostics and correction.

On 2014-01-14, ZStefan <ZStefan@no-mx.forums.opensuse.org> wrote:
> I have an executable which is doing math calculations and array
> operations, using about 1 GB of RAM. The executable is similar to the
> one described in http://tinyurl.com/pgzwclo

To diagnose any optimisation issue, you’ll need the provide the code, and specify whether it’s compiling in a 32- or 64-
bit environment.

> It appears that the execution speed can get two values, differing three
> times from each other! The speed changes perhaps depending on reboot,
> compilation, user. There are no intermediate speeds: only the slow and
> quick one. Same is observed with libm version 15, which I check by using
> LD_PRELOAD.

I can think of two likely causes of bistable benchmarks from the same compile:

  1. Stack alignment. Either the stack is aligned for a given execution or not. IIRC GNU’s C/C++ compiler only guarantees
    a constant stack offset but not stack alignment. To guarantee alignment, you have to code it yourself inline.

  2. Data alignment. If you’re using (directly or indirectly) SIMD intrinsics, then your data is either aligned according
    to 128-bit boundaries or not. If not, performance will be seriously degraded (although I’m told this is less of an issue
    with modern processors… but I don’t believe it).

I can think of few more exotic causes, but I wouldn’t like to guess without seeing the code.

I don’t want to bring the whole code here, but it is similar to the code in Post # 20 of https://forums.opensuse.org/showthread.php/484988-Executable-running-slower-on-opensuse-12-3

There is a call to a function, executed in a loop. The function does math calculations similar to the Post 20. Nothing fancy; just a computation with several math functions like sin, cos, rand, and arithmetic.

Somehow, in the years before there was no need to program stack alignment and data alignment, and the program ran normally.

In opensuse 12.3 there was a major bug in libm slowing down computation by a factor of 3. In opensuse 13.1, I observe these jumps in speed. Never seen such a thing before, in any computer. Three times is too much.

I compiled and ran the code in opensuse 12.2. Runs normally every time. Don’t remember any slowdowns with several previous versions.

With opensuse 13.1, in one computer I observe mostly normal running (99% of time, but right after installation it was 50%). In another computer, I observe mostly slow running, 95% of time.

Did somebody again screwed up gcc, g++, glibc, kernel, opensuse or something else, in an attempt to improve?

I will try to figure out conditions under which the program runs normally. Looks like more chances for normal run are fresh after reboot, and when there is external storage connected.

I have checked the RAM with memtest. The OS is updated often. Swap is not used. This is 64 bit computer running 64 bit opensuse 13.1. gcc version 4.8.1 20130909 [gcc-4_8-branch revision 202388] (SUSE Linux). Kernel 3.11.6-4-desktop, 64 bit.

On 2014-01-15, ZStefan <ZStefan@no-mx.forums.opensuse.org> wrote:
>
> I don’t want to bring the whole code here, but it is similar to the code
> in Post # 20 of http://tinyurl.com/pgzwclo

Without being to compile your specific code myself, I can’t really test and work out why you are seeing problems.

> There is a call to a function, executed in a loop. The function does
> math calculations similar to the Post 20. Nothing fancy; just a
> computation with several math functions like sin, cos, rand, and
> arithmetic.

A loop calling a leaf function isn’t as efficient as incorporating the loop within the leaf function, and I don’t think
external looping improves branch point prediction either. Anyway, varying performances has little to do with branch
point prediction, unless the number of loop iterations is data (or data-sign)-dependent.

> Somehow, in the years before there was no need to program stack
> alignment and data alignment, and the program ran normally.

This is not true. In years before, if your data wasn’t aligned and SSEn instructions were executed, you’d get a
segmentation fault. Some compilers using intrinsics like to try and hide this from you, but in my opinion this approach
is counterproductive.

> In opensuse 12.3 there was a major bug in libm slowing down computation
> by a factor of 3. In opensuse 13.1, I observe these jumps in speed.
> Never seen such a thing before, in any computer. Three times is too
> much.

I don’t think it’s ever been called a `bug’ because it gets the right answer. The libm modification has the benefit that
the result is less CPU architecture-dependent but at the cost of taking a longer time.

I compiled and ran the code in opensuse 12.2. Runs normally every time.
Don’t remember any slowdowns with several previous versions.
With opensuse 13.1, in one computer I observe mostly normal running (99%
of time, but right after installation it was 50%). In another computer,
I observe mostly slow running, 95% of time.

If the same compiled executable gives bistable performance with constant background processes running, this can only
result one or more of three issues:

  1. Data: different floating point data can dramatically affect performance (e.g. with the introduction of Infs/NaNs).
  2. Data: some data addresses read within the executable are more quickly accessed than others typically due to
    differences in address alignment (or address distances from related data).
  3. Data: data processing outside the executable (e.g. external processes, swap memory) can very easily impact variably
    on the performance of the binary machine instructions running inside the executable.

Since you found everything was `normal’ under 12.2, #1 is unlikely.

> Did somebody again screwed up gcc, g++, glibc, kernel, opensuse or
> something else, in an attempt to improve?

It’s not the compiler or library - if something really was screwed up you’d never see a `good’ run. I don’t know enough
about the kernel or openSUSE to comment on whether it’s anything do with them, but I suspect not.

> I will try to figure out conditions under which the program runs
> normally.

Again, please specify whether your compiling architecture (e.g. 32 vs 64 bit) as this changes the shortlist of likely
causes.

> Looks like more chances for normal run are fresh after reboot,
> and when there is external storage connected. I have checked the RAM.
> The OS is updated often. Swap is not used.

Comes back to likely cause being data-addressing issues as the memory burden changes from startup. Can’t help you
without the code though.

This is the simplified program. When I compile and run it on a computer, the speed of one iteration appears to be bistable, with durations varying about 2.5 times. Sometimes the speed changes in the middle of run (looping), say, after 10th iteration.


#include <iostream>
#include <string>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <sys/time.h>
using namespace std;

float do_calculation(double y)
{
 float   z1 = 33.33; 
 double  z2 = 400.077;
 double  z3;
 double  sum = 0.0;
 
 for(int i=0; i<1000000; i++)
 {
   z1 = sin(z1+z2+i) / (fabs(z1) + 4.6);
   z2 = cos(z1*i + y);
   z3 = atan2(z1+i, z1+z2-i/(i+30));
   z3 += z1/(i+4) + fabs(rand()*z1 + z2 - cos(i*3.06));
   z3 = exp(-fabs(z3 + 2.7*y + z1 - z2/5.7)); 
   sum += -y + z3/(fabs(z3) + 2.0 + i/(i+4.7));
 }
 return sum;
} 



int main() {
 const int ONE_MILLION = 1000000;
 const int LOOP  = 8;
 struct timeval  tv;        struct timezone tz;
 long   dtm;
 double starttm, endtm, how_long;
 double tmpsum = 0.0;

 for(;;) {
  gettimeofday(&tv, &tz);  dtm = tv.tv_sec; 
  gettimeofday(&tv, &tz); 
  starttm = (tv.tv_sec - dtm) + (1.0/ONE_MILLION)*tv.tv_usec;
 
  for(int i=0; i<LOOP; i++) tmpsum += cos(2.0*i) * do_calculation(2.2);
  tmpsum = (fabs(rand()*tmpsum) - 4000.66) / (fabs(tmpsum) + 0.006);

  gettimeofday(&tv, &tz); 
  endtm   = (tv.tv_sec - dtm) + (1.0/ONE_MILLION)*tv.tv_usec;              
  how_long = endtm - starttm; // = time one iteration takes, in seconds
  printf("Time of one iteration = %9.4lf s, computation result = %e
", how_long, tmpsum); 
 }
return 0;
}

This looks like a bug, but I cannot figure out where. I have written many programs in C and C++ and have never observed such a behavior of the compiled program. The computer is not loaded with any other processes.

This is 64 bit computer running 64 bit opensuse 13.1, updated.

On 2014-01-17, ZStefan <ZStefan@no-mx.forums.opensuse.org> wrote:
>
> This is the simplified program. When I compile and run it on a computer,
> the speed of one iteration appears to be bistable, with durations
> varying about 2.5 times. Sometimes the speed changes in the middle of
> run (looping), say, after 10th iteration.
>
<SNIP>
>
> This looks like a bug, but I cannot figure out where. I have written
> many programs in C and C++ and have never observed such a behavior of
> the compiled program. The computer is not loaded with any other
> processes.
>
> This is 64 bit computer running 64 bit opensuse 13.1, updated.

Thank you for posting the code. This is the result of my first run (openSUSE 13.1 x86_64 KDE, fully updated):


sh-4.2$ g++ a.cpp
sh-4.2$ ./a.out
Time of one iteration =    2.6495 s, computation result = 9.366910e+08
Time of one iteration =    2.6481 s, computation result = 5.707878e+08
Time of one iteration =    2.6480 s, computation result = 4.978989e+08
Time of one iteration =    2.6479 s, computation result = 1.190977e+09
Time of one iteration =    2.6481 s, computation result = 9.298645e+08
Time of one iteration =    2.6478 s, computation result = 1.552897e+08
Time of one iteration =    2.6478 s, computation result = 1.710884e+09
Time of one iteration =    2.6477 s, computation result = 2.023878e+09
Time of one iteration =    2.6482 s, computation result = 1.185668e+09
Time of one iteration =    2.6480 s, computation result = 1.627396e+09
Time of one iteration =    2.6479 s, computation result = 1.285587e+09
Time of one iteration =    2.6479 s, computation result = 1.284789e+09
Time of one iteration =    2.6480 s, computation result = 5.410588e+07
Time of one iteration =    2.6480 s, computation result = 8.577717e+08
Time of one iteration =    2.6479 s, computation result = 1.333628e+09
Time of one iteration =    2.6480 s, computation result = 6.056444e+08
Time of one iteration =    2.6482 s, computation result = 1.130641e+09
Time of one iteration =    2.6480 s, computation result = 8.075440e+08
Time of one iteration =    2.6480 s, computation result = 1.071738e+09
Time of one iteration =    2.6481 s, computation result = 1.497799e+09
Time of one iteration =    2.6482 s, computation result = 6.832906e+08
Time of one iteration =    2.6480 s, computation result = 1.850542e+09
Time of one iteration =    2.6479 s, computation result = 2.099904e+09
Time of one iteration =    2.6479 s, computation result = 6.601369e+08
Time of one iteration =    2.6480 s, computation result = 1.654135e+09
Time of one iteration =    2.6482 s, computation result = 1.265377e+09
Time of one iteration =    2.6480 s, computation result = 7.491475e+08
Time of one iteration =    2.6481 s, computation result = 1.618072e+09
Time of one iteration =    2.6479 s, computation result = 9.804571e+08
Time of one iteration =    2.6482 s, computation result = 9.666316e+08
Time of one iteration =    2.6481 s, computation result = 1.076384e+09
Time of one iteration =    2.6480 s, computation result = 1.915146e+09
Time of one iteration =    2.6481 s, computation result = 1.325639e+09
Time of one iteration =    2.6479 s, computation result = 1.309158e+09
Time of one iteration =    2.6481 s, computation result = 1.040363e+09
Time of one iteration =    2.6480 s, computation result = 8.497754e+08
Time of one iteration =    2.6481 s, computation result = 2.008792e+09
Time of one iteration =    2.6480 s, computation result = 3.436796e+08
Time of one iteration =    2.6480 s, computation result = 4.997928e+08
Time of one iteration =    2.6477 s, computation result = 1.252841e+09
Time of one iteration =    2.6481 s, computation result = 1.206959e+09
Time of one iteration =    2.6478 s, computation result = 1.049436e+09
Time of one iteration =    2.6480 s, computation result = 9.855468e+07
Time of one iteration =    2.6482 s, computation result = 1.643193e+09
Time of one iteration =    2.6480 s, computation result = 1.217729e+09
Time of one iteration =    2.6482 s, computation result = 4.503854e+08
Time of one iteration =    2.6479 s, computation result = 1.326227e+09
Time of one iteration =    2.6479 s, computation result = 1.415164e+09
Time of one iteration =    2.6479 s, computation result = 9.095195e+08
Time of one iteration =    2.6477 s, computation result = 1.485605e+09
Time of one iteration =    2.6481 s, computation result = 4.506785e+08
Time of one iteration =    2.6480 s, computation result = 1.261918e+09
Time of one iteration =    2.6480 s, computation result = 2.026706e+09
Time of one iteration =    2.6481 s, computation result = 7.786600e+08
Time of one iteration =    2.6481 s, computation result = 1.524530e+09
Time of one iteration =    2.6476 s, computation result = 1.570667e+09
Time of one iteration =    2.6481 s, computation result = 1.206787e+09
Time of one iteration =    2.6478 s, computation result = 1.250838e+09
Time of one iteration =    2.6480 s, computation result = 2.095063e+09
Time of one iteration =    2.6481 s, computation result = 1.865386e+09
Time of one iteration =    2.6480 s, computation result = 3.281944e+08
Time of one iteration =    2.6480 s, computation result = 1.132857e+09
Time of one iteration =    2.6478 s, computation result = 6.180122e+08
Time of one iteration =    2.6479 s, computation result = 6.900676e+08
Time of one iteration =    2.6479 s, computation result = 1.615849e+09
Time of one iteration =    2.6481 s, computation result = 1.126033e+09
Time of one iteration =    2.6479 s, computation result = 2.139742e+09
Time of one iteration =    2.6478 s, computation result = 1.894477e+08
Time of one iteration =    2.6477 s, computation result = 1.631572e+09
Time of one iteration =    2.6479 s, computation result = 8.535685e+08
Time of one iteration =    2.6481 s, computation result = 1.273882e+09
Time of one iteration =    2.6482 s, computation result = 1.660796e+08
Time of one iteration =    2.6482 s, computation result = 1.038502e+09
Time of one iteration =    2.6482 s, computation result = 5.600697e+07
Time of one iteration =    2.6482 s, computation result = 1.619022e+09
Time of one iteration =    2.6481 s, computation result = 6.123897e+08
Time of one iteration =    2.6481 s, computation result = 8.961824e+08
Time of one iteration =    2.6483 s, computation result = 1.648120e+09
Time of one iteration =    2.6481 s, computation result = 1.074455e+09
Time of one iteration =    2.6481 s, computation result = 1.626155e+09
Time of one iteration =    2.6482 s, computation result = 1.550417e+09
Time of one iteration =    2.6483 s, computation result = 1.369261e+08
Time of one iteration =    2.6479 s, computation result = 2.025879e+09
Time of one iteration =    2.6481 s, computation result = 4.188721e+08
Time of one iteration =    2.6482 s, computation result = 1.167625e+09
Time of one iteration =    2.6483 s, computation result = 1.395522e+09
Time of one iteration =    2.6480 s, computation result = 1.273302e+09
Time of one iteration =    2.6481 s, computation result = 1.901946e+09
Time of one iteration =    2.6478 s, computation result = 9.852336e+08
Time of one iteration =    2.6482 s, computation result = 1.411876e+09
Time of one iteration =    2.6481 s, computation result = 1.194318e+09
Time of one iteration =    2.6479 s, computation result = 9.078349e+07
Time of one iteration =    2.6480 s, computation result = 3.560195e+08
Time of one iteration =    2.6480 s, computation result = 1.043311e+08
Time of one iteration =    2.6480 s, computation result = 8.975936e+08
Time of one iteration =    2.6482 s, computation result = 1.611007e+09
Time of one iteration =    2.6481 s, computation result = 1.073908e+09
Time of one iteration =    2.6480 s, computation result = 1.342637e+09
Time of one iteration =    2.6482 s, computation result = 8.293112e+08
Time of one iteration =    2.6482 s, computation result = 2.042515e+09
Time of one iteration =    2.6481 s, computation result = 1.620367e+09
Time of one iteration =    2.6481 s, computation result = 9.631723e+08
Time of one iteration =    2.6481 s, computation result = 1.870537e+09
Time of one iteration =    2.6480 s, computation result = 2.074187e+09
Time of one iteration =    2.6481 s, computation result = 9.977173e+08
Time of one iteration =    2.6482 s, computation result = 1.094236e+09
Time of one iteration =    2.6484 s, computation result = 2.101217e+09
Time of one iteration =    2.6481 s, computation result = 4.879628e+08
Time of one iteration =    2.6483 s, computation result = 1.372519e+08
Time of one iteration =    2.6480 s, computation result = 1.352022e+09
Time of one iteration =    2.6480 s, computation result = 1.427920e+09
Time of one iteration =    2.6480 s, computation result = 8.491647e+08
Time of one iteration =    2.6480 s, computation result = 1.062124e+09
Time of one iteration =    2.6480 s, computation result = 4.722387e+08
Time of one iteration =    2.6480 s, computation result = 9.563156e+08
Time of one iteration =    2.6481 s, computation result = 1.815737e+09
^C
sh-4.2$

As you can see there’s very little variation in time of each iteration (out of >100). Unfortunately I cannot reproduce
your problem despite trying the following:

  1. Running it as a different user.
  2. Rebooting repeatedly.
  3. Disabling/re-enabling swap.

At the very least, the results suggest this is unlikely to be a bug. Unfortunately this isn’t very helpful in
identifying the cause of your results. I suspect this is a tough one to solve. While we are running the same openSUSE
version, our configurations are likely to differ. It would be helpful if anyone else can try out your code to see if my
configuration or your configuration is the odd one out. I would also be interested to see if can reproduce the problem
you see running the code at runlevel 2.

136 lines calculated, the times are
min = 2.5894
max = 2.7090
not really a large variation.
Calculated on i3 2.3 GHz.


PC: oS 13.1 x86_64 | i7-2600@3.40GHz | 16GB | KDE 4.11 | GTX 650 Ti
ThinkPad E320: oS 13.1 x86_64 | i3@2.30GHz | 8GB | KDE 4.11 | HD 3000
HTPC: oS 13.1 x86_64 | Celeron@1.8GHz | 2GB | Gnome 3.10 | HD 2500

Thank you for testing in your machines.

The code I posted is a cut-out emulation from a larger program. The large program and this small piece display the same behavior: when one runs slow, the other one would also run slow.

Currently, I observe that the likely conditions of getting bistable behavior are the following ones. But none is decisive factor:

  • Boot after a long (more than a few minutes) shutdown. This is the main factor.
  • Upgrade.
  • Mounted external storage present.
  • Old computer, new opensuse.
  • Running in runlevel 3.

The following does not have an effect on bistable behavior (leaves the speed at one value, whatever it was):

  • Running as root or user.
  • Other processes loading other cores of the CPU.
  • Swap present or absent.
  • Compilation options (none, -O2, -O, --fast-math).
  • Using older math library libm-2.15.so.
  • Running the executable which was compiled on another 64 bit computer.

I am now thinking that this bistable behavior is not limited to my small program. Likely, most programs of opensuse 13.1 are bistable. But I am not sure how to test. Is there a simple package to test the CPU’s productivity?

I think that the possible culprits might be:

  • Wrong throttling of the processor or parts of it. The frequency does not change, though.
  • Processor switching the execution thread between cores in wrong ways.
  • Defective microcode.

Same bistable behavior observed at runlevel 2.

On 2014-01-17, ZStefan <ZStefan@no-mx.forums.opensuse.org> wrote:
> I am now thinking that this bistable behavior is not limited to my small
> program. Likely, most programs of opensuse 13.1 are bistable. But I am
> not sure how to test. Is there a simple package to test the CPU’s
> productivity?

OK. Let’s test your configuration with a another benchmarking program, and one that is independent of libm. Please
compile and output the results of this C++ file (please excuse the irregular indenting but NNTP posts are just rubbish
at preserving correct indentation!):


//---------------------------------------------------------------------------
# include <iostream>
# include <time.h>

//---------------------------------------------------------------------------
static const volatile double COS2PI5  =  0.30901699437494745;
static const volatile double SIN2PI5  =  0.95105651629515353;
static const volatile double COS4PI5  = -0.80901699437494734;
static const volatile double SIN4PI5  =  0.58778525229247325;

//---------------------------------------------------------------------------
using namespace std;

//---------------------------------------------------------------------------
void fft5(double* ro, double* io, double* ri, double* ii);

//---------------------------------------------------------------------------
int main(int argc, char* argv])
{
int i, j, n, N, M;
M = 50;
N = 50000000;
n = 5;
clock_t t0, t1;


double* ri = new double[n];
double* ii = new double[n];
double* ro = new double[n];
double* io = new double[n];

for (i = 0; i<n; i++) {
ri* = (double)i;
ii* = (double)(n-i);
}

fft5(ro, io, ri, ii);

cout << "Result: " << endl;
for (i = 0; i<n; i++) {
cout << i << ": (" << ro* << ", " << io* << ");" << endl;
}

for (j = 0; M; j++) {
t0 = clock();
for (i = N; i; i--) {
fft5(ro, io, ri, ii);
}
t1 = clock();
cout << j << ": number of clock cycles = " << t1-t0 << ";" << endl;
}

delete] ri;
delete] ii;
delete] ro;
delete] io;
}

//---------------------------------------------------------------------------
void fft5(double* ro, double* io, double* ri, double* ii) {
double sr1, sr2, si1, si2;
double dr1, dr2, di1, di2;
double ar1, ar2, ai1, ai2;
double br1, br2, bi1, bi2;

sr1 = ri[1] + ri[4]; si1 = ii[1] + ii[4];
dr1 = ri[1] - ri[4]; di1 = ii[1] - ii[4];
sr2 = ri[2] + ri[3]; si2 = ii[2] + ii[3];
dr2 = ri[2] - ri[3]; di2 = ii[2] - ii[3];

ar1 = ri[0] + sr1*COS2PI5 + sr2*COS4PI5;
ai1 = ii[0] + sr1*COS2PI5 + sr2*COS4PI5;
ar2 = ri[0] + sr1*COS4PI5 + sr2*COS2PI5;
ai2 = ii[0] + sr1*COS4PI5 + sr2*COS2PI5;

br1 = dr1*SIN2PI5 + dr2*SIN4PI5;
bi1 = di1*SIN2PI5 + di2*SIN4PI5;
br2 = dr1*SIN4PI5 - dr2*SIN2PI5;
bi2 = di1*SIN4PI5 - di2*SIN2PI5;

ro[0] = ri[0] + sr1 + sr2; io[0] = ii[0] + si1 + si2;
ro[1] = ar1 + bi1;         io[1] = ai1 - br1;
ro[2] = ar2 + bi2;         io[2] = ai2 - br2;
ro[3] = ar2 - bi2;         io[3] = ai2 + br2;
ro[4] = ar1 - bi1;         io[4] = ai1 + br1;
}

//---------------------------------------------------------------------------

If your results are still bistable, it’s nothing to do with libm but your computer configuration.

I think that the possible culprits might be:

  • Wrong throttling of the processor or parts of it. The frequency does
    not change, though.
  • Processor switching the execution thread between cores in wrong ways.
  • Defective microcode.

… I don’t understand: what you mean by `defective microcode’?


On 2014-01-17, ZStefan <ZStefan@no-mx.forums.opensuse.org> wrote:
> I am now thinking that this bistable behavior is not limited to my small
> program. Likely, most programs of opensuse 13.1 are bistable. But I am
> not sure how to test. Is there a simple package to test the CPU’s
> productivity?

Let’s first exclude your belief that it’s libm fault. Please compile and run the following libm-independent (but still
mathematical) C++ code and post the output:


//---------------------------------------------------------------------------
# include <iostream>
# include <time.h>

//---------------------------------------------------------------------------
static const volatile double COS2PI5  =  0.30901699437494745;
static const volatile double SIN2PI5  =  0.95105651629515353;
static const volatile double COS4PI5  = -0.80901699437494734;
static const volatile double SIN4PI5  =  0.58778525229247325;

//---------------------------------------------------------------------------
using namespace std;

//---------------------------------------------------------------------------
void fft5(double* ro, double* io, double* ri, double* ii);

//---------------------------------------------------------------------------
int main(int argc, char* argv])
{
int i, j, n, N, M;
M = 50;
N = 50000000;
n = 5;
clock_t t0, t1;


double* ri = new double[n];
double* ii = new double[n];
double* ro = new double[n];
double* io = new double[n];

for (i = 0; i<n; i++) {
ri* = (double)i;
ii* = (double)(n-i);
}

fft5(ro, io, ri, ii);

cout << "Result: " << endl;
for (i = 0; i<n; i++) {
cout << i << ": (" << ro* << ", " << io* << ");" << endl;
}

for (j = 0; M; j++) {
t0 = clock();
for (i = N; i; i--) {
fft5(ro, io, ri, ii);
}
t1 = clock();
cout << j << ": number of clock cycles = " << t1-t0 << ";" << endl;
}

delete] ri;
delete] ii;
delete] ro;
delete] io;
}

//---------------------------------------------------------------------------
void fft5(double* ro, double* io, double* ri, double* ii) {
double sr1, sr2, si1, si2;
double dr1, dr2, di1, di2;
double ar1, ar2, ai1, ai2;
double br1, br2, bi1, bi2;

sr1 = ri[1] + ri[4]; si1 = ii[1] + ii[4];
dr1 = ri[1] - ri[4]; di1 = ii[1] - ii[4];
sr2 = ri[2] + ri[3]; si2 = ii[2] + ii[3];
dr2 = ri[2] - ri[3]; di2 = ii[2] - ii[3];

ar1 = ri[0] + sr1*COS2PI5 + sr2*COS4PI5;
ai1 = ii[0] + si1*COS2PI5 + si2*COS4PI5;
ar2 = ri[0] + sr1*COS4PI5 + sr2*COS2PI5;
ai2 = ii[0] + si1*COS4PI5 + si2*COS2PI5;

br1 = dr1*SIN2PI5 + dr2*SIN4PI5;
bi1 = di1*SIN2PI5 + di2*SIN4PI5;
br2 = dr1*SIN4PI5 - dr2*SIN2PI5;
bi2 = di1*SIN4PI5 - di2*SIN2PI5;

ro[0] = ri[0] + sr1 + sr2; io[0] = ii[0] + si1 + si2;
ro[1] = ar1 + bi1;         io[1] = ai1 - br1;
ro[2] = ar2 + bi2;         io[2] = ai2 - br2;
ro[3] = ar2 - bi2;         io[3] = ai2 + br2;
ro[4] = ar1 - bi1;         io[4] = ai1 + br1;
}

//---------------------------------------------------------------------------

… apologies for the rubbish indenting, but the openSUSE forum sucks when it comes to preserving indentation of code
posts from NNTP; at least we’re not talking about Python!

On 2014-01-17, ZStefan ZStefan@no-mx.forums.opensuse.org wrote:

I think that the possible culprits might be:

  • Wrong throttling of the processor or parts of it. The frequency does
    not change, though.
  • Processor switching the execution thread between cores in wrong ways.
  • Defective microcode.

I don’t understand - what is `defective microcode’?


On 2014-01-17, ZStefan <ZStefan@no-mx.forums.opensuse.org> wrote:
> I am now thinking that this bistable behavior is not limited to my small
> program. Likely, most programs of opensuse 13.1 are bistable. But I am
> not sure how to test. Is there a simple package to test the CPU’s
> productivity?

Let’s first exclude your belief that it’s libm fault. Please compile and run the following libm-independent (but still
mathematical) C++ code and post the output:


//---------------------------------------------------------------------------
# include <iostream>
# include <time.h>

//---------------------------------------------------------------------------
static const volatile double COS2PI5  =  0.30901699437494745;
static const volatile double SIN2PI5  =  0.95105651629515353;
static const volatile double COS4PI5  = -0.80901699437494734;
static const volatile double SIN4PI5  =  0.58778525229247325;

//---------------------------------------------------------------------------
using namespace std;

//---------------------------------------------------------------------------
void fft5(double* ro, double* io, double* ri, double* ii);

//---------------------------------------------------------------------------
int main(int argc, char* argv])
{
int i, j, n, N, M;
M = 50;
N = 50000000;
n = 5;
clock_t t0, t1;

double* ri = new double[n];
double* ii = new double[n];
double* ro = new double[n];
double* io = new double[n];

for (i = 0; i<n; i++) {
ri* = (double)i;
ii* = (double)(n-i);
}

fft5(ro, io, ri, ii);

cout << "Result: " << endl;
for (i = 0; i<n; i++) {
cout << i << ": (" << ro* << ", " << io* << ");" << endl;
}

for (j = 0; j<M; j++) {
t0 = clock();
for (i = N; i; i--) {
fft5(ro, io, ri, ii);
}
t1 = clock();
cout << j << ": number of clock cycles = " << t1-t0 << ";" << endl;
}

delete] ri;
delete] ii;
delete] ro;
delete] io;
}

//---------------------------------------------------------------------------
void fft5(double* ro, double* io, double* ri, double* ii) {
double sr1, sr2, si1, si2;
double dr1, dr2, di1, di2;
double ar1, ar2, ai1, ai2;
double br1, br2, bi1, bi2;

sr1 = ri[1] + ri[4]; si1 = ii[1] + ii[4];
dr1 = ri[1] - ri[4]; di1 = ii[1] - ii[4];
sr2 = ri[2] + ri[3]; si2 = ii[2] + ii[3];
dr2 = ri[2] - ri[3]; di2 = ii[2] - ii[3];

ar1 = ri[0] + sr1*COS2PI5 + sr2*COS4PI5;
ai1 = ii[0] + si1*COS2PI5 + si2*COS4PI5;
ar2 = ri[0] + sr1*COS4PI5 + sr2*COS2PI5;
ai2 = ii[0] + si1*COS4PI5 + si2*COS2PI5;

br1 = dr1*SIN2PI5 + dr2*SIN4PI5;
bi1 = di1*SIN2PI5 + di2*SIN4PI5;
br2 = dr1*SIN4PI5 - dr2*SIN2PI5;
bi2 = di1*SIN4PI5 - di2*SIN2PI5;

ro[0] = ri[0] + sr1 + sr2; io[0] = ii[0] + si1 + si2;
ro[1] = ar1 + bi1;         io[1] = ai1 - br1;
ro[2] = ar2 + bi2;         io[2] = ai2 - br2;
ro[3] = ar2 - bi2;         io[3] = ai2 + br2;
ro[4] = ar1 - bi1;         io[4] = ai1 + br1;
}

//---------------------------------------------------------------------------

… apologies for the rubbish indenting, but the openSUSE forum sucks when it comes to preserving indentation of code
posts from NNTP; at least we’re not talking about Python!

On 2014-01-17, ZStefan ZStefan@no-mx.forums.opensuse.org wrote:

I think that the possible culprits might be:

  • Wrong throttling of the processor or parts of it. The frequency does
    not change, though.
  • Processor switching the execution thread between cores in wrong ways.
  • Defective microcode.

I don’t understand - what is `defective microcode’?


On 2014-01-17 21:48, flymail wrote:
> On 2014-01-17, ZStefan <> wrote:
>>

> It would be helpful if anyone else can try out your code to see if my
> configuration or your configuration is the odd one out. I would also be interested to see if can reproduce the problem
> you see running the code at runlevel 2.

We should have the exact build line to compile it all of us the same.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

I compiled and ran the code which does not use libm in two computers. The compilation command is the same: g++ -Wall -O2 code.C In a powerful computer, the speed is stable. With several runs, the picture is the same. This computer began behaving 99% correctly (stable speed) in December:

Result:
0: (10, 15);
1: (0.940955, 5.94095);
2: (-1.6877, 3.3123);
3: (-3.3123, 1.6877);
4: (-5.94095, -0.940955);
0: number of clock cycles = 508502;
1: number of clock cycles = 453678;
2: number of clock cycles = 445733;
3: number of clock cycles = 445755;
4: number of clock cycles = 445716;
5: number of clock cycles = 445726;
6: number of clock cycles = 445741;
7: number of clock cycles = 445745;
8: number of clock cycles = 445716;
9: number of clock cycles = 445734;
10: number of clock cycles = 445715;
11: number of clock cycles = 445732;
12: number of clock cycles = 445743;
13: number of clock cycles = 445711;
14: number of clock cycles = 445723;
15: number of clock cycles = 445731;
16: number of clock cycles = 445732;
17: number of clock cycles = 445734;
18: number of clock cycles = 445733;
19: number of clock cycles = 445720;
20: number of clock cycles = 445725;
21: number of clock cycles = 445736;
22: number of clock cycles = 445716;
23: number of clock cycles = 445726;
24: number of clock cycles = 445732;
25: number of clock cycles = 445725;
26: number of clock cycles = 445722;
27: number of clock cycles = 445732;
28: number of clock cycles = 445730;
29: number of clock cycles = 445700;
30: number of clock cycles = 445735;
31: number of clock cycles = 445724;
32: number of clock cycles = 445729;
33: number of clock cycles = 445743;
34: number of clock cycles = 445754;
35: number of clock cycles = 445736;
36: number of clock cycles = 445736;
37: number of clock cycles = 445721;
38: number of clock cycles = 445722;
39: number of clock cycles = 445731;
40: number of clock cycles = 445726;
41: number of clock cycles = 445720;
42: number of clock cycles = 445730;
43: number of clock cycles = 445755;
44: number of clock cycles = 445709;
45: number of clock cycles = 445727;
46: number of clock cycles = 445731;
47: number of clock cycles = 445712;
48: number of clock cycles = 445735;
49: number of clock cycles = 445724;

In a weak computer, the speed is unstable, both within a run and from run to run. Here are results of several runs:

Result:
0: (10, 15);
1: (0.940955, 5.94095);
2: (-1.6877, 3.3123);
3: (-3.3123, 1.6877);
4: (-5.94095, -0.940955);
0: number of clock cycles = 4687464;
1: number of clock cycles = 4686562;
2: number of clock cycles = 4691095;
3: number of clock cycles = 4689177;
4: number of clock cycles = 4683970;
5: number of clock cycles = 4683920;
6: number of clock cycles = 4692106;
7: number of clock cycles = 4675616;
8: number of clock cycles = 4675457;
9: number of clock cycles = 4680332;
10: number of clock cycles = 4673424;
11: number of clock cycles = 4672849;
12: number of clock cycles = 3823070;
13: number of clock cycles = 3775575;
14: number of clock cycles = 3778872;
15: number of clock cycles = 3773278;
16: number of clock cycles = 3772061;
17: number of clock cycles = 3777061;
18: number of clock cycles = 3774211;
19: number of clock cycles = 3771861;
20: number of clock cycles = 3773929;
21: number of clock cycles = 3770219;
22: number of clock cycles = 3774789;
23: number of clock cycles = 3772138;
24: number of clock cycles = 3772746;
25: number of clock cycles = 3775962;
26: number of clock cycles = 3770545;
27: number of clock cycles = 3774087;
28: number of clock cycles = 3795417;
29: number of clock cycles = 3775497;
30: number of clock cycles = 3770216;
31: number of clock cycles = 3774042;
32: number of clock cycles = 3772416;
33: number of clock cycles = 3774656;
34: number of clock cycles = 3776444;
35: number of clock cycles = 3769991;
36: number of clock cycles = 3773826;
37: number of clock cycles = 3770372;
38: number of clock cycles = 3779762;
39: number of clock cycles = 3773585;
40: number of clock cycles = 3771470;
41: number of clock cycles = 3777312;
42: number of clock cycles = 3769601;
43: number of clock cycles = 3774548;
44: number of clock cycles = 3774476;
45: number of clock cycles = 3795706;
46: number of clock cycles = 3771409;
47: number of clock cycles = 3772462;
48: number of clock cycles = 3771654;
49: number of clock cycles = 3774292;

Result:
0: (10, 15);
1: (0.940955, 5.94095);
2: (-1.6877, 3.3123);
3: (-3.3123, 1.6877);
4: (-5.94095, -0.940955);
0: number of clock cycles = 3776052;
1: number of clock cycles = 3773844;
2: number of clock cycles = 3769828;
3: number of clock cycles = 3774982;
4: number of clock cycles = 3772525;
5: number of clock cycles = 3859246;
6: number of clock cycles = 3817527;
7: number of clock cycles = 3771042;
8: number of clock cycles = 3780128;
9: number of clock cycles = 3784646;
10: number of clock cycles = 3789066;
11: number of clock cycles = 3788524;
12: number of clock cycles = 3773102;
13: number of clock cycles = 3773818;
14: number of clock cycles = 3769632;
15: number of clock cycles = 3773451;
16: number of clock cycles = 3771665;
17: number of clock cycles = 3771029;
18: number of clock cycles = 3774964;
19: number of clock cycles = 3770120;
20: number of clock cycles = 3776895;
21: number of clock cycles = 3769220;
22: number of clock cycles = 3773796;
23: number of clock cycles = 3769357;
24: number of clock cycles = 3773784;
25: number of clock cycles = 3784030;
26: number of clock cycles = 3779311;
27: number of clock cycles = 3773114;
28: number of clock cycles = 3824498;
29: number of clock cycles = 3844043;
30: number of clock cycles = 3786189;
31: number of clock cycles = 3774377;
32: number of clock cycles = 3850162;
33: number of clock cycles = 3776456;
34: number of clock cycles = 3773919;
^C
(Aborted by me).

Result:
0: (10, 15);
1: (0.940955, 5.94095);
2: (-1.6877, 3.3123);
3: (-3.3123, 1.6877);
4: (-5.94095, -0.940955);
0: number of clock cycles = 3778007;
1: number of clock cycles = 3774864;
2: number of clock cycles = 3770265;
3: number of clock cycles = 3774364;
4: number of clock cycles = 3785002;
5: number of clock cycles = 3879270;
6: number of clock cycles = 3834924;
7: number of clock cycles = 3790309;
8: number of clock cycles = 3925327;
9: number of clock cycles = 9767283;
10: number of clock cycles = 9023985;
11: number of clock cycles = 9018802;
12: number of clock cycles = 8594617;
13: number of clock cycles = 8594863;
14: number of clock cycles = 8771053;
15: number of clock cycles = 8593567;
16: number of clock cycles = 8691937;
17: number of clock cycles = 8573534;
18: number of clock cycles = 8616794;
19: number of clock cycles = 8582255;
20: number of clock cycles = 8584782;
21: number of clock cycles = 8581797;
22: number of clock cycles = 8581370;
23: number of clock cycles = 8579252;
24: number of clock cycles = 8585953;
25: number of clock cycles = 8598985;
26: number of clock cycles = 8602138;
27: number of clock cycles = 8583094;
28: number of clock cycles = 8665441;
29: number of clock cycles = 8599922;
30: number of clock cycles = 8583363;
31: number of clock cycles = 8584255;
32: number of clock cycles = 8591014;
33: number of clock cycles = 8630006;
34: number of clock cycles = 8577533;
35: number of clock cycles = 8588553;
36: number of clock cycles = 8587103;
37: number of clock cycles = 8588094;
38: number of clock cycles = 8585281;
39: number of clock cycles = 8589022;
40: number of clock cycles = 8618443;
41: number of clock cycles = 8579892;
42: number of clock cycles = 8580931;
43: number of clock cycles = 8581267;
44: number of clock cycles = 8590375;
45: number of clock cycles = 8623978;
46: number of clock cycles = 8586611;
47: number of clock cycles = 8610163;
48: number of clock cycles = 8596857;
49: number of clock cycles = 8576197;

Result:
0: (10, 15);
1: (0.940955, 5.94095);
2: (-1.6877, 3.3123);
3: (-3.3123, 1.6877);
4: (-5.94095, -0.940955);
0: number of clock cycles = 8635552;
1: number of clock cycles = 8588276;
2: number of clock cycles = 8615518;
3: number of clock cycles = 8789013;
4: number of clock cycles = 8587630;
5: number of clock cycles = 8576961;
6: number of clock cycles = 8632155;
7: number of clock cycles = 8604124;
8: number of clock cycles = 8587669;
9: number of clock cycles = 8663184;
10: number of clock cycles = 8583841;
11: number of clock cycles = 8664453;
12: number of clock cycles = 8583779;
13: number of clock cycles = 8571368;
14: number of clock cycles = 8595658;
15: number of clock cycles = 8579887;
16: number of clock cycles = 8641144;
17: number of clock cycles = 8615109;
18: number of clock cycles = 8582671;
19: number of clock cycles = 8588623;
20: number of clock cycles = 8580934;
21: number of clock cycles = 8584653;
22: number of clock cycles = 8585117;
23: number of clock cycles = 8579035;
24: number of clock cycles = 8617971;
25: number of clock cycles = 8584675;
26: number of clock cycles = 8586880;
27: number of clock cycles = 8582742;
28: number of clock cycles = 8584673;
29: number of clock cycles = 8586307;
30: number of clock cycles = 8695514;
31: number of clock cycles = 8821997;
32: number of clock cycles = 8575760;
33: number of clock cycles = 8582446;
34: number of clock cycles = 8589007;
35: number of clock cycles = 8591747;
36: number of clock cycles = 8579118;
37: number of clock cycles = 8575939;
38: number of clock cycles = 8637496;
39: number of clock cycles = 8583684;
40: number of clock cycles = 8582790;
41: number of clock cycles = 8581654;
42: number of clock cycles = 8584947;
43: number of clock cycles = 8580994;
44: number of clock cycles = 8588341;
45: number of clock cycles = 8607097;
46: number of clock cycles = 8591201;
47: number of clock cycles = 8582559;
48: number of clock cycles = 8583620;
49: number of clock cycles = 8587400;

If I know correctly, microcode is some sort of helper driver of the processor. It may execute the compiled code in different ways if it is defective.

uname -a

Linux computer 3.11.6-4-desktop #1 SMP PREEMPT Wed Oct 30 18:04:56 UTC 2013 (e6d4a27) x86_64 x86_64 x86_64 GNU/Linux

gcc -v

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/4.8/lto-wrapper
Target: x86_64-suse-linux
Configured with: …/configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8 --enable-ssp --disable-libssp --disable-plugin --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion=‘SUSE Linux’ --disable-libgcj --disable-libmudflap --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --enable-linker-build-id --program-suffix=-4.8 --enable-linux-futex --without-system-libunwind --with-arch-32=i586 --with-tune=generic --build=x86_64-suse-linux
Thread model: posix
gcc version 4.8.1 20130909 [gcc-4_8-branch revision 202388] (SUSE Linux)

On 2014-01-18, ZStefan <ZStefan@no-mx.forums.opensuse.org> wrote:
> I compiled and ran the code which does not use libm in two computers.
> The compilation command is the same: g++ -Wall -O2 code.C In a
> powerful computer, the speed is stable. With several runs, the picture
> is the same. This computer began behaving 99% correctly (stable speed)
> in December:

<SNIP>

> In a weak computer, the speed is unstable, both within a run and from
> run to run. Here are results of several runs:

Thank you. For your information, my code gives very little variation in performance within/across runs on 4 openSUSE
13.1 machines I’ve tested it on (1 is 32-bit, the other 3 64-bit). I think your trials have established two things
(indicating that the cause is likely to be #3 listed in one of my previous posts - i.e. external influences):

  1. Your problem is not related to libm.
  2. There is something very wrong with your weak computer or its configuration. The results from this machine certainly
    seem to identify it as the odd one out.

Since my code doesn’t use any exotic libraries, the problem is nothing to do with gcc/g++ or any of its headers. To
establish this beyond doubt, ideally you can you test your CPU maths using a precompiled benchmarking program to confirm
bistable results?

As to why this is happening is probably beyond me, I’d be guessing to be honest since it is unlikely to be a programming
issue but to do with the idiosyncracies of your weak machine and its relationship with the kernel: e.g. your machine may
not like kernel 3.11. It possible to test this hypothesis using a different distribution that uses the same kernel (e.g.
Mint Petra), but I suspect other people may have more clever ideas.

On 2014-01-18 09:26, ZStefan wrote:
>
> robin_listas;2617000 Wrote:
>>
>> We should have the exact build line to compile it all of us the same.
>>
>
> uname -a
>
> Linux computer 3.11.6-4-desktop #1 SMP PREEMPT Wed Oct 30 18:04:56 UTC
> 2013 (e6d4a27) x86_64 x86_64 x86_64 GNU/Linux
>
>
> gcc -v
>

No, no, I mean the line to compile your code.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

On 2014-01-18, Carlos E. R. <robin_listas@no-mx.forums.opensuse.org> wrote:
> No, no, I mean the line to compile your code.

Doesn’t ZStefan’s post (#12) say…


g++ -Wall -O2 code.C

… or am I missing something?

On 2014-01-18 09:16, ZStefan wrote:
>
> flymail;2616993 Wrote:
>>
>> I don’t understand - what is `defective microcode’?
>>
>
> If I know correctly, microcode is some sort of helper driver of the
> processor. It may execute the compiled code in different ways if it is
> defective.

The idea is to recode how CPU instructions work. You can program them to
some extent, so that you can cure some cpu bugs.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)