I have a number of benchmarks that are often used in reviews of new Raspberry Pis and have also been run on everything from DOS to Windows 10, Linux distros, includingOpenSuse, and Android, many with 32 bit and 64 bit compilations. I am converting these to run on Raspberry Pi 3 at 64 bits via SUSE and OpenSUSE. I won’t bore you with details (unless you want me to) but these can be found at:
I have encountered a problem that prevents realistic performance being demonstrated that appears to be caused by the CPU clock alternating between 1200 and 600 MHz, evenwhen idling. I monitored MHz and CPU temperature via the watch command, with 1 second sampling. One way of avoiding the MHz variation was to change the watch sampling rate to 0.1 seconds - why?
As established in the above RPi topic, a workaround is to change boot config.txt force_turbo=0 to =1 (Turbo mode: 0 = enable dynamic freq/voltage - 1 = always max). Alwaysrunning at 1200 MHz is good for benchmarking but not for real work. Using Raspbian, normally CPU MHz is 1200 MHz when running CPU bound programs and 600 MHz when idling.
Another issue is to find out what happens when I run my MP stress tests. These cause the RPi 3 to overheat but the CPU MHz is throttled to minimise the effects. This clockspeed is identified using command vcgencmd measure_clock arm, with variable steps, but that does not appear to be available for SUSE. How is this overheating avoided with SUSE - running continuously at 600 MHz?
Thanks, that works but, after setting performance, all cores run permanently at 1200 MHz. With OpenSuse 11.3, on a PC, normal operation is mainly at a low frequency when idle (e.g. 800 MHz) then at full speed (like 3000 MHz) when a benchmark is running, but only the core being used, followed by back to lower MHz when finished. RPi 3 with Raspbian works in the same way, at 600 or 1200 MHz, except all cores run at the same frequency. Reminder - normal operation with RPi3 and SUSE is that it switches to lower frequency when a CPU bound program is running.
I have compiled my RPi Floating point and integer CPU stress tests, at 64 bits for OpenSUSE, and run them on my RPi 3. The purpose was to see what happens when the system is running with the CPU frequency governor settings of performance and on-demand, bearing in mind that the first converted single core benchmarks demonstrated unexpected slow performance with the latter setting. Details of the tests are in the following and benchmarks, with source code, in the tar.gz file.
The first tests were with the RPi 3 fitted in a FLIRC case ##, where 15 minute 32 bit tests, under Raspbian, showed no degradation in performance, with limited increase in CPU temperature. The next ones had a copper heat sink on the CPU and no case, where the original tests demonstrated CPU MHz throttling and slower performance, with increases in temperature.
with FLIRC case, the whole aluminium case acts as an heatsink.
A summary of OpenSUSE results are below, showing average speeds per core for multi-core tests, with floating point test results in MFLOPS and integer tests in MB/second.
Single vs Multi Core - With performance setting, up to 10% per core degradation could be expected. With on demand, even an overheated MP core can be faster than a cold single core.
Performance vs On Demand - Multi core performance is essentially the same but single core OD speeds can be too slow.
FLIRC case - Note constant performance over 15 minutes MP tests.
Copper Heatsink - CPU speed throttling kicks in at over 80°C CPU temperature, recorded as on-going small reductions in MHz, and slower measured performance.
Per Core Average Speeds
On Demand Performance OD/Perf
MFLOPS MB/sec MFLOPS MB/sec MFL MB/s
Single core 1 pass 1996 2278 3832 2768 52% 82%
FLIRC Case
4 cores first 2 passes 3623 2465 3644 2528 99% 98%
4 cores last 2 passes 3591 2476 3645 2618 99% 95%
Copper Heatsink
4 cores first 2 passes 3603 2485 3463 2477 104% 100%
4 cores last 2 passes 3152 2017 3104 1975 102% 102%
The above htm report includes 8 graphs, with 2 represented below, indicating variable recordings with on demand setting and fairly constant speeds using the performance option, with limited increases in CPU temperature.
> Now you need to look at kernel tweaks Or compare the kernel configurations between openSUSE and SLES (Free for a year)for aarch64<
Thanks, but I have never considered using kernel tweaks and prefer to stick to “as is”, unless a simple run time optionis available. I already have a copy of SLES and have repeated the stress tests and it produced the same loading effects as OpenSUSE.
Following are average speeds of stressintPi64, running 6 minute tests at 8 KB, using 1, 2 and 4 cores. Withthe performance setting, except for a little degradation due to heat effects, with 4 threads, MB/second results were the same. On the other hand, default on demand option produced better performance per core as the load increased. That seems to be the wrong way round.
On Demand Performance
Program Total Average Total Average
Copies MB/sec MB/sec MB/sec MB/sec
1 2079 2079 2758 2758
2 4651 2325 5519 2759
4 10811 2703 10806 2701
I have had a quite a few crashes using the SUSE Operating Systems, but it might be my Raspberry Pi 3 and/or the sortof programs I am running, as they also occurred (less frequently) using Raspbian. Unfortunately, sometimes the SD cards cannot boot afterwards. This happened with SLES. Although it is no real hassle to produce another copyof the system, the new copy could not install SUSE software/updates and the old registration code was not recognised. But, so far, I can still run my benchmarks using SLES.
I know it’s a bit off topic, but can you guys please advice if Leap 42.2 for RPi is mature enough for daily usage, or is it still more of a proof-of-concept? I have been using Raspbian for for several months and I’m quite disappointed with it. I would love to switch to openSUSE, because that’s what I use for everything else, but I don’t want to end up with something even worse then Raspbian.
I understand it all depends on the intended use etc, but I just want to know your subjective opinion - is openSUSE for Arm similar in maturity and stability to the x86 version?
Hi
Had no issues with openSUSE Leap 42.2, Tumbleweed (I like the xfce on this) or the free (1 year subscription) SLES 12 SP2 version, I tend to use for command line only and playing with the GPIO. I had to modify wiringPi code as well as create fake cpu info.
I have compiled and run the first set of 64 bit benchmarks. Full details and results are in the following, with benchmarksand source codes in the tar.gz file:
The Classic Benchmarks are the first programs that set standards of performance for computers in the 1970s and 1980s. Theyare Whetstone, Dhrystone, Linpack and Livermore Loops. Improvements indicated relate to comparisons of gcc-6 64 bit versions and 32 bit compilations from gcc 4.8 via Raspbian.
** Whetstone **- This includes simple test loops that do not benefit from advanced instructions. There was a 40% improvement inoverall performance. This was due to limited but dominant tests using such as COS and EXP functions.
** Dhrystone **- rated in VAX MIPS AKA DMIPS produced a 43% improvement, but this benchmark is susceptible to over optimisation.
Linpack - double and single precision versions (DP and SP), with results reported in MFLOPS. Speed improvements, over the 32bit version, were around 1.9 times DP and 2.5 times SP. There is also a version that uses NEON intrinsic functions where, at 32 bits and 64 bits, are compiled as different varieties of vector instructions, with only a 10%improvement.
** Livermore Loops **- has 24 test kernels, where 64 bit performance increased between 1.02 and 2.88 times. The official averagewas 34% faster, at 279 MFLOPS. This is 21 times faster than the Cray 1 supercomputer, where this benchmark confirmed the original selection.
** Memory tests -** These measure cache and RAM speeds with results in MB/second. As could be expected, RAM speeds were generallyquite similar for particular test functions.
** MemSpeed -** Nine tests measure speeds using floating point (FP) and integer calculations. Cache based improvements were 1.64to 2.60 DPFP, 1.17 to 1.55 SPFP and 1.03 to 1.23 integer.
**BusSpeed - **this reads data via loops with 64 AND instructions, attempting to measure maximum data transfer speeds. It includesvariable address increments to identify burst reading and to provide a means of estimating bus speeds. Main differences were on using L1 cache data, where average bursts speeds were 38% faster but reading all data was slower.This is surprising as the 64 bit disassembly indicate that far more registers were used, with fewer load instructions, and the same type of AND instructions.
** NeonSpeed -** All floating point data is single precision. The source code carries out the same calculations using normal arithmeticand more complicated NEON intrinsic functions, the latter being compiled as different types of vector instructions, with no real average 64 bit improvement. The normal SP calculations were slightly faster.
Latest programs converted were my Fast Fourier Transform benchmarks that showed some 64 bit performance improvements. Source code and execution files are included in the above. These execute FFTs sized 1K to 1024K, the larger ones depending on RAM speeds. Using Raspbian (32 bit), SUSE and another Linux distro, (64 bit), the short FFTs, with execution times of less than 0.5 milliseconds, produced inconsistent running times (like sometimes half speed). This was only with “on demand” MHz settings. SUSE also produced longer periods of poor performance, as observed through random slow results on other benchmarks. [FONT=Verdana]To investigate this, I produced another test that executes 30 1K sized FFTs 500 times, with 32 bit and 64 bit compilations (These will be included in the tar.gz file). Example results are below.
Most of my multithreading benchmarks run using 1, 2, 4 and 8 threads. Many have tests that use approximately 12 KB. 120 KBand 12 MB, to use both caches and RAM. The first set attempt to measure maximum MFLOPS. with two test procedures, one with two floating point operations per data word and the other with 32. The latter includes a mixture ofmultiplications and additions, coded to enable SIMD operation. In this case, using single precision numbers, four at a time, plus linked multiply and add, a top end CPU can execute eight operations per clock cycle per core.It is not clear what the potential maximum MFLOPS is on an ARM Cortex-A53, but eight per core is mentioned. The same benchmark code obtained a maximum of 24 MFLOPS/MHz on a top end quad core Intel CPU, via Linux - see thefollowing:
[FONT=Verdana]Following shows the format of the MP-MFLOPS benchmarks with the best 64 bit Raspberry Pi 3 results. Note performance increasesusing more threads, except when limited by RAM speed. These benchmarks carry out a fixed number of test passes, with each thread carrying out the same calculations on different sections of data. Numeric results produced (x100000) are output to show that all data has been used.
[/FONT]
Benchmarks appropriate for comparison of 32 and 64 bit versions are single and double precision versions, compiled for normalfloating point and one using NEON intrinsic functions that are clearly suitable for SIMD operation and are converted to different types of vector operation.
64 bit/32 bit speed comparisons are below. Single precision MP-MFLOPS has the highest gain by using vector instructions, insteadof scalar. With compiled intrinsics the systems use different forms of vector instructions.
There is also an OpenMP benchmark that carries out the same calculations, but also with 8 calculations per data word. OpenSUSEuses all available CPU cores. So, for comparison purposes, a version, without the MP directive, is also provided. Results identify MP gains of up to 3.89 times at 64 bits. The 64 bit version produces some similar speedsto the 32 bit compilation, but was faster by 2.47 to 2.80 times using 32 floating point operations per word, in the MP tests.
As usual benchmark, source codes, details and results are in:
The other MP benchmarks, included in the tar.gz file, demonstrate some MP and 64 bit performance gains, with others identifyingthat multithreading provided little or no benefit and, sometimes, much worse performance.
**
M**P-Whetstone - Multiple threads each run the eight test functions at the same time, but with some dedicated variables. MP performanceis good but the simple test functions are nit appropriate for more advanced instructions at 64 bits, so relative 32 bit performance is between 0.48 and 2.08.
**
MP-Dhrystone - This runs multiple copies of the whole program at the same time. Dedicated data arrays are used for each threadbut there are numerous other variables that are shared. The latter reduces performance gains via multiple threads and, in some cases, these can be slower than using a single thread. In this case, some quad core improvementsare shown as up to 2.5 times faster than a single core. Single core 64 bit/32 bit speed ratio was 1.50 reducing to 1.10 using four threads.
**
MP-Linpack **- **The original Linpack Benchmark operates on double precision floating point 100x100 matrices. This one runs on100x100, 500x500 and 1000x1000 single precision matrices using 0, 1, 2 and 4 separate threads, mainly via NEON intrinsic functions that are compiled into different forms of vector instructions. The benchmark was produced todemonstrate that the original Linpack code could not be converted (by me) to show increased performance using multiple threads. The official line is that users are allowed to implement their own linear equation solver forthis purpose. At 100 x 100, data is in L2 cache, others depend more on RAM speed. The critical daxpy function is affected by numerous thread create and join directives, even on using one thread. This leads to slow and constantperformance using all thread tests - see example below. The 32 bit version produced slightly slower speeds.
Linpack Single Precision MultiThreaded Benchmark
64 Bit NEON Intrinsics, Wed Mar 8 11:36:25 2017
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
N 100 552.47 112.73 105.19 105.31
N 500 442.32 303.75 303.64 305.03
N 1000 353.88 315.96 309.15 308.31
**MP-BusSpeed - **This runs integer read only tests using caches and RAM, each thread accessing the same data, but with staggeredstarting points. It includes tests with variable address increments, to identify burst reading and bus speeds. The main “Read All” test is intended to identify maximum RAM speed. The benchmark demonstrated some appropriateMP performance gains, but slow 64 bit speeds, with the 32 bit version being 2.5 times faster via cache based data. The reason is that the latter compiled arithmetic as 16 four way NEON operations compared with 64 scalar instructions.
**MP-RandMem - **The benchmark has cache and RAM read only and read/write tests using sequential and random access, each threadaccessing the same data but starting at different points. The read only L1 cache based tests demonstrated MP gains of 3.6 times and 64 bit version 43% faster than the 32 bit variety. Read/write tests produced no multithreadingperformance improvement and the latest benchmark appeared to be siomewhat slower than the 32 bit version.
This was produced for use on Linux based PCs. It has four tests using coloured or textured simple objects then a wireframeand textured complex kitchen structure. It can be run from a script file specifying different window sizes and a command to disable VSYNC, enabling speeds greater than 60 FPS to be demonstrated. The benchmark, source codeand details are in the following:
In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework ofthe Unity desktop software. One reason probably was that a test can be run for extended periods as a stress test.
Below are results from a Raspberry Pi 3, using the experimental desktop GL driver and the new 64 bit version. The latter includedtests at a smaller window size, to show that maximum speed was not limited by VSYNC. It can be seen that, using smaller windows, the 32 bit version was significantly faster running simple coloured objects, with the 64 bitbenchmark being ahead with complex structures. Then, performance became close up to 1024 x 768, with the later program falling over with a full screen display - (config setting?). Note that this benchmark would not run onsome Leap installations.
######################### RPi 3 Original #########################
GLUT OpenGL Benchmark 32 Bit Version 1, Wed Jul 27 20:31:52 2016
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
320 240 308.4 182.1 82.6 52.3 21.6 13.7
640 480 129.5 119.6 74.6 49.2 21.6 13.8
1024 768 54.8 52.2 43.7 39.2 21.4 13.6
1920 1080 21.5 17.9 20.3 19.6 20.6 13.4
########################## RPi 3 SUSE ###########################
GLUT OpenGL Benchmark 64 Bit Version 1, Sat Mar 18 19:03:25 2017
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
160 120 87.1 76.3 64.3 46.9 24.3 15.6
320 240 59.2 54.7 53.7 43.9 25.6 15.6
640 480 33.4 31.7 31.0 27.6 24.4 15.3
1024 768 17.5 17.5 17.7 17.0 16.2 14.1
1920 1080 8.2 8.3 9.0 9.3 8.4 7.6
JavaDraw Benchmarks
The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Fivetests draw on a background of continuously changing colour shades. Benchmarks, further details and results can be obtained via the above links.
Results below include all sorts of issues, where the original system did not run well after the new OpenGL GLUT driver wasinstalled and OpenSUSE performance depended on a particular distribution.
##################### RPi 3 JavaDraw FPS ######################
PNG PNG +Sweep +200 +320 +4000
Bitmaps Bitmaps Gradient Small Long Small
1 2 Circles Circles Lines Circles
Pi 2 900 MHz 44.4 56.8 57.3 55.0 38.6 25.2
Pi 3 Original 55.0 69.5 70.0 67.7 46.4 29.5
Pi 3 +GLUT Driver 2.9 3.2 7.3 8.1 7.5 7.0
Pi 3 OpenSUSE 8.6 10.9 10.7 10.1 7.9 3.6
Pi 3 OpenSUSE 22.8 32.1 32.3 27.7 15.3 6.2
[FONT=Verdana]Java Whetstone Benchmark
Details and results are also included ii the above files. Excluding two tests, where each was much faster, the average 64bit speed was nearly twice as fast.
64 Bit I/O Benchmarks
**
[FONT=Verdana][size=2]My DriveSpeed and LanSpeed programs have now been recompiled as DriveSpeed64 and LanSpeed64, **with benchmarks, source codes,details and results in the tar.gz and htm files quoted earlier. The code for these is identical, except DriveSpeed opens files to use direct I/O, avoiding caching. LanSpeed normally runs without using local caching. The benchmarksmeasure writing and reading speeds of relatively large files, random access and numerous small files.
There might be tuning parameter, but DriveSpeed64 produced errors using the installed OpenSUSE, where direct I/O did not appearto be available. It did run using SUSE SLES, producing the results shown below. In this case, random access and small file test results were not as expected.
#################### DriveSpeed64 SUSE SLES ####################
DriveSpeed RasPi 64 Bit 1.1 Mon Apr 3 23:40:21 2017
Current Directory Path: /home/roy/driveLANSUSE
Total MB 29465, Free MB 27495, Used MB 1970
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 10.26 15.50 7.78 47.27 51.62 48.91
16 10.58 13.86 10.14 54.05 55.50 45.78
Cached
8 520.96 586.68 601.25 709.43 709.23 706.46
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.005 0.004 0.004 16.91 20.31 22.13
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.25 0.36 1.06 252.55 403.28 621.47
ms/file 16.10 23.00 15.43 0.02 0.02 0.03 0.029
End of test Mon Apr 3 23:40:59 2017
>>>>>>>>>>>>>>>>>>> Comparison with 32 Bit Version <<<<<<<<<<<<<<<<<<<
Large Files > Faster SD card reflected, reading > twice as fast
Random > Writing exceptionally slow, reading far too fast, data cached?
Small Files > Writing exceptionally slow, reading far too fast, data cached?
DriveSpeed can also be used for testing USB connected drives. This produced errors using flash drives and USB connected SDcards. It did run on the latter, via a different 64 bit OS, but only on a btrfs formatted partition.
**LAN **access could only be used via OpenSUSE, following installation of additional facilities. Samba for SUSE SLES could notbe downloaded following a necessary reinstallation of the system. OpenSUSE results are below from accessing a Windows based PC.
[/FONT][/size][size=2][FONT=Verdana][FONT=Verdana]
Cannot insert table in CODE see:
http://www.roylongbottom.org.uk/Raspberry%20Pi%20Benchmarks.htm#anchor22a
[/FONT][/FONT][/size][size=2][FONT=Verdana][FONT=Verdana]
LanSpeed64 was also successfully run targeting the main and USB drives that would not run DriveSpeed64, identifying speedswhen data was cached, and suggesting that the earlier failures were due to trying to open files (as used in the programs) to force direct I/O. Details are available in the aforementioned htm report.