OpenSuSE 13.1, 3.11.10-7-desktop #1 SMP PREEM: I/O performance with SSD's very bad.

Hi,

I installed Oracle 11.2 under OpenSuSE 13.1, verified that asynchronous I/O is setup correctly.
I create for a given tablespace 4 files, each 1G of size, in parallel. This means at runtime
4 Oracle processes are writing in parallel into the filesystem.
The destination is a ext4 filesystem called /oracle mounted with the options:

acl,user_xattr

The whole operation takes 32 seconds, means 128MB per second. I checked via strace that
the Oracle processes perform io_submit() and io_getevent() calls, so asynchronous I/O is used.
Looking via the tool iotop, called with the options

iotop -ok --user=oracle

i see that every Oracle process writes around 32MB.

If i do the same thing under Oracle Linux Version 6 Update 5, 3.8.13-16.2.1.el6uek.x86_64 #1 SMP,
i see with iotop that every Oracle process writes with around 120MB per second. The whole
operation takes about 4-5 seconds. The filesystem mount options are the same.

Why we are limited under OpenSuSE 13.1 wit 32MB? Before i used OpenSuSE 12.3, and under this version
i did not encounter this problem.

Thanks for any hint and help.

Regards,

Uwe

Hi,

I did just some tests with the dd command under OpenSuSE 13.1 using the mounted filesystem
/oracle:

Using DIRECT and SYNC option:

glorfindel:/oracle # dd if=/dev/zero bs=1048576 count=1024 oflag=direct,sync of=/oracle/dd.01
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 12.4679 s, 86.1 MB/s

glorfindel:/oracle # dd if=/oracle/dd.01 bs=1048576 count=1024 iflag=direct,sync of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.64461 s, 406 MB/s

Using the DIRECT option only:

glorfindel:/oracle # dd if=/dev/zero bs=1048576 count=1024 oflag=direct of=/oracle/dd.01
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 8.37267 s, 128 MB/s
glorfindel:/oracle # dd if=/oracle/dd.01 bs=1048576 count=1024 iflag=direct of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.63974 s, 407 MB/s

Under Oracle Linux i get the results:

root@olv6:/vorlons> dd if=/dev/zero bs=1048576 count=1024 oflag=direct,sync of=/vorlons/dd.01
1024+0 Datensätze ein
1024+0 Datensätze aus
1073741824 Bytes (1,1 GB) kopiert, 8,29227 s, 129 MB/s

root@olv6:/vorlons> dd if=/vorlons/dd.01 bs=1048576 count=1024 iflag=direct,sync of=/dev/null
1024+0 Datensätze ein
1024+0 Datensätze aus
1073741824 Bytes (1,1 GB) kopiert, 2,10068 s, 511 MB/s

root@olv6:/vorlons> dd if=/dev/zero bs=1048576 count=1024 oflag=direct of=/vorlons/dd.01
1024+0 Datensätze ein
1024+0 Datensätze aus
1073741824 Bytes (1,1 GB) kopiert, 2,25797 s, 476 MB/s

root@olv6:/vorlons> dd if=/vorlons/dd.01 bs=1048576 count=1024 iflag=direct of=/dev/null
1024+0 Datensätze ein
1024+0 Datensätze aus
1073741824 Bytes (1,1 GB) kopiert, 2,12025 s, 506 MB/s

Unfortunately i cannot simulate asynchronous I/O with dd. But isee that the difference between
the two operating systems is the handling of the SYNC option. Any ideas?

Regards,

Uwe

Hello and welcome here.

Can you please use CODE tags around your computer output?? You get them by clicking on the # button the tool bar of the post editor.

You should check and compare the device and file system parameters between the two systems; for example read ahead blocksize:

blockdev --getra /dev/sda1

and the io scheduler:

cat /sys/block/sda/queue/scheduler

and …
If these settings are different, try to change them.

Hendrik

Thanks for the reply, Hendrik.

For both systems i wrote a small script which outputs the blockdev command.
In addition i made sure the settings on both systems for the used block devices
are the same. For OpenSuSE i get the output:


glorfindel:~ # blockdev.sh /dev/sdb4
getsz: 290400256
getro: 0
getdiscardzeroes: 0
getss: 512
getpbsz: 512
getiomin: 512
getioopt: 0
getalignoff: 0
getmaxsect: 2048
getbsz: 4096
getsize64: 148684931072
getra: 256
getfra: 256

For Oracle Linux Version 6 it is:


root@olv6:/proc/fs/ext4/sda3> blockdev.sh /dev/sda3
getsz: 41943040
getro: 0
getdiscardzeroes: blockdev: Unbekannter Befehl: --getdiscardzeroes

Aufruf:
  blockdev -V
  blockdev --report [Geräte]
 blockdev -v|-q] Befehle Geräte

Verfügbare Befehle:
        --getsz                        get size in 512-byte sectors
        --setro                        Nur-Lesen setzen
        --setrw                        Lesen-Schreiben setzen
        --getro                        Nur-Lesen ermitteln
        --getss                        get logical block (sector) size
        --getpbsz                      get physical block (sector) size
        --getiomin                     get minimum I/O size
        --getioopt                     get optimal I/O size
        --getalignoff                  get alignment offset
        --getmaxsect                   get max sectors per request
        --getbsz                       Blockgröße ermitteln
        --setbsz BLOCKSIZE             Blockgröße setzen
        --getsize                      32-Bit-Sektor-Zähler ermitteln
        --getsize64                    Größe in Bytes ermitteln
        --setra READAHEAD              Readahead (vorausschauendes Lesen) setzen
        --getra                        Readahead ermitteln
        --setfra FSREADAHEAD           Readahead (vorausschauendes Lesen) für Dateisystem setzen
        --getfra                       Readahead für Dateisystem ermitteln
        --flushbufs                    Puffer leeren
        --rereadpt                     Partitionstabelle erneut lesen

getss: 512
getpbsz: 512
getiomin: 512
getioopt: 0
getalignoff: 0
getmaxsect: 2048
getbsz: 4096
getsize64: 21474836480
getra: 256
getfra: 256

The I/O scheduler on both systems is the deadline one. Redoing the tests does not
change the behaviour on the OpenSuSE 13.1 side. I then thought i that the problem
maybe is not the block layer, but the filesystem layer, here ext4. I took a look at the
runtime ext4 settings. For OpenSuSE:


cd /proc/fs/ext4/sdb4
cat options
rw
delalloc
barrier
user_xattr
acl
resuid=0
resgid=0
errors=continue
commit=5
min_batch_time=0
max_batch_time=15000
stripe=0
data=ordered
inode_readahead_blks=32
init_itable=10
max_dir_size_kb=0

For Oracle Linux i get:


cd /proc/fs/ext4/sda3
cat options
rw
mblk_io_submit
delalloc
barrier
user_xattr
acl
resuid=0
resgid=0
errors=continue
commit=5
min_batch_time=0
max_batch_time=15000
stripe=0
data=ordered
inode_readahead_blks=32
init_itable=10
max_dir_size_kb=0

The only difference i see here is the use of the option mblk_io_submit i do not
see under OpenSuSE 13.1. I then changed the mount options of the filesystem in
question, under OpenSuSE it is /oracle, in the following way:


umount /oracle
mount -o acl,user_xattr,mblk_io_submit /oracle

I then checked the used options of the ext4:


cd /proc/fs/ext4/sdb4
cat options
rw
delalloc
barrier
user_xattr
acl
resuid=0
resgid=0
errors=continue
commit=5
min_batch_time=0
max_batch_time=15000
stripe=0
data=ordered
inode_readahead_blks=32
init_itable=10
max_dir_size_kb=0

The new option mblk_io_submit is not set. I checked then the kernel messages:


dmesg
...
 6888.112342] EXT4-fs (sdb4): Ignoring removed mblk_io_submit option
 6888.112355] EXT4-fs (sdb4): Ignoring removed mblk_io_submit option
 6888.121224] EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: acl,user_xattr,mblk_io_submit,acl,user_xattr,mblk_io_submit

Could that explain the difference?

Regards,

Uwe

I forgot to list my laptop details:

Newest Alienare M17
8GB main memory
CPU: Intel(R) Core™ i7-4700MQ CPU @ 2.40GHz
2 Samsung SSD’s with 256GB: SAMSUNG SSD 830
No Raid configuration.
Internal Intel 4400 graphics card plus nvidia gtx 860M.

I made via Samsung Magician 4.1 some Performance Tests under Windows 7, which is using the same internal
disk as OpenSuSE 13.1. The sequentiell write performance is about 100MB and under OpenSuSE 13.1
i have around 94MB.

Conclusion: The slow performance is not OS related.

I have to talk with the Alienware Support, why even under Windows 7 the throughput is so slow. Thanks anyway.

Regards,

Uwe

mblk_io_submit could make a difference, if it works as advertised (multiple pages in one io request); especially for sequential io.
But there have been reports of data corruption issues with this option.
I tried to find out, if theses issues have been fixed, but did not find anything :frowning:

Hendrik