Disk writing increase after upgrade from 42.3 to 15.0

Wrummm · May 5, 2019, 9:48pm

Hello

I have a problem with the dis load since I upgraded from openSUSE Leap 42.3 to 15.0.
As for every upgrade I basically stopped the main services, changed the repos and run zypper dup. After restart and starting everything again everything seemed to work. But then I recognize in the monitoring an increase in the IO writing statics on the system disk (btrf file system). This also leads to a noticeable decrease of system performance and leading the system to be a significant portion of time in iowait.

This is what I found out so far:

The source of the problem seems to be the nextcloud instance running in docker. It contains of a nextcloud 15 container, mariadb 10.3 container and redis 5.0 container. When I stop these, everything goes back to normal.
These were exactly the same containers in the beginning as before the upgrade. Of cause during the analysis I recreated the containers, but this didn’t change anything.
Docker version is the same in openSUSE Leap 42.3 and 15.0 according to the package list.
When I make sure, that no request can reach the nextcloud and I deactivate the cron job calling it every 15 min, then there is NO increased writing on the disk.
Because the effect is intermitting, it is hard to tell what exactly is causing it, but looking at iotop for a while it seems to be the mysql database causing the increased writing. But what should have changed here compared to before the upgrade.
I cannot find anything unusual in the logging of nextcloud or mysql, no problems or errors reported.

I am running out of ideas. So every hint would be highly appreciated. What would be further directions to head?

Best

gogalthorp · May 5, 2019, 11:21pm

Size of memory and swap usage please???

Wrummm · May 6, 2019, 9:39pm

I think some pictures from the monitoring say the most. The gap is the time during upgrade. Before the upgrade I did some backup.
iostat:
http://susepaste.org/images/91283756.png
cpu:
http://susepaste.org/images/83928036.png

memory:
http://susepaste.org/images/7536393.png

# btrfs fi usage /
Overall:
    Device size:                 266.09GiB
    Device allocated:             89.07GiB
    Device unallocated:          177.02GiB
    Device missing:                  0.00B
    Used:                         80.01GiB
    Free (estimated):            184.31GiB      (min: 95.80GiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              280.69MiB      (used: 0.00B)

Data,single: Size:83.01GiB, Used:75.71GiB
   /dev/sda2      83.01GiB

Metadata,DUP: Size:3.00GiB, Used:2.15GiB
   /dev/sda2       6.00GiB

System,DUP: Size:32.00MiB, Used:16.00KiB
   /dev/sda2      64.00MiB

Unallocated:
   /dev/sda2     177.02GiB

Wrummm · May 18, 2019, 5:45pm

Hello all,

looking at the problem for a while here are two more things that I recognized:

The used disk space do not change significantly, so I assume, that the writing is consecutively overwriting something
I can see now in the memory usage, that the caching behavior seems to have changes:
http://susepaste.org/images/18334418.png

Before the cache used more or less constantly the full amount of unused memory, now it is decreasing over the time. The jump up every day is at 3 am, when I do some automatic “backup”.
Also “active” and “inactive” are much less constant. Only the changes in the “committed” memory I can explain by activating more or less docker containers for testing reasons.

So my theory is, that before the upgrade some data, that is now causing the IO to the disk, was cached in the RAM back then.

If this is correct, the question is, what caused this change?

My search brought me that far until now, that I can configure the disk cache behavior with these:

 sysctl -a | grep dirty
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200

But these seems to be perfectly common values and I guess, these didn’t change?

Do you know anything that changed from 42.3 to 15.0, that could explain the behavior change?

Best