rpm installation randomly fails inside VirtualBox --- ideas needed...

Hi folks,

Summary: When I install packages with rpm inside VirtualBox, rpm sometimes deletes arbitrary files as a side-effect, files which should not be affected by the installation. It happens rarely, but it breaks openSUSE installations, kiwi-builds and other things that rely on rpm. My host is openSUSE 12.1 with VirtualBox (4.1.4) from oss repo, my guests are openSUSE 12.1 and 12.2 (build 0091). It happened also with host=oS 11.4. I have no idea where to look for the cause.

I usually test new openSUSE versions inside VirtualBox before I install them on my production machines. This testing includes building kiwi images. Starting with oS 12.1, the kiwi built process sometimes produced broken images. *(https://bugzilla.novell.com/754259) but kiwi’s principal author WONTFIX it since it only occurs in VirtualBox and since it is possibly an rpm bug and not in kiwi. I can understand his decision.

I boiled down the kiwi sequence to lots of rpm calls and observed that the bug can occur whenever rpm -U is used. The following script installs the “filesystem” package about a million times but stops as soon as the resulting root tree differs from the previous one.


#!/bin/bash

# Mount install dvd in /mnt/suse !
suse=/mnt/suse
logprefix=/tmp/rpmbug
rpmroot=/tmp/rpmroot
arch=x86_64

for k in $(seq -f %06.0f 999999)
do
  rm -rf $rpmroot
  echo -n "$k: "
  # Prepare root dir
  mkdir -p $rpmroot/etc
  cp /var/adm/fillup-templates/group.aaa_base $rpmroot/etc/group
  cp /var/adm/fillup-templates/passwd.aaa_base $rpmroot/etc/passwd

  # Install 'filesystem' package
  rpm \
  --root $rpmroot \
  -U \
  --nodeps \
  -vv \
  -- \
  $suse/suse/$arch/filesystem-*.$arch.rpm \
  >& $logprefix-$k.rpm.log

  # Get hash of directory structure
  currentsha1=$( \
     (cd $rpmroot ; find) \
     | tee $logprefix-$k.ls \
     | sha1sum \
     | cut -d' ' -f1 \
     | tee $logprefix-$k.ls.sha1)
  echo $currentsha1
  # Stop if hash is different from last one
   -z "$firstsha1" ] && firstsha1=$currentsha1
   "$firstsha1" != "$currentsha1" ] && break
done

The host is an x86_64 machine with openSUSE 12.1 and VirtualBox 4.1.4, as guest I use openSUSE 12.1 or 12.2 from DVD (also x86_64). When I call this script, it stops after a few 1000 iterations — on my last test run, the bug occured on iteration 2837. The log files from rpm (rpmbug-00283{6,7}.rpm.log) are exactly the same, but the resulting root trees are different. Here are the first few lines of “vimdiff rpmbug-00283{6,7}.ls”:


  .                                |  .
  ./selinux                        |  --------------------------------
  ./root                           |  ./root
  ./root/bin                       |  --------------------------------
  ./root/.gnupg                    |  --------------------------------
  ./var                            |  ./var
  ./var/X11R6                      |  --------------------------------
  ./var/adm                        |  ./var/adm
  ./var/adm/backup                 |  ./var/adm/backup
  ./var/adm/backup/rpmdb           |  --------------------------------
  ./var/adm/backup/sysconfig       |  --------------------------------
  ./var/adm/update-scripts         |  --------------------------------
  ./var/adm/perl-modules           |  --------------------------------
  ./var/adm/SuSEconfig             |  ./var/adm/SuSEconfig
  ./var/adm/SuSEconfig/bin         |  --------------------------------
  ./var/adm/SuSEconfig/md5         |  --------------------------------
  ./var/adm/update-messages        |  --------------------------------
  ./var/adm/fillup-templates       |  --------------------------------
  ./var/games                      |  --------------------------------
  ./var/spool                      |  ./var/spool
  ./var/spool/clientmqueue         |  --------------------------------
  ./var/spool/lpd                  |  --------------------------------

And “wc rpmbug-00283{6,7}.ls” says:


  484   484 11579 rpmbug-002836.ls
  193   193  4059 rpmbug-002837.ls
  677   677 15638 insgesamt

I’m stuck. This bug bites me hard since November and I cannot ignore it. Most of the broken kiwi images boot perfectly, but basic applications (like vim) refuse to work because some files are missing. Note that I cannot reproduce this ouside of VBox. Also note that in some (early) test runs with multiple zypper calls, one of those calls removed files which have been installed by earlier zypper invocations. So it appears that rpm does not just fail to install the files/directories it is supposed to do, but it actively removes existing files — all happening inside the root tree on which rpm operates. The removal is not completely random, e.g., my kiwi images often lack “/var/lib/empty” and “/etc/alternatives”.

I have no idea where I might look for the cause of this. Did anyone of you observe something similar? If yes I’d be highly interested in details. Do you have an idea what I could do to further round up the cause? I’d like to reopen my bug report but I would prefer to return with something which can be reproduced by the maintainers.

– Yarny*

So this is very likely a VirtualBox problem. I say that because we know openSUSE has no such rpm installation problem normally. I see that VirtualBox is up to version 4.1.20 you can install outside of YaST and perhaps its is time to download the latest version into your $HOME/Downloads folder, open your file manager, right click on the new file and select Open With -->** Install/Remove Software** and replace the version you installed from the YaST Repository. You would also need to get the latest additions and install them into each VM as its first run. For more info on this have a look at this Article on the subject here: http://forums.opensuse.org/content/59-how-install-virtualbox-opensuse-11-4-12-1-tumbleweed.html

Thank You,

Maybe you could try to install openSUSE in kvm virtual machines and see if you experience the same issues.

Hi,

thanks for your thoughts and suggestions so far!

I just did this: The bug also occurs inside qemu-kvm! I tested with openSUSE 12.1 as guest, it stopped after ~3800 iterations. So it’s not a VirtualBox problem, but it seems to happen inside … err … any kind of virtualization environment. Thanks, please_try_again, I think this is quite a usefull observation.

With the observation above, do you think this would still be helpful? I’m asking because testing with a not-from-oss-repo version of VirtualBox would be quite involved (I cannot replace my current version from my production system easily, so I would do the testing from a live medium instead).

I’m stuck again. It seemingly does not depend on the virtualization software, but it requires virtualization. I remember doing this before, but I’m running my test script on my machine (without virtualization) again, just to be sure this bug depends on virtualization at all. It just passed 12000 iterations. Where should I look for cause of such a bug?

Thanks! – Yarny

I made a copy of your script and will test it in VirtualBox 4.1.20 to see if it makes any difference this weekend.

Thank You,

Wow, thanks already! Some observations which just came into my mind and with might be helpful:

  • I couldn’t reproduce it when I booted from SUSE’s live CD (inside VBox), no matter if the rpm-root is on a VBox harddisk or in ram. The bug occurs on a standard installation (minimal text) from SUSE’s setup DVD. My script uses the filesystem-rpm file from the DVD since the DVD is already connected to the machine after the setup, but the bug also occurs when the filesystem-rpm is on the virtual harddisk.
  • During some test runs, the bug did not appear for 10000 iterations. I stopped the machine, switched back to a snapshot just after the suse installation, and quickly restarted the script. Then the bug appeared very quickly (~2000 it). After I observed this several times, I started practicing the following: Whenever I switched on my host for some testing, it started the script, then after ~100 iterations quickly stop, back to snapshot, start again. I have the impression that this increased the rate of bug occurence, but I’m not sure…

Thanks again for your offer, I really appreciate this!
– Yarny

P.S.: One day I thought it might depend on the speed of the underlying hard disk (maybe a race-condition in the kernel?). I tested the script on a network block device (not in VBox) to artifically slow down disk access times, but couldn’t reproduce it either.

I played a bit with the caching options of qemu-kvm. Precisely, I’m using the command line

qemu-kvm -smp 4 -drive file=suse121-A.qcow2,media=disk,cache=$CACHE -drive file=../openSUSE-12.1-DVD-x86_64.iso,media=cdrom -m 768 -net nic -net user,hostfwd=tcp::50000-:50000 -monitor stdio -S -enable-kvm -k en-us -vnc :0

I couldn’t reproduce the bug with CACHE=none, but I could with CACHE=writeback.

More interesting: With CACHE=unsafe the find command of my script said

find: `./media': No such file or directory
find: `./mnt': No such file or directory
find: `./sys': No such file or directory
find: `./home': No such file or directory

and with CACHE=writethrough I received 20 lines like that with many different directories.

It seems that the directories mysteriously vanish after rpm exited, while find is running, such that find sees the directories but some microseconds later when it tries to parse these directories, they’re gone. My conclusion is that rpm is innocent and there is a problem in the kernel. But I’m still puzzled by why this bug only appears in virtualized environments.

I also tried to reproduce the bug on an old i686 laptop, inside VirtualBox, but failed.

I revived my bug report. – Yarny