Hello!
It has happened multiple time now (maybe I should say: every time since Leap 15): I start installation, create 500MB RAID1 with EXT4 for /boot, 60GB RAID10 with XFS for / (with layout o2, chunk size 1M), select “Server” system and add some packages (about 600MB download, 2.5GB installed size) and let it go.
After some time (2-5 minutes), not always the same, install stalls (well, halts, never to continue), RAID10 rebuild also stalls and never moves further.
On 4th virtual console, there are messages about “task md1_resync blocked for more than 480 seconds” and various “kworker/xxx blocked for more than 480 seconds”. Disks are all new, checked - without bad sectors, machine is new Dell PowerEdge T30 (also happened with random old PC with two good disks, also happened with SSD’s).
After reboot, if I let the RAID finish sync, I can complete install using “current partitions” and only mkfs-ing them again. If I create RAIDs from scratch, same problem. Like there is some race condition between writing to RAID while it is being sync-ed for the first time (should I say that that is officially supported?).
I have been doing this for years the same way, and this kind of problem never happened before.
Not really a solution, that way I cannot have rootfs on RAID10 (and I really want that). I am already doing that for other filesystems (/home, /data,…)
It seems to me there happens some race condition between RAID10 sync and (XFS?) writes to the same RAID volume, but I am not enough developer to pinpoint the problem.
My workaround for now is to create md’s for /boot and / in advance, then start install…
I’m setting up a test machine to test with different filesystems (EXT4, Btrfs) and RAID layouts, will get back with results.
So it happened again: new machine: AMD Ryzen, 32GB RAM, 2xWD RED 2TB (I know, not a “server” disk, but good enough for testing).
Started “Net” install, created 500MB RAID1 with EXT4 for /boot and 40GB RAID10 (with parity o2) with XFS for /. After installation was at 9% it stopped. Switched to vc2, /proc/mdstat says sync is at 21.8% and not moving any further…
Just did a fresh LEAP 42.3 install the same way: same PC, Net install, created the same partitions (deleted everything first), same RAID config, and it passed just as expected.
Just tested with latest LEAP 15.1 Alpha: Dell PowerEdge T30, 2 1TB HDDs in AHCI mode.
Run Install from NET CD, created two partitions, first 500MB, second 60GB on both disks, created raid1 mirror over 500MB partitions for /boot, and raid10 mirror over 60GB partitions for / (with 1MB stripe size and o2 layout). Continued to Server selection and started install.
Everything was working OK until 22%, when it stopped. /proc/mdstat says resync is at 32.7%. Waited 15 minutes, then rebooted…
Tried LEAP 15.0: deleted all partitions, created all the same from the beginning, started install which stopped at 18% (raid resync at 23.5%)
Back to 42.3: deleted all partitions, created all the same from the beginning, started install - finished without problems.
I’d say there is definitely something wrong with LEAP 15+ …
I feel like I’m talking to myself, but here it is again:
new setup with two 120GB SSDs: Net install of LEAP 15.0, server selection, 500MB /boot with RAID1 and 110GB / with RAID10, o2 layout, XFS
Everything was going smooth until 60%, when it stopped. Back at VT-2, cat /proc/mdstat says resync is at 89.9% and never moving further.
There were no strange messages in dmesg, not in other VTs.
Next, I tried everything the same, only paused package instalation by clicking Abort, then waiting until initial sync is over, then clicking “Continue installation”, and everything went OK.
Now, 15.1 being in alpha stage and having the same issue, I’d like to see this fixed before release, since I don’t think anything can be done for 15.0