Leap 15 install on RAID hangs

Hello!
It has happened multiple time now (maybe I should say: every time since Leap 15): I start installation, create 500MB RAID1 with EXT4 for /boot, 60GB RAID10 with XFS for / (with layout o2, chunk size 1M), select “Server” system and add some packages (about 600MB download, 2.5GB installed size) and let it go.
After some time (2-5 minutes), not always the same, install stalls (well, halts, never to continue), RAID10 rebuild also stalls and never moves further.

On 4th virtual console, there are messages about “task md1_resync blocked for more than 480 seconds” and various “kworker/xxx blocked for more than 480 seconds”. Disks are all new, checked - without bad sectors, machine is new Dell PowerEdge T30 (also happened with random old PC with two good disks, also happened with SSD’s).

After reboot, if I let the RAID finish sync, I can complete install using “current partitions” and only mkfs-ing them again. If I create RAIDs from scratch, same problem. Like there is some race condition between writing to RAID while it is being sync-ed for the first time (should I say that that is officially supported?).

I have been doing this for years the same way, and this kind of problem never happened before.

Has anybody seen such behavior?

Best regards,
Sinisa

You might consider creating your RAID 10 array <after> you’ve completed your installation…

To my eye, this SLES documentation is completely applicable to openSUSE (including openSUSE 15)

https://www.suse.com/documentation/sles-12/singlehtml/stor_admin/stor_admin.html#cha.raid10

TSU

Not really a solution, that way I cannot have rootfs on RAID10 (and I really want that). I am already doing that for other filesystems (/home, /data,…)

It seems to me there happens some race condition between RAID10 sync and (XFS?) writes to the same RAID volume, but I am not enough developer to pinpoint the problem.

My workaround for now is to create md’s for /boot and / in advance, then start install…

I’m setting up a test machine to test with different filesystems (EXT4, Btrfs) and RAID layouts, will get back with results.

Sinisa

So it happened again: new machine: AMD Ryzen, 32GB RAM, 2xWD RED 2TB (I know, not a “server” disk, but good enough for testing).
Started “Net” install, created 500MB RAID1 with EXT4 for /boot and 40GB RAID10 (with parity o2) with XFS for /. After installation was at 9% it stopped. Switched to vc2, /proc/mdstat says sync is at 21.8% and not moving any further…

15 minutes later, dmesg says:
1463.260482] INFO: task kworker/2:1:55 blocked for more than 480 seconds.
1463.260485] Not tainted 4.12.14-lp150.11-default #1
1463.260486] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
1463.260488] kworker/2:1 D 0 55 2 0x00000000
1463.260534] Workqueue: xfs-eofblocks/md1 xfs_eofblocks_worker [xfs]
1463.260536] Call Trace:
1463.260546] ? __schedule+0x23f/0x870
1463.260549] schedule+0x28/0x80
1463.260552] rwsem_down_write_failed+0x153/0x320
1463.260594] ? xlog_grant_head_check+0x42/0xd0 [xfs]
1463.260599] ? call_rwsem_down_write_failed+0x13/0x20
1463.260601] call_rwsem_down_write_failed+0x13/0x20
1463.260605] down_write+0x20/0x30
1463.260641] xfs_free_eofblocks+0x11a/0x1c0 [xfs]
1463.260678] xfs_inode_free_eofblocks+0x179/0x1b0 [xfs]
1463.260713] ? xfs_inode_ag_walk_grab+0x5f/0x90 [xfs]
1463.260744] xfs_inode_ag_walk.isra.14+0x191/0x420 [xfs]
1463.260776] ? __xfs_inode_clear_eofblocks_tag+0x120/0x120 [xfs]
1463.260781] ? load_balance+0x13c/0x920
1463.260785] ? sched_clock+0x5/0x10
1463.260816] ? __xfs_inode_clear_eofblocks_tag+0x120/0x120 [xfs]
1463.260819] ? radix_tree_gang_lookup_tag+0xc4/0x130
1463.260849] ? __xfs_inode_clear_eofblocks_tag+0x120/0x120 [xfs]
1463.260879] xfs_inode_ag_iterator_tag+0x73/0xb0 [xfs]
1463.260910] xfs_eofblocks_worker+0x29/0x40 [xfs]
1463.260915] process_one_work+0x1da/0x3f0
1463.260919] worker_thread+0x2b/0x3f0
1463.260922] ? process_one_work+0x3f0/0x3f0
1463.260925] kthread+0x11a/0x130
1463.260928] ? kthread_create_on_node+0x40/0x40
1463.260930] ret_from_fork+0x22/0x40
1463.260934] INFO: task kworker/0:2:118 blocked for more than 480 seconds.
1463.260936] Not tainted 4.12.14-lp150.11-default #1
1463.260936] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
1463.260937] kworker/0:2 D 0 118 2 0x00000000
1463.260947] Workqueue: md md_submit_flush_data [md_mod]
1463.260948] Call Trace:
1463.260952] ? __schedule+0x23f/0x870
1463.260955] schedule+0x28/0x80
1463.260959] wait_barrier+0x11c/0x170 [raid10]
1463.260963] ? wait_woken+0x80/0x80
1463.260966] raid10_write_request+0x178/0x910 [raid10]
1463.260969] ? wait_woken+0x80/0x80
1463.260973] ? mempool_alloc+0x55/0x160
1463.260976] raid10_make_request+0xbc/0x130 [raid10]
1463.260979] ? wait_woken+0x80/0x80
1463.260985] md_make_request+0x93/0x230 [md_mod]
1463.260990] generic_make_request+0x101/0x2e0
1463.260994] ? raid10_write_request+0x6cc/0x910 [raid10]
1463.260997] raid10_write_request+0x6cc/0x910 [raid10]
1463.260999] ? wait_woken+0x80/0x80
1463.261002] ? mempool_alloc+0x55/0x160
1463.261004] ? sched_clock+0x5/0x10
1463.261007] ? sched_clock_cpu+0xc/0xb0
1463.261010] ? pick_next_task_fair+0x494/0x530
1463.261013] raid10_make_request+0xbc/0x130 [raid10]
1463.261019] md_submit_flush_data+0x36/0x70 [md_mod]
1463.261022] process_one_work+0x1da/0x3f0
1463.261026] worker_thread+0x2b/0x3f0
1463.261029] ? process_one_work+0x3f0/0x3f0
1463.261031] kthread+0x11a/0x130
1463.261034] ? kthread_create_on_node+0x40/0x40
1463.261036] ret_from_fork+0x22/0x40
1463.261062] INFO: task kworker/u32:0:5018 blocked for more than 480 seconds.
1463.261063] Not tainted 4.12.14-lp150.11-default #1
1463.261064] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
1463.261065] kworker/u32:0 D 0 5018 2 0x00000000
1463.261072] Workqueue: writeback wb_workfn (flush-9:1)
1463.261074] Call Trace:
1463.261077] ? __schedule+0x23f/0x870
1463.261080] schedule+0x28/0x80
1463.261083] wait_barrier+0x11c/0x170 [raid10]
1463.261086] ? wait_woken+0x80/0x80
1463.261089] raid10_write_request+0x178/0x910 [raid10]
1463.261091] ? wait_woken+0x80/0x80
1463.261094] ? mempool_alloc+0x55/0x160
1463.261097] raid10_make_request+0xbc/0x130 [raid10]
1463.261099] ? wait_woken+0x80/0x80
1463.261105] md_make_request+0x93/0x230 [md_mod]
1463.261109] ? pagevec_lookup_tag+0x1d/0x30
1463.261111] ? write_cache_pages+0xdf/0x430
1463.261113] generic_make_request+0x101/0x2e0
1463.261116] ? submit_bio+0x6c/0x140
1463.261118] submit_bio+0x6c/0x140
1463.261152] xfs_submit_ioend+0x70/0x1a0 [xfs]
1463.261186] xfs_vm_writepages+0xaa/0xc0 [xfs]
1463.261189] do_writepages+0x3c/0xd0
1463.261195] ? ata_scsi_security_inout_xlat+0x140/0x140
1463.261198] ? ata_scsi_translate+0xce/0x1a0
1463.261200] ? __writeback_single_inode+0x3d/0x320
1463.261202] __writeback_single_inode+0x3d/0x320
1463.261205] ? fprop_reflect_period_percpu.isra.5+0x70/0xb0
1463.261208] writeback_sb_inodes+0x18a/0x430
1463.261211] __writeback_inodes_wb+0x5d/0xb0
1463.261214] wb_writeback+0x243/0x2d0
1463.261217] ? wb_workfn+0x16d/0x3f0
1463.261219] wb_workfn+0x16d/0x3f0
1463.261222] process_one_work+0x1da/0x3f0
1463.261226] worker_thread+0x2b/0x3f0
1463.261229] ? process_one_work+0x3f0/0x3f0
1463.261231] kthread+0x11a/0x130
1463.261233] ? kthread_create_on_node+0x40/0x40
1463.261237] ? do_syscall_64+0x7b/0x140
1463.261240] ? SyS_exit_group+0x10/0x10
1463.261242] ret_from_fork+0x22/0x40
1463.261245] INFO: task md1_resync:5145 blocked for more than 480 seconds.
1463.261247] Not tainted 4.12.14-lp150.11-default #1
1463.261247] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
1463.261248] md1_resync D 0 5145 2 0x00000000
1463.261251] Call Trace:
1463.261254] ? __schedule+0x23f/0x870
1463.261256] ? wait_woken+0x80/0x80
1463.261258] schedule+0x28/0x80
1463.261262] raise_barrier+0x83/0x160 [raid10]
1463.261264] ? wait_woken+0x80/0x80
1463.261268] raid10_sync_request+0x1ea/0x1d50 [raid10]
1463.261275] ? is_mddev_idle+0xc9/0x109 [md_mod]
1463.261282] ? is_mddev_idle+0xa4/0x109 [md_mod]
1463.261288] md_do_sync+0x882/0xe90 [md_mod]
1463.261292] ? cpumask_next_and+0x26/0x40
1463.261294] ? wait_woken+0x80/0x80
1463.261301] ? find_pers+0x70/0x70 [md_mod]
1463.261306] ? md_thread+0x10d/0x140 [md_mod]
1463.261312] md_thread+0x10d/0x140 [md_mod]
1463.261315] kthread+0x11a/0x130
1463.261317] ? kthread_create_on_node+0x40/0x40
1463.261320] ? do_syscall_64+0x7b/0x140
1463.261323] ? SyS_exit_group+0x10/0x10
1463.261325] ret_from_fork+0x22/0x40
1463.261331] INFO: task xfsaild/md1:5167 blocked for more than 480 seconds.
1463.261332] Not tainted 4.12.14-lp150.11-default #1
1463.261332] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
1463.261333] xfsaild/md1 D 0 5167 2 0x00000000
1463.261335] Call Trace:
1463.261339] ? __schedule+0x23f/0x870
1463.261341] schedule+0x28/0x80
1463.261345] wait_barrier+0x11c/0x170 [raid10]
1463.261347] ? wait_woken+0x80/0x80
1463.261350] raid10_write_request+0x178/0x910 [raid10]
1463.261352] ? wait_woken+0x80/0x80
1463.261355] ? mempool_alloc+0x55/0x160
1463.261358] raid10_make_request+0xbc/0x130 [raid10]
1463.261360] ? wait_woken+0x80/0x80
1463.261366] md_make_request+0x93/0x230 [md_mod]
1463.261371] ? crc32c_pcl_intel_update+0x93/0xa0 [crc32c_intel]
1463.261374] generic_make_request+0x101/0x2e0
1463.261377] ? submit_bio+0x6c/0x140
1463.261378] submit_bio+0x6c/0x140
1463.261412] _xfs_buf_ioapply+0x2fa/0x4a0 [xfs]
1463.261445] ? xfs_buf_delwri_submit_buffers+0xe8/0x260 [xfs]
1463.261476] ? xfs_buf_submit+0x61/0x210 [xfs]
1463.261506] xfs_buf_submit+0x61/0x210 [xfs]
1463.261537] xfs_buf_delwri_submit_buffers+0xe8/0x260 [xfs]
1463.261579] ? xfsaild+0x343/0x710 [xfs]
1463.261619] ? xfsaild+0x343/0x710 [xfs]
1463.261656] xfsaild+0x343/0x710 [xfs]
1463.261693] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
1463.261696] ? kthread+0x11a/0x130
1463.261698] kthread+0x11a/0x130
1463.261700] ? kthread_create_on_node+0x40/0x40
1463.261703] ? do_syscall_64+0x7b/0x140
1463.261705] ? SyS_exit_group+0x10/0x10
1463.261707] ret_from_fork+0x22/0x40
1463.261712] INFO: task rpm:5337 blocked for more than 480 seconds.
1463.261713] Not tainted 4.12.14-lp150.11-default #1
1463.261713] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
1463.261714] rpm D 0 5337 3865 0x00000000
1463.261716] Call Trace:
1463.261720] ? __schedule+0x23f/0x870
1463.261723] schedule+0x28/0x80
1463.261758] _xfs_log_force_lsn+0x1d5/0x310 [xfs]
1463.261761] ? file_check_and_advance_wb_err+0x2c/0xc0
1463.261764] ? wake_up_q+0x70/0x70
1463.261793] xfs_file_fsync+0xda/0x1a0 [xfs]
1463.261796] do_fsync+0x38/0x60
1463.261799] SyS_fdatasync+0xf/0x20
1463.261801] do_syscall_64+0x7b/0x140
1463.261804] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
1463.261807] RIP: 0033:0x7fb6ebfdf1a4
1463.261808] RSP: 002b:00007ffd2d96bd98 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
1463.261810] RAX: ffffffffffffffda RBX: 0000000000bf4350 RCX: 00007fb6ebfdf1a4
1463.261812] RDX: 0000000000be43a0 RSI: 0000000000bf4350 RDI: 0000000000000004
1463.261813] RBP: 0000000000b5d170 R08: 0000000000c853b0 R09: 00007fb6ecd07b60
1463.261814] R10: 0000000000c83758 R11: 0000000000000246 R12: 0000000000000000
1463.261815] R13: 0000000000000064 R14: 0000000000010830 R15: 0000000000c83708

Just did a fresh LEAP 42.3 install the same way: same PC, Net install, created the same partitions (deleted everything first), same RAID config, and it passed just as expected.

Will try 15.0 again a bit later…

Just tested with latest LEAP 15.1 Alpha: Dell PowerEdge T30, 2 1TB HDDs in AHCI mode.

Run Install from NET CD, created two partitions, first 500MB, second 60GB on both disks, created raid1 mirror over 500MB partitions for /boot, and raid10 mirror over 60GB partitions for / (with 1MB stripe size and o2 layout). Continued to Server selection and started install.

Everything was working OK until 22%, when it stopped. /proc/mdstat says resync is at 32.7%. Waited 15 minutes, then rebooted…

Tried LEAP 15.0: deleted all partitions, created all the same from the beginning, started install which stopped at 18% (raid resync at 23.5%)

Back to 42.3: deleted all partitions, created all the same from the beginning, started install - finished without problems.

I’d say there is definitely something wrong with LEAP 15+ …

Best regards,
Sinisa

Hello everyone,

I feel like I’m talking to myself, but here it is again:

new setup with two 120GB SSDs: Net install of LEAP 15.0, server selection, 500MB /boot with RAID1 and 110GB / with RAID10, o2 layout, XFS

Everything was going smooth until 60%, when it stopped. Back at VT-2, cat /proc/mdstat says resync is at 89.9% and never moving further.

There were no strange messages in dmesg, not in other VTs.

Next, I tried everything the same, only paused package instalation by clicking Abort, then waiting until initial sync is over, then clicking “Continue installation”, and everything went OK.

Now, 15.1 being in alpha stage and having the same issue, I’d like to see this fixed before release, since I don’t think anything can be done for 15.0

Best regards,
Sinisa

If you want things fixed report on bugzilla all here are just users

Well, the title says :

so I thought that I might get some Technical Help here…

Will try bugzilla.

We help if we can but in general we do not fix bugs

Complex RAID environments can be tricky in any case you really need to know the ins and outs.