Hard drive failing with Btrfs error correction running. How to minimise damage?

My desktop machine crashed last night. I left it doing an update and the next day I had an amber warning light on the panel and my network was down. Looked at the network with another machine on the lan and was told there was an address conflict.
Decided to reboot and the boot stalled waiting for what from the booting screen is stated as rebuilding Btrfs file or similar words but it had not corrected errors after 9 minutes so I chickened out and turned the machine off while I had a think. It appears the network problem was a red herring due to DHCP being messed up during boot as all the network was OK once the failing machine was turned off.

It is clear that a hard drive is in trouble but my problem is my notes from installation have been lost in a house move. I believe there are possibly two raids setup but in hardware so difficult to find out. As I recall first drives were set up with root partition on Btrfs and the home partition on xfs. There was also a very large raid drive just for data. I am sure the data partition is OK and is fully backed up. What I am not so confident of is the home partition and the backups I have on my NAS seem too old so I would like to try and recover if possible.

My options appear to be:-
Use gparted to interrogate the partitions to the extent possible, possibly with help here to see what we have in the box.
Look for warning lights on the drives themselves and if only one is winking, clone it and see if it can be fixed on boot. (I have not seen a raid warnings anywhere!)
Reboot the machine, wait and pray.

Grateful for some guidance please as I am sure every time I try and boot, more damage will be done.

Get a live or rescue system and follow SDB:BTRFS - openSUSE Wiki

Hi and very many thanks for the link. Before I received your help I spent some time trying to research what I had done and although my ms notes are lost I have some history on websites and was able to call up the raid configuration gui. Quite difficult to follow but it appears I have all 8 1Tb drives set up as a raid 5 array and one has failed.

In consequence the raid has, I assume, been working in the background but the system still allowed me to boot. At this point the Btrfs system must have had a fit chasing a moving system. So there is some fault tolerance built into Btrfs and also into raid 5. What should I do next?

I have now identified the dead drive but am struggling for an exact replacement. Just a thought, I have several similar spec 2Tb drives available. How tolerant would the raid controller be to accepting a replacement which is larger but only using what is required? Any hardware specialists out there?

Meanwhile I shall search for replacement of same type and size but don’t hold out much hope so need to work out a plan on what next. Plenty of reading through the weekend I fear.

Hi
As long as the disk has the same Sector Size eg 512 bytes logical/physical, create the partition size with the same sector count as the smaller drive your matching and should be fine…

A hindsight comment… Remember RAID does not equal BACKUP/RESTORE… so you should have a backup disk(s) that are the same size or greater as your RAID configuration… For me, if creating a RAID setup then normally purchase a spare disk to have a round :wink:

Hi Malcolm,
I think the spare went some time ago to family! Fortunately I have found a new identical drive on line at much less than original price and it should be here within a couple of days. Meanwhile will leave machine alone. If the disc arrives and I put it in, do I then issue a “rebuild” instruction or will it just work. I am terrified I issue the wrong command and wipe the lot. So many choices/options in the raid controller!

Hi
I would check the array with mdadm (assuming you have used software RAID?), or look at using YaST partitioner to remove the degraded drive and add the new one in…

Hi Malcolm,
No software raids here. Using IBM ServeRAID hardware.

Hi
Then likely you need to perform removal/replacement at the RAID BIOS level, or does it support hotswap?

Hi Malcolm,
All singing and dancing and I recall it does hot swap so once I get the drive will plug in and try. The BIOS controller can use cli or a small gui which runs from the BIOS but I am spoiled for choice with options and am not confident with the terminology. The term “rebuild” could mean rebuilding the array but does it keep the file system intact? In my distant memory rebuild meant you lost the existing array so will not try that. I was hoping for repair!!!
More when I get the drive.
Regards,
Budge

Hi
I would suggest going in and remove the drive from the array, shutdown, replace drive, then bring up and re-attach to the array. Read the IBM documentation for the controller, but I would expect you need to format, rebuild all via the controller before starting the system.

Hi Malcolm,
I had the failed drive out already and was in BIOS-> RAID configuration where I left the machine while the drive was in the post. It arrived this evening (yes on Sunday thanks to Amazon prime with no postage charge!!!). Swapped out the caddy and put back in and the magic took over. Warning light went out after a few secs then drive lights did a dance. I exited from the Bios and the machine restarted. A wait of about 5 minutes while Btrfs was playing catchup and when I went back to the machine it was up and running except that now I cannot get back on line.
I am using Network Manager and two NICs built-in. No wireless device and only one of the two lan ports connected but when I right click on the Networks icon there are two wired connections shown:-
Wired connection 1 (enp0s26f1u2) and
Wired connection 1 (en01)
Strange but when I selected the other one (I think en01, it was connected. What was going on there. Should I post a new question in networking forum?
Will let you know how the RAID is tomorrow but I guess it will be working on rebuilding the hot spare.
Thanks for the hand holding while I worried. Now will revisit backup procedures and warning email alert setup.
Regards,
Budge.