XFS loses data

Hi All,

I had an installation of OpenSuse Leap 42.1. The linux partition was BTRFS while the partition hosting the /home directory was XFS. I used the suggested openSUSE default partition suggestions. The system is/was a clean install on a Fujitsu server with an LSI raid adapter. The RAID was was run as a RAID1. The system ran flawlessly with no down time for about 1 year. I decided that the box needed more memory, so I added a stick of RAM. I boot up the otherwise perfectly running system, and notice that /var/run and its subdirectories have been wiped clean, and also random directories are missing under /home. For example, in user a’s directory (/home/usera) there might be a set of folders A1, A2, A3, A4 and so forth. The youngest of the folders would be missing. Upon reboot, I would have A’s home directory, but subdirectory A3 and A4 would be missing.

I dismissed this occurrence as some sort of foolishness on my part. Perhaps I did not wait long enough for the system to shutdown. Yesterday, I decided that the harddrives would be upgraded from 500GB to 1TB. I backed up most of the information, shut the box down gracefully and waited for the drives to shutdown. I installed the new drives, created a new Virtual Disk with the two new 1TB drives and installed Leap 42.3. Everything installed and the box came up. I powered the box down and added the original 2 harddrives (each 500GB) back as a secondary and separate RAID1 Virtual Disk. I powered the box up-opensuse booted.

  1. The new (old) 500GB virtual disk did not mount automatically which was surprising.
  2. However, what was more surprising and worrisome was that upon manually mounting the partition that held the (old) /home directories, I observed the phenomenon wherein random sub directories disappeared in the 500Gb RAID array. Again, age seemed to have mattered, as “older” directories were more likely to have survived than newer ones. Paradoxically, the hidden directories .* seem to all have survived.

While the box is run on battery backup, I am concerned #2 will come back to haunt me. I will want to upgrade again at some point. I am worried that when I reboot, I will get the same loss of data. I assume there is something misconfigured on my end. Otherwise, the internet chat forums would be awash in stories like this. Does anyone have any experience that is similar? How can I further diagnose this?

Thank you,

-Greg

IMO
You need to first evaluate the health of your disk hardware.
Install smart-tools, take a reading, do some stuff over a day and then take another reading a day later.
Look for reported errors and especially any increase in errors.

When you are certain you don’t have hardware problems,
Then do a fsck on the suspect partition(s).

I haven’t had to do this on XFS, but according to the MAN page the regular fsck command calls xfs.fsck as needed
https://linux.die.net/man/8/fsck.xfs

TSU

Thanks for the suggestion, one of the drives in the old raid did have 1 UNC error, but thats it. I tried to repair the partition, but nothing more comes back. While it could be one lone UNC error that spoiled the punch, fact it seems to be a systematic loss of data is puzzling. We are not talking about a random collection of corrupted files, but a loss in entire directories. It is also not like both drives were affected. Just one. Here is the SMART output:

MART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It “wraps” after 49.710 days.

Error 1 occurred at disk power-on lifetime: 45163 hours (1881 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 00 74 f1 05 00 Error: UNC at LBA = 0x0005f174 = 389492

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


60 00 08 70 f1 05 40 00 16d+15:58:57.841 READ FPDMA QUEUED
60 00 08 10 a7 06 40 00 16d+15:58:57.841 READ FPDMA QUEUED
60 00 08 60 f1 05 40 00 16d+15:58:57.841 READ FPDMA QUEUED
60 00 08 20 a7 06 40 00 16d+15:58:57.840 READ FPDMA QUEUED
60 00 08 28 a7 06 40 00 16d+15:58:57.839 READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Short offline Completed without error 00% 51099 -

I can live with the loss of the directories, I am much much more interested in preventing this from happening again.

Thank you,

-Greg

AFAIK the following methods are reliable and maintained:

  1. Disks: SMART.
  2. Other hardware: regular inspection of the Systemd Journals.
  3. KDE: “System Monitor” – there’s some additional sensors which can be added to a new tab.
  4. GNOME: don’t know.
  5. HP and Dell offer Linux monitoring applications for their hardware.

Occasionally, KDE Plasma widgets appear which may be useful but, they’re often short-lived …