nbd corrupting ext3 filesystem?

Hello there.

This is my first post - I hope it’s to the correct section, the description seems to indicate it’s the most appropriate.

I’ve been using openSUSE for about 2 years as my main desktop, but still very much a noob, which I guess is due to a combination of being too busy or lazy to really get stuck in to the CLI, and that it mostly just works!

I have a fresh install of 11.1, and have a show-stopping problem:

When trying to copy data from one partition to another:

#rsync -av --progress /mnt/sdb5 /mnt/sdb8

…the output gave multiple examples of the form:

rsync: read errors mapping “<filename>”: Input/output error (5)

(About) 1 in 10 files gave this message and were not copied, I had the feeling they were mainly large (100MB plus) files.

Both partitions have luks encrypted filesystems - I’ve been using luks without issues for over a year, and don’t THINK my problems are related to that.

So I unmounted and did this:

#fsck.ext3 /dev/mapper/sdb5

This resulted in numerous:

Error reading block xxxxxx (Attempt to read block from filesystem resulted in a short read ) while doing inode scan. Ignore<y>? yes

Force rewrite? yes
<filesystemlabel>: ***** FILE SYSTEM WAS MODIFIED *****
<filesystemlabel>: 402274/15261696 files (0.4% non-contiguous), 22952676/61030662 blocks

Just to be sure, I repeated:
#fsck.ext3 /dev/mapper/sdb5
e2fsck 1.41.1 (01-Sep-2008)
<filesystemlabel>: clean, 402274/15261696 files, 22952676/61030662 blocks (check in 5 mounts)

Then, I tried the copy again:
#rsync -av --progress /mnt/sdb5 /mnt/sdb8

And got exactly the same errors!

I’ve searched for ages to no avail, but I did find this:
https://bugzilla.redhat.com/show_bug.cgi?id=305301

In which one poster asks:

“Have you shared the underlying block
device out over nbd or iscsi or anything?”

I have indeed shared the underlying block device out over nbd, in fact this is the whole purpose of the box in question.

I then searched “nbd corrupt filesystem” and similar search terms, but am still none the wiser.

So, my questions are, can nbd corrupt a filesystem, how might I prevent it, or have I missed something?

TIA & regards.

How old is your disk? Maybe you really have a failing disk? Have you looked at the disk health with the smartmontools package, the utility is smartctl. With that you can look at the built-in stats and also initiate self-testing.

Hi Ken, thanks for replying.

It’s a brand new drive, and has suffered no shocks or maltreatment (at least post delivery ex ebay!)

smartctl tells me:

SMART overall-health self-assessment test result: PASSED

but with errors as below.

I’ll read up on badblocks, but meanwhile, do you think the drive is toast?

smartctl -t short /dev/sdb5

smartctl 5.39 2008-10-24 22:33 [i686-suse-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-8 by Bruce Allen, smartmontools Home Page (last updated $Date: 2009/01/23 13:25:23 $)

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: “Execute SMART Short self-test routine immediately in off-line mode”.
Drive command “Execute SMART Short self-test routine immediately in off-line mode” successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Sat Feb 21 00:03:37 2009

Use smartctl -X to abort test.

Waited 5 minutes…

smartctl -a /dev/sdb5

smartctl 5.39 2008-10-24 22:33 [i686-suse-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-8 by Bruce Allen, smartmontools Home Page (last updated $Date: 2009/01/23 13:25:23 $)

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green family
Device Model: WDC WD10EACS-65D6B0
Serial Number: WD-WCAU41924733
Firmware Version: 01.01A01
User Capacity: 1,000,204,886,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sat Feb 21 00:11:37 2009 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (21600) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 248) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 198 198 051 Pre-fail Always - 21595
3 Spin_Up_Time 0x0027 159 153 021 Pre-fail Always - 7008
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 123
5 Reallocated_Sector_Ct 0x0033 150 150 140 Pre-fail Always - 393
7 Seek_Error_Rate 0x002e 100 253 051 Old_age Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1292
10 Spin_Retry_Count 0x0032 100 100 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 123
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 123
194 Temperature_Celsius 0x0022 135 103 000 Old_age Always - 15
196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 248
197 Current_Pending_Sector 0x0032 195 192 000 Old_age Always - 872
198 Offline_Uncorrectable 0x0030 196 192 000 Old_age Offline - 770
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 6
200 Multi_Zone_Error_Rate 0x0008 171 164 051 Old_age Offline - 3993

SMART Error Log Version: 1
Warning: ATA error count 18934 inconsistent with error log pointer 1

ATA Error Count: 18934 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It “wraps” after 49.710 days.

Error 18934 occurred at disk power-on lifetime: 1290 hours (53 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 00 71 05 78 e6 Error: UNC at LBA = 0x06780571 = 108529009

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 06 70 05 78 06 00 1d+04:20:07.888 READ DMA
ec 00 00 00 00 00 00 02 1d+04:20:07.879 IDENTIFY DEVICE
ef 03 45 00 00 00 00 02 1d+04:20:07.872 SET FEATURES [Set transfer mode]
ec 00 00 00 00 00 00 02 1d+04:20:07.865 IDENTIFY DEVICE
c8 00 06 70 05 78 06 00 1d+04:20:04.662 READ DMA

Error 18933 occurred at disk power-on lifetime: 1290 hours (53 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 00 71 05 78 e6 Error: UNC at LBA = 0x06780571 = 108529009

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 06 70 05 78 06 00 1d+04:20:04.662 READ DMA
ec 00 00 00 00 00 00 02 1d+04:20:04.654 IDENTIFY DEVICE
ef 03 45 00 00 00 00 02 1d+04:20:04.654 SET FEATURES [Set transfer mode]
ec 00 00 00 00 00 00 02 1d+04:20:04.646 IDENTIFY DEVICE
c8 00 06 70 05 78 06 00 1d+04:20:01.441 READ DMA

Error 18932 occurred at disk power-on lifetime: 1290 hours (53 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 00 71 05 78 e6 Error: UNC at LBA = 0x06780571 = 108529009

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 06 70 05 78 06 00 1d+04:20:01.441 READ DMA
ec 00 00 00 00 00 00 02 1d+04:20:01.432 IDENTIFY DEVICE
ef 03 45 00 00 00 00 02 1d+04:20:01.425 SET FEATURES [Set transfer mode]
ec 00 00 00 00 00 00 02 1d+04:20:01.418 IDENTIFY DEVICE
c8 00 06 70 05 78 06 00 1d+04:19:58.500 READ DMA

Error 18931 occurred at disk power-on lifetime: 1290 hours (53 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 00 71 05 78 e6 Error: UNC at LBA = 0x06780571 = 108529009

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 06 70 05 78 06 00 1d+04:19:58.500 READ DMA
ec 00 00 00 00 00 00 02 1d+04:19:58.490 IDENTIFY DEVICE
ef 03 45 00 00 00 00 02 1d+04:19:58.484 SET FEATURES [Set transfer mode]
ec 00 00 00 00 00 00 02 1d+04:19:58.476 IDENTIFY DEVICE
c8 00 06 70 05 78 06 00 1d+04:19:55.414 READ DMA

Error 18930 occurred at disk power-on lifetime: 1290 hours (53 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 00 71 05 78 e6 Error: UNC at LBA = 0x06780571 = 108529009

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


c8 00 06 70 05 78 06 00 1d+04:19:55.414 READ DMA
ec 00 00 00 00 00 00 02 1d+04:19:55.405 IDENTIFY DEVICE
ef 03 45 00 00 00 00 02 1d+04:19:55.398 SET FEATURES [Set transfer mode]
ec 00 00 00 00 00 00 02 1d+04:19:55.391 IDENTIFY DEVICE
c8 00 06 70 05 78 06 00 1d+04:19:52.476 READ DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Short offline Completed: read failure 90% 1291 53632856

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

No, the drive’s probably ok. What about the interface? Any doubts about that?

No reason to doubt the interface that I can think of - PCI SATA card in IBM e330 server.

When I first used the card /drive combination, it froze the above box and my ASUS P4S800D-X on boot. I updated the SATA card BIOS which fixed that, and then used the card / drive combination in the ASUS to partition and format with ext3 on luks, then copied about 100GB onto each of sdb5-8.

sdb7 has (had?) identical copy of the data on sdb5, checked when originally copied with

diff -r /originaldata /mnt/sdb5
diff -r /originaldata /mnt/sdb7

When I:

rsync -av --progress /mnt/sdb7/ /mnt/sdb8

no errors are reported.

I’ve just put it together, and being cautious only tested nbd export from sdb5. The only write operations were to sdb5 (from nbd-client box):

touch /mnt/exported-nbd/test.txt
rm /mnt/exported-nbd/test.txt

Which is one reason I suspected nbd.

I’m thinking of trying badblocks tomorrow, never ran it before -

#e2fsck -c -C 0 -V /dev/mapper/sdb5

??

Thanks for your continued interest, much appreciated.