Software RAID not working after a while

Hello,

I have a server using opensuse 12.1. Since several services including third party software are running on this machine I did not upgrade on newer version. All in all the server is running well except of one thing.

I have installed 4 SATA (Seagate ST1000DL002) disk drives additional to the existing main disk. The 4 drives are supposed to work as a software RAID 5. Unfortunatelly no matter what type of RAID I use (0, 4, 5) and independend of the file system (I tried xfs, ext3, ext4) after using the RAID for some hours and writing data on it, the RAID system changes the mount options from rw to ro and can not be used correctly.

Like I said I tried all kind of combinations but always I get the error from above. On the other hand, the disks seem to be okay and their SMART values are also okay. If I mount each disk separately I can work with the disk a copy hundreds of GB and it’s still okay.

I also tried to set-up a logical volume but still with no success if below is RAID. Also the dmraid (Disk mapper RAID) misbehaves in a similar way.
I replaced the SATA controller (initially SiL now Promise) but even this did not help.

The astonishing fact is that the RAID works for let’s say some hours and I can write on it but then the mount option is changed.

Do You have any advices for me?

Thank you in advance

honterus0

Did you check the log files and dmesg?

As hcvv suggested, checking log is where to start. Although SMART attributes may show the drives are healthy, there may be issues with drive or controller timeouts, or various other things which are causing reads or writes to fail resulting the the file system being remounted RO, but such events should be reflected in the logs.

Are you using the on board SATA controller or HBA card? Which mode is it set to? BIOS up to date?

Hello,

I repeated the test and got the error again. Here is what I did:

  1. 4 disks combined with RAID5 to /dev/md0 using Yast
  2. Format of /dev/md0 ext4
  3. Mounting /dev/md0 with rw options
  4. Start of copy procedure (in a shell using cp) of roughly on 1TB of an external USB disk
  5. After some time of copying I get error messages in the copying shell
  6. The dmesg messages are appended below. This is a part of dmesg containing the messages to the procedures from above.
  7. As You can see some EXT4-fs error occurs on /dev/md0 but I don’t know what to do with it.
  8. I used additional PCI SATA Controller cards with (SiL or Promise Chipset) but unfortunatelly I got always the same misbehaviour.
  9. The SiL Controller was brand new, the Promise Fasttrak is old but in both cases the disks were recognized and shown with the right size in Yast.

Can You please give me an hint?

Thank You.

honterus0

Section of dmesg:


[521594.336785] device-mapper: uevent: version 1.0.3
[521594.336896] device-mapper: ioctl: 4.21.0-ioctl (2011-07-06) initialised: dm-devel@redhat.com
[521597.584331] st: Version 20101219, fixed bufsize 32768, s/g segs 256
[521746.460166] md: md0 still in use.
[521746.660368] md0: detected capacity change from 4000810270720 to 0
[521746.660508] md: md0 stopped.
[521746.660514] md: unbind<sdf1>
[521746.663048] md: export_rdev(sdf1)
[521746.663253] md: unbind<sde1>
[521746.677045] md: export_rdev(sde1)
[521746.677236] md: unbind<sdc1>
[521746.679030] md: export_rdev(sdc1)
[521746.679191] md: unbind<sdd1>
[521746.681051] md: export_rdev(sdd1)
[521858.614080] md: bind<sdc1>
[521859.002666] md: bind<sdd1>
[521859.339553] md: bind<sde1>
[521859.717642] md: bind<sdf1>
[521859.765485] bio: create slab <bio-1> at 1
[521859.765500] md/raid0:md0: looking at sdf1
[521859.765504] md/raid0:md0:   comparing sdf1(1953520640) with sdf1(1953520640)
[521859.765507] md/raid0:md0:   END
[521859.765509] md/raid0:md0:   ==> UNIQUE
[521859.765511] md/raid0:md0: 1 zones
[521859.765513] md/raid0:md0: looking at sde1
[521859.765515] md/raid0:md0:   comparing sde1(1953520640) with sdf1(1953520640)
[521859.765519] md/raid0:md0:   EQUAL
[521859.765521] md/raid0:md0: looking at sdd1
[521859.765523] md/raid0:md0:   comparing sdd1(1953520640) with sdf1(1953520640)
[521859.765526] md/raid0:md0:   EQUAL
[521859.765528] md/raid0:md0: looking at sdc1
[521859.765531] md/raid0:md0:   comparing sdc1(1953520640) with sdf1(1953520640)
[521859.765534] md/raid0:md0:   EQUAL
[521859.765535] md/raid0:md0: FINAL 1 zones
[521859.765541] md/raid0:md0: done.
[521859.765543] md/raid0:md0: md_size is 7814082560 sectors.
[521859.765546] ******* md0 configuration *********
[521859.765548] zone0=[sdc1/sdd1/sde1/sdf1/]
[521859.765553]         zone offset=0kb device offset=0kb size=3907041280kb
[521859.765555] **********************************
[521859.765556] 
[521859.765578] md0: detected capacity change from 0 to 4000810270720
[521859.779983]  md0: unknown partition table
[521860.109882] async_tx: api initialized (async)
[521860.121925] xor: automatically using best checksumming function: pIII_sse
[521860.126005]    pIII_sse  :  6256.000 MB/sec
[521860.126008] xor: using function: pIII_sse (6256.000 MB/sec)
[521860.188733] raid6: int32x1    750 MB/s
[521860.206619] raid6: int32x2    914 MB/s
[521860.223058] raid6: int32x4    656 MB/s
[521860.246044] raid6: int32x8    730 MB/s
[521860.266039] raid6: mmxx1     1656 MB/s
[521860.286113] raid6: mmxx2     3011 MB/s
[521860.303075] raid6: sse1x1     429 MB/s
[521860.339041] raid6: sse1x2    1164 MB/s
[521860.356441] raid6: sse2x1     582 MB/s
[521860.377089] raid6: sse2x2    1113 MB/s
[521860.377093] raid6: using algorithm sse1x2 (1164 MB/s)
[521860.405784] md: raid6 personality registered for level 6
[521860.405788] md: raid5 personality registered for level 5
[521860.405790] md: raid4 personality registered for level 4
[521951.943652] usb 1-5: USB disconnect, device number 7
[522234.775364] md0: detected capacity change from 4000810270720 to 0
[522234.775374] md: md0 stopped.
[522234.775381] md: unbind<sdf1>
[522234.778391] md: export_rdev(sdf1)
[522234.778463] md: unbind<sde1>
[522234.781053] md: export_rdev(sde1)
[522234.781074] md: unbind<sdd1>
[522234.786072] md: export_rdev(sdd1)
[522234.786148] md: unbind<sdc1>
[522234.791053] md: export_rdev(sdc1)
[522246.843755] md: bind<sdc1>
[522246.852228] md: bind<sdd1>
[522246.857100] md: bind<sde1>
[522246.869474] md: bind<sdf1>
[522246.872070] bio: create slab <bio-1> at 1
[522246.872098] md/raid:md0: device sde1 operational as raid disk 2
[522246.872101] md/raid:md0: device sdd1 operational as raid disk 1
[522246.872105] md/raid:md0: device sdc1 operational as raid disk 0
[522246.872582] md/raid:md0: allocated 4221kB
[522246.872694] md/raid:md0: raid level 5 active with 3 out of 4 devices, algorithm 2
[522246.872696] RAID conf printout:
[522246.872698]  --- level:5 rd:4 wd:3
[522246.872701]  disk 0, o:1, dev:sdc1
[522246.872703]  disk 1, o:1, dev:sdd1
[522246.872705]  disk 2, o:1, dev:sde1
[522246.880533] created bitmap (8 pages) for device md0
[522246.892550] md0: bitmap initialized from disk: read 1/1 pages, set 14905 of 14905 bits
[522246.912504] md0: detected capacity change from 0 to 3000607703040
[522246.932966]  md0: unknown partition table
[522248.005506] RAID conf printout:
[522248.005513]  --- level:5 rd:4 wd:3
[522248.005517]  disk 0, o:1, dev:sdc1
[522248.005520]  disk 1, o:1, dev:sdd1
[522248.005522]  disk 2, o:1, dev:sde1
[522248.005524]  disk 3, o:1, dev:sdf1
[522248.006196] md: recovery of RAID array md0
[522248.006199] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[522248.006202] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[522248.006207] md: using 128k window, over a total of 976760320k.
[522326.577671] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
[522350.500031] usb 1-5: new high speed USB device number 8 using ehci_hcd
[522350.643274] usb 1-5: New USB device found, idVendor=174c, idProduct=55aa
[522350.643280] usb 1-5: New USB device strings: Mfr=2, Product=3, SerialNumber=1
[522350.643284] usb 1-5: Product: MEDION HDDrive2GO
[522350.643286] usb 1-5: Manufacturer: MEDION
[522350.643288] usb 1-5: SerialNumber: 3120000000000000A4FF
[522350.644933] scsi15 : usb-storage 1-5:1.0
[522355.365202] scsi 15:0:0:0: Direct-Access     ST1000DM 003-9YN162       NDP3 PQ: 0 ANSI: 0
[522355.365509] sd 15:0:0:0: Attached scsi generic sg2 type 0
[522355.383457] sd 15:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[522355.384297] sd 15:0:0:0: [sdb] Write Protect is off
[522355.384301] sd 15:0:0:0: [sdb] Mode Sense: 23 00 00 00
[522355.385044] sd 15:0:0:0: [sdb] No Caching mode page present
[522355.385047] sd 15:0:0:0: [sdb] Assuming drive cache: write through
[522355.387422] sd 15:0:0:0: [sdb] No Caching mode page present
[522355.387428] sd 15:0:0:0: [sdb] Assuming drive cache: write through
[522355.407882]  sdb: sdb1
[522355.410431] sd 15:0:0:0: [sdb] No Caching mode page present
[522355.410438] sd 15:0:0:0: [sdb] Assuming drive cache: write through
[522355.410442] sd 15:0:0:0: [sdb] Attached SCSI disk
[522356.401491] kjournald starting.  Commit interval 5 seconds
[522356.402228] EXT3-fs (sdb1): using internal journal
[522356.402237] EXT3-fs (sdb1): mounted filesystem with ordered data mode
[540850.776025] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729096: block 62922884: comm cp: bad entry in directory: directory entry across blocks - offset=0(0), inode=3908271042, rec_len=26844, name_len=121
[540909.160932] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729098: block 62922885: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=3687578907, rec_len=52137, name_len=198
[540909.372793] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729098: block 62922885: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=3687578907, rec_len=52137, name_len=198
[543166.878317] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729164: block 62922897: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1288353844, rec_len=20641, name_len=25
[543425.258248] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729169: block 62922899: comm cp: bad entry in directory: directory entry across blocks - offset=0(0), inode=2879359133, rec_len=16348, name_len=76
[545313.940811] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729217: block 62922911: comm cp: bad entry in directory: inode out of bounds - offset=0(0), inode=1210108374, rec_len=204, name_len=0
[547589.595886] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729270: block 62922924: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2039048537, rec_len=59290, name_len=217
[547589.635018] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729270: block 62922924: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2039048537, rec_len=59290, name_len=217
[547657.370717] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729273: block 62922926: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1381318334, rec_len=28898, name_len=43
[547657.483228] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729273: block 62922926: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1381318334, rec_len=28898, name_len=43
[547657.521590] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729273: block 62922926: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1381318334, rec_len=28898, name_len=43
[548464.883348] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729307: block 62922934: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=3531331575, rec_len=33229, name_len=97
[548465.044910] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729307: block 62922934: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=3531331575, rec_len=33229, name_len=97
[548465.091959] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15729307: block 62922934: comm cp: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=3531331575, rec_len=33229, name_len=97
[566606.166017] EXT4-fs error (device md0): add_dirent_to_buf:1271: inode #15728641: block 62922945: comm cp: bad entry in directory: directory entry across blocks - offset=0(0), inode=341302059, rec_len=10836, name_len=87
...

Hi
Hardware fault… Power supply not enough for all those drives?, CPU overheating (run sensors command), hard drive temperatures (hhdtemp or smartctl), fans all working ok, RAM perhaps?