openSuse 11.0 64-bit RAID 5 Array Questions

After a few months of research I still have a couple questions concerning a RAID 5 array I wish to build soon.

System specs:

Asus P5Q Pro
Intel E6750
2x 1GB DDR2 800
Gb Ethernet LAN
openSuse 11.0 64-bit

This is my personal server. I mainly use it for NAS and media functions (encoding, streaming, etc). All my LAN hardware is Gb and I do intend to use the Gb speeds (even now without the RAID I push 300-400 Mb/s over the network).

I want to install a ~2TB RAID 5 array (I’m still trying to decide between 4x 500, 750, or 1024 GB drives). BIOS RAID is not an option. I am not sure if I want to use kernel RAID (pure software) or go with a full hardware RAID. I would like a single large volume and I would like the throughput to be close to (or in excess of) 1 Gb/s.

The first question is will kernel RAID be able to provide the target 1 Gb/s throughput? I know the hardware solution will be faster (and use less CPU), but if the software solution is fast enough I don’t see the need to buy a $500 RAID controller just yet.

Second do large partitions/volumes (over 2 TB) work? I have not been able to find a conclusive answer to this. I know traditional partitions do not support sizes over 2 TB. The other confusion here is that partitions and volumes seems to be used interchangeably. From what I know the partition is the way the volume is divided up (say you have a 500 GB hard drive, the volume would be 500 GB but you could make up to 4 primary partitions with a total size of up to 500 GB. Basically the partitions are divisions in the volume). I think this concerns hardware RAID more than the kernel RAID since the hardware RAID would present the OS a single very large hard drive with capacity over 2 TB. I will probably put the OS on a different volume due to the size issues, so this large volume is for storage (mainly multimedia).

If I go hardware I am trying to decide between the HighPoint RocketRAID 3510 (Newegg.com - HighPoint RocketRAID 3510 SATA II Hardware RAID Controller with Intel 2Nd Generation PCI-Express I/O Processor RAID 0/1/5/6/10 JBOD - Controllers / RAID Cards) or the 3ware 9650SE-8LPML (Newegg.com - 3ware 9650SE-8LPML PCI Express SATA II Controller Card RAID Levels 0, 1, 5, 6, 10, 50, Single Disk, JBOD - Controllers / RAID Cards). I am leaning towards the 3ware because the other 3ware products seem to have good linux support. I also want the RAID controller to be natively supported by the kernel. I have had bad luck with other (although cheap) RAID solutions and their proprietary drivers not working.

In the future I will have to add some other SATA/RAID adapter since the P5Q Pro only has 6x SATA connectors (for future expansion I intend to add a second RADI 5 array when I fill this first one, then cycle out the arrays as needed). This would also be an argument for hardware RAID. Once concern with hardware RAID is that performance seems to be mixed. Some people claim nice hight speed while other claim speeds as low as 8MB/s (note that most of the HDs I am considering I have tested before and clocked real world speeds at about 60 MB/s sustained). The RAID array needs to match at least 125 MB/s. The RAID is mostly read operations although it does do a fair share of writing (the data has to get there somehow).

Also which file system should I use? Should I stick with ext3 or should I try something else? I have only used ext2/3 and ReiserFS so I don’t know a whole lot about the others out there (I have seen XFS mentioned a lot with RAID 5 arrays).

Any comments, suggestions, and/or advice is welcome. The purpose in this thread is to hopefully figure out what I should buy before I invest $500-1K in hardware. I know things like backups will be a pain with this large array (they already are and I only have 600GB of data…).

One last thing I was pondering but never could find any mention of anyone attempting, some of the Western Digital GP drives support multiple RPM speeds (5400 or 7200). From what I understand they idle at 5400 RPM and then kick up to 7200 RPM when accessed. I have one in an external enclosure and it works nice, but I am wondering what that would do to a RAID array. My guess is it would be like building an array using some 5400 RPM drives and some 7200 RPM drives. The power (and cooling) savings would be nice but intuitively I think would be bad for the array. Any ideas why no one has attempted this (or at least not admitted to trying it)?

No one has any ideas/suggestions?

Followup:

Hardware change:
RAM is now 4GB (2x 2GB DDR2 800)

Since my last post I have set up a RAID 5 array using software RAID. I have 4x 1TB Western Digital WD1001FALS drives and the onbard SATA ports on my motherboard. The final array is 2.7TB. I placed a single partition on the array and formatted it as ext3 (not that my OS is on a seperate drive not in the array).

So far I see very little CPU usage when reading/writing (less than 10% on average) which considering what I use the server for is fine. When writing over the Gb LAN from my 750 GB SATA WD Green drive (5400 to 7200 variable drive) to the RAID array via Samba I get about 74MB/s. Reading has yet be accuratly benchmarked over the network (I’ll post back later once I figure out a good way to measure that).

Using dd to generate 10GB files of 0’s I get 27MB/s write speed and 193MB/s read speed. when I generate files less than 1GB I get 403MB/s write speed and 3GB /s read speed.

Over all I am pleased with the performance of the software RAID array. Also Windows XP SP3 has not issues connecting to a samba share over 2TB.

The first question is will kernel RAID be able to provide the target 1 Gb/s throughput? I know the hardware solution will be faster (and use less CPU), but if the software solution is fast enough I don’t see the need to buy a $500 RAID controller just yet.

I have a software raid 5 with 4 160 GB drives. I get around 775 Mb/s throughput on a read-heavy, light write workload. Kernel raid writes are usually slower than read (about the same as a single drive without tweaking). Hardware will improve both read and write performance, but for write the most significant improvement requires the controller to allow write-back caching (usually requires battery backed cache on the controller).

Second do large partitions/volumes (over 2 TB) work? I have not been able to find a conclusive answer to this. I know traditional partitions do not support sizes over 2 TB. The other confusion here is that partitions and volumes seems to be used interchangeably. From what I know the partition is the way the volume is divided up (say you have a 500 GB hard drive, the volume would be 500 GB but you could make up to 4 primary partitions with a total size of up to 500 GB. Basically the partitions are divisions in the volume). I think this concerns hardware RAID more than the kernel RAID since the hardware RAID would present the OS a single very large hard drive with capacity over 2 TB. I will probably put the OS on a different volume due to the size issues, so this large volume is for storage (mainly multimedia).

msdos partition tables do not support partitions greater than 2TB. 11.1 (finally) has gpt partition table support, which allows for larger partitions, and eliminates the distinction between primary and extended partitions. That said, unless your physical disks are > 2TB each, this won’t be an issue (for software anyway).

If you choose the MD raid route, I’d recommend putting a 100MB partiton on each drive for a raid1 /boot, use the rest of each drive to make your raid5. Then use LVM to carve up the large raid 5 into / /home, swap and whatever.

In the future I will have to add some other SATA/RAID adapter since the P5Q Pro only has 6x SATA connectors (for future expansion I intend to add a second RADI 5 array when I fill this first one, then cycle out the arrays as needed). This would also be an argument for hardware RAID. Once concern with hardware RAID is that performance seems to be mixed. Some people claim nice hight speed while other claim speeds as low as 8MB/s (note that most of the HDs I am considering I have tested before and clocked real world speeds at about 60 MB/s sustained). The RAID array needs to match at least 125 MB/s. The RAID is mostly read operations although it does do a fair share of writing (the data has to get there somehow).

Here’s how my array performs with a largish squential write:

kenn@loki:~> time dd if=/dev/zero of=testfile bs=4096 count=1048576 && time sync
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 43.3409 s, 99.1 MB/s

real	0m43.550s
user	0m0.196s
sys	0m6.776s

real	0m4.153s
user	0m0.000s
sys	0m0.100s

(99 MB/s = 792 Mb/s) Here’s read (with the same file created above):

kenn@loki:~> time cat testfile > /dev/null

real 0m42.733s
user 0m0.168s
sys 0m4.716s

Bear in mind I didn’t shut anything down for these quick benchmarks, I’ve got a boatload of stuff running. This isn’t quite your target, but it’s pretty close.

Expandability is also an argument for the software/LVM approach. If you add more sata ports, you can create a new software raid set and add that as another physical volume to your volume group. You can then cat that to your existing volume, so when the first fills, you start using the second (I would love to say you can use LVM to stripe them, raid50, but I don’t know if you can do that with existing data, you can do it if you set it up that way in the beginning).

Another option for adding to a software raid5 (from the md(4) man page) :

As of Linux 2.6.17, md can reshape a raid5 array to have more devices. Other possibilities may follow in future kernels.

Also which file system should I use? Should I stick with ext3 or should I try something else? I have only used ext2/3 and ReiserFS so I don’t know a whole lot about the others out there (I have seen XFS mentioned a lot with RAID 5 arrays).

If you have a lot of large media files, use XFS, but keep in mind that you can only grow XFS. If you ever think you’ll need to shrink the filesystem, use ext3.

Ooops, I didn’t read the entire thread. Glad you’re happy with md raid.

Try increasing your stirpe cache. My writes significantly improved when I went from 256 (default) to 4096.

As root:

echo 4096 > /sys/block/md0/md/stripe_cache_size

Thanks for the info. I will look into the cache settings. Honestly I was a little disappointed in the write speed, but at the same time the fastest thing I can feed the array with (or from) is a single SATA drive (which I know from previous usage has a max of about 50-80MB/s depening on which drive/system I use), so it makes it rather hard to test the real situation (the RAID is meant for a NAS like setup, thus the 1Gb/s target). Also I find it odd that using dd to write a file of 0’s writes slower and reads faster than a real drive copying data.

The slow write is due to using write-through caching instead of write-back.

With write-back, the write request returns as soon as it is committed to cache. Write-through doesn’t return until it’s committed to disk.

The end result is with MD raid, you cannot have write-back without battery backed cache, so it’s really only as fast as 1 disk (at best).

Increasing the stripe cache seems to speed writes up though.

Ya I kind of figured the caching would be needed. But still, the same model drives are reportedly writing at 100MB/s (some other review I found). I should have tested them before putting them into the array, but I didn’t think to do so and now it is too late take the array apart (it took more work that I am willing to redo for the sake of testing, restoring 600GB of data takes a while no matter what your backup source is).

I cannot change the stripe cache at the moment (I have a few other things running), but I will report back my findings once I do test it.

Examining the /sys/block/md0/md/stripe_cache_size file I see it contains a single value, 256. After entering

# echo 4096 > /sys/block/md0/md/stripe_cache_size

the file is empty. After a reboot the file again contains 256.

Any idea what is going on? The redirect doesn’t seem to be doing what it should be doing.

Can I manually edit this file with krwite? or is this really a binary file that is being interpreted to be 256?

I don’t see a button to edit posts.

Anyway, I do notice a performance increase when I echo 4096 to stripe cache size (I get 99MB/s). I also see a slight increase at 8192 (107 MB/s).

What is cost of increasing the stripe cache size? I noticed a little more CPU usage (15-30% now instead of the previous 8-20%). Is there an optimal (or close approximation to optimal) stripe cahce size that is based on drive properties? For example my 4 drive each have 32MB of cache and the system has 4GB of ram (most is free all the time). I guess what I am asking is that is there a relationship between the disk’s cache, RAM, CPU usage, and optimal stripe cache size?

I did some testing with various stripe cache sizes and 4096 seems to be the best performance/resouce tradeoff.

For a 10GB file of 0’s (dd if=/dev/zero of=test bs=1M count=10240)
At 4096 I saw about 10-20% usage with a write speed ~99MB/s and a read speed of ~200MB/s

At 8192 20-30% CPU with write 107MB/s and read 217MB/s

At 16384 40-50% CPU with write 115MB/s and read of 203MB/s

While increasing the stripe cache yeiled faster RAID speeds, it came at the cost of CPU time. For my purposes I think 4096 is ideal. ~100MB/s is good enough for writing and anything over 125MB/s read speed will not be noticed.

Interestingly with a strpie cache of 4096 (didn’t test this with the others), I noticed that smaller files slowed down. It’s almost like increasing the cache “spread” out the IO speeds. I was able to hit 3GB/s read speed on small file with the default 256 cache, but with 4095 the same file drop to a “mere” 400MB/s. Nothing to worry about and it makes sense when you consider how cache works and what it’s meant to do, but I find it interesting none the less to see the real number change (I’m used to dealing with the theoretical stuff, this is the first “real” array I have built).

I still have not figured out how to make the cache change permanent yet. Although I did find this on another site talking about RAID optomizations.

vi /etc/sysctl.conf
vm.vfs_cache_pressure=200
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100

Most of the stuff on the site proabalby doesn’t apply to me since it was talking about optomizing a hardware controlled RAID, but I am wondering if I can add this to the /etc/sysctl.conf:


vm.stripe_cache_size = 4096

Will that work or will it cause isses?

Also I see read ahead mentiond a lot. For example

blockdev --setra 8192 /dev/sda

Would that do anything for a software RAID?

Sorry to hijack the thread. I am interested in building a NAS server on a Linux box and found this topic interesting. Other I plan on using a hardware RAID and a few 1tb sata drives.

I plan on serving up storage for Windows and Linux clusters/ file systems. Any hardware recommendations and advice would be greatly appreciated.

I am somewhat new to the Linux world, but I am learning.

Regards,
BILL

For hardware you probably want to take a look at Hardware - openSUSE
I suggest you start a new thread with some more info like the amount of disks, total size required etc

Anyone know how to make stripe_cache_size permanent?

Also wm_sorg, you might find these liks useful:
How can I improve performance using 3ware controllers with the Linux 2.6 kernel?
Kyle Cordes

They talk about performance on hardware RAID (they are where I found of the above snippets). I don’t know if it will help. I haven’t been looking into the hardware side of things since the software RAID is performing well enough.

Ok I figured out how to make the stripe cache size permanent.

I was looking for a setting somewhere but found none, so I resorted to adding a line to the boot.local file.

I added

echo 4096 > /sys/block/md0/md/stripe_cache_size

to

/etc/init.d/boot.local

The change now takes effect at boot time.

Thank you. I will investigate.

Hi there,

sorry for hijacking this thread too, but i have exactly the same problem. I also have a RAID 5 (but with only 3 drives) and Gigabit LAN and would like to speed up the RAID to match the LAN speed. I have already set the stripe_cache_size to 8192, which increased the speed by 3 times, but it still is only 60 MB/s write speed and 140 MB/s read speed.

Did you do any more optimization such as using other drivers or tweaking some more configs?
Otherwise i would have to accept, that my hardware is just too slow. :frowning:

Greetings,
ZeD

You might find turning off barriers on ext3 mount option improves performance.

If you really want speed, RAID 5 is counter-productive use RAID 10 or RAID 0+1.

Well, RAID 10 or 0+1 would “waste” another drive which i just cannot afford. :wink:

In the meantime i checked the speed of a single drive, and i think theres my problem:

fileserver:~ # hdparm -t /dev/sdb

/dev/sdb:
 Timing buffered disk reads:  258 MB in  3.01 seconds =  85.65 MB/sec

So if nobody can tell me how to speed up the drives itself, i think i won’t get any better read/write speeds … :\

and risk severe file system corruption during a crash, especially if the drive is doing out-of-order write caching

anyways, file systems going through the device mapper (like software RAID and LVM) may not support barriers and turn them off automatically.