Which filesystem to choose for bigger blocksize?


I have an application which uses 256k as blocksize.
And I would like to have a filesystem which handles I/O blocksize as near as possible to this blocksize.

I can not use EXT3, because that will be to small blocks.
I was getting into problem to use XFS; because that has some dependencies to the PAGESIZE in the kernel, it can not use bigger blocksizes than the PAGESIZE in the kernel.

What more opertunities do I have?
Or how can I get XFS to work with bigger blocks.?

Please help me.

Regards Tomas

You realize the amount of wasted space you will have with a block sizes that large? A block is the smallest size the file system will allocate. That means a file of one byte will use 256K.

You are not in Windows! which fragments files badly which is the only reason to desire huge blocksize.

In Linux, files are stored in block groups large enough to handle the whole file. Then as space free’s up and the system is idle files are moved to keep as much space available as possible.

I most ideal sizes for blocksize are 512, 1024, 2048 and 4096. Let’s say you have a blocksize of 512 bytes and most of your programs and files are about 500 bytes. Then on average there would be maybe as little as 12 bytes of waisted space, but for the very small files of say 20 bytes there would be 492 bytes of waisted space. Now you come along with the odd program that needs to save a 256k file. Using a 512 byte blocksize it would take 512 blocks with no waist while still being optimum for the other files that need to be saved.

To enable blocksize to 256k you would need about a 500 terabyte drive for a basic Linux system that would function as well as a 4k blocksize on a 20GB drive get the idea.

Thanks for the quick reply.

The application is a backup application, so the “waste” of space is not a big problem here.
But the small blocks are a performance degree.


I don’t know of any back-up program that needs huge blocks. HDD read/Write is controlled at a fixed rate by the head control electronics. I really think you need to have a better understanding about how sectors, clusters, plattons, and azmyth play into the drives geometry. Asking for a blocksize greater then 8192bytes (8k) will seriously affect even a Mainframe such that it will virtually render it useless.

Are you sure you mean blocksize? What backup program are you trying to use? Maybe they are meaning something totally different.

If you have an 8ms access time on a drive, and the drive is at rest, it will take .008 seconds to bring the drive to life. Then comes the rotational speed of say 5400rpm. In addition, there is track-step time, # of tracks, # of heads, physical bit write speed. And all these come into play to define the speed at which the drive can perform. If in the course of making a back-up the drive must seek back and forth to fetch the various blocks of data, and again seek to a location to write the data back out (assuming source data and destination are on the same drive with maybe different partitions) transfer speed will suffer. Like I said before, this is typically found when using windows when the drive is fragmented (files broken into little chunks spread all over the drive). In Linux, this fragmentation does not occur unless you are experiencing a partition that is almost completely filled. Thusly, in a Linux system, drive seek time remains at it’s highest speed. The biggest obstacles that interfere with fast back-ups are other processes running, attempting to backup to slow speed devices (CDRW, DVDRW, USBHDD, USBflash, and external USB hardisks) which can’t handle reading or writing at the speeds possible from an internal drive.

we have decided to not go for linux, cause of the bad i/o performance.
We lost 100MB/s in the filesysem layer (80%), compared to the device layer.
On AIX we only lost 10%.
On Windows we will proberbly only loose around 5-10% too.

daleto wrote:
> On AIX we only lost 10%.
> On Windows we will proberbly only loose around 5-10% too.

sounds like you don’t know how to run big iron…

so, sticking to Redmond is probably your best course…
don’t forget the nose ring…

DenverD (Linux Counter 282315)
CAVEAT: http://is.gd/bpoMD
posted via NNTP w/TBird | KDE 3.5.7 | openSUSE 10.3 SMP i686
AMD Athlon 1 GB RAM | GeForce FX 5500 | ASRock K8Upgrade-760GX |
CMedia 9761 AC’97 Audio

This hasn’t been my experience, but it’s your choice. You talk about loss between filesystem and device layers and yet have not answered what backup software you are using. AIX supports up to 8192 byte blocks, Windows 95/98/ME support 512byte/2048byte/4096byte, Windows 2K/XP/Vista support 512byte if pre-created otherwise only 4096byte and Linux supports 512b/1kb/2kb/4kb/8kb.

As for filesystems, Windows is fat12/fat16/fat32 pre 2K and added NFS for NT, NTFS for 2k/XP/Vista. AIX uses one 16 JFS formats, Linux/Unix use ext2/ext3/ext4 as regular formats.

This from the AIX forum:

What Blocksize does AIX use? I am looking for something around 512000 for easy backups.
"What you are talking about is something completely different. This just means that when creating/changing a filesystem you can specify the size of it in multiples of 512 (bytes), Megabytes or Gigabytes. Nothing more, nothing less.
Basically put the device layer will always == the whole size of the defined filesystem which is a single partition (physical or logical). Now the larger you make the sector size the more waisted space occurs when dealing with smaller files.
Thusly, if your referring to loss as a function of %, it's probably because you have a larger number of very small files as compared to the sector size."

I am using IBM Tivoli Storage Manager.
And File type device, which handles 256k blocks.

We runned some test on Linux, and we couldn’t get more than 50-70MB/s.
Compared to IBM AIX, which delivers 120MB/s
The device layer has the capability to deliver 150MB/s.

So EXT3 is not a good choice for TSM.
And neither are XFS too.
Or is there more things to tune?

I suggest you go to ibm tivoli storage management support because after reading the details found on the site you have not set the device up properly. The loss you are talking about and blocksize you are talking about refer to read speed of at source HDD vs write speed at destination tape and residual wait time between operations, and the blocksize refers to how much data to transfer at a time.

This is clearly an issue not of the filesystem / sector size but the way in which the system is configured to do the transfer. Given the SAME HDD used and tested under AIX, Windows, and Linux, the source read speed from the drive or destination write speed of the drive will be virtually the same.

We are emulating I/O using iozone with the same block size as TSM are written.
And the logical volume can deliver 150MB/s but EXT3 can only deliver 50MB/s. I do not understand why this should be a TSM problem that EXT3 filesystem dropes 100MB/s in performance, the only thing I can figure out is that there are no (or I do not understand how to configure) the “read-ahead” or other caching mechanism in EXT3, to bundle the 4k blocks in EXT3 to bigger I/O blocks to underlaying layer (lvm).
AIX has vom option to tune the I/O behaviour, but what is the equivalent in EXT3?
Or can I use bigger filesystem blocks from the start, eg 256k block on the filesystem.

iozone is a benchmark tool that measures for a given RECORD size of transfer what a typical read/write/seek speed will be for a specific set of parameters. The data it provides is virtually useless if you are emulating an io process under a different OS. In real life transfers are governed by the real OS and system load. If you look at the graph provided by iozone site you will note along the left the record size from 0 to 150,000 bytes per record and along the z axis the sector size as a function of .5/1/2/4/8/16k and along the y-axis the type of transfer being done.

You really need to read the full details from the IBM TSM site and read the iozone disclaimers! I don’t know of many databases with record sizes of 150,000 bytes. That would suggest only 6 record per MB. What are you doing computing seismic data for the whole earth in one database?

I don’t know how I can help further, maybe someone else has some clues

I just checked my system running opensuse 11.2 and it benchmarks at 148,933 bytes per second with sector size of 4k. Yes Linux does use look ahead caching. Part of the cache is with the harddisk controller circuitry and a small part is handled by the OS.

that is great that EXT3 are doing readahead.
how about cache of writes too for performance improvements.
and what parameters can I use to tune?
are the blocksize still 4k if you look with iostat?
i have never seen that EXT3 are trying to send bigger io to underlaying layer.
it seams to be fixed to 4k?