Storing and accessing a very big number of files

genesup · June 4, 2009, 1:15pm

Hello,

I am working with a very big (~10^5) number of files stored in a single folder. I am thinking about increasing this amount to 10^6.

1/ Could this be bad somehow for filesystem (ext3)? For example making it slower for overall use.
2/ What is the maximum amount of files eligible to be present in one folder?
3/ What is the best way to store a big number of files?
4/ What database would you suggest to use in my case?

Thank you for your time.
SDA

ken_yap · June 4, 2009, 3:44pm

Try to spread out the tree by using more levels of directories. For example you could divide the files by first letter like this

a/aardvark
/…
b/bravo
/…

Or two levels:

a/aa/aardvark
/…

If the filename don’t distribute nicely by alphabet use a hash function on the filename.

vodoo · June 4, 2009, 6:17pm

Hi

4/ What database would you suggest to use in my case?

Do you mean that your big number of files are database tables? Then there is possibly something wrong with your layout. Can you explain what you are trying to do?

I use MySQL, but that’s my personal taste. Please be aware that MySQL creates 3 files per table when MYISAM tables are used. That would triple the number of your files, but with InnoDB tables you can have everything in a few huge files.

recraig2 · June 6, 2009, 1:07am

For MASSIVE databases you should be using PostGRESQL.

I have wondered about the number of files handled as well in KDE4. My system often trolls to a stall out when I am performing a function of 10^6 files at a time, regardless of the total size. Too many times I have had to dump out, usually with a Kill command.

I know it sounds awkward, but you might need to revert to small scripts for repeating an action on smaller groups of files if you are having similar problems.

Don’t forget the industry standard (though widely neglected) practice of backing up REGULARLY!

Cheers!

microchip8 · June 6, 2009, 2:26am

MySQL can handle massive DBs just as easy as Postgres. Amazone has done it and Facebook too, among others

afaik, there’s no limit to how many files one can have inside a directory on ext3. However, there is a limit on how many subdirectories one can have and that is 32.000

Depending on the file sizes, if most of your files will be large, I would go for XFS. I’ve just created one million files in a directory on an ext3 partition using a simple for loop and the operation is not the speedies one, especially when ext3 starts to sync/commit changes every 5 seconds (default) when used in ordered mode. So I would go with XFS if I was you

recraig2 · June 6, 2009, 2:33am

Have a look at THIS to know the differences in limits. You’ll see immediately that PostGreSQL handles far larger sizes, namely tables and rows, etc…

Maybe you don’t need this size, but that is the issue here, right?

ken_yap · June 6, 2009, 2:40am

Filesystems will degrade in performance (various operations: e.g. lookup, creation, deletion) as the number of files in a directory increase, some more than others. It’s best to do your own benchmarks to see if this matters to you. Don’t forget the filesystem may have options that affect this, e.g. dir_index on ext3 (though probably standard now). You can mitigate this by having additional directory levels to spread out the tree. It’s a common technique, see the sourceforge filesystem or the squid cache directory.

If you are indexing the files and only retrieving all of the file at once, perhaps you might consider storing the data as blobs in a RDBMS.

Care to tell us what application this is?

microchip8 · June 6, 2009, 2:46am

To change the default size limit for MyISAM tables, set the myisam_data_pointer_size, which sets the number of bytes used for internal row pointers. The value is used to set the pointer size for new tables if you do not specify the MAX_ROWS option. The value of myisam_data_pointer_size can be from 2 to 7. A value of 4 allows tables up to 4GB; a value of 6 allows tables up to 256TB.

On Linux 2.2, you can get MyISAM tables larger than 2GB in size by using the Large File Support (LFS) patch for the ext2 file system. Most current Linux distributions are based on kernel 2.4 or higher and include all the required LFS patches. On Linux 2.4, patches also exist for ReiserFS to get support for big files (up to 2TB). With JFS and XFS, petabyte and larger files are possible on Linux.

And this is only when using the default MyISAM engine, with InnoDB you can go even higher

MySQL :: MySQL 5.0 Reference Manual :: B.1.2.12 The table is full

genesup · June 6, 2009, 8:22am

Thank you for the answers - I now have some new ideas.

I am not working with database tables. I am working with text files. If going a little bit in details - I am running molecular docking and files that I was talking about are coordinate files of proteins in plain text format. The program enters a directory, takes a file, creates a new lower level directory and starts operations with the file inside this directory. Since the FS is getting slower and slower as number of files approach 10^5 I was starting to think if I could store all my coordinate files in a single database file.

So is there a database to store plain files with a function of retrieving this files from the database on demand.

Thanks.

ken_yap · June 6, 2009, 8:52am

You can just store it as a blob column in a RDBMS. The nice thing about a RDBMS is that you can store other metadata in the table, you can index the table by other columns, the DB takes care of concurrent access issues, and you can have finer grain management over the access.

recraig2 · June 6, 2009, 9:04am

[Limits

LFS raises the limit of maximal file size. For 32-bit systems the limit is 231 (2 GiB) but using the LFS interface on filesystems that support LFS applications can handle files as large as 263 bytes.

For 64-bit systems the file size limit is 263 bytes unless a filesystem (like NFSv2) only supports less.](http://www.suse.de/~aj/linux_lfs.html)

Yes, you should be using a Database program in this case. It is short-sighted to have begun with text files to store data for such an application. MySQL might be the easiest to switch to. If you want to do things right, learn PostGreSQL, but if you are very new to programming then your choice should be MySQL. Of course, this assumes two things. Use of MySQL for commercial applications requires a license, not so for PostGreSQL. So either you are developing a program for personal use or you have no problem paying a license fee for use in your commercial program that you are developing. If these two assumptions are true then you should use MySQL. Otherwise, PostGreSQL is my advice.

genesup · June 6, 2009, 5:21pm

Thank you for answering.

I`ve installed PostGreSQL and read the manual and now I have a more advanced question.

1/ I`ve created a table in the database

mypdb=# create table pdb(
index int,
filename text,
filecontent text
);

2/ Then I want to read a file into it

First I insert metainfo

mypdb=# insert into pdb (index , filename) values (1,
‘/home/sda/Documents/Work/PVA_India/PGA_test_modeling/pdb/1gm9.pdb’);
INSERT 0 1

But then comes the problem because the only command I found to read in
the file content is COPY FROM but the following command would not work

mypdb=# copy pdb (filecontent) from
‘/home/sda/Documents/Work/PVA_India/PGA_test_modeling/pdb/1gm9.pdb’
where index=1;
ERROR: syntax error at or near “where”
LINE 1: …ts/Work/PVA_India/PGA_test_modeling/pdb/1gm9.pdb’ where
inde…

QUESTION: what is the command to read the content of a plain text file
into a SPECIFIED table entry?

Thank you for your time.
SDA

ken_yap · June 6, 2009, 5:44pm

Write a little program in say Perl to populate your table. Should take maybe 20 lines max using the DBI module.

vodoo · June 6, 2009, 7:05pm

Yes, this is perfectly possible with a database. I use MySQL to store something like 50000 medium large texts with several keys and fulltext search. Or a couple of million weather reports. Backup and reloading gets a bit tricky (and slow!) when your tables grow very large but this is not a limitation of the database.

But keep in mind that a database is not very suitable to keep tree like structures. Its possible though.

recraig2 · June 6, 2009, 9:47pm

genesup, here is a start for you with the perl DBI.

I hope this helps.

user · June 7, 2009, 1:21am

On Thu, 04 Jun 2009 11:16:01 GMT, genesup
<genesup@no-mx.forums.opensuse.org> wrote:

>
>Hello,
>
>I am working with a very big (~10^5) number of files stored in a single
>folder. I am thinking about increasing this amount to 10^6.
>
>1/ Could this be bad somehow for filesystem (ext3)? For example making
>it slower for overall use.
>2/ What is the maximum amount of files eligible to be present in one
>folder?
>3/ What is the best way to store a big number of files?
>4/ What database would you suggest to use in my case?
>
>Thank you for your time.
>SDA

From my personal experience file counts in excess or 1E5 are
problematic an any one directory in ext3. Something like 1.4E5 files
in a single directory nearly killed one of my machines once. Do try
to look for alternatives.