Page 1 of 2 12 LastLast
Results 1 to 10 of 16

Thread: Storing and accessing a very big number of files

  1. #1

    Default Storing and accessing a very big number of files

    Hello,

    I am working with a very big (~10^5) number of files stored in a single folder. I am thinking about increasing this amount to 10^6.

    1/ Could this be bad somehow for filesystem (ext3)? For example making it slower for overall use.
    2/ What is the maximum amount of files eligible to be present in one folder?
    3/ What is the best way to store a big number of files?
    4/ What database would you suggest to use in my case?

    Thank you for your time.
    SDA

  2. #2
    Join Date
    Jun 2008
    Location
    UTC+10
    Posts
    9,686
    Blog Entries
    4

    Default Re: Storing and accessing a very big number of files

    Try to spread out the tree by using more levels of directories. For example you could divide the files by first letter like this

    a/aardvark
    /...
    b/bravo
    /...

    Or two levels:

    a/aa/aardvark
    /...

    If the filename don't distribute nicely by alphabet use a hash function on the filename.

  3. #3
    Join Date
    Jan 2009
    Location
    Switzerland
    Posts
    1,529

    Default Re: Storing and accessing a very big number of files

    Hi

    4/ What database would you suggest to use in my case?
    Do you mean that your big number of files are database tables? Then there is possibly something wrong with your layout. Can you explain what you are trying to do?

    I use MySQL, but that's my personal taste. Please be aware that MySQL creates 3 files per table when MYISAM tables are used. That would triple the number of your files, but with InnoDB tables you can have everything in a few huge files.

  4. #4

    Default Re: Storing and accessing a very big number of files

    For MASSIVE databases you should be using PostGRESQL.

    I have wondered about the number of files handled as well in KDE4. My system often trolls to a stall out when I am performing a function of 10^6 files at a time, regardless of the total size. Too many times I have had to dump out, usually with a Kill command.

    I know it sounds awkward, but you might need to revert to small scripts for repeating an action on smaller groups of files if you are having similar problems.

    Don't forget the industry standard (though widely neglected) practice of backing up REGULARLY!

    Cheers!

  5. #5
    Join Date
    Jun 2008
    Location
    /dev/belgium
    Posts
    1,946

    Default Re: Storing and accessing a very big number of files

    MySQL can handle massive DBs just as easy as Postgres. Amazone has done it and Facebook too, among others

    afaik, there's no limit to how many files one can have inside a directory on ext3. However, there is a limit on how many subdirectories one can have and that is 32.000

    Depending on the file sizes, if most of your files will be large, I would go for XFS. I've just created one million files in a directory on an ext3 partition using a simple for loop and the operation is not the speedies one, especially when ext3 starts to sync/commit changes every 5 seconds (default) when used in ordered mode. So I would go with XFS if I was you

  6. #6

    Default Re: Storing and accessing a very big number of files

    Have a look at THIS to know the differences in limits. You'll see immediately that PostGreSQL handles far larger sizes, namely tables and rows, etc...

    Maybe you don't need this size, but that is the issue here, right?

  7. #7
    Join Date
    Jun 2008
    Location
    UTC+10
    Posts
    9,686
    Blog Entries
    4

    Default Re: Storing and accessing a very big number of files

    Filesystems will degrade in performance (various operations: e.g. lookup, creation, deletion) as the number of files in a directory increase, some more than others. It's best to do your own benchmarks to see if this matters to you. Don't forget the filesystem may have options that affect this, e.g. dir_index on ext3 (though probably standard now). You can mitigate this by having additional directory levels to spread out the tree. It's a common technique, see the sourceforge filesystem or the squid cache directory.

    If you are indexing the files and only retrieving all of the file at once, perhaps you might consider storing the data as blobs in a RDBMS.

    Care to tell us what application this is?

  8. #8
    Join Date
    Jun 2008
    Location
    /dev/belgium
    Posts
    1,946

    Default Re: Storing and accessing a very big number of files

    Quote Originally Posted by recraig2 View Post
    Have a look at THIS to know the differences in limits. You'll see immediately that PostGreSQL handles far larger sizes, namely tables and rows, etc...

    Maybe you don't need this size, but that is the issue here, right?
    To change the default size limit for MyISAM tables, set the myisam_data_pointer_size, which sets the number of bytes used for internal row pointers. The value is used to set the pointer size for new tables if you do not specify the MAX_ROWS option. The value of myisam_data_pointer_size can be from 2 to 7. A value of 4 allows tables up to 4GB; a value of 6 allows tables up to 256TB.
    On Linux 2.2, you can get MyISAM tables larger than 2GB in size by using the Large File Support (LFS) patch for the ext2 file system. Most current Linux distributions are based on kernel 2.4 or higher and include all the required LFS patches. On Linux 2.4, patches also exist for ReiserFS to get support for big files (up to 2TB). With JFS and XFS, petabyte and larger files are possible on Linux.
    And this is only when using the default MyISAM engine, with InnoDB you can go even higher

    MySQL :: MySQL 5.0 Reference Manual :: B.1.2.12 The table is full

  9. #9

    Default Re: Storing and accessing a very big number of files

    Thank you for the answers - I now have some new ideas.

    I am not working with database tables. I am working with text files. If going a little bit in details - I am running molecular docking and files that I was talking about are coordinate files of proteins in plain text format. The program enters a directory, takes a file, creates a new lower level directory and starts operations with the file inside this directory. Since the FS is getting slower and slower as number of files approach 10^5 I was starting to think if I could store all my coordinate files in a single database file.

    So is there a database to store plain files with a function of retrieving this files from the database on demand.

    Thanks.

  10. #10
    Join Date
    Jun 2008
    Location
    UTC+10
    Posts
    9,686
    Blog Entries
    4

    Default Re: Storing and accessing a very big number of files

    You can just store it as a blob column in a RDBMS. The nice thing about a RDBMS is that you can store other metadata in the table, you can index the table by other columns, the DB takes care of concurrent access issues, and you can have finer grain management over the access.

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •