Looking for advise on distributed filesystem choice for opensuse

Hi all.

First of all, sorry if this is the wrong place in the forums to put this question: I was unsure where it would be mostly appropriate, specially after it was refuse in the mail lists.

I’m looking for some advise for a test implementation of a distributed filesystem at a computational chemistry students lab. If it is successful, the same solution will be considered for the (small) research lab clusters available.

At the research lab we are used to NFS usage for quite some time (since the 90’s) under suse, even for our clusters. The students lab also usually used similar solutions, but for reasons that are unimportant here a single NFS server won’t be an option now. The students lab is also being upgraded at this moment, and will have new 6 i3 computers with 500gb disks (that will still share room with 5 old Pentium Ds with 200gb disks).

The idea that we are considering is the following:

  1. The old Pentium Ds will mount an NFS /home from the i3s (with “failure fallback” between them);

  2. The i3s will serve the NFS /home for the Pentium Ds;

  3. In order to have all computers accessing all files coherently, a distributed filesystem among them for this /home must be used. This will also increase the available disk space due to data striping;

  4. Reliability is however important, and as such we are considering that instead of 3tb (6500gb) of total space we have only 2tb (4500gb + 2) and be able to withstand up to two i3 computers failing;

  5. Reliability must go farther than just files, but even necessary metadata servers and others must be redundant: optimally, replicated on every i3 computer;

  6. They must be able to also deal with all 11 computers accessing, both reads and writes, the whole/home . From molecular Dynamics to quantum mechanics calculations and also molecular docking, they must be able to deal with different demands and some big (several gb) file sizes, all simultaneously;

  7. Also we would not like to have to have to make special instructions to the users, like “you can’t rename a file, but copy and delete it later” for example, as some (old?) gluster instructions indicate to be necessary;

  8. The simpler the implementation and the maintenance, the better it will be for both actual students lab and the future production clusters use.

From what I can gather, the most usual suggestions are ceph, gluster or beegfs. Which ones are available on opensuse leap 15? Any other options worth considering? Which ones are recommended for the proposed application?

Any help or direction on this will be really useful, and we would be very grateful for it.

Thanks a lot in advance.

Let’s first consider one of your proposed ideas which is to mount /home as a shared directory used by all.

In principle, that’s not ordinarily advisable.
The /home partition is typically associated with a User profile which is tied to the logged in User. A “better” decision is for every person on your network to log in with their own personal, unshared credentials so that person’s activities are unique. It’s the only way that you can have accountability. Once Users start sharing login credentials and resources on the network using same credentials, you’ve lost any way of identifying who did what and when. Even if everyone on your network is your most trusted family, things can get sticky when odd things happen and people point fingers at each other and you have no way to identify the cause of such and such was at such time by the User who used particular credentials. This should all sound somewhat familiar if you’ve been running your lab for years, it’d be hard to avoid situations when you’d want to know who did what.

So,
/home shouuld be unique to the logged in User.
And, you can mouunt network resources separately and make access to those resources easy to find… Like a shortcut on the Desktop. Or, a website with links that automatically connect to, or launch those mount points. Or, web pages with instructions. Or, whatever else which can be imagined.

Now,
As for installing ceph, you can if you wish
https://en.opensuse.org/openSUSE:Ceph
And, you can install Glusterfs if you wish
https://software.opensuse.org/package/glusterfs

But, you’d have to really consider whether either of those addresses any need. Do Users really need to share their results? Do your lab exercises require intensive writing to a shared resource?

Typically most lab exercises I’m familiar with do nothing of the above, every exercise only reads and doesn’t write to a shared resource. Users/students run exercieses individually or as individual groups where the results are stored as “own” files which even if uploaded somewhere would still be an “own” location, not used by others to do additional work.

Note that if you’re setting up an application to run in a lab (or otherwise), that would be different. Any time you <run an application>. the specific requirements for that stand alone.

The requirements to support what I just described are relatively modest… You might cluster as one strategy to support fault tolerance, but you shouldn’t have any real issues with performance or contention of any type even today before considering any changes. If your requirements haven’t changed, then I’d ask why you might want to change what has worked for you, and what you’re familiar with.

HTH and IMO,
TSU

On 07/08/2018 08:56 PM, tsu2 wrote:

What was originally proposed, and then what you are writing here at the
top of your post, I think were different things entirely.

> Let’s first consider one of your proposed ideas which is to mount /home
> as a shared directory used by all.

In a normal system, /home is a shared directory used by all, but just not
directly; it is shared and used by ‘joe’ to provide /home/joe, and the
‘susie’ to provide /home/susie, etc. Maybe the OP meant it your way, but
that just sounds crazy, as you pointed out below.

> In principle, that’s not ordinarily advisable.
> The /home partition is typically associated with a User profile which is
> tied to the logged in User. A “better” decision is for every person on
> your network to log in with their own personal, unshared credentials so
> that person’s activities are unique. It’s the only way that you can have
> accountability. Once Users start sharing login credentials and resources
> on the network using same credentials, you’ve lost any way of
> identifying who did what and when. Even if everyone on your network is
> your most trusted family, things can get sticky when odd things happen
> and people point fingers at each other and you have no way to identify
> the cause of such and such was at such time by the User who used
> particular credentials. This should all sound somewhat familiar if
> you’ve been running your lab for years, it’d be hard to avoid situations
> when you’d want to know who did what.

It did not sound, to me, like the proposal was to simply have everybody
write directly into a shared /home directory, but to mount the /home to
all systems and then still have separate directories within there for each
user, with appropriate/normal permissions.

> But, you’d have to really consider whether either of those addresses any
> need. Do Users really need to share their results? Do your lab exercises
> require intensive writing to a shared resource?
>
> Typically most lab exercises I’m familiar with do nothing of the above,
> every exercise only reads and doesn’t write to a shared resource.
> Users/students run exercieses individually or as individual groups where
> the results are stored as “own” files which even if uploaded somewhere
> would still be an “own” location, not used by others to do additional
> work.

I guess my experience here is more like the OP’s; while in many academic
settings each user should do their own assignment sans collaboration, I’ve
also seen a lot of work in both academia and the commercial word where
sharing is the norm among teams. Maybe sharing within /home is not the
norm, but having /home available (with subdirectories) for each user, and
then having another shared location (maybe /home/finance-group or
/var/finance-group or whatever) is also necessary for projects. Sure,
users can share files other ways, but depending on the task at hand,
sharing directly with a group is much more efficient, especially as the
groups get beyond a few individuals.

Just a few more opinion, worth what anybody gets from them.


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

@ab,
Agree with everything you posted.

But,
If collaboration on a project is the desired organization,
Then I’d suggest various common solutions…

  • Network security like LDAP or AD support “single sign-on” to grant group level permissions to resources.
  • A web based solution, eg a Wiki is commonly used as a gathering point. Users are granted permissions to contribute at various levels. A CMS might also be used.

Either of the above can then be applied to services and resources like File Shares, specific applications, more.

And, perhaps I was too quick to assume that a shared /home would mean shared User settings, file storage, etc.
But, if the goal is collaboration, nothing comes to my mind at the moment that would be simply deployed in /home although it’s certainly possible to store <settings> that point to collaborative resources.

TSU

I have probably got this wrong, but I read the OP differently to both tsu2 and ab. I took the critical phrases to be “distributed file system”, “reliability/redundancy”, “accessibility”. Mounting, NFS etc. just seemed means to an end.

The use of a striped filesystem across multiple machines and resiliency are contradictory. Whether a suitable compromise is available would depend on the numbers of machines (old and new that were available. Hard drive (spinning) space is cheap. I would abandon striping and consider fitting the older machines with larger hard drives to avoid them being dependent on a cluster or partnered new machine.

Assuming that the application data files are stored (student read-only) in “/home/chemlabdata/” or somesuch, and that we have only a small number (less than 100?) of active students.
Use rsync to synchronise the each local “/home” files with the most recent versions available in the lab at the start of the day and/or at user login. At logout use an rsync script to update “$HOME” on the other machines to the local version. If a student/user makes a mess of a session, they can simply crash out (<ctrl+alt+backspace>) of the session without logging out so as to start again. Combined with an out-of-lab backup this gives a lot of resiliency with minimum effort and is easy to administer.
I have used this approach with rooms of 30 machines and 90+ users in the past – BSD Unix and SuSE/openSUSE Linux.

Hi guys.

First, thank you all for the prompt replies! I wasn’t expecting it to be so fast.

Second, let me clarify the misunderstandings pointed out mostly by tsu: I really meant that /home was shared over the network, but each user (student in this case) only have access to its own /home/$USERNAME directory (when in research clusters where there is need for different users to access the same files we deal with the XFS capabilities on demand).

In this way, every student had its own files readily available on every single computer: except for network or server failures, they would be resilient to single seat failures of any kind.

Also: there is no way on using NFS now because each computer we got has a singles 500gb disk that, also, both bureaucratic and physically cannot be put together into a single NFS file server machine. That is why we are considering a distributed file system to be in place. Widely well-known Brazilian economic crisis accounts for the impossibility to upgrade the older computers (getting new ones in place of the totally broken ones is already enough of a miracle nowadays).

As such, our main problem is: what would be the recommended approach to make a “some sort of over the network raid among different computers”, that would allow for both the striping of data that improves space usage and redundancy that allows for its resilience?

I see it like the following: Let’s suppose I have a data “ABCDEF”, that spread across 6 computers would be:
1: ABC
2: BCD
3: CDE
4: DEF
5: EFA
6: FAB

Somewhat like raid 6 would be, but over the network here. I think a very well chosen distributed filesystem is the answer, but which one, and under what configurations?

Thanks a lot for all the answers! :slight_smile:

For what you’re describing you might also take a look at drbd.
Keep in mind, there is a good chance this is overkill, but it can deliver the best solution supporting High Availability.
I was going to set this up on a project years ago but the SUSE/openSUSE implementation was broken at that time so I had to pass(Didn’t want to take the time to learn the whole thing on RHEL)…

But, in response to a bug I submitted,
DRBD is supposed to be fixed and it looks like besides the standard User-space management tools, it looks like there is even a YaST module (That’s cool, isn’t it?).

The Wikipedia entry
https://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
SLES documentation
https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_drbd_configure.html
https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_drbd_configure.html

Note that this is the kind of thing that you set up when you need really fast performance across a network and ordinarily is very expensive to set up… This is the kind of thing you might consider if something like rsync isn’t suitable for some reason, like the files are too large, you need to minimize latencies, nearly real time mirroring, etc.

TSU

I concur. Investigate using DRBD for a cost-effective open source solution.

https://en.opensuse.org/openSUSE:High_Availability

Thanks for the suggestion. However, it seems that DRBD is still not the right solution for the problem at hand. :’(

As you mentioned, rsync is not a solution due to both big files usage (100s gb easily) and the real time mirroring needed (say a computer fails during class, student must just jump to the next seat and continue working: undoable with rsync solutions).

DRBD also have a big issue: while having a dedicated yast module is lovely, it only “kind of mimics” a RAID 1 configuration, and a striping+parity of something closer to a RAID 6 is a requirement (yes, I’m considering the possibility of a second failure before the request fix for a first failure arrives: it already happened more often than once! :stuck_out_tongue: ).

From this solution, however, I learned about the NBD, which while a very interesting idea for other problems, still doesn’t “quite fill the gap” needed: unfortunately, as it only allows for a single read-write access, it can have only one single NFS server working on it, thereby creating again a “single point of failure”.

I spent a good chunk of the day studying both gluster and ceph. Gluster is almost out of the poll: while pretty easy to install, the impossibility of using a simple “mv” command for renaming together with the whole file storage approach it seems to use is a show stopper. I cannot, for example, have a situation where I have 4x400gb disks (not even considering parity here just for the sake of the argument) where I might have issues in writing 5x300gb files. Other smaller issues also seemed to “float around it”.

Ceph seems like a good option, however it makes me concern its metadata server approach: it now can have more than one metadata server, but I’m unsure how parity works among them (so, if I loose one or two of them, will the rest still be able to serve the whole system and the files accessible?), and if I can put a server on each computer together with the actual files storage, or if I have to have dedicated computers (which would be unacceptable and total show stopper once again).

Does anybody now the answer for the above questions (tomorrow I’ll post them also on ceph forums or alike)? Any alternative, both on distributed filesystems or any other completely different approach suggestion?

Once again, thanks for all the help up to now. lol!

The thing that makes DRBD lightning fast for mirroring large files is that you’re talking about <block> replication, ie only the changes at the disk block level are replicated.

Take for example a 4GB data file, it can be an enormous spreadsheet (?!) or a RDBMS file, and only one tiny part of the file is changed, like a single sentence or a single data point. If you used rsync, the entire 4GB file would have to be replicated to maintain the mirror’s integrity. With DRBD, only the changed data in 256k or 512k blocks would be replicated, perhaps totaling a transfer of only 4MB. That’s a big difference in network transfer, 4GB vs 4MB.

Replication is also a one-way transfer, replication is not the same as synchronization, so it’s not the same as RAID 1 when both members of the set are considered both master and slave.

Think of DRBD this way… it can be the backbone of High Availability in that you can have at least one and possibly many nearly exact clones of an original machine(or part of machine) every second… Without DRBD, alternatives for setting up high performance replication or something similar to RAID 1 are usually limited to some specific application and may not perform as well.

Because of the limitations of replication, DRBD cannot be considered the ultimate and perfect solution for fault tolerance, but it can uniquely make some things possible that wouldn’t be possible or practical any other way.

TSU

Hi Tsu.

Thanks for the feral explanation on why it is so fast, but it still suffers from the problem is being “comparable” to raid 1 and not to raid 6: of course, that increases it reliability, but at the big cost of the available space. Considering that even if I could make it work with 6 computers at once, and that seems to be a big “if”, I’m looking for a solution that would yield me 4disk space with resilience up to 2 computer failures, and not one that I would end up 1disk space with resilience up to 5 computers.

Don’t get me wrong: I’m really thankful for all help and suggestions. I just don’t see DRBD being the solution for this particular problem requirements (while certainly it can be great for other used).

Would Hadoop meet your requirements? I know you’ve mentioned RAID 6, but some of the features it offers might be a good fit with using your existing hardware while delivering the high availability you’re after. An interesting discussion here…

https://community.hortonworks.com/questions/82202/jbods-vs-raid-for-data-nodes.html

First, Hadoop is a distributed file system. It is created to manage big data. Files in Hadoop are large with each file block at a minimum 128 MB. If you have a file with 300 MB, then you have 3 blocks of file (128 MB, 128 MB, and 44 MB) distributed across the cluster. Meaning one file is not on one machine. Hadoop makes three copies of each block and distributes blocks across nodes in the cluster. In case of failure, that block can be pulled from a node that is still up. Since three copies of data are already made, you don’t need RAID 6. By the way, if you don’t like making 3 copies of data, the newer version of Hadoop will include support for Erasure coding (RAID 6 techniques of data resiliency). Again, no need to implement your own RAID 6. JBOD is the way to go.
Now, if you put 16x10 TB drives on one machine, and that machine goes down you lose 160 TB of data (it’s going to be less but just assume theoretically). When Hadoop loses a machine and realizes that its operating with under replicated blocks, it will start making copies of lost block on another node to ensure it has three copies of data. In your case, that would be 160 TB. Do some maths and it will come around 35 hours of data movement across the cluster (to copy 160 TB of blocks).

To give you an idea…
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html

The basics of setting up with openSUSE…

Replicated nodes are not supposed to be a replacement for RAID, but a different strategy.
When you design fault tolerance, Architects usually apply the “Rule of Three,”
Within any particular strategy, create 3 separate instances.
When applying strategies, implement 3 different and unrelated strategies.
And, so on.

The idea is to be able to suffer catastrophic failure of any 2 of 3, but still have a working solution.
Replication, and particularly DRBD in a High Availability solution is just another strategy to be considered.

I don’t know that there are too many solutions that implement RAID over a network, you can consider running RAID over iSCSI.

TSU

I don’t know how well hadoop might work…
Although a well designed hadoop cluster will provide fault tolerance,
data would be hard to retrieve… Most tools today will be based on the needs of Big Data Analysis which means tagging the data using available search tools that understands the stored data.

If anyone wants to play around with this stuff, I wrote a Wiki article quite awhile ago (likely needs to be updated) on installing Elastic, an alternative and direct competitor to the Hadoop stack. Elastic was created with a number of improvements at the time, primarily standardizing on javascript and json to store, transfer, query and manage. The Elastic stack is complete with plugin support to extend, whereas the Hadoop stack is a number of apps each with their own language to manage and exchange data.

https://en.opensuse.org/User:Tsu2/elasticsearch_logstash_official_repos

The original question was about a “distributed file system” which suggests that the data is not stored in a database and does not need analytical tools or other special means to input and retrieve data.

TSU

Hi again. :slight_smile:

Haven’t thought about hadoop before. Looked up today and it looked like promissing, however (there is always a “but”): A blocksize of 128Mb is a real killer because, while needing to be able to deal with 100+gb files, it also will have user space. Cant’ even begin to account for the thousands of .configuration files from KDE of a single user. Moreover, being based on java (ackward!) might lead to performance issues, and also to lack of posix compliance will make issues for several typical commands in simple scripts (heavily used). I also found some other issues, while uncertain about them: having a single (point of failure) master for some tasks for instance.

Also, Tzu, sorry for the missused of the “raid” expression: I’m aware it describes a completely different thing, however it’s well-known simple naming of the different types is impossible to resist while making these comparisons. Also, iSCSI are and will be both “bureaucraticly” (remember, I can’t change the internal hardware of the machines… :stuck_out_tongue: ) and “financially” impossible for a very long time.

I’ve found a few comparison papers and presentations now: from that, I got a more solid certain that gluster is NOT the best option, while BeeGfs is now being considered along with Ceph (and also any other new idea someone can bring to the post! :slight_smile: ). Does anybody has any previous experience with each one of those? Moreover, any other idea on a solution of how to implement the described problem?

Once again, thanks a lot for all help!

I remain extremely apprehensive about the use of hosting a distributed live filesystem on shared student workstations unless you can ensure sufficient redundancy and high availability.

For instance in the ABCDEF scenario it would only take two users to power-off their machines when they finished their session to crash the shared filesystem. This tends to be reflex behaviour for many home computer users. My experience with trained office workers is that an inadvertent power-off (with the loss of a shared resource) occurs about 1:65 workplace shift changes.

At a minimum I recommend disabling non-admin user power-off and reboot options, and protecting the Off Buttons. I also have an on-boot script that attempts to fix things when the users realise what has happened and switch the machines back on.

That is already the"standard procedure" due to that “student’s standard issue”: don’t worry, it will be implemented again. :wink: