After a while of using openSUSE i’ve grown to liking it more than other linux distros. I’ve hard really poor experience with other linux when it comes to working with other operating systems on the same machine, developing software and when it comes to updates. So far openSUSE has been stable at it and has survived the longest under my use. In order to complement my project i would like to create a cloud/cluster of openSUSE machines that perform multiple tasks. I do not like the idea of using a single system image when it comes to flexibility and disk management.
I’d basically like to set up a distributed system for:
GPGPU cluster (openCL)
compiling code (like distcc for example)
Storage cloud (only a specific directory to be replicated the same way on every system with a size limit and not an entire user folder, a bit like dropbox but for LAN only). The storage cloud will also act as a file server for my own remote use. I do have a good upload speed though.
running software in a distributed manner.
There should be no central server.
The system must be flexible by allowing new systems to be added automatically when turned on, systems can be turned off and systems that are on standby can be woken up when needed, and seperate systems to be able to be used at the same time.
I’ve found a guide on making a webserver which is very useful for one machine though but doesnt cover the uses above. Starting with the openSUSE installation from DVD, what packages and configurations would i need?
All systems will use 64 bit and have good enough ram and processor connected through gigabit LAN with a high performance managed gigabit switch. The programming languages i use are web based, C,C++ and java. If i can avoid using ubuntu for this it would really help a lot. I also need a way to monitor or login to another openSUSE without losing my local session so it would look like using a virtual machine when logging into another system so i dont need to have a lot of monitors and input devices.
I like icecc for distributed compiling. I have used this on my servers running both openSUSE and Debian. Though I am not fully sure of all of it’s limitations. software.opensuse.org:
While I can’t imagine you will receive step by step instruction to complete implementation of the rather huge list of requirements, you can likely get pointed in the right direction on how to best meet some of your needs.
Essentially it sounds like you want:
A compute cluster
There are many cluster management tools out there, and tools to make adding nodes via network booting or network based OS installs easy. Once the nodes are installed / provisioned you can certainly use distcc on the to compile, or run MPI jobs on them, etc.
Install Manager
If you want to install Suse on multiple nodes, but not use the exact same image, you could consider setting up a Cobbler PXE boot / install server.
You could then create AutoYast boot config files so when you boot a machine you can select which type of profile you would like to have installed on it.
A cloud type service, such as OpenStack
This will let you bring up (very quickly) virtual machines dynamically, so you could bring up N compile or compute nodes, however getting them to work with GPUs will be tricky.
The machines are based off images though, so you don’t get the flexibility of installing them individually.
StorageCloud
Perhaps OwnCloud (like your own private DropBox type service) might suite this need, though I’m not sure what types of file systems and storage protocols you need to support, etc.
All the above can certainly be accomplished with OpenSuse.
Or . . . .
Just use Amazon EC2 and S3: They have Suse images, storage, and GPU nodes. You could fire up N Suse node very quickly and shut them off when you don’t need to compile or run jobs.
Thanks for the replies but when i meant step by step i meant for each function i listed. I dont need a complex cluster for running stuff, only for computation, compilation, storage and management.
I thought linux has some sort of networked file system or can create a partial networked file system. I dont mind using scripts. I dont think dropbox provides enough space and while dropbox does update on LAN its not as fast on LAN as i’d like it to be.
Although there are many tools i’d like suggestions on what tools to use that works with openSUSE. For managing it i dont mind logging into each machine via ssh and x-window but if there is a tool that shows the resource usage of all machines in a single display that would also work too. All machines should have the same software except for drivers. The hardware of each machine is different. The point is to create a cluster with hardware around for use with openCL. I know amazon has GPU clusters but what i really need is the physical hardware for study and for creating software to fully utilise it. So that means tracking what i install from scratch and installing the least amount of software possible in order to get it working. If you used amazon’s GPU cluster it would essentially look like a single system image. In houses people dont really use such things. You cant expect people to keep all systems online just to run your software, it has to be able to work in mixed environments.
for GPGPU cluster i intend to write my own software to use it in a cluster. icecc seems like it would work well for compilations. Does openMPI work with software that wasnt written for it?
System Error Message wrote:
> Thanks for the replies but when i meant step by step i meant for each
> function i listed. I dont need a complex cluster for running stuff, only
> for computation, compilation, storage and management.
As somebody else said, you’re going to need to post specific questions
for each problem you come across, and after you try to solve them
yourself.
> I thought linux has some sort of networked file system or can create a
> partial networked file system. I dont mind using scripts. I dont think
> dropbox provides enough space and while dropbox does update on LAN its
> not as fast on LAN as i’d like it to be.
One thing that doesn’t help is that it’s not clear what your level of
expertise is. You seem to know something about a fairly wide range of
subjects, but it appears that you haven’t heard of NFS, which is rather
basic. So perhaps if you gave more background it might help, or again be
specific in your questions.
> Although there are many tools i’d like suggestions on what tools to use
> that works with openSUSE.
Well, to a first approximation, all linux tools work with openSUSE.
I would rate myself as an intermediate user. I can code complex software and operate linux using terminal and use features like ssh and remote logins but i am not familiar with the linux package manager on the command line.
I know what NFS is but i dont want the entire file system replicated on the network. Since each system would have different drive configurations. I also want to keep them seperate since replicating the entire linux partition would cause issues with GPU drivers. Hence i want to replicate only a directory (not user directory) in which work is stored in to operate as some sort of networked RAID 1 but for the purpose of backing up and speeding up networked software such as distcc. I’ve never used a distributed file system before except for glusterfs where i used to work but that didnt go well when using distcc since it caused a lot of errors during compilation.
From what i’ve gather so far i’ll need icecc and openMPI.
I still am not sure about what to use for the networked directory.
If by “replicating the entire linux partition” - I believe you mean /, you would not do this with NFS.
Why not set up an NFS server and export a /work directory to all clients? Use the system with the best filesystem IO for the server: iozone and other tools can help you benchmark its read / write performance. Is there some reason this would not work? Using an NFS /work directory to share libraries and code between systems is a common solution.
Are you certain you even need this though? You don’t need a distributed file system to use distcc, and using NFS with distcc can present some extra considerations as outlined on the distcc site. Of course if you set up an NFS /work you could try both using it and not using it (for compiling) and see how it goes.
well i would like to share /work but i do not want a central server. I could use distcc/icecc without NFS and expect the new files to be replicated over the network though. From what i’ve tested in networking it takes multiple connections to saturate gigabit LAN so it shouldnt slow things down if distcc operated from 1 drive in the cluster and replicated the new files to other systems currently online. So can NFS be used without a central server?
Every system will have a drive but i dont want to have a central server as i’m trying to set up a cluster without any central control. If i did have to setup central server the electrical cost would be high since every system has more than 1 GPU with the server having 4 GPUs and dual CPUs. I could modify the GPU bios to make them use a lot less power but it wouldnt solve the problem i’m trying to solve which is to have compute without a central server
Having a networked/distributed directory is essential so that i dont have to copy my compiled work to each system in order to run them at the same time. Its also good to have backups and redundancy so if i have trouble with a system i can still continue my work.
On Fri, 19 Jul 2013 15:56:02 +0000, System Error Message wrote:
> well i would like to share /work but i do not want a central server.
It’s not clear to me what you mean by this - either you’re sharing the
directory from a central location, or you’re not. You can’t share the
directory and not have one system acting as a “central server”, unless
you use something like rsync to synchronize the directory between
different systems.
Yeah basically i want a way to synchronise the directory between systems with no central server. I dont like keeping a system powered on 24/7. Its easy to set up magic packets using a router so i could have a webserver on standby whenever its idle but i dont think NFS would work if the central server was on standby. I need a software to synchronise a directory over multiple systems on LAN only with no central structure. With routerOS i can set up all sorts of networked setups but packets do not go through it when 2 systems sit on the same physical network. I could bridge ports instead of switching for detecting and marking packets for LAN but that would severely limit network bandwidth. If i had to use a central server and it goes down it would defeat the purpose of having redundancy.
You can certainly achieve HA NFS to provide redundancy, but in these type cases (as with running file servers and clusters in general) there is some assumption your going to leave the servers running all the time. Your desire for full distribution and replication with no central server and with being able to power off nodes at will is a pretty tall order to fill. You may need to re-examine your requirements given the type of system / cluster your trying to build.
Options for distributed replicated file systems are pretty few. You might check out Tahoe-LAFS but honestly this is likely to get more complicated than it may be worth. Another option might be DRBD: The DRBD User’s Guide but the same applies.
Another option might be to use a version control system like git, and have each system push /work to github.com and then the “clients” would “sync” by doing a git pull. Lots of manual intervention required, not transparent though it could be automated to some degree.
They way simple approach: Depending on the size of the data you could also revisit the Dropbox / Spiderweb / other file sharing service. This would let you offload the central server to someone else, and simply install clients on your systems to handle the syncing. The Dropbox client can also use LAN sync when clients are on the same network, eliminating the need to send traffic through the Dropbox central server and improving performance. I’m still having trouble understanding why this solution would not work for you, out side of the sync delay, which may or may not matter in your case. Setup would be negligible, and nodes could be powered on / off at will and would (generally) resync correctly.
Most researchers I know are interested in spending time on computing, and not building complex, hard to maintain, clusters. Your goals and time may differ from this, and of course if your looking to learn new tools that is great. All I’m saying is that if you want to “get work done” it may be necessary to re-evaluate your requirements. I’m not saying you can’t have your cake and eat it too, but merely pointing out it may be a pretty costly cake.
On Sat, 20 Jul 2013 02:46:02 +0000, System Error Message wrote:
> Yeah basically i want a way to synchronise the directory between systems
> with no central server. I dont like keeping a system powered on 24/7.
> Its easy to set up magic packets using a router so i could have a
> webserver on standby whenever its idle but i dont think NFS would work
> if the central server was on standby. I need a software to synchronise a
> directory over multiple systems on LAN only with no central structure.
> With routerOS i can set up all sorts of networked setups but packets do
> not go through it when 2 systems sit on the same physical network. I
> could bridge ports instead of switching for detecting and marking
> packets for LAN but that would severely limit network bandwidth. If i
> had to use a central server and it goes down it would defeat the purpose
> of having redundancy.
That does clarify what you’re trying to do a lot - and rsync would be the
solution to look at for this, or something like dropbox.
And you’re right the NFS would have problems when the server was on
standby - but if you woke it up, it /should/ pop back online without any
trouble (I use NFS on my systems, and occasionally have to turn it off on
the “server” (which is actually my desktop), and the “client” (which is
actually a server system) will automatically reconnect when I restart the
server service on the desktop.
But bear in mind that synchronization solutions have their own unique
problems - namely file collisions (the file on “system1” and on “system2”
gets modified on both systems - who’s authoritative?). I do a sync like
this between my laptop and my desktop (because the desktop isn’t always
local to the other systems), and I run into file collisions regularly if
I’m not careful. Dropbox handles them fairly well (in that it renames
the file that’s a collision indicating the date/time of the collision, as
well as the source), but I periodically have to clean it up.
There are other ways to provide redundancy, though - clusters with their
own storage set up as a mirror over the network would be a common way.
But when you’re talking HA solutions like this, you’re usually looking at
a commercial solution, too. SLES has some options like this that are
pretty easy to set up, so I’m led to understand.
I think rsync may work well as long as it updates files dynamically. When using rsync is it possible to update the file on every change? So if PC-A edits a file the same time as PC-B it would write the file similar to how working on the same text file with 2 instances of gedits would seem
I checked icecc and it still requires a server to schedule jobs. So it looks like i’ll have to use distcc instead.
Or Dropbox with local LAN sync? It’s been mentioned a few times but I’m not sure if you have ruled it out? You say you don’t want a centralized server, but with it’s local LAN sync feature it does not need to communicate externally - it will sync on your LAN - isn’t this essentially what you want?
On Wed, 24 Jul 2013 09:46:02 +0000, System Error Message wrote:
> I think rsync may work well as long as it updates files dynamically.
> When using rsync is it possible to update the file on every change? So
> if PC-A edits a file the same time as PC-B it would write the file
> similar to how working on the same text file with 2 instances of gedits
> would seem
>
> I checked icecc and it still requires a server to schedule jobs. So it
> looks like i’ll have to use distcc instead.
You have to trigger rsync, but as LewsTherinTelemon said you can use
inotify to detect changes.
Another option would be Dropbox (as he suggested) or something like the
BitTorrent sync tool (which looks interesting and AFAICS is entirely peer-
to-peer).
Ultimately, though, you’re trying to do something that is easily done
using standard tools (like NFS and clustering) but adding in a constraint
that makes that not work. If you’re looking for a HA solution,
generally, that means that systems are “always on”. It may be worth
looking at your requirements again and adjusting them so that standard
methods can be used - otherwise what you’re doing is adding complexity
for what appears to be no /good/ reason (from a system design
perspective), which is going to make the system more difficult to support
in the long run.
i found aerofs which is basically a LAN only dropbox with collision management and versioning. It works really well so far as it requires no central server.
i’ve chosen to use distcc and openMPI for distributed compilations and running cpu code.
For openCL i’ll write my own network based software.
Whats the best way to manage multiple machines remotely over LAN? If only the power cable and network cables are needed thanit would greatly simplify things. I’d still like to use the x window but have the client machine render instead of the remote one. I only plan to use remote access on LAN. Which way would work and is easy to setup in what i mentioned?
OpenSUSE makes it easy to set startup scripts and programs. iv’e gone through openSUSE manual on remote access but cant seem to make any of the gui options work.
Once remote access is done my cluster would be completed and ready for software development.
On Mon, 29 Jul 2013 21:56:02 +0000, System Error Message wrote:
> i found aerofs which is basically a LAN only dropbox with collision
> management and versioning. It works really well so far as it requires no
> central server.
>
> i’ve chosen to use distcc and openMPI for distributed compilations and
> running cpu code.
> For openCL i’ll write my own network based software.
>
> Whats the best way to manage multiple machines remotely over LAN? If
> only the power cable and network cables are needed thanit would greatly
> simplify things. I’d still like to use the x window but have the client
> machine render instead of the remote one. I only plan to use remote
> access on LAN. Which way would work and is easy to setup in what i
> mentioned?
>
> OpenSUSE makes it easy to set startup scripts and programs. iv’e gone
> through openSUSE manual on remote access but cant seem to make any of
> the gui options work.
>
> Once remote access is done my cluster would be completed and ready for
> software development.
Interesting.
Something else you might look at if this type of solution meets your need
is Bittorrent Sync. It’s different than Bittorrent used for (for
example) downloading openSUSE ISOs, in that it’s just a file sync product.
I’ve been meaning to set it up and try it out, just haven’t gotten around
to it.
Do you mean beyond ssh to manage them? Are you looking for a graphical / desktop session to each machine? (NXServer/FreeNX work well there).
If by manage you mean manage the physical hardware (outside the OS) remotely, then IPMI allows you to power on, power off, etc. If these are servers (not desktop systems) may also have come with some kind of service processor / remote management tools (iLO, DRAC, etc.)
If you wish to view OpenGL simulations remotely you will quickly discover that OpenGL is actually rendered on the client. Although the computations may be done on the remote system, all rendering commands are sent to the client to be displayed locally, essentially killing performance. VirtualGL provides a solution to this by rendering results on the server, storing them in a pixel buffer and transferring that to the client. It does so in a clever way, by attaching a loadable module to the binary which intercepts the OpenGL calls and redirects them locally.
For managing i’d like an X window system or GUI that renders on the client machine. Since the GPUs wont be connected to a monitor (or in the case of some, have a dummy VGA connector). Most of them are desktop systems so they dont have an IPMI port.
For openGL i do know that you can render into a buffer. Windows nvidia drivers allow you to render openGL on a different nvidia GPU that doesnt handle any monitors. It shouldnt be hard to render openGL into a buffer on the server but if the 3D engine ran both on the client and the server while the server only updates 3D coordinates it might work for rendering on the client. It mainly depends on what form of data is sent. Even 1Gb/s ethernet is not fast enough to stream 1080P at 60fps.
Interesting.
Something else you might look at if this type of solution meets your need
is Bittorrent Sync. It’s different than Bittorrent used for (for
example) downloading openSUSE ISOs, in that it’s just a file sync product.
I’ve been meaning to set it up and try it out, just haven’t gotten around
to it.
Jim
I chose aerofs over bittorent sync because it has the same kind of versioning system that dropbox uses and asks you what you want to do with collisions. Bittorent sync may be faster if aerofs doesnt use multicast or UDP but if the file sync does use multicast it would work for updating multiple machines from a single machine quickly. I think aerofs may be similar to bittorent sync in how it updates except that bittorent sync uses the torrent protocol.