Beowulf Cluster

I am working on a beowulf cluster and having trouble geting it to boot. It says could not mount root filesystem – exiting to /usr/sbin/sulogin.
Is there anyway to test nfs to see if it is working.

You’ll need to post

  • The reference you’re following to set up. A URL if online.
  • The full error which should generally include either the command that resulted in the error or the results leading up to the error.
  • The distro version you’re running

TSU

   

 Notes on Beowulf:
 

     A set of dumb terminals or a group of computers put together under a boot server is basically a cluster.
    The term needed for a Blender render farm or other is parallel computing.
 

     A server is needed with 2 network cards. One facing the cluster and one facing “out”.
     DHCP needs to be running on the inside interface, or on a second server inside the cluster to assign Ips to each machine as they come on-line and point them to the PXE boot server.
     The PXE boot server seems most often on Linux to be set up with TFTP or ATFTP ( for larger clusters ) which is a UDP driven on port 69.  
     The image(s?) for the cluster is stored in tftpboot/.  
 

         ***DHCP***
 

 Once installed, configure /etc/dhcpd.config then Start tftp and set it to run on boot:
 

 authoritative;
 log-facility local7;
 default-lease-time 600;
 ddns-update-style none;
 

 subnet 192.168.10.0 netmask 255.255.255.0  
 {
     range 192.168.10.10 192.168.10.254;
     default-lease-time 14400;
     max-lease-time 172800;
     next-server 192.168.10.1;
     filename "prelinux.0";
 }    
 

     ***TFTP***
 

 Once installed, in /etc/tftp make sure that
 


  - server_args has the -s set to     allow saving to the folder 
  - the folder and path is set ( this     case /srv/tftpboot ) 
  - Permissions to the folder are set     to 777 
  - disable is set to no 


   
         server_args     = -u tftp -c -***s*** /srv/tftpboot
         disable            = ***no***
 

 Start tftp and set it to run on boot
 

      Testing:      tftp -v 192.168.1.1 -c get myfile ( have some test file in the folder to “get” )
 

    Adding –verbose to the end of the command in tfpt enables logging
 Be sure to check that all files ( for now ) inside tftpboot is set to 777
 

 

     ***SNMP***
  

  Once installed, start snmpd and make it part of the start at boot
  

     chkconfig --add snmpd    
     service snmpd start  or chkconfig snmpd on
 

 check to see if it is working
 

     netstat -natv | grep ':199'
 

 Set snmp to be read remotely:
 

     in /etc/snmp/snmpd.conf add the network to be read from:
 

         rocommunity public 192.168.128.0/20
 

 

 ***    MRTG***
 

 Create a starter / simple map:
 

     cfgmaker public@127.0.0.1  --global "WorkDir: /srv/www/htdocs/mrtg/" --output server.cfg
 

 Test the script output:
 

     indexmaker --output test.html server.cfg  
 

 Run the script:
 

     env LANG=C /usr/bin/mrtg server.cfg
 

 

     ***CRON***
 

 Create a cron job to run the mrtg script ( every 5 mins ):
 

 */5 * * *  * env LANG=C /usr/bin/mrtg /srv/server.cfg
 

 

 **    PXE**
 

 In the BIOS settings, check if the network ROM is enabled
 

     **PXE Boot**
 

 Before really starting to build the PXE environment, you have to install the syslinux package. This package provides:
 

   /usr/share/syslinux/pxelinux.0
   cp /usr/share/syslinux/pxelinux.0 /srv/tftpboot
 

 Downloaded:
 

  wget http://download.opensuse.org/factory/repo/oss/boot/i386/loader/linux
  wget http://download.opensuse.org/factory/repo/oss/boot/i386/loader/initrd
  wget -O initrd64 http://download.opensuse.org/factory/repo/oss/boot/x86_64/loader/initrd
  wget -O linux64 http://download.opensuse.org/factory/repo/oss/boot/x86_64/loader/linux
 

 Present folder:
 

  initrd    initrd64    linux    linux64    syslinux.cfg     then   pxelinux.cfg/default
 

 default text:
 

  default install
  prompt   1
  timeout  30
  

  # Install i386 Linux
  label install
    kernel linux
    append initrd=initrd splash=silent vga=0x314 showopts install=boot/i386/loader/
  

  # Install x86_64 Linux
  #label install64
  #  kernel linux64
  #  append initrd=initrd64 splash=silent vga=0x314 showopts install=http://download.opensuse.org/factory/repo/oss/
 

     ***BLENDER***
  

 **Render a picture**
  

      # blender -b file.blend -o //file -F JPEG -x 1 -f 1
  

      -b  
  

      Load blender without an interface  
  

      file.blend  
  

      File .blend to render  
  

      -o //file  
  

      Directory + Target image file  
  

      -F JPEG  
  

      JPEG image format  
  

      -x 1  
  

      Ensures an extension .jpg to the file name  
  

      -f 1  
  

      Render frame 1  
  

  

 **Render a movie**
  

      # blender -b file.blend -x 1 -o //file -F MOVIE -s 003 -e 005 -a  
  

      -b  
  

      Load blender without an interface  
  

      file.blend  
  

      File .blend to render  
  

      -x  
  

      Ensures an extension .avi to the movie  
  

      -o //file  
  

      Directory + Target image file  
  

      -F MOVIE  
  

      This saves a .AVI movie with low compression  
  

      -s 003 -e 005 -a  
  

      Set start frame to 003 and end frame to 005. Important: You can use -s or -e, but if they're not in order, they'll not work!  
 

 

 

 

 

      Gregm from Linuxforums added  
 

     They're mostly used for scientific computing - number crunching that would take a single processor much longer. It seems with GPU based floating point processing becoming ubiquitous something where a lot of disk io or high memory usage could be a good candidate (such as the rendering in blender).
 

     I haven't worked with beowulf but I think you can execute a program with mpirun. The path to the executable must be the same on each server and you use the switch -np to indicate number of processors.
 i.e.
 

 

 mpirun -np 6 blender -b file.blend -a -x 1 -o //render.out
     
 

 

 

 

 

 

 

 

 

 

 

 

 SuSE Linux Instructions
 

 The procedure for setting up a TFTP server under SuSE Linux is given below. For other distributions of Linux this procedure may differ slightly. You may have to install TFTP if it is not already installed on your system.
 Setting up from command line
 

 First, as root, make a directory to store the uClinux image which will be loaded onto the target system.
 

 bash# mkdir /tftpboot
 

 Next the ownership of this directory must be changed to nobody as this is the default user ID setup by tftpd. It is also a good idea to give world write permission to this directory, to allow normal users (not root) to copy files for downloading.
 

 bash# chown nobody:nobody /tftpboot
 bash# chmod 777 /tftpboot
 

 Next move the image of a compiled version of uClinux into the tftpboot directory. This file is usually named linux. For more information on compiling uClinux see Compiling the Kernel.
 

 Now the file /etc/xinetd.d/tftp must be edited to match the following:
 

 service tftp
 {
    socket_type  =  dgram
    protocol     =  udp
    wait         =  yes
    user         =  root
    server       =  /usr/sbin/in.tftpd
    server_args  =  -s /tftpboot
    disable      =  no        
 }
 

 Under SuSE Linux 9.0 the following options needed to be changed from their default values:
 

 disable = no (enables the TFTP service)
 server_args = -s /tftpboot (sets the directory to /tftpboot
 

 Next, to start the TFTP server enter the following command as root:
 

 bash# /etc/init.d/xinetd restart
 

 

 

 

 

 

 

 Install all of kiwi. The latest release is in:
 

 http://download.opensuse.org/repositories/Virtualization:/Appliances/
 

 Then look at the doc for kiwi (/usr/share/doc/packages/kiwi/kiwi.pdf
 - comes with the kiwi-doc package). Specifically chapter 10 on "PXE
 Image - Thin Clients". LTSP is one implementation of that type of image.
 But there is much more you can do.
 

 We use kiwi to set up a number of opensuse images that boot via pxe and
 form our version of a cluster/cloud/jboc that we use in image processing
 and other systems. Works very well.
 

 Also. look at suse studio (www.susestudio.com). It is a web wrapper
 (quite a powerful one at that) around kiwi that lets you make all kinds
 of images.
 

 

 There is several sites that I have used. We used suse studio to make our operating system. We are going to use this for a blender render farm.


So, if you are using SUSE Studio, do you have a public link to what you’ve created?
Or, if you’re running locally as I described earlier you have to describe exactly where and when you experienced your error, complete with the verbatim error displayed if possible.

So, in your case there can be numerous reasons why a specified location won’t mount and if you don’t describe your steps to that point no one is going to be able to guess what you’ve been doing (wrong or right).

Also, because you haven’t posted any information about what distro version you’re using there is no way to know if you’re following instructions which are more suitable for the subsystems in SUSE (which your instructions are based on) or openSUSE (which incorporates much later architecture).

TSU

It is opensuse 13.1
We think it is a problem with nfs. Is there a way to test nfs to see if it working.

For starters,
As I described earlier, you’re using a guide that’s based on old architecture (SystemVinit).
You’re advised to use systemd commands instead of the init commands in your guide.

So, this is probably a starting point… How familiar are you with openSUSE and systemd?
Do yiou know how to start/stop/query status for any service like nfs and that if something is amiss a relevant snippet of the system log is automatically displayed?

Additionally, the guide you’re following describes setting up nfs (and other) config files manually which is always subject to human error and if you created files probably files in the wrong location. For those who are unfamiliar managing and configuring openSUSE (and SUSE) you have a tool that will enable you to properly set up and manage services… YAST.

Although the SDB NFS article is a bit dated, it looks like it probably still should work, and it does describe installing nfs-kernel-server and yast2-nfs-server.
https://en.opensuse.org/SDB:Network_file_system

So,
It looks to me according to the Beowulf guide you’re following that you should be installing the YAST management modules for the following services…

zypper in yast2-dhcp-server yast2-tftp-server

It also looks like you may need to verify some or all of the following are running properly (again, systemd is probably the best first step)
SNMP
MRTG
PXE / PXE BOOT

Standard troubleshooting practice is probably required…

  • Verify networking ports are open. Generally telnet (or a similar probing app for some protocols) will return a result telling you whether the port is closed by firewall (eg denied), simply not active, or port open but no functionality behind it.
  • Running the systemd command “systemctl” is probably the best general tool to verify a network service is running or if not some hints what the problem might be. Some services may also write to their own special logfiles.

HTH,
TSU

we can plug any machine in it will get a ip adress. It loads pxelinux.cfg, vmlinuz, and initrd. Then it goes to a screen and it says
could not mount root filesystem – exiting to /usr/sbin/sulogin
Give root password for maintenance
(or type Control-D to continue)

Ping is a poor and very minimal network troubleshooting tool, all it will tell you is whether you have <very> fundamental network connectivity(name resolution also if pinging names), nothing more.

You’d do far better if you used the tools I mentioned which

  • Test for network connectivity on a particular specified port, returning various possible state/status.
  • Test for functionality behind an open port
  • Test locally for the status of the app/service providing functionality
  • Displays helpful information if an app/service is non-functional

Also,
Whether you are experienced setting up certain services on openSUSE or not, the tools I described will increase the likelihood of doing it right many-fold.

As I said in my first post, there can be many reasons for why a file or location “cannot be found.” At this point, without more information a cause and solution is only wild guess-work. You can use the tools I described to gather more information and maybe even fix the problem on your own. The tools are not that difficult to use, try them. If you do have problems using them, post specific details about your attempt and anyone on this Forum will be able to help.

TSU

Hi TSU or anyone else that has information that is willing to try to help

Thanks for your reply to my student.
I wanted to help with the information for the Beowulf project.
We plan to use it for a Blender render farm.
We want to have a diskless system, except the server.
We’d like to boot from the server rather than from CD or USB on each client.

We have a server with:

openSUSE 13.1 (i586)
VERSION = 13.1
CODENAME = Bottle

There is no firewall, it is inside the lab.
The Yast tools you speak of are present.

dhcpd.conf contains:

<code>
allow booting;
allow bootp;
next-server 192.168.10.1;

subnet 192.168.10.0 netmask 255.255.255.0 {
  next-server 192.168.10.1;
  filename "prelinux.0";
  range 192.168.10.10 192.168.10.254;
  default-lease-time 14400;
  max-lease-time 172800;
}

host cub1
{
  hardware ethernet 00:26:18:43:07:A5;
  fixed-address 192.168.10.30;

}

</code>

We have the one test machine added directly to the dhcp list
and will most likely have any others that are added to the cluster
in here later as well.

TFTP seems to be up and working

merv@tpinstructor:~> tftp -v 192.168.142.1 -c get prelinux.0
Connected to 192.168.142.1 (192.168.142.1), port 69
getting from 192.168.142.1:prelinux.0 to prelinux.0 [netascii]
Received 27120 bytes in 0.2 seconds [1155764 bit/s]

NFS

BigBadWulf:~ # service nfs status
nfs.service - LSB: NFS client services
Loaded: loaded (/etc/init.d/nfs)
Drop-In: /run/systemd/generator/nfs.service.d
-50-insserv.conf-$remote_fs.conf Active: active (running) since Wed 2015-03-18 14:37:56 EDT; 1 day 23h ago Process: 22825 ExecStop=/etc/init.d/nfs stop (code=exited, status=0/SUCCESS) Process: 22878 ExecStart=/etc/init.d/nfs start (code=exited, status=0/SUCCESS) CGroup: /system.slice/nfs.service -22909 /usr/sbin/rpc.gssd -D -p /var/lib/nfs/rpc_pipefs

Mar 18 14:37:56 BigBadWulf nfs[22878]: Starting NFS client services: sm-notify gssd idmapd…done
Mar 18 14:37:56 BigBadWulf systemd[1]: Started LSB: NFS client services.

a clip from systemctl …

<code>
cycle.service loaded active exited LSB: Set default boot entry if called
dbus.service loaded active running D-Bus System Message Bus
dhcpd.service loaded active running LSB: ISC DHCP 4.x Server
getty@tty1.service loaded active running Getty on tty1
kmod-static-nodes.service loaded active exited Create list of required static device nodes for the current kernel
network.service loaded active exited LSB: Configure network interfaces and set up routing
network@enp0s7.service loaded active exited ifup managed network interface enp0s7
network@enp1s6.service loaded active exited ifup managed network interface enp1s6
nfs.service loaded active running LSB: NFS client services
nfsserver.service loaded active running LSB: Start the kernel based NFS daemon
nscd.service loaded active running Name Service Cache Daemon
postfix.service loaded active running Postfix Mail Transport Agent
rc-local.service loaded active exited /etc/init.d/boot.local Compatibility
rpcbind.service loaded active running RPC Bind
rsyslog.service loaded active running System Logging Service
sshd.service loaded active running OpenSSH Daemon
</code>

merv@tpinstructor:~> netstat -a …

<code>
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 *:nfs : LISTEN
tcp 0 0 *:sunrpc : LISTEN
tcp 0 0 *:49424 : LISTEN
tcp 0 0 *:mountd : LISTEN
tcp 0 0 *:53236 : LISTEN
tcp 0 0 *:ssh : LISTEN
tcp 0 0 localhost:smtp : LISTEN
tcp 0 0 balewolf:ssh 192.168.131.23:52742 ESTABLISHED
tcp 0 0 balewolf:ssh 192.168.128.254:60294 ESTABLISHED
tcp 0 0 *:40224 : LISTEN
tcp 0 0 *:nfs : LISTEN
tcp 0 0 *:sunrpc : LISTEN
tcp 0 0 *:mountd : LISTEN
tcp 0 0 *:ssh : LISTEN
tcp 0 0 *:33752 : LISTEN
tcp 0 0 localhost:smtp : LISTEN
udp 0 0 *:791 :
udp 0 0 *:19849 :
udp 0 0 *:48612 :
udp 0 0 *:nfs :
udp 0 0 *:bootps :
udp 0 0 *:mountd :
udp 0 0 *:sunrpc :
udp 0 0 *:37097 :
udp 0 0 localhost:746 :
udp 0 0 *:791 :
udp 0 0 *:37716 :
udp 0 0 *:53118 :
udp 0 0 *:nfs :
udp 0 0 *:tftp :
</code>

merv@tpinstructor:~> netstat -an | egrep ‘Proto|LISTEN’

<code>
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:2049 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:49424 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:20048 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:53236 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN
tcp 0 0 :::40224 :::* LISTEN
tcp 0 0 :::2049 :::* LISTEN
tcp 0 0 :::111 :::* LISTEN
tcp 0 0 :::20048 :::* LISTEN
tcp 0 0 :::22 :::* LISTEN
tcp 0 0 :::33752 :::* LISTEN
tcp 0 0 ::1:25 :::* LISTEN
</code>

merv@tpinstructor:~> telnet 192.168.142.1 111
Trying 192.168.142.1…
Connected to 192.168.142.1.
Escape character is ‘^]’.
^

Connection closed by foreign host.

merv@tpinstructor:~> telnet 192.168.142.1 2049
Trying 192.168.142.1…
Connected to 192.168.142.1.
Escape character is ‘^]’.
^

Connection closed by foreign host.

from the default script in tftpboot/pxelinux.cfg/ :

kernel boot/vmlinuz
append initrd=boot/initrd ramdisk_size=512000 ramdisk_blocksize=4096 root=/dev/nfs nfsroot=192.168.10.1/srv/Den/root rw

This is what we are presently trying.

Jared got a custom ISO from OpenSuse Studio “burnt” to a usb stick.
Systems will boot from the usb stick, console OS.
I used the partitions from the USB drive to create the “Den” folder for the
clients, what NFS is pointing to.

We used vmlinuz and initrd from the usb stick for the pxe boot

Through YaST, NFS is set:

192.168.10.1/255.255.255.0 rw,no_root_squash,no_subtree_check,crossmnt,fsid=0
the exports is /srv/Den

drwxrwxrwx 23 1002 users 4096 Feb 11 14:58 Den
drwxrwxrwx 25 tftp tftp 4096 Mar 3 13:04 tftpboot

What is happening

If we use the test system or one of the systems from the lab, it recieves an IP
PXE boots and finds the tftpboot folder, gets the pxelinux.cfg/default script,
loads the kernal then initrd. The machine begins to load but then hangs on trying
to mount root.

It boots rather quickly, several lines of configuration goes past too fast to see,
but then says it cannot mount the root file system - exiting to /usr/sbin/sulogin
give root password for maintainace or control-D to continue.

We have tried watching with iptraf on the server but do not see a nfs attempt.

It seems to be down to the client nfs connection but we are not sure or how
to test past this.

The iso has Blender, Nfs-client, pico, python.

  1. Have you tried to simply mount your exported NFS to see if you have anything?
  2. Your test results suggest that you have some kind of working functionality behind the NFS port, but you still don’t know if you configured your NFS correctly.
  • Recommend setting up a second NFS export, it can be something small and simple to test whether you can do it. Although I sense resistance, I highly recommend you set up using YAST, it’s solved many problems for many people.
  • Whether you set up a second NFS or not, I highly recommend backtracking and remove all the NFS server stuff you’ve already done and then set up again using YAST.
  1. I’m a big believer in spending a little extra time learning required fundamentals because there is immediate big payback when you run into issues. Besides knowing and following highly recommended practice (like using YAST), the ability to get a little bit more relevant information can be the difference between success and failure.
  • Spend a few minutes learning the basics using “systemctl” and what “journalctl” might provide. If you do this then you won’t be posting init and service commands which may or may not still be useful on a systemd system today.
  • The telnet commands and results were big in at least determining to a degree what isn’t your problem (network connectivity, whether something is responding behind the NFS port). But, your inability to run systemctl properly isn’t returning relevant information about the local nfs service. Alternatively to a simple systemctl command but more difficult is to parse the syslog for relevant entries. If you’re really, really lost then maybe posting your syslog to pastebin would allow others to wade through the system’s entire logfile for you.

I’m moving very fast through the above recommendations without details because a post with details would be <extremely> long. Am hoping you can figure out how to do what I recommend on your own, but if you are unable, then <do> post specifically what you’ve tried and trying to do and people can help with that.

HTH,
TSU

Thinking a bit about your setup,

If I were designing your cluster and wanted to boot nodes from common files, I’d personally prefer to set up iSCSI instead of NFS.

  • Although there may be less of a diff nowadays with the latest openSUSE because almost all the runtime is now in RAM and rarely references the diskfiles after boot, using a distributed block device like iscsi avoids the overhead of a network share technology like what NFS is supposed to have. Still, whether you choose NFS or iSCSI I’d recommend you mark the files read only as much as possible.
  • As a block device, it should be easier to set up and implement. Less configuration and complications than what you’d have configuring NFS.

HTH,
TSU

It boots to the screen that shows the suse logo. And it won’t go any further.
It says waiting for device
/dev/disk/by-ID/usb
then after that it has the name of the flash drive. There is not flash drive in the machine.

You’ll have to be a bit more descriptive.

During the boot sequence…

  • An initial marked boot partition is found
  • The boot files on the partition are run.
  • On openSUSE, the default bootloader GRUB2 is run
  • The GRUB menu displays (has its own openSUSE logo) which contains various boot options including for a normal boot, rescue mode and possible kernel options… and advanced options.
  • After a GRUB menu option is selected, the boot sequence continues which is directed a root partition which contains the files to boot the proper operating system.
  • Assuming that no Desktop is installed, the boot sequence will mount partitions and start essential system services, eventually ending in displaying an openSUSE logo and login prompt.

So, where in the above sequence does your error display and does it result in a “soft” error allowing you to login or does it result in a system halt?
If it’s displaying somewhere in the last 2 steps I described, then IMO it’s likely a problem with your original image likely in /etc/fstab.
If it’s earlier, then it might be a problem with your GRUB configuration or something else.
In any case, it’s problem with your image, not a setup problem (that I can see now).

Note that i’m relying on your statement that no USB devices should be referenced, if a device should exist but is not found then the problem would have been more likely the method the device was referenced.

TSU

Last thing we were doing was trying to see if nfs worked. We connected a machine internaly and got it to connect to nfs. And we can see the directories after we connect. Now we are at the part where we need the kernal to connect to nfsroot. How would we do that.

Sounds like you’ve largely resolved the networking part of your setup (congrats!)
I’d recommend you either search existing threads or ask this new question which is no longer about networking in the Install/boot/login Forum

You’ll need to manually edit your grub.cfg file creating a new entry pointing to your specified location…

TSU