Josh Moore
http://home.attmil.ne.jp/a/jm/

Building a Beowulf Cluster

Introduction-

I had just finished the Linux from Scratch Project about a week ago and thought I should find something else to try to do.  I looked around the internet and thought I would try to build a Beowulf cluster.  During construction of it, I noticed a lack of guides and decided to write one for others interested in building their own cluster.  This is not the only way to build a cluster, but it is the way I built mine and will be a guide on how to build a cluster similar to the one I built.

Before we get started into any technical details, you must first know what a Beowulf cluster is.  A Beowulf cluster is nothing more then a group of two or more computers that are networked together and allow programs to be split into parts and run between the computers to speed up tasks.  The first Beowulf cluster was built by two NASA engineers in 1994.  It was constructed out of 16 Intel DX4 processors running at 100 MHz each.  All the computers in a cluster are called nodes.  A lot of groups who need huge processing power (universities, labs, engineering groups, NASA) often use clusters instead of super computers.  Why spend tons of money on an eight processor computer when it would be cheaper to build a cluster of eight computers that would have the same or greater performance results?  As with SMP (multiple processors), not just any program can be ran to take advantage of a cluster.  It must first be programmed and compiled to use message passing (clustering software).  This is just a watered down introduction of what a cluster is, but it should be enough to get a basic idea for the rest of the document.

For my first cluster,  I built a few computers from parts laying around the office. They consist of three computers ranging from Pentium 133 MHz to Pentium 200 MHz.  All computers had between 64 and 128 megs of ram.   All are keyboard and monitor less but allow SSH log ins for administration.  The cluster also sits behind my firewall on the same network as my other computers.  I used Red Hat 9.0 as the operating system on all nodes.  Any setup will work but this is what seemed to work best for me since it required no money and was purely a project to use for future reference while building other clusters.

Nodes-
Most clusters have a master node which controls everything and a group of slave nodes.  The master node is supposed to be the fastest computer and the computer with the most disk space.  Once the cluster is completed, users will only need to log into the master node to run jobs and perform other tasks.

When and Where-
Once the nodes are setup and working the way you want, you can begin to start the actual setup of clustering.  The first step is to edit the host file on each node and make such all the nodes can talk to each other by name.  It would be a good idea to have the nodes assigned a static IP or have a DHCP server issue them IP addresses based on MAC address.  I used master, slave1, slave2, etc... to keep things simple.  For those in LA county, you might need to go with a different naming scheme.  It is also important for all nodes to have the same time set.  You should already have all clocks synchronized since it is a network environment and helps with intrusion detection and other tasks.  It is best to have an internal computer running a network time protocol server and synchronize all the computers on the network with it.  If not, synchronizing all the clocks from an external NTP server would work as long as it is the same server.

Who-
All nodes also must have a similar user with identical home directories and settings.  I just added a user named cluster to keep things simple.  I also exported the /home directory on the master node and mounted it to all the slaves using NFS.  This ensured that all nodes would have the same user configuration files and work space.

SSH versus RSH-
To be able to run jobs on each other, all nodes most be able to log into and run commands on the other nodes.  This can be done one of two ways.  You can setup SSH to require no password by using a key file.  This will add some security as it is encrypted, but will also slow the cluster down a little bit since it has to be encrypted and decrypted.  Since all I am doing is running programs I don't really care if the packets are encrypted, so I went with the second option.  The second option is using RSH.  It can be a little trickier to setup but works fine once it is up and working.  If you want to use RSH but are really paranoid, you could always add a second NIC to the master node and hook it up to your main network and firewall that connection.  It would also improve performance a little, as the cluster is on its own network and doesn't have to worry about other traffic.

Since most people probably don't know how to use RSH I will explain how to set it up.  The following must be done on ever cluster.  First edit the /etc/hosts.equiv file and add the host name of all the computers in the cluster.  Edit /etc/xinetd.d/rsh and /etc/xinetd.d/rlogin files and change disable=yes to disable=no.  This line can be at the top or bottom of the file and I have seen it at both places on RH9.  Also edit /etc/pam.d/rlogin and move 'auth sufficient pam_rhosts_auth.so' from the middle to the top of the file.  Restart xinetd.  Test it out by issuing simple commands such as 'rsh host_name uptime'. 

Message Passing-
Now it is time to install message passing software.  There are several choices, some better then others, but I went with MPICH.  The Mathematics and Computer Science Division at Argonne National Laboratory at www-unix.mcs.anl.gov describes MPICH as "a freely available, portable implementation of MPI, the Standard for message passing libraries."  Once MPICH was downloaded, I extracted it into the cluster home directory and built at according to the install guide.  It is always a good idea to follow the install guide.  Installation was not too difficult.  ./configure, make, su -c make install.

After MPICH is installed to master node, I exported it's directory via NFS and mounted it to all the slave nodes.  This will help make future upgrades easier and causes no measurable decrease in performance.  There is on last step in configuring MPICH.  The machines.LINUX file under the share directory of MPICH must be edited.  By default localhost.localdomain is listed five times.  Delete all occurrences of it and list the host names of all the nodes in the cluster.  Also edit your .bash_profile and add the mpich/bin directory to your path.  Now it is time to test everything out.

Testing-
It is important to only perform clustering as the cluster user since running as root is always bad and anything you do wrong could mess up several computers because of NFS and RSH.  There is a test program called tstmachines located in the sbin directory of MPICH.  Running it should produce no output if everything goes ok.  You can run it with the -v (verbose) option and should receive something similar to the following.

Trying true on master ...
Trying true on slave1 ...
Trying true on slave2 ...
Trying ls on master ...
Trying ls on slave1 ...
Trying ls on slave2 ...
Trying user program on master ...
Trying user program on slave1 ...
Trying user program on slave2 ...

Once all the work and troubleshooting is done, you are ready to actually run programs.  This can be done with the 'mpirun -np # -nolocal program name'  Replace # with the number of processors you want to be used.  The -nolocal option tells MPICH to run the program across the nodes instead of being ran on the same node # times.  MPICH comes with a sample program called cpi which will calculate PI and give you a rough benchmark of the system.  It is best to run it on each node then on all nodes and compare the results.  If all goes well, you can get into some serious clustering.

My own programs-
You might be thinking, great now I can take a few computers to a LAN party and run Quake faster then anyone and be the frag master.  It would be nice if you could, but life is not that simple.  I mentioned earlier that clustering is a lot like multiple CPUs except cheaper.  Not just any program be be ran across a beowulf cluster.  It must be written in parallel and compiled with a parallel compiler.  There are several companies that offer parallel compilers but MPICH comes with a parallel compiler that works well.  The program split up parts of the program that can be ran on multiple nodes.  Parallel programming is beyond the scope of this document but there are several good books on the subject and if you have fairly good knowledge of C or FORTRAN, it shouldn't be too hard to pick up.  Below is code from a simple hello_world.c program that is written in parallel.

#include <stdio.h>
#include "mpi.h"
 
int main(int argc, char **argv)
{
   int myrank, nprocs, namelen;
   char processor_name[MPI_MAX_PROCESSOR_NAME];
 
   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
   MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
   MPI_Get_processor_name(processor_name, &namelen);
 
   printf("Hello World!  I'm rank %d of %d on %s\n", myrank,
      nprocs, processor_name);
 
   MPI_Finalize();
   return 0;
}


The above prints hello world and tells which node it is from and the nodes rank when compared to the others.  The best way to learn is probably to look at the source code of other programs and try to learn from others.  There are several programs available that are already coded in parallel.  They include: image and video renders, password crackers, astronomy and weather simulators, and programs that look for oil.

What's next?
This was just a basic guide on how to get a cluster working.  A cluster in the real world would benefit from: administration tools so you only have to type a command on one node and have the command issued on all nodes which makes upgrades and maintenance easier, monitoring software to track CPU and memory usage in order to make sure everything is running smoothly, and job scheduling.  Job scheduling makes sure that all jobs and users share CPU power and makes the cluster more efficient.  This can be very useful when used with monitoring software to quickly  diagnose and fix any decrease in performance.  Perhaps sometime later I will write a part two that explains what to do with the cluster after it is actually built and give more details on optional tools that are available and how to use them.