MrBayes on the Biowulf Linux CLuster

MrBayes on Biowulf

MrBayes is a program for the Bayesian estimation of phylogeny. Bayesian inference of phylogeny is based upon a quantity called the posterior probability distribution of trees, which is the probability of a tree conditioned on the observations. The conditioning is accomplished using Bayes's theorem. The posterior probability distribution of trees is impossible to calculate analytically; instead, MrBayes uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees.

The program takes as input a character matrix in a NEXUS file format. The output is several files with the parameters that were sampled by the MCMC algorithm. MrBayes can summarize the information in these files for the user. The program features include:

Extensive help available via the command line;
Ability to analyze nucleotide, amino acid, restriction site, and morphological data;
Mixing of data types, such as molecular and morphological characters, in a single analysis;
A general method for assigning parameters across data partitions;
An abundance of evolutionary models, including 4 X 4, doublet, and codon models for nucleotide data and many of the standard rate matrices for amino acid data;
Estimation of positively selected sites in a fully hierarchical Bayes framework;
Distributed computing using MPI

The most recent version of the program, 3.1, is the first non-beta release of MrBayes 3. Compared to previous beta versions, it includes numerous bug fixes as well as several new features such as topology constraints, inference of ancestral states and site rates, and simultaneous independent runs with convergence diagnostics calculated on the fly.

MrBayes is developed by a group of researchers at several institutions.

Submit A Single MrBayes Batch Job(s):

1. Create a script file which contains the MrBayes commands as below:

---------- /data/user/mrbayes/test --------------

#!/bin/bash
#
#PBS -N MrBayes
#PBS -m be
#PBS -k oe
PATH=/usr/local/mpich/bin:$PATH; export PATH

cd /usr/local/bench/mrbayes

mpirun -machinefile $PBS_NODEFILE -np $np /usr/local/bin/mrbayes arch107_L1000.nex

2. Now submit the script using the 'qsub' command, e.g.

qsub -v np=8 -l nodes=4:o2800 /data/user/mrbayes/test

Where
np is the desired number of processors (2x the number of nodes, 4x for dual-core nodes)
nodes is the desired number of nodes (in this case, 4)
o2800 is the desired type of processor
"test" is the name of the script file.

Parallelization

MrBayes is parallelized, and uses MPI to distribute heated and cold chains among available processors. When run in parallel, each chain is done by a single processor. Thus, MrBayes cannot use more processors than there are chains. If you submit your MrBayes job to more processors than you have chains, you will see the error message

> " The number of chains must be at least as great
>      as the number of processors (#)

It is possible to increase the number of chains (nchains) or the number of independent runs (nruns), and then submit to more processors. Increasing the 'nruns' parameter and running on more processors will not speed up the calculation, since each independent run will still take the same amount of time to compute. However, it will allow you to have more independent runs evaluated at the same time, and therefore get a better result.

Submitting a swarm of MrBayes jobs

To run a large number of MrBayes jobs, and have each job use multiple processors, follow this procedure. Set up a swarm command file along the following lines:

# --- this file is swarm.cmd ------
setenv PATH /usr/local/mpich/bin:$PATH ; cd /data/user/myjob/a1 ; mpirun -machinefile $PBS_NODEFILE -np 4 /usr/local/bin/mrbayes test1.nex
setenv PATH /usr/local/mpich/bin:$PATH ; cd /data/user/myjob/a2 ; mpirun -machinefile $PBS_NODEFILE -np 4 /usr/local/bin/mrbayes test2.nex
setenv PATH /usr/local/mpich/bin:$PATH ; cd /data/user/myjob/a3 ; mpirun -machinefile $PBS_NODEFILE -np 4 /usr/local/bin/mrbayes test3.nex

In the example above, a different directory is being used for each run for convenience. Each MrBayes run is set up to use 4 processors ('-np 4'). Thus, the swarm command must also be set up so that each MrBayes run is allocated a single node with 4 processors.

Submit this swarm with the command:

swarm -f swarm.cmd -n 1 -l nodes=1:dc

The '-n 1' flag ensures that only 1 MrBayes job will run on each node. Since the MrBayes commands within the swarm file use '-np 4' (i.e. run on 4 processors), the swarm jobs must run on nodes with 4 processors, i.e. the dual-core (dc) nodes.

Run MrBayes interactively

If you have a small MrBayes job, it is probably easiest to run on Helix. Occasionally, for debugging purposes, an interactive job may be run on Biowulf by allocating an interactive node. Please remember to exit from the node when done.

<biowulf>% qsub -I -l nodes=1

qsub: waiting for job 593807.biobos to start
qsub: job 593807.biobos ready

<p2>% cd /usr/local/bin
<p2>%mrbayes

                               MrBayes v3.1.2

                      (Bayesian Analysis of Phylogeny)

                             (Parallel version)
                         (1 processors available)

                                     by

                  John P. Huelsenbeck and Fredrik Ronquist

                 Section of Ecology, Behavior and Evolution
                       Division of Biological Sciences
                     University of California, San Diego
                           johnh@biomail.ucsd.edu

                       School of Computational Science
                           Florida State University
                            ronquist@csit.fsu.edu 

              Distributed under the GNU General Public License

               Type "help" or "help <command>" for information
                     on the commands that are available.

MrBayes > execute /usr/local/bench/mrbayes/arch107_L1000.nex 


   Executing file "/usr/local/bench/mrbayes/arch107_L1000.nex"
   UNIX line termination
   Longest line length = 1011
   Parsing file
   Expecting NEXUS formatted file
   Reading data block
      Allocated matrix
      Matrix has 107 taxa and 1000 characters
      Missing data coded as ?
      Gaps coded as -
      Data is Dna
      Setting default partition (does not divide up characters).
      Taxon   1 -> Har.maris2
      Taxon   2 -> Har.maris1
      Taxon   3 -> Har.mukoht
      Taxon   4 -> Ntm.pharao
      Taxon   5 -> AB012057
      Taxon   6 -> AB012052
      Taxon   7 -> AB012054
      Taxon   8 -> Hc.salifo2
      Taxon   9 -> Hb.cutirub
      Taxon  10 -> AF071880
      [....]
      Taxon 101 -> U81774
      Taxon 102 -> AB019720
      Taxon 103 -> AF068822
      Taxon 104 -> AB019721
      Taxon 105 -> AB019719
      Taxon 106 -> AB019715
      Taxon 107 -> AB019717
      Setting output file names to "/usr/local/bench/mrbayes/arch107_L1000.nex.run<i>.<p/t>"
      Successfully read matrix
   Exiting data block
   Reached end of file

MrBayes >quit

        Deleting matrix
        Quitting program




<p2>%exit
<biowulf>%

Documentation

The MrBayes website

The MrBayes manual (PDF)

The MrBayes wiki which contains the manual and a FAQ.