Structural Biology on the Biowulf Cluster
David Hoover, hooverdm@helix.nih.gov
Helix Systems, CIT/NIH
December 5, 2007
Biowulf cluster -- details
Dependent vs. independent parallel processing
There are two computational situations which are very well suited to a large cluster.
Dependent parallel processing: large, monolithic processes which can be broken into smaller interdependent processes:
These are typically solved using an application that is already parallelized. The application typically takes an input file and perhaps some options, maybe environment variables required (source files?).
Independent parallel processing: short processes that can be run independently, with the results combined later on (also termed embarrassingly parallel):
These sometimes take work from the user to set up. This can be done typically with shell scripts. Some complex situations call for perl or python scripts, or maybe C/C++/Fortran programs for the ambitious.
Applications
AMBER
home page http://amber.scripps.edu/ version 9.0 type molecular dynamics ease-of-use * documentation http://biowulf.nih.gov/apps/amber.html parallelized? yes myrinet? yes scaling 8-16 cpu AMBER is a package of molecular simulation programs. It is also a set of molecular mechanical force fields for the simulation of biomolecules. AMBER was initially created by Peter Kolhman; the package is currently a joint development of at least six institutions
There are about 50 programs in version 8. The main programs are segregated into three categories:
- Preparatory programs: LEaP, antechamber
- Simulation programs: sander, pmemd, nmode
- Analysis programs: ptraj, and mm_pbsa
Unless you are a hard core theoretical chemist, you probably want to go through the AMBER tutorials before doing anything specialized.
AMBER is compiled to run as a MPI-parallel program across multiple nodes. It can also utilize Myrinet interconnects to improve efficiency. It can scale to about 8 or 16 processors, depending on the processor and interconnect type as well as the program executed.
CHARMM
home page http://www.charmm.org version 27-34, others type molecular dynamics ease-of-use * documentation http://biowulf.nih.gov/apps/charmm/index.html parallelized? yes myrinet? yes scaling 16 cpu CHARMM (Chemistry at HARvard Macromolecular Mechanics) is a command-line program for performing molecular dynamic simulations of biomolecules. CHARMM was initially created by Martin Karplus; like AMBER, CHARMM is currently a joint development of at least a dozen institutions, including NIH. CHARMM is actively developed here at NIH by Rick Venable and Bernard Brooks.
CHARMM can be run as regular ethernet interconnect parallel or Myrinet interconnect, depending on the version used.
Two command scripts, qcharmm and mpicharmm, are available for simplifying submission of CHARMM jobs to Biowulf. An input script (.inp) with a series of commands and variables is required to run.
Here is a decent introduction to MD using CHARMM.
GROMACS
home page http://www.gromacs.org version 3.3.1,3.2.1 type molecular dynamics ease-of-use *** documentation http://biowulf.nih.gov/apps/gromacs/index.html parallelized? yes myrinet? yes scaling 10-20 cpu GROMACS (GROningen MAchine for Chemical Simulations) is touted as the World's Fasted Molecular Dynamics, and it is definitely more user-friendly than CHARMM or AMBER. It was designed and developed primarily by Herman Berendsen's group at Groningen University, although there is some collaboration with other institutions.
Coarse grain simulations are also possible, speeding up simulations by ~1000 fold.
NAMD
home page http://www.ks.uiuc.edu/Research/namd version 2.6 type molecular dynamics ease-of-use *** documentation http://biowulf.nih.gov/apps/namd/index.html parallelized? yes myrinet? no scaling 4-32+ cpu NAMD (Not Another Molecular Dynamics program) is a molecular dynamics simulation program that was designed specifically for Beowulf-class clusters (like Biowulf). It was developed by the Theoretical Biophysics Group at the Beckman Institute (University of Illinois).
NAMD, like GROMACS, primarily performs molecular dynamics. It scales quite well with ordinary ethernet interconnects, but not very well with Myrinet interconnects.
NAMD takes easily obtained PSF, PDB, and parameter files from CHARMM and X-PLOR as input, and is submitted via qsub with simple commands.
Uses spatial-decomposition strategies for parallelism; CHARMM and AMBER use atom-decomposition (replicated data) for parallelism.
VMD was written specifically for NAMD, so the output is very easily visualized.
APBS
home page http://apbs.sourceforge.net version 0.5.0 type electrostatics ease-of-use ** documentation http://biowulf.nih.gov/apps/apbs.html parallelized? no myrinet? no scaling n/a APBS (Adaptive Poisson-Boltzmann Solver) is a software package for the numerical solution of the Poisson-Boltzmann equation (PBE).
APBS is run in batch mode, and its output can be visualized using VMD. It is similar to GRASP, but is more complex and powerful.
GAMESS
home page http://www.msg.ameslab.gov/GAMESS/ version Mar. 2007 type quantum chemistry ease-of-use *** documentation http://biowulf.nih.gov/apps/gamess.html parallelized? yes myrinet? no scaling 8 cpu? GAMESS (the General Atomic and Molecular Electronic Structure System) is a general ab initio quantum chemistry package. GAMESS is maintained by the members of the Gordon research group at Iowa State University.
GAUSSIAN03
home page http://www.gaussian.com/g03.htm version D02 type quantum chemistry ease-of-use *** documentation http://biowulf.nih.gov/apps/gaussian/ parallelized? no myrinet? no scaling n/a Gaussian03 is the latest in the Gaussian series of electronic structure programs. Designed to model a broad range of molecular systems under a variety of conditions, it performs its computations starting from the basic laws of quantum mechanics.
Q-chem
home page http://www.q-chem.com/ version 2.1 type quantum chemistry ease-of-use ** documentation http://biowulf.nih.gov/apps/q-chem.html parallelized? yes myrinet? no scaling ? Q-Chem is an ab initio electronic structure program capable of performing first principles calculations on both the ground and excited states of molecules.
PROSPECT
home page http://compbio.ornl.gov/structure/prospect2/index.html version 2.0 type structure prediction ease-of-use *** documentation http://biowulf.nih.gov/apps/prospect_guide.html parallelized? yes myrinet? no scaling 64+ cpu PROSPECT is a threading-based protein structure prediction system. PROSPECT will find structural homologs of a target sequence, even when the structural homolog sequences have insignificant identity to the target sequence.
HADDOCK
home page http://www.nmr.chem.uu.nl/haddock/ version 2.0 type structure prediction ease-of-use * documentation http://biowulf.nih.gov/apps/haddock_biowulf.html parallelized? yes myrinet? no scaling ? HADDOCK (High Ambiguity Driven protein-protein DOCKing) is an approach for predicting protein-protein complex structures that makes use of biochemical and/or biophysical interaction data such as chemical shift perturbation data resulting from NMR titration experiments or mutagenesis data.
CNS, XPLOR-NIH
home page http://cns.csb.yale.edu/v1.1/ version 1.1 type structure determination and refinement ease-of-use * documentation http://helix.nih.gov/apps/structbio/cns.html, http://biowulf.nih.gov/apps/xplor-nih.html parallelized? yes and no myrinet? no scaling ? Crystallography and NMR System (CNS) is a flexible multi-level package for macromolecular structure determination.
Xplor-NIH is a structure determination program which builds on the X-PLOR program, including additional tools for NMR analysis. The advantage of running Xplor-NIH on Biowulf would be to spawn a large number of independent refinement jobs which would run on multiple Biowulf nodes.
Qs
home page http://www.mbg.duth.gr/~glykos/Qs.html version 1.3 type structure determination and refinement ease-of-use *** documentation http://biowulf.nih.gov/apps/Qs/index.html parallelized? no myrinet? no scaling n/a Qs (Queen of Spades) is a "brute force" style molecular replacement program which uses a method based on a reverse Monte Carlo minimisation of the conventional crystallographic R-factor in the 6n-dimensional space defined by the rotational and translational parameters of the n molecules. Because all parameters of all molecules are determined simultaneously, this algorithm should improve the signal-to-noise ratio in difficult cases involving high crystallographic/non-crystallographic symmetry in tightly packed crystal forms.
AMoRe
home page http://www.gv.cnrs-gif.fr/english/vs2-english.html version n/a type structure determination and refinement ease-of-use ** documentation http://biowulf.nih.gov/apps/amore/index.html parallelized? no myrinet? no scaling n/a AMoRe is an automated utility for performing molecular replacement using fast rotation and translation functions in a step-wise fashion.
PovRay
home page http://www.povray.org/ version 3.1,3.6 type visualization ease-of-use ** documentation http://biowulf.nih.gov/apps/povray/index.html parallelized? yes and no myrinet? no scaling n/a POVRAY (Persistence of Vision RAYtracer) is a high-quality tool for creating three-dimensional graphics. Raytraced images are publication-quality and 'photo-realistic', but are computationally expensive so that large images can take many hours to create. PovRay images can also require more memory than many desktop machines can handle. To address these concerns, a parallelized version of PovRay (povray_swarm) has been installed on the Biowulf system.
VMD
home page http://www.ks.uiuc.edu/Research/vmd/current/docs.html version 1.8.6 type visualization ease-of-use **** documentation http://helix.nih.gov/Applications/vmd.html parallelized? no myrinet? no scaling n/a VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting. It has powerful and comprehensive filtering and configuration capabilities. It is especially well-suited for analyzing NAMD results.
RasMol
home page http://www.umass.edu/microbio/rasmol/ version 2.7.2.1 type visualization ease-of-use *** documentation http://www.openrasmol.org/ parallelized? no myrinet? no scaling n/a RasMol is a molecular graphics program intended for the visualisation of proteins, nucleic acids and small molecules. The program is aimed at display, teaching and generation of publication quality images. RasMol runs on wide range of architectures and operating systems including Microsoft Windows, Apple Macintosh, UNIX and VMS systems. UNIX and VMS versions require an 8, 24 or 32 bit colour X Windows display (X11R4 or later). The X Windows version of RasMol provides optional support for a hardware dials box and accelerated shared memory communication (via the XInput and MIT-SHM extensions) if available on the current X Server.
Rosetta
home page http://www.rosettacommons.org/ version 2.2 type protein structure prediction and modeling ease-of-use * documentation http://biowulf.nih.gov/apps/Rosetta.html parallelized? no myrinet? no scaling n/a The Rosetta++ software suite focuses on the prediction and design of protein structures, protein folding mechanisms, and protein-protein interactions. The Rosetta codes have been repeatedly successful in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition as well as the CAPRI competition and have been modified to address additional aspects of protein design, docking and structure.
ZDOCK
home page http://zlab.bu.edu/zdock/index.shtml version 2.3 type protein modeling ease-of-use *** documentation http://biowulf.nih.gov/apps/zdock.html parallelized? yes myrinet? no scaling up to 32 cpu ZDOCK uses a fast Fourier transform to search all possible binding modes for the proteins, evaluating based on shape complementarity, desolvation energy, and electrostatics.
nest
home page http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:nest version n/a type homology modeling ease-of-use * documentation http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:nest parallelized? no myrinet? no scaling n/a nest is a program for modeling protein structure based on a given sequence-template alignment. It has the following capabilities:
- model building with artificial evolution
- sequence alignment tuning
- composite structure building
- model building based on multiple templates
- structure refinement
nest can be used to build homology models based on:
- a single sequence-template alignment
- from multiple templates for the entire structure
- from different templates used for different regions of the structure
It also carries out energy based structure refinement and can change an alignment based on energetic considerations.
nest, and the entire Jackal suite from Jason Xiang, is also available through mmignet.
Autodock
home page http://autodock.scripps.edu/ version 3.0.5/td> type protein modeling ease-of-use ** documentation http://biowulf.nih.gov/apps/autodock.html parallelized? no myrinet? no scaling n/a Autodock is a suite of automated docking tools. It is designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure. Autodock was developed at the Scripps Research Institute in San Diego.
PROCHECK
home page http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html version 3.5 type structure analysis ease-of-use ** documentation http://biowulf.nih.gov/apps/procheck/ parallelized? no myrinet? no scaling n/a Checks the stereochemical quality of a protein structure, producing a number of PostScript plots analysing its overall and residue-by-residue geometry.
DSSP
home page http://swift.cmbi.ru.nl/gv/dssp/ version Nov. 2002, CMBI version type structure analysis ease-of-use *** documentation n/a parallelized? no myrinet? no scaling n/a The DSSP program was designed by Wolfgang Kabsch and Chris Sander to standardize secondary structure assignment. DSSP is a database of secondary structure assignments (and much more) for all protein entries in the Protein Data Bank (PDB). DSSP is also the program that calculates DSSP entries from PDB entries.
Benchmarks for MD (AMBER, CHARMM, and NAMD --http://brooks.scripps.edu/charmm_docs/Benchmarks/chm_amb_namd.html)
Running the applications
All applications run on Biowulf must submitted through qsub. For short tests, an interactive session can be started with the -I flag, but all long runs (greater than 30 minutes) should be submitted to the regular batch queue.
A script containing commands is created:
myjob.sh:
#!/bin/bash myprog < /data/me/mydata
This is submitted with qsub:
qsub -l nodes=1 myjob.sh
Minimally, the number of nodes must be supplied with the -l nodes=1 options. More precise properties required can be added:
qsub -l nodes=1:o2200:myr2k:m2048 myjob.sh
Node properties:
fastefast ethernet (100 Mb/s) interconnect gigegigabit ethernet (1 Gb/s) interconnect myr2kMyrinet (2 Gb/s) interconnect ibInfiniband (10 Gb/s) interconnect m10241 GB memory m20482 GB memory
m40964 GB memory p28002.8 GHz Intel Xeon o20002.0 GHz AMD Opteron 246 o22002.2 GHz AMD Opteron 248 o26002.6 GHz AMD Opteron 285, dual-core (4 CPU) o28002.8 GHz AMD Opteron 254 altixSGI Altix 350 (see Firebolt page for more information) x86-64o2000 + o2200 nodes + o2800 nodes
dcdual-core (o2600) nodes
centos2.8 GHz dual-core (o2800) nodes running CentOS 4.2
Other options:
-N name Declare a name for the job -m mail_options Send mail to user upon 'a' (abort), 'b' (begin), 'e' (end) -k keep e = standard error, o = standard output -s path_list Declares the shell that interprets the job -v variable_list Export named environment variables to hosts running job -V Export all environment variables
All options (except for -l nodes=... option) can be placed within the qsub script:
myjob.sh:
#!/bin/bash #PBS -N MyJob #PBS -m be #PBS -k oe #PBS -V #PBS -s /bin/sh myprog < /data/me/mydata
PBS-specific environment variables:
$PBS_O_HOST name of the host upon which the qsub command is running $PBS_O_QUEUE name of the original queue to which the job was submitted
$PBS_O_SYSTEM operating system name given by uname -s on $PBS_O_HOST $PBS_O_WORKDIR absolute path of the directory from which the qsub command was given $PBS_ENVIRONMENT either PBS_BATCH or PBS_INTERACTIVE $PBS_JOBID job identifier assigned to the job by the batch system $PBS_JOBNAME job name supplied by the user $PBS_NODEFILE pathname of the file containing the list of nodes assigned to the job $PBS_QUEUE name of the queue from which the job is executed
Monitoring and deleting qsub jobs
Monitor through the web: http://biowulf.nih.gov/sysmon/
Monitor interactively:
[biowulf]$ freen m1024 m2048 m4096 m8192 Total ----------------- GeneralPool ----------------- o2800 / / 0/210 / 0/210 o2200 / 22/232 0/58 / 22/290 o2000 / 17/40 / / 17/40 p2800 2/79 91/195 0/62 / 93/336 ----------------- Myrinet ----------------- o2200 / 34/71 / / 34/71 o2000 38/47 / / / 38/47 p2800 37/38 / / / 37/38 ----------------- Infiniband ----------------- o2800 / / 14/93 / 14/93 ----------------- Reserved ----------------- o2800 / 46/89 / 27/34 73/123 o2600 / / 46/274 / 46/274 ------------------- Altix -------------------- Available: 15 processors, 15.2 GB memory
qstat -u displays simple list of jobs for a single user:
[biowulf]$ qstat -u me
biobos:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
999999.biobos me norm MyJob 8713 1 1 -- -- R 99:99
qstat -f jobid displays a detailed report of a single job:
[biowulf]$ qstat -f 999999.biobos
Job Id: 999999.biobos
Job_Name = MyJob
Job_Owner = me@p1397
resources_used.cpupercent = 98
resources_used.cput = 23:26:24
resources_used.mem = 9452kb
resources_used.ncpus = 1
resources_used.vmem = 152328kb
resources_used.walltime = 23:26:56
job_state = R
queue = norm
server = biobos
Checkpoint = u
ctime = Wed Jan 4 14:25:35 2006
Error_Path = p1397:/home/me/MyJob.e999999
exec_host = p295/0
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = ae
mtime = Wed Jan 4 14:27:00 2006
Output_Path = p1397:/home/me/MyJob.o999999
Priority = 0
qtime = Wed Jan 4 14:25:35 2006
Rerunable = True
Resource_List.ncpus = 1
Resource_List.neednodes = 1:faste
Resource_List.nodect = 1
Resource_List.nodes = 1:faste
session_id = 8713
Variable_List = PBS_O_HOME=/home/me,PBS_O_LANG=en_US,
PBS_O_LOGNAME=me,PBS_O_PATH=/bin:/usr/bin,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=p1397,PBS_O_WORKDIR=/home/me
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=vlong
comment = Job run at Wed Jan 04 at 14:26
etime = Wed Jan 4 14:25:35 2006
qdel jobid kills the job
[biowulf]$ qdel 999999.biobos
Sometimes you have to push:
[biowulf]$ qdel -W force 999999.biobos
Delete all your jobs:
[biowulf]$ qdel -W force `qselect -u me`
The swarm command
http://biowulf.nih.gov/apps/swarm.html
A large set of independent processes can be submitted automatically to the cluster without having to create a qsub script for each process.
MyJob.swarm:
cd /home/me/a; myprog -param a < infile-a > outfile-a
cd /home/me/b; myprog -param b < infile-b > outfile-b
cd /home/me/c; myprog -param c < infile-c > outfile-c
cd /home/me/d; myprog -param d < infile-d > outfile-d
cd /home/me/e; myprog -param e < infile-e > outfile-e
cd /home/me/f; myprog -param f < infile-f > outfile-f
cd /home/me/g; myprog -param g < infile-g > outfile-g
Submit the swarm job:
[biowulf]$ swarm -f MyJob.swarm -V -l nodes=1:x86-64 720749.biobos 720750.biobos 720751.biobos 720752.biobos
Bundled swarm jobs:
If there are thousands and thousands of processes within a single swarm file (which each last a miniscule amount of time), it is better to serially run a block of individual processes on a single host, rather than spawn a new batch job for each process. This makes PBS much happier and is a much more efficient use of time:
[biowulf]$ swarm -b 100 -f MyJob.swarm -V -l nodes=1:x86-64 720805.biobos 720806.biobos
Deleting a single set of swarm jobs:
It is tricky to delete a single set of swarm jobs in the midst of other jobs in the batch queue. It is simplified with by using the command swarmdel:
Type the swarmdel command using one of the jobids as the argument:
[biowulf]$ swarmdel 720751.biobos 720749 'swarm1n29087' deleted 720750 'swarm2n29087' deleted 720751 'swarm3n29087' deleted 720752 'swarm4n29087' deleted
MPI and multirun
http://www-unix.mcs.anl.gov/mpi/
MPI is a library specification for message-passing, proposed as a standard by a broadly based committee of vendors, implementors, and users. A program is compiled using MPICH (TCP/IP) or MPICH-GM (Myrinet GM), and the program is run using the command mpirun:
mpirun -nolocal -machinefile $PBS_NODEFILE -np 8 MyProg
Here is a typical batch command file to run an MPI-compiled program (AMBER):
amber.run:
#!/bin/csh #PBS -N sander #PBS -m be #PBS -k oe set path = (/usr/local/mpich-pg/bin $path ) set file=/data/me/amber/dinuc_test cd /data/me/amber/nomyri date mpirun -machinefile $PBS_NODEFILE -np $np /usr/local/amber/exe.mpich-pg/sander \ -i $file.in -o $file.out -p $file.top -c $file.coor -x $file.crd -e $file.en \ -inf $file.info -r $file.rst
This script can be submitted with the qsub command
[biowulf]$ qsub -v np=8 -l nodes=4:o2200 amber.run
The multirun command
Similar to swarm, but more controlled (and oftentimes less efficient), creating a single job with unified STDOUT and STDERR output files.
1. Create an executable shell script which will run multiple instances of your program (run6.sh):
#!/bin/csh
switch ($MP_CHILD)
case 0:
MyProg < args0
breaksw
case 1:
MyProg < args1
breaksw
case 2:
MyProg < args2
breaksw
case 3:
MyProg < args3
breaksw
case 4:
MyProg < args4
breaksw
case 5:
MyProg < args5
breaksw
endsw
2. Use mpirun in your batch command file (MyJob.sh) to run the mpi shell program (run6.sh):
#!/bin/tcsh
#PBS -N MyJob
#PBS -m be
#PBS -k oe
set path=(/usr/local/mpich/bin $path)
mpirun -machinefile $PBS_NODEFILE -np 6 \
/usr/local/bin/multirun -m /home/me/run6.sh
3. Submit the job to the batch system:
[biowulf]$ qsub -l nodes=3 MyJob.sh
Large-scale structural biology
The term "large-scale" here refers to repetively executing a series of programs on a large number of individual inputs (protein structures, nucleotide sequences, data sets, etc.).
Practical tips for parallelizing jobs using scripts
Important commands and tools:
Important concepts:
Managing I/O, memory, and disk space requirements
Important elements:
Important concepts:
Visualizing results
This document is available as http://helix.nih.gov/talks/strbio.html
Biowulf home page | Helix
Systems | NIH
Last modified: 05 Dec 2007