Applications on Biowulf

Biowulf at the NIH

Status Applications Hardware Storage User Guide Performance Research About

Scientific Applications on Biowulf

Application Areas

Sequence Analysis

Sequence Assembly

Linkage/Phylogenetics

Computational Chemistry

Molecular Modeling

Mathematics/Statistics

Structural Biology

The Biowulf staff maintains several programs, packages and databases for our users. Below is a non-exhaustive list of software available on Biwoulf with site-specific instructions on how to run a given package on the cluster, including links to vendor/author provided documentation if applicable.

Sequence Analysis

BLAST, developed at NCBI, is a set of programs to find similarity between a query protein or DNA sequence and a sequence database. A scheme for efficiently running a large number of sequence files against a variety of BLAST databases has been implemented on Biowulf.

IPRScan is a tool that combines different protein signature recognition methods against multiple databases into one resource. These databases include PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAMs, PIR superfamily, SUPERFAMILY, Gene3D and PANTHER.

The EMBOSS package is a comprehensive suite of sequence analysis software that can perform sequence alignment, motif identification, pattern analysis, and more.

WU-BLAST, developed at Washington University, is fast, gapped Blast with statistics, intended to find similarity between a query protein or DNA sequence and a sequence database.

BLAT is a DNA/Protein Sequence Analysis program written by Jim Kent at UCSC. It is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates.

The fasta program package contains many programs for searching DNA and protein databases and one program (prss) for evaluating statistical significance from randomly shuffled sequences.

Meme is designed to discover motifs (highly conserved regions) in groups of related DNA or protein sequences, and Mast will search sequence databases using motifs.

Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus. HMMER uses profile HMMs for several types of homology searches.

RandFold computes the probability that, for a given RNA sequence, the Minimum Free Energy (MFE) of the secondary structure is different from a distribution of MFE computed with random sequences..

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program.

Jim Kent Library jksrc454.zip

A collection of executables from Jim Kent have been compiled on Biowulf. The programs perform a multitude of tasks from simple number crunching to highly specific sequence analysis and database construction. The executables are located in the directory /usr/local/ucsc on biowulf. Brief description of each program.

Scientific Databases

A list of all available nucleotide, protein, structural, and otherc databases available on the system for Blast, WU-Blast, Fasta etc., and their update status.

Sequence Assembly

The PCAP program is intended for large-scale assembly of genomic sequences with quality values and with or without forward-reverse read pairs.

Linkage/Phylogenetics

Fastlink/FastSLINK

FASTLINK is a modified and improved version of the original LINKAGE suite for genetic linkage analysis. The additional LINKAGE utilities are also installed. FastSLINK is a merger of code from FASTLINK v 2.x to the SLINK package, which simulates and analyzes replicates.

Loki is a linkage analysis package, primarily for large and complex pedigrees, which uses Markov chain Monte Carlo (MCMC) techniques to avoid many of the computational problems that prevent exact computational methods being used for large pedigrees.

MERLIN uses sparse trees to represent gene flow in pedigrees and is one of the fastest pedigree analysis packages around

MrBayes performs Bayesian estimation of phylogeny.

A statistical genetics computer application for haplotype, parametric linkage, non-parametric linkage (NPL), identity by descent (IBD) and mistyping analyses on any size of pedigree. SimWalk2 uses Markov chain Monte Carlo (MCMC) and simulated annealing algorithms to perform these multipoint analyses.

A package of software to perform several kinds of statistical genetic analysis, including linkage analysis, quantitative genetic analysis, and covariate screening.

PAUP* (Phylogenetic Analysis Using Parsimony) is a software package for inference of evolutionary trees.

Tools for the statistical analysis of family-based association studies (FBAT).

PLINK is whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.

TREE-PUZZLE is a computer program to reconstruct phylogenetic trees from molecular sequence data by maximum likelihood. It implements a fast tree search algorithm, quartet puzzling, that allows analysis of large data sets and automatically assigns estimations of support to each internal branch.

Computational Chemistry / Molecular Modeling

AMBER (Assisted Model Building with Energy Refinement) is a package of molecular simulation programs. Version 9 is currently installed on Biowulf. Major programs in the AMBER package include sander, gibbs, nmode, LEap

APBS (Adaptive Poisson-Boltzmann Solver) is a software package for the numerical solution of the Poisson-Boltzmann equation (PBE), one of the most popular continuum models for describing electrostatic interactions between molecular solutes in salty, aqueous media.

A suite of automated docking tools. It is designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure.

CHARMM (Chemistry at HARvard Molecular Mechanics) is a program which supports a wide range of theoretical modeling calculations of the structure and dynamics of biological molecules. In addition to energy minimization and molecular dynamics simulations, Monte Carlo sampling, use of genetic algorithms, and several interfaces to quantum codes (AM1, GAMESS) are available or under development. Recent CHARMM versions have been made available for use on Biowulf, as a joint effort between NHLBI/LBC Computational Biophysics Section and CBER/OVRR Biophysics Lab and with the support of Biowulf Staff. Multiple executables are available for each version, in order to support larger molecular systems, and the different types of parallel communications available on Biowulf, i.e. ethernet and Myrinet 2000. The support files are also available for the above versions, e.g. version .doc files, and the standard topology and parameter files.

GAMESS is a program for ab initio quantum chemistry. Briefly, GAMESS can compute wavefunctions ranging from RHF, ROHF, UHF, GVB, and MCSCF, with CI and MP2 energy corrections available for some of these. Analytic gradients are available for these SCF functions, for automatic geometry optimization, transition state searches, or reaction path following. Computation of the energy hessian permits prediction of vibrational frequencies. A variety of molecular properties, ranging from simple dipole moments to frequency dependent hyperpolarizabilities may be computed.

Gaussian 03 is a series of electronic structure programs performing computations starting from the basic laws of quantum mechanics. Gaussian can predict energies, molecular structures, vibrational frequencies for systems in the gas phase and in solution, and it can model them in both their ground state and excited states.

A versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins and lipids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.

NAMD is a parallel molecular dynamics program for UNIX platforms designed for high-performance simulations in structural biology. It is developed by the Theoretical Biophysics Group at the Beckman Center, University of Illinois. NAMD is particularly well suited to Beowulf clusters, as it was specifically designed to rugn efficiently on parallel machines.
VMD, the molecular visualization program integrated with NAMD, is also available on Helix and Biowulf.

PROSPECT is a threading-based protein structure prediction system. PROSPECT will find structural homologs of a target sequence, even when the structural homolog sequences have insignificant identity to the target sequence.

An ab initio electronic structure program capable of performing first principles calculations on both the ground and excited states of molecules.

A limited number of Schrödinger applications (such as MacroModel, Jaguar, and QikProp) are available through the Molecular Modeling Interest Group.

Proteomics / Mass Spectrometry

An efficient search engine for identifying MS/MS peptide spectra by searching libraries of known protein sequences. OMSSA scores significant hits with a probability score developed using classical hypothesis testing, the same statistical method used in BLAST.

Matches tandem mass spectra with peptide sequences for protein identification.

A MS/MS database search tool specifically designed to address two crucial needs of the proteomics comminuty: post-translational modification identification and search speed.

Mathematical Analysis / Statistics

The GAUSS Mathematical and Statistical System is a fast matrix programming language designed for computationally intensive tasks, which has a wide variety of statistical, mathematical and matrix handling routines.

Matlab integrates mathematical computing, visualization, and a powerful language to provide a flexible environment for technical computing.

Mathematica is a fully integrated environment for technical and scientific computing. Mathematica combines numerical and symbolic computation, visualization, and programming in a single, flexible interactive system.

GNU Octave is an open-source language for numerical calculations that has a command-line interface and can interpret many (but not all) Matlab scripts. It is not license-limited and so can be used for many simultaneous independent runs.

R (the R Project) is a language and environment for statistical computing and graphics. R is similar to S, and provides a wide variety of statistical and graphical techniques (linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, ...)

The SAS System is an integrated, hardware-independent system of applications software for data access, management, statistical analysis and report writing. The Base SAS windowing environment provides a full-screen facility for interacting with all parts of a SAS program.

Scilab is an open-source alternative to Matlab which includes hundreds of mathematical functions and the ability to interactively add C/Fortran programs. It includes a Matlab->Scilab converter.

FSL is a comprehensive library of image analysis and statistical tools for FMRI, MRI and DTI brain imaging data.

AFNI (Analysis of Functional NeuroImages) is a set of C programs for processing, analyzing, and displaying functional MRI (FMRI) data - a technique for mapping human brain activity.

EMAN is a suite of scientific image processing tools aimed primarily at the transmission electron microscopy community, though it is beginning to be used in other fields as well.

Structural Biology

Xplor-NIH is a structure determination program which builds on the X-PLOR program, including additional tools for NMR analysis. The advantage of running Xplor-NIH on Biowulf would be to spawn a large number of independent refinement jobs which would run on multiple Biowulf nodes.

POV-RAY (Persistence of Vision RAYtracer) is a high-quality tool for creating three-dimensional graphics. Raytraced images are publication-quality and 'photo-realistic', but are computationally expensive so that large images can take many hours to create. POV-Ray images can also require more memory than many desktop machines can handle. To address these concerns, a parallelized version of PovRay has been installed on the Biowulf system.

Queen of Spades

Qs (Queen of Spades) is a "brute force" style molecular replacement program which uses a method based on a reverse Monte Carlo minimisation of the conventional crystallographic R-factor in the 6n-dimensional space defined by the rotational and translational parameters of the n molecules. Because all parameters of all molecules are determined simultaneously, this algorithm should improve the signal-to-noise ratio in difficult cases involving high crystallographic/non-crystallographic symmetry in tightly packed crystal forms.

AMoRe is an automated utility for performing molecular replacement using fast rotation and translation functions in a step-wise fashion.

HADDOCK (High Ambiguity Driven protein-protein DOCKing) is an approach for predicting protein-protein complex structures that makes use of biochemical and/or biophysical interaction data such as chemical shift perturbation data resulting from NMR titration experiments or mutagenesis data.

The Rosetta++ software suite can perform de novo protein structure predictions, identify low free energy sequences for target protein backbones, predict the structure of a protein-protein complex from the individual structures of the monomer components, incorporate NMR data into the basic Rosetta protocol to accelerate the process of NMR structure prediction, and more...

Chemical-Shift-ROSETTA is a robust protocol to use NMR chemical shifts for de novo protein structure generation by SPARTA-based selection of protein fragments from the PDB, in conjunction with a regular ROSETTA Monte Carlo assembly and relaxation method.

ZDOCK predicts protein-docking models, and uses a fast Fourier transform to search all possible binding modes for proteins, evaluating based on shape complementarity, desolvation energy, and electrostatics.

Command-line homology model builder (written by Jason (Zhexin) Xiang) on par with MODELER. To use, type nest at the prompt. Nest can be used in conjuction with PROSPECT using prospect2pdb.pl

General Purpose

Swarm is a program designed to simplify submitting a group of commands to the Biowulf cluster. Some programs do not scale well and thus are not suited to true parallelizing. Other programs may be such that each individual job is very short, but many such jobs need to be run. Such programs are well suited to running 'swarms of single-threaded jobs'. The Swarm program simplifies this process. See the documentation for details. Download swarm.

back to top

A collection of utility programs is also available on the Biowulf cluster.

Disclaimer | Privacy | Accessibility | CIT | NIH | DHHS | USA.gov