HMMER on the Biowulf Linux CLuster

HMMER on Biowulf

Profile hidden Markov models for biological sequence analysis

Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus. HMMER uses profile HMMs, and can be useful in situations like:

if you are working with an evolutionarily diverse protein family, a BLAST search with any individual sequence may not find the rest of the sequences in the family.
the top hits in a BLAST search are hypothetical sequences from genome projects.
your protein consists of several domains which are of different types.

HMMER (pronounced 'hammer', as in a more precise mining tool than BLAST) was developed by Sean Eddy at Washington University in St. Louis. The HMMER website is hmmer.janelia.org.

HMMER User Guide (PDF)

HMMER is a very cpu-intensive program and is parallelized using threads, so that each instance of hmmpfam or hmmsearch can use all the cpus available on a node. HMMER on Biowulf is intended for those who need to run HMMER searches on large numbers of query sequences.

Searching query sequences against a profile HMM database

One use of HMMER is to look for known domains in a query sequence, by searching a single sequence against a library of HMMs. One such library is the PFAM database. PFAM is available and updated on our systems in the directory /fdb/fastadb/pfam. It is also possible to create your own database; see the user guide for details).

Create a swarm command file with one line for each of the query sequences. Sample swarm command file:

---------------- file swarm.cmd ----------------------------------------------------
hmmpfam  /fdb/fastadb/pfam/Pfam_fs  /data/user/seqs/myseq1 > /data/user/out/seq1.out
hmmpfam  /fdb/fastadb/pfam/Pfam_fs  /data/user/seqs/myseq2 > /data/user/out/seq2.out
hmmpfam  /fdb/fastadb/pfam/Pfam_fs  /data/user/seqs/myseq3 > /data/user/out/seq3.out
hmmpfam  /fdb/fastadb/pfam/Pfam_fs  /data/user/seqs/myseq4 > /data/user/out/seq4.out
hmmpfam  /fdb/fastadb/pfam/Pfam_fs  /data/user/seqs/myseq5 > /data/user/out/seq5.out
[....]
------------------------------------------------------------------------------------

The HMMER programs hmmcalibrate, hmmsearch, and hmmpfam are set up to use all available cpus on a node. Therefore this swarm job should be submitted so as to run only a single command on each node. Submit with:

swarm -f swarm.cmd -n 1

Searching a sequence database for homologues of a protein family

Another common use of HMMER is to search a sequence database for homologues of a protein family of interest. If you start with a file containing several sequences belonging to the family, you can use this to find remote homologues from a protein database. The following sample batch script will run hmmbuild, hmmcalibrate, and hmmsearch in sequence.

----------- file hmm_homolog  -----------------------------------------
#!/bin/csh
#PBS -N Hmmer
#PBS -m be
#PBS -k oe

cd /data/user/mydir
hmmbuild -g globins.hmm globins.msf 
hmmcalibrate  globins.hmm 
hmmsearch globins.hmm /fdb/fastadb/ecoli.aa.fas
------------------------------------------------------------------------

This script starts with a multiple sequence alignment of a protein domain or protein family in the file globins.msf. This file can be created by aligning sequences with ClustalW. The hmmbuild command builds a profile HMM from the alignment, the hmmcalibrate command increases the sensitivity of the search, and the hmmsearch command uses the globin model to search for globin domains in the Ecoli database. See the HMMER documentation for more information.

Submit this file with:

qsub -l nodes=1 hmm_homolog

More information

The entire HMMER suite of programs is available in /usr/local/hmmer. Note that only hmmcalibrate, hmmsearch and hmmpfam are parallelized.

A large collection of protein sequence databases is in /fdb/fastadb/.
Fasta-format databases and update status.