Biowulf at the NIH
HADDOCK on Biowulf

haddock2

The structure determination of protein-protein complexes is a rather tedious and lengthy process, both by NMR and X-ray crystallography. Several methods based on docking to study protein complexes have been well developed over the past few years. Most of these approaches are however not driven by experimental data but based on combination of energetics and shape complementarity. HADDOCK (High Ambiguity Driven protein-protein DOCKing) is an approach that makes use of biochemical and/or biophysical interaction data such as chemical shift perturbation data resulting from NMR titration experiments or mutagenesis data.

Author: Alexandre Bonvin


How to use

Read the HADDOCK instructions carefully (particularly the Tutorial), generate the Ambiguous Interaction Restraints (AIR) file, set up a new project, and create the new.html file.

The following script generates a run.cns file from a new.html file. The run.cns file will appear in a directory run#, where # is the RUN_NUMBER from new.html.

haddock_setup.csh

#!/bin/csh
# set environmental variables
setenv HADDOCK /usr/local/haddock
source $HADDOCK/haddock_configure.csh
set runnum = `grep -P -o "^RUN_NUMBER=\d+" new.html | cut -f2 -d=`
echo "run.cns is in run$runnum"
haddock2.0

run.cns can be edited using a web interface. Go to the HADDOCK main page and click on 'Project setup'. Upload run.cns, edit, and save to your local disk. Replace the original run.cns.

In order to parallelize the HADDOCK run properly on Biowulf, the run.cns file that is created must be edited dynamically after nodes have been assigned (unless a multinode job is run interactively -- strongly discouraged).

The following PBS script inserts the node hostnames and the number of cpu per node into the run.cns file. Then it starts the parallelized HADDOCK run.

haddock.csh

#!/bin/csh
#PBS -N haddock
#PBS -o haddock.out
#PBS -e haddock.err
cd $PBS_O_WORKDIR
# set environmental variables
setenv HADDOCK /usr/local/haddock
source $HADDOCK/haddock_configure.csh
source /usr/local/cns/cns_solve_env
set cns=$CNS_INST/bin/cns
# substitute node hostnames in run.cns
if (! -e run.cns-orig) then
   cp run.cns run.cns-orig
endif
set inum=1
foreach i (`cat $PBS_NODEFILE | uniq`)
   sed -i -e "s#queue_$inum="\""#queue_$inum="\""rsh $i csh#g" run.cns
   sed -i -e "s#cns_exe_$inum="\""#cns_exe_$inum="\""$cns#g" run.cns
   set ncpu=`rsh $i "cat /proc/cpuinfo | grep -c ^processor"`
   sed -i -e "s#cpunumber_$inum=0#cpunumber_$inum=$ncpu#g" run.cns
   @ inum+=1
end
# now run parallelized HADDOCK
haddock2.0 >& haddock.log

The haddock.csh PBS script is submitted from the run directory (e.g., run1) where the run.cns file was created using the qsub command

qsub -l nodes=N haddock.csh

where N is the number of nodes desired.

More information and finer details

HADDOCK is not perfectly parallelized, and is certainly not optimized for running under PBS Pro. It is most efficient to have the number of structures be a multiple of the number of CPUs desired. For example, 1000 structures in iteration 0 and 200 for iteration 1, would best be run with 20, 40, or 50 CPU. This maximizes the number of active CPU during the run, and minimizes the total time for the job to complete.

Also, there are still bugs in version 2.0. Sometimes HADDOCK will fail unexpectantly or stall at a point during the run. It is best to keep an eye on the job to make sure it doesn't go haywire.