Biowulf at the NIH
RSS Feed
VCFtools on Biowulf

VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, and 4.0), nevertheless, the users are encouraged to use the latest version VCFv4.0. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well.

VCFTools is maintained and developed by Adam Auton, Peter Danecek and collaborators. VCFTools paper.

Program Location

/usr/local/vcftools/bin

Please Note, tabix and bgzip are both under the same directory.

It is important that the paths be set up correctly for VCFtools. This can be done by typing 'module load vcftools' as in the example below.

Submitting a single VCFtools batch job

(More examples of VCFtools command lines)

1. Create a script file. The file will contain the lines similar to the lines below. Modify the path of location before running. Remember, the two $PERL5LIB and $PATH environmental variables have to be set correctly first.

#!/bin/bash
# This file is vcftools
#
#PBS -N vcftools
#PBS -m be
#PBS -k oe

module load vcftools

cd /data/user/somewhereWithInputfile
compare-vcf inputFile1 inputFile2

2. Submit the script using the 'qsub' command on Biowulf.

qsub -l nodes=1:g4 /data/username/theScriptFileAbove

The job has been submitted to a node with 4 GB of memory ('g4' in the command above). Use 'freen' to see available node types.

Submitting a swarm of vcftools jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

The command 'module load vcftools' can be included in the swarm command file, as in the example below. It can also be added to your .bashrc or .cshrc file, and then it will not need to be included in the swarm command file.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:


module load vcftools; cd /data/user/somewhereWithInputfile1; compare-vcf inputFile1 inputFile2
module load vcftools; cd /data/user/somewhereWithInputfile2; compare-vcf inputFile1 inputFile2
module load vcftools; cd /data/user/somewhereWithInputfile3; compare-vcf inputFile1 inputFile2
[....etc....]
module load vcftools; cd /data/user/somewhereWithInputfile15; compare-vcf inputFile1 inputFile2

Submit this swarm with:

swarm -f cmdfile

By default, each line of the commands above will be executed on '1' processor core of a node and uses 1GB of memory. If each VCFtools command requires more than 1 GB of memory, you should specify the memory required by using the '-g #' flag to swarm, where # is the number of Gigabytes of memory required.

For example, if each of the vcftools commands in the swarm command file above requires 10 GB of memory, then you will need to submit the swarm job with:

biowulf> $ swarm -g 10 -f cmdfile

For more information regarding running swarm, see swarm.html

Running an interactive VCFtools job

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf% qsub -I -l nodes=1 
qsub: waiting for job 2236960.biobos to start 
qsub: job 2236960.biobos ready

[user@p4]$ cd /data/user/myruns 
[user@p4]$ module load vcftools
[user@p4]$ cd /data/userID/vcftools/run1 
[user@p4]$ compare-vcf file1.gz file2.gz 
[user@p4]$ .......... 
[user@p4] exit 
qsub: job 2236960.biobos completed 
[user@biowulf ~]$ 

If you want a specific type of node, you can specify that on the qsub command line. For example, to request a node with 24 GB of memory, use

biowulf% qsub -I -l nodes=1:g24

Documentation

http://vcftools.sourceforge.net/docs.html