TagZilla

TagZilla is computer program that computes tag SNPs based on pairwise linkage disequilibrium that takes into account practical needs and constraints faced by investigators.

TagZilla is implemented in the Python programming language, with a very small proportion of optional accelerator code in C, and runs on Windows, Unix/Linux, and Apple operating systems and has been released as an open source application, free for both non-profit and commercial use.

notice

TagZilla version 1.1 will be released at the end of next week. It will contain numerous updates, including several corrections and extensions to the multi-population tagging capabilities.

background

Over 7 million common Single Nucleotide Polymorphisms (those with minor allele frequency (MAF) > 5%) are thought to be present in major human populations, and that number is likely to be higher in African populations. However, under simple genetic models, it has been shown that genome-wide search for association may not require that all SNPs be genotyped, if one is willing to accept a modest decrease in power of the study. Indeed, knowledge of linkage disequilibrium (LD) between SNPs located in the same chromosomal region enables the grouping of SNPs into "bins" and selection of a subset, called tag SNPs, which when genotyped will capture most of the available information. Utilizing pairwise correlation between SNPs, an algorithm has been proposed by Carlson et al in 2004 enabling the efficient selection of tag SNPs that provide information robust to genotypic errors and missing data.

features

  • Integrated LD estimation capability for unrelated individuals and founders within families, including support for sex-linked loci and hemizygous genotypes.
  • Incorporation of flexible assay design and SNP quality scores for optimal tag selection.
  • A high-performance implementation of a variant of the pairwise greedy algorithm suggested by Carlson et al, suitable for whole genome tag SNP selection.
  • Rapid and efficient multi-population tagging to produce panels that cover more than a single population, with the ability to assign independent thresholds for MAF, LD, and other parameters by population.
  • Creation of fixed-size or fixed-coverage panels, based on the specification of targets for maximum number of allowed tag SNPs or polymorphic loci to be covered.
  • Evaluation of coverage by computing the proportion of bins and loci covered.
  • Efficient augmentation of tag SNPs sets to minimize information loss due to missing genotype data.
  • Ability to efficiently force selected SNPs to be obligatorily included or excluded from the set of tag SNPs.
  • Supports many input file formats for genotype or LD data, including
    • HapMap genotypes and pedigrees
    • Linkage genotypes, locus files, and pedigrees
    • ldSelect/PrettyBase genotypes w/ Linkage pedigree files
    • FESTA LD input
  • Produces informative program outputs that show the final disposition of each locus, numbers of SNPs captured by each bin size, average bin widths, and coverage summaries.

example

TagZilla is able to analyze the entire HapMap build 20 CEU data in under 6 hours of computation time on a fast desktop PC, using default settings and without splitting any data files. For example, within a single analysis it is possible to identify a near-optimal fixed sized panel of multi-population tag SNPs that simultaneously capture the HapMap build 20 CEU loci with an r2 threshold of 0.8 and MAF threshold of 5%, and the HapMap JPT, CHB, and YRI populations with an r2 threshold of 0.7 and MAF threshold of 10%.