HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Sequence Quality Control

"No matter what drug we give, our sequences always are 25% wildtype."
- Quote from the head of a reference laboratory.

Some HIV researchers are convinced that careful lab work is enough to prevent contamination. We disagree. Contamination happens, even in the best laboratories. Screening for contamination should be done before the analysis of the sequences, and periodically during the course of large sequencing studies, so problems can be detected and corrected early.

To show what contamination looks like in practice, we have collected some examples of (mostly published) datasets where contamination is a problem, and included some references that discuss contamination.

Following the steps below will help to check your sequences. They are no substitute for common sense and precautions, but they may help spot contamination in your sequences. We have created interactive pages where you can build a tree and do a BLAST search with your sequences. If you work with sequences from very conserved regions (such as protease or RT), check here for more tips on identifying problem sequences.


1. Create a phylogenetic tree that includes all the sequences in the study.

Common signs of trouble are:

A phylogenetic tree clarifies the relations between the sequences. If you have lab strain contamination or sample mix-ups between two patients, a phylogenetic tree will likely show it. Once you have your sequences aligned, generate a simple neighbor joining tree to check for potential problems.


2. Compare your sequences to all published sequences (BLAST search).

Blast is a program that finds sequences with very high similarity to the query sequence. If your sequence is very similar to a published strain, especially a lab strain that is used for in vitro studies, it is likely that you have contamination. Even if your sequence is not identical to the lab strain, watch out for in vitro recombination, where only part of the sequence matches the lab strain, and the other part is derived from your patient sample. You can compare your sequences to all GenBank entries (GenBank BLAST), which contains the very latest sequences, or against the HIV database (click the BLAST button) which can lag behind a few weeks, but contains more background information about the sequences.

What is 'reasonable similarity' depends on the gene or region (RT sequences are much more similar than V3 sequences) and on the population (compare a set of clonal sequences from different tissues of one person to a set from different persons in a clustered outbreak, to a set from different African countries). We've prepared some basic guidelines that show the varying degrees of conservation among different genes of HIV.


3. Look carefully at the alignments, and pay attention to patient signature patterns.

Signature patterns often help to show what is 'typical' and 'atypical' for a patient, and thus help to recognize sequences that don't seem to belong with a patient. The usefulness of signature patterns can be seen in the contamination examples. You can use VESPA to find the patterns, but often a simple alignment is sufficient to spot suspicious sequences. When you have an alignment, you can use SeqPublish to create a formatted version of it that will make it easy to spot problems (see example 2).


4. Keep a background set of sequences that are commonly used in your laboratory for comparison.

Blast searches can detect problems with common lab strains, but contamination with material that was recently used in your lab may not show up. Aligning sequences that look suspicious with other sequences that your lab has produced may bring this type of contamination to light.


5. Special case: Quality control in conserved regions.

Even if your region contains very little variation, there are ways to increase confidence in the validity of the sequences. See Contamination in Conserved Regions (article posted July 1998).


If you would like more information or help on checking your data for contamination, please get in touch with us at the address below. We'll be glad to help.

last modified: Tue Apr 22 11:33 2008


Questions or comments? Contact us at seq-info@lanl.gov.