HIV sequence database

HIV Sequence Alignments

Web Alignments provide nucleotide and protein alignments that represent the full spectrum of HIV and SIV sequences in the database.
Subtype Reference Alignments contain approximately 4 representatives of each subtype, and are useful for classifying new sequences.
Consensus/Ancestral Sequences include a consensus for each subtype, an M-group consensus-of-consensuses, and some ancestral sequences.

Before use, please read the additional information below.

You have javascript turned off
Please note that some tool features, form validation in particular, may not work properly.

Web alignments

What sequences are included

The alignments presented on the web differ from the ones printed in the Compendia. The whole genome alignments are complete, meaning that they contain all genome sequences we have in the database. Very similar sequences (e.g. multiple clones from one isolate, multiple sequences from one person) have been deleted. This selection was made on the basis of phylogenetic trees: from tight clusters of sequences, one representative was retained and the others were removed. An exception has been made for HXB2/LAI, as these are important lab strains that are frequently used in experiments.

How the sequences are aligned

These alignments were generated by an iterative process between automated alignment using HMMER and manual editing using MASE, BioEdit or Se-Al. Most gaps have been introduced in multiples of 3 bases to maintain open reading frames. Any alignment is a compromise between optimal alignment, readability, and an attempt to keep codons intact. In particular the 'Other SIV' sequences are difficult to align, so please consider these as a starting point for your analyses.

Further details

Codons containing IUPAC multistate characters involved in silent substitutions are translated to amino acids, otherwise they are translated to 'X'.

The protein alignments provided for each gene were constructed using both nucleotide and translated amino acid sequences. Because the translations are based on alignments, they may differ from a straight, non-aligned, translation. For instance, an aligned translation will include frameshift compensation.

For all genome and single-gene DNA alignments, we have tried to keep the reading frame intact. However, this doesn't always work out. Be cautious when translating the aligned nucleotide sequences.

Sequences that are known to be recombinants are usually labeled as such, even if they are not recombinant in the region under consideration.

Relevant links

How the HIV Database Classifies Sequence Subtypes

Subtype Reference Alignments

What sequences are included

For each subtype, 4 genomes were selected as being broadly representative of that subtype. A hypertext and printable table of the sequences included in the 2005 Subtype Reference Set is provided. A paper describing the criteria used in selecting the 2005 reference sequences is available online as an HTML or PDF file.

How the sequences are named

The subtype, country, and year of isolation are given if they are defined in our database. If the sequence name is undefined, the GenBank accession number appears instead.

Relevant links

2005 Subtype Reference Sequences, a table listing all sequences in the set
Leitner, et al. 2005, a compendium review article describing the 2005 subtype reference set
Information about HIV and SIV subtype nomenclature
How the HIV Database Classifies Sequence Subtypes
Information about CRFs

Consensus/Ancestral Sequences

What sequences are included

We provide consensuses for the M group subtypes A (including A, A1, and A2), B, C, D, F (including F1 and F2), and G; the circulating recombinant forms CRF01 and CRF02; and group O. We also provide a Consensus M-group, which is a consensus of consensus sequences for subtypes A, B, C, D, F, G, H. Ancestral sequences are also provided. Ancestral sequences are based on the Complete Genome M-group Ancestral sequence and its phylogenetic tree. For more details, see M-group Consensus Construction explanation file.

How the consensus alignments are made

The input alignments are the HIV Sequence Web Alignments. These sequences have undergone additional annotation after retrieval. Specifically, question marks in consensus sequences have been resolved, and glycosylation sites have been aligned. From the input, consensus sequences were built using our Consensus Maker site.

The consensus sequences were calculated according to the default values on the Consensus Maker tools except that they were computed for all subtypes having 3 or more (rather than 4 or more) sequences in the alignment. If a column in a subtype group contained equal numbers of two different letters, we resolved that tie by looking at the same column throughout the M group and using the most common letter as the consensus. An upper case letter in a DNA consensus sequence indicates that the nucleotide is preserved unanimously in that position in all sequences used to make the consensus. In cases of nonunanimity, the most common nucleotide is shown in lowercase. Regions spanned by multiple insertions and deletions are difficult to align; we attempt to anchor alignments in such regions on glycosylation sites, and to preserve the minimal elements that span such regions.

How the ancestral sequences are derived

The ancestral tree and sequences were built as described in Ancestral Tree Construction explanation file.

Interpreting the format of consensus sequences

An upper case letter in a DNA consensus sequence indicates that the nucleotide is preserved in that position in all sequences used to make the consensus. A lower case letter is the most common nucleotide in a variable position.

PROTEIN sequences are always upper case letters.

The number of sequences used to make the consensus is indicated in parentheses following the subtype designation.

Reagent development

Consensus and Ancestral sequences from this web page are suitable for reagent development because they do not contain question marks or ambiguous characters.

Relevant links

Consensus Maker Tools allow you to build a consensus from your own alignment according to your preferences.
Consensus Maker Explanation page shows the output format options.
Ancestral Tree Construction explanation file.
M-group Consensus Construction explanation file.

last modified: Mon Jul 14 17:04 2008

Index of all tools	ADRA
Branchlength	Codon Alignment
Consensus Maker	ELF
ElimDupes	Entropy
Epilign	FindModel
Format converter	Gap strip/squeeze
Gene Cutter	HDent/HDdist
Heatmap	Hepitope
Highlighter	HIV BLAST
HIValign	Hypermutation
jpHMM at GOBICS	Mosaic Vaccine Tool Suite
Motif Scan	N-Glycosite
ODprep/ODfit	PCOORD
PeptGen	PhyloPlace
Primalign	Protein Feature Accent
Protein structure	Recombinant HIV-1 drawing tool
RIP	SeqPublish
Sequence locator	SNAP
SUDI subtyping	SynchAlign
Translate	Treemaker
External tools

Alignment type
Year
Organism
Region	Pre-defined region of the genome User-defined range Start: End: (Coordinates: HIV1-HXB2, HIV2-Mac239)
Subtype
DNA/Protein
Format