HIV sequence database

HIV Sequence Alignments

Web Alignments provide nucleotide and protein alignments (one per patient) that represent the full spectrum of HIV and SIV sequences in the database.
Filtered Web Alignments consist of a smaller number of sequences than the original web alignments: sequences with large inserts, high content of ambiguity codes(>1%) or multiple frame shifts have been removed from these alignments, so the resulting alignments are cleaner, but contain less information.
Subtype Reference Alignments contain approximately 4 representatives of each subtype, and are useful for classifying new sequences.
Compendium Alignments provide the ones printed in the Compendium.
Consensus/Ancestral Sequences include a consensus for each subtype, an M-group consensus-of-consensuses, and some ancestral sequences. For more details see here.

Before use, please read the additional information below.

Javascript required
This tool depends upon Javascript for its interactive features. If you have Javascript disabled in your browser, this tool will not function.

*“2008” alignments contains sequences published before 2009. The Compendia are published the year after the sequences were published, so the 2008 alignment appears in the 2009 compendium.

Web Alignments

What sequences are included

These alignments are complete, meaning that they contain all sequences we have in the database, with some exceptions:

Very similar sequences have been deleted. The cut-off for deleting similar sequences has been determined by looking at the distance distribution, and varies by gene.
Multiple sequences from one patient are deleted.
In some cases the curation is done after the web alignment, and multiple sequences from one patient end up in the alignment. In this case they will be removed the following year.
Problematic sequences are removed.

How the sequences are aligned

These alignments were generated by an iterative process between automated alignment using HMMER and manual editing using MASE, BioEdit or Se-Al. Most gaps have been introduced in multiples of 3 bases to maintain open reading frames. Any alignment is a compromise between optimal alignment, readability, and an attempt to keep codons intact. In particular the 'Other SIV' sequences are difficult to align, so please consider these as a starting point for your analyses.

Further details

Codons containing IUPAC multistate characters involved in silent substitutions are translated to the correct amino acids; when this is not possible, they are translated to 'X'.

The protein alignments provided for each gene were constructed using both nucleotide and translated amino acid sequences. Because the translations are based on alignments, they may differ from a straight, non-aligned, translation. For instance, an aligned translation will include frameshift compensation.

For all genome and single-gene DNA alignments, we have tried to keep the reading frame intact. However, this doesn't always work out. Be cautious when translating the aligned nucleotide sequences.

Sequences that are known to be recombinants are usually labeled as such, even if they are not recombinant in the region under consideration.

Relevant links

Codes and Symbols in Sequence Alignments
How the HIV Database Classifies Sequence Subtypes

Filtered Web Alignments

What sequences are included

We have added a new category of curated web alignments, called "filtered alignments". These alignments consist of a smaller number of sequences than the original web alignments: sequences that contain large insertions, high content of ambiguity codes(>1%) or multiple frame shifts have been removed, so the resulting alignments are cleaner, but contain less information.

Subtype Reference Alignments

What sequences are included

For each subtype, 4 genomes were selected as being broadly representative of that subtype. A paper describing the criteria used in selecting the 2005 reference sequences is available online as an HTML or PDF file.

How the sequences are named

The subtype, country, and year of isolation are given if they are defined in our database. If the sequence name is undefined, the GenBank accession number appears instead.

Relevant links

Leitner, et al. 2005, a compendium review article describing the 2005 subtype reference set
Information about HIV and SIV subtype nomenclature
How the HIV Database Classifies Sequence Subtypes
Information about CRFs
Codes and Symbols in Sequence Alignments

Compendium alignments

The compendium alignments comprise a re-aligned selection from the web alignments. Because they need to fit in a limited space, this set is limited to ~200 sequences. We try to contain newer sequences in this set, in addition to the subtype reference sequences.

Consensus/Ancestral Sequences

What sequences are included

We provide consensuses for the M group subtypes A (including A, A1, and A2), B, C, D, F (including F1 and F2), and G; the circulating recombinant forms CRF01 and CRF02; and group O. We also provide a Consensus M-group, which is a consensus of consensus sequences for subtypes A, B, C, D, F, G, H. Ancestral sequences are also provided. Ancestral sequences are based on the Complete Genome M-group Ancestral sequence and its phylogenetic tree. For more details, see M-group Consensus Construction explanation file.

How the consensus alignments are made

The input alignments are the HIV Sequence Web Alignments. These sequences have undergone additional annotation after retrieval. Specifically, question marks in consensus sequences have been resolved, and glycosylation sites have been aligned. From the input, consensus sequences were built using our Consensus Maker site.

The consensus sequences were calculated according to the default values on the Consensus Maker tools except that they were computed for all subtypes having 3 or more (rather than 4 or more) sequences in the alignment. If a column in a subtype group contained equal numbers of two different letters, we resolved that tie by looking at the same column throughout the M group and using the most common letter as the consensus. An upper case letter in a DNA consensus sequence indicates that the nucleotide is preserved unanimously in that position in all sequences used to make the consensus. In cases of nonunanimity, the most common nucleotide is shown in lowercase. Regions spanned by multiple insertions and deletions are difficult to align; we attempt to anchor alignments in such regions on glycosylation sites, and to preserve the minimal elements that span such regions.

How the ancestral sequences are derived

The ancestral tree and sequences were built as described in Ancestral Tree Construction explanation file.

Interpreting the format of consensus sequences

An upper case letter in a DNA consensus sequence indicates that the nucleotide is preserved in that position in all sequences used to make the consensus. A lower case letter is the most common nucleotide in a variable position.

PROTEIN sequences are always upper case letters.

The number of sequences used to make the consensus is indicated in parentheses following the subtype designation.

Reagent development

Consensus and Ancestral sequences from this web page are suitable for reagent development because they do not contain question marks or ambiguous characters.

Relevant links

Consensus Maker Tools allow you to build a consensus from your own alignment according to your preferences.
Consensus Maker Explanation page shows the output format options.
Ancestral Tree Construction explanation file.
M-group Consensus Construction explanation file.
Codes and Symbols in Sequence Alignments

last modified: Wed Mar 14 15:10 2012

Index of all tools	HIV BLAST	Quality Control
ADRA	HIVAlign	QuickAlign
Branchlength	Hypermut	Rainbow Tree
Codon Alignment	jpHMM at GOBICS	Recombinant HIV-1 Drawing Tool
Consensus Maker	Mosaic Vaccine Tool Suite	RIP
ELF	Motif Scan	SeqPublish
ElimDupes	N-Glycosite	Sequence Locator
Entropy	PCOORD	SNAP
FindModel	PepMap	SUDI Subtyping
Format Converter	PeptGen	SynchAlign
Gap Strip/Squeeze	PhyloPlace	Translate
GenBank Entry Generation	PhyML	TreeMaker
Gene Cutter	Pixel	TreeRate
Heatmap	Poisson-Fitter	VESPA
Hepitope	Protein Feature Accent	External Tools
Highlighter	Protein Structure

Alignment type
Year *
Organism
Region	Pre-defined region of the genome User-defined range Start: End: (Coordinates: HIV1-HXB2, HIV2-Mac239)
Subtype
DNA/Protein
Format