Javascript required
This tool depends upon Javascript for its interactive features. If you have Javascript disabled in your browser, this tool will not function.
*“2008” alignments contains sequences published before 2009. The Compendia are published the year after the sequences were published, so the 2008 alignment appears in the 2009 compendium.
These alignments are complete, meaning that they contain all sequences we have in the database, with some exceptions:
These alignments were generated by an iterative process between automated alignment using HMMER and manual editing using MASE, BioEdit or Se-Al. Most gaps have been introduced in multiples of 3 bases to maintain open reading frames. Any alignment is a compromise between optimal alignment, readability, and an attempt to keep codons intact. In particular the 'Other SIV' sequences are difficult to align, so please consider these as a starting point for your analyses.
Codons containing IUPAC multistate characters involved in silent substitutions are translated to the correct amino acids; when this is not possible, they are translated to 'X'.
The protein alignments provided for each gene were constructed using both nucleotide and translated amino acid sequences. Because the translations are based on alignments, they may differ from a straight, non-aligned, translation. For instance, an aligned translation will include frameshift compensation.
For all genome and single-gene DNA alignments, we have tried to keep the reading frame intact. However, this doesn't always work out. Be cautious when translating the aligned nucleotide sequences.
Sequences that are known to be recombinants are usually labeled as such, even if they are not recombinant in the region under consideration.
Codes and Symbols in Sequence Alignments
How the HIV Database Classifies Sequence Subtypes
We have added a new category of curated web alignments, called "filtered alignments". These alignments consist of a smaller number of sequences than the original web alignments: sequences that contain large insertions, high content of ambiguity codes(>1%) or multiple frame shifts have been removed, so the resulting alignments are cleaner, but contain less information.
For each subtype, 4 genomes were selected as being broadly representative of that subtype. A paper describing the criteria used in selecting the 2005 reference sequences is available online as an HTML or PDF file.
The subtype, country, and year of isolation are given if they are defined in our database. If the sequence name is undefined, the GenBank accession number appears instead.
Leitner, et al. 2005,
a compendium review article describing the 2005 subtype reference set
Information about HIV and SIV subtype nomenclature
How the HIV Database Classifies Sequence Subtypes
Information about CRFs
Codes and Symbols in Sequence Alignments
The compendium alignments comprise a re-aligned selection from the web alignments. Because they need to fit in a limited space, this set is limited to ~200 sequences. We try to contain newer sequences in this set, in addition to the subtype reference sequences.
We provide consensuses for the M group subtypes A (including A, A1, and A2), B, C, D, F (including F1 and F2), and G; the circulating recombinant forms CRF01 and CRF02; and group O. We also provide a Consensus M-group, which is a consensus of consensus sequences for subtypes A, B, C, D, F, G, H. Ancestral sequences are also provided. Ancestral sequences are based on the Complete Genome M-group Ancestral sequence and its phylogenetic tree. For more details, see M-group Consensus Construction explanation file.
The input alignments are the HIV Sequence Web Alignments. These sequences have undergone additional annotation after retrieval. Specifically, question marks in consensus sequences have been resolved, and glycosylation sites have been aligned. From the input, consensus sequences were built using our Consensus Maker site.
The consensus sequences were calculated according to the default values on the Consensus Maker tools except that they were computed for all subtypes having 3 or more (rather than 4 or more) sequences in the alignment. If a column in a subtype group contained equal numbers of two different letters, we resolved that tie by looking at the same column throughout the M group and using the most common letter as the consensus. An upper case letter in a DNA consensus sequence indicates that the nucleotide is preserved unanimously in that position in all sequences used to make the consensus. In cases of nonunanimity, the most common nucleotide is shown in lowercase. Regions spanned by multiple insertions and deletions are difficult to align; we attempt to anchor alignments in such regions on glycosylation sites, and to preserve the minimal elements that span such regions.
The ancestral tree and sequences were built as described in Ancestral Tree Construction explanation file.
An upper case letter in a DNA consensus sequence indicates that the nucleotide is preserved in that position in all sequences used to make the consensus. A lower case letter is the most common nucleotide in a variable position.
PROTEIN sequences are always upper case letters.
The number of sequences used to make the consensus is indicated in parentheses following the subtype designation.
Consensus and Ancestral sequences from this web page are suitable for reagent development because they do not contain question marks or ambiguous characters.
Consensus Maker Tools allow you to build a consensus from your own alignment according to your preferences.
Consensus Maker Explanation page shows the output format options.
Ancestral Tree Construction explanation file.
M-group Consensus Construction explanation file.
Codes and Symbols in Sequence Alignments