HIV Databases HIV Databases home HIV Databases home
HIV sequence database



HIV Sequence Alignments

Before use, please read the additional information below.

Javascript required
This tool depends upon Javascript for its interactive features. If you have Javascript disabled in your browser, this tool will not function.

Options
Alignment type
Year *
Organism
Region Pre-defined region of the genome        
User-defined range       Start:    End:    (Coordinates: HIV1-HXB2, HIV2-Mac239)
Subtype
DNA/Protein
Format



*“2008” alignments contains sequences published before 2009. The Compendia are published the year after the sequences were published, so the 2008 alignment appears in the 2009 compendium.





Web Alignments

What sequences are included

These alignments are complete, meaning that they contain all sequences we have in the database, with some exceptions:

How the sequences are aligned

These alignments were generated by an iterative process between automated alignment using HMMER and manual editing using MASE, BioEdit or Se-Al. Most gaps have been introduced in multiples of 3 bases to maintain open reading frames. Any alignment is a compromise between optimal alignment, readability, and an attempt to keep codons intact. In particular the 'Other SIV' sequences are difficult to align, so please consider these as a starting point for your analyses.

Further details

Codons containing IUPAC multistate characters involved in silent substitutions are translated to the correct amino acids; when this is not possible, they are translated to 'X'.

The protein alignments provided for each gene were constructed using both nucleotide and translated amino acid sequences. Because the translations are based on alignments, they may differ from a straight, non-aligned, translation. For instance, an aligned translation will include frameshift compensation.

For all genome and single-gene DNA alignments, we have tried to keep the reading frame intact. However, this doesn't always work out. Be cautious when translating the aligned nucleotide sequences.

Sequences that are known to be recombinants are usually labeled as such, even if they are not recombinant in the region under consideration.

Relevant links

Codes and Symbols in Sequence Alignments
How the HIV Database Classifies Sequence Subtypes





Filtered Web Alignments

What sequences are included

We have added a new category of curated web alignments, called "filtered alignments". These alignments consist of a smaller number of sequences than the original web alignments: sequences that contain large insertions, high content of ambiguity codes(>1%) or multiple frame shifts have been removed, so the resulting alignments are cleaner, but contain less information.





Subtype Reference Alignments

What sequences are included

For each subtype, 4 genomes were selected as being broadly representative of that subtype. A paper describing the criteria used in selecting the 2005 reference sequences is available online as an HTML or PDF file.

How the sequences are named

The subtype, country, and year of isolation are given if they are defined in our database. If the sequence name is undefined, the GenBank accession number appears instead.

Relevant links

Leitner, et al. 2005, a compendium review article describing the 2005 subtype reference set
Information about HIV and SIV subtype nomenclature
How the HIV Database Classifies Sequence Subtypes
Information about CRFs
Codes and Symbols in Sequence Alignments





Compendium alignments

The compendium alignments comprise a re-aligned selection from the web alignments. Because they need to fit in a limited space, this set is limited to ~200 sequences. We try to contain newer sequences in this set, in addition to the subtype reference sequences.





Consensus/Ancestral Sequences

What sequences are included

We provide consensuses for the M group subtypes A (including A, A1, and A2), B, C, D, F (including F1 and F2), and G; the circulating recombinant forms CRF01 and CRF02; and group O. We also provide a Consensus M-group, which is a consensus of consensus sequences for subtypes A, B, C, D, F, G, H. Ancestral sequences are also provided. Ancestral sequences are based on the Complete Genome M-group Ancestral sequence and its phylogenetic tree. For more details, see M-group Consensus Construction explanation file.

How the consensus alignments are made

The input alignments are the HIV Sequence Web Alignments. These sequences have undergone additional annotation after retrieval. Specifically, question marks in consensus sequences have been resolved, and glycosylation sites have been aligned. From the input, consensus sequences were built using our Consensus Maker site.

The consensus sequences were calculated according to the default values on the Consensus Maker tools except that they were computed for all subtypes having 3 or more (rather than 4 or more) sequences in the alignment. If a column in a subtype group contained equal numbers of two different letters, we resolved that tie by looking at the same column throughout the M group and using the most common letter as the consensus. An upper case letter in a DNA consensus sequence indicates that the nucleotide is preserved unanimously in that position in all sequences used to make the consensus. In cases of nonunanimity, the most common nucleotide is shown in lowercase. Regions spanned by multiple insertions and deletions are difficult to align; we attempt to anchor alignments in such regions on glycosylation sites, and to preserve the minimal elements that span such regions.

How the ancestral sequences are derived

The ancestral tree and sequences were built as described in Ancestral Tree Construction explanation file.

Interpreting the format of consensus sequences

An upper case letter in a DNA consensus sequence indicates that the nucleotide is preserved in that position in all sequences used to make the consensus. A lower case letter is the most common nucleotide in a variable position.

PROTEIN sequences are always upper case letters.

The number of sequences used to make the consensus is indicated in parentheses following the subtype designation.

Reagent development

Consensus and Ancestral sequences from this web page are suitable for reagent development because they do not contain question marks or ambiguous characters.

Relevant links

Consensus Maker Tools allow you to build a consensus from your own alignment according to your preferences.
Consensus Maker Explanation page shows the output format options.
Ancestral Tree Construction explanation file.
M-group Consensus Construction explanation file.
Codes and Symbols in Sequence Alignments



last modified: Wed Mar 14 15:10 2012


Questions or comments? Contact us at seq-info@lanl.gov.