HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Frequently Asked Questions: HIV Sequence Database

Click a question to view the answer below. For questions about immunology resources and tools, see Immunology Database FAQ.

 

Site overview

What can I find on this website?

This FAQ addresses questions about the HIV Sequence Database. We provide a variety of tools and information for researchers studying HIV and SIV. The main aim of this website is to provide easy access to our sequence database, alignments, and the tools and interfaces we have produced. The toolbar at the top of the page should help you navigate among these resources.

The HIV Sequence Database focuses on five primary goals:

Who are you?

The database staff includes molecular biologists, sequence analysts, computer technicians, post-docs, and graduate research assistants. We are part of the Theoretical Biology and Biophysics Group (T-10) at the Los Alamos National Laboratory. We are funded by the Division of AIDS of the National Institute of Allergy and Infectious Diseases through an interagency agreement with the Department of Energy.

Why are there several separate databases here?

Our databases are organized around several areas of viral informatics. The other affiliated databases are:

I am HIV positive; does this site have information useful to me?

The information on this site is developed for researchers who study the AIDS virus and are seeking ways of defeating it. The information available here is not directly helpful to patients. We are not qualified to give medical advice of any kind; please discuss medical issues with your doctor. You can find links to more relevant websites on our Links Page.

 

Sequence retrieval

What sequences can I find here, and why would I use this database to retrieve them?

Our sequence database receives all HIV-1, HIV-2, and SIV sequences that are deposited to GenBank. We retrieve these sequences periodically, so the very most recent sequence deposits may not be here yet. In addition to the information contained in GenBank records, we further annotate the sequences with an array of additional information (see questions below about annotation). We also provide an array of tools useful for understanding and working with these sequences.

How do I retrieve a specific region of a specific sequence?

You can search for sequences by either their common name or by their accession number, using our search interface.

This search interface finds all sequences that fit your criteria (e.g., all subtype B sequences from Thailand with names starting with 'H'), and allows you to download them either aligned or unaligned. If you want them aligned, you will get an alignment that has the length of the complete genome, and it can contain non-overlapping sequences (for example, if your retrieval contains both env and gag sequences). Another way to do it is to specify a region you are interested in (e.g., env, V3, or HXB2 nucleotide positions 5253-7640). In that case only the sequences are found that both fit your criteria and contain that region, and the alignment will only contain that region.

Additional information about sequence retrieval is available in the Search Help page.

How do I retrieve a set of sequences from a specific paper?

One simple way is to find the PubMed ID (listed in any PubMed record) and paste that into the appropriate field of the search interface. Another is to use other search criteria (e.g., accession number, author name, title word) to find one sequence from that article, and display its accession record (by clicking on the accession number in the search output). In the accession record you will find a link called "display all sequences from this publication". If you click on that link, it will retrieve all those sequences.

How do I obtain an alignment of all sequences of a particular gene with a particular subtype or country of origin?

This can easily be done with our search interface by choosing the appropriate fields. There are predefined regions (genes and proteins) or you can use the genome map and find the coordinates. To find coordinates if you have a sequence, use the "Sequence Locator Tool", which lets you paste in a sequence fragment and find its beginning and ending coordinates.

What is that "Patient code" that's listed in the output? Why are there two numbers there?

We try to link groups of sequences from a single patient by assigning them all to a unique patient ID. This unique patient ID that we assign is the number in parentheses. The other name/number is usually the sample name assigned by the authors (for example, "Patient_1"). For more details, please see the Search Help page.

A search limited to "complete genome" yields more sequences than the same search limited to "Gag" only. Why?

The reason this happens is an artifact of how we define "complete genome". A search for "complete genome" will include all sequences >7000 base pairs. These "complete" genomes are not always 100% complete; many have a small truncation of the 5' end of Gag. A search for "Gag" is limited to sequences that have a full-length Gag gene; those sequences that have a small truncation of Gag are omitted, and thus a smaller number of sequences is obtained.

If you want to search for Gag sequences that include those sequences with small truncations of the 5' end, it is best to search using exact genome coordinates, with the 5' coordinate selected for the greatest truncation you are willing to accept.

 

Sequence annotation

What is different about HIV Database accession entries and GenBank entries?

Only the author (or owner) of an entry can update or modify the GenBank entry. Our database includes fields and comments that we add ourselves in GenBank-style entries. The fields we add are usually not reviewed by the authors. For example, we might add the health status of the patient, the date of sampling, the patient risk group, the phenotype of a viral culture from which a sequence is derived, subtype information, or additional references to a specific sequence entry. These added comments and fields come from our reading of the literature or analysis. Often, GenBank entries are not updated by their authors after the initial submission, but subsequent publications provide new and important information that pertains to a particular sequence, and we try to link this information.

Why is added information available for some but not all sequences?

Most information annotated to sequences is derived from publications and entered manually by our staff. As this is a time-consuming process, not all sequences are annotated. We are systematically adding some of the most important fields (subtype and country of sampling, for example) to records. For more detailed annotation, we emphasize full length genomes, complete gene coding regions, and sequences over 500 base pairs in length. However, not all published papers provide additional information.

How do you guys enter the country information?

GenBank requests this information for new HIV sequence submissions, so almost all entries now have this information. Please note that we distinguish between 'sampling country' and 'infection country'. This distinction can be important when, for example, a Somalian immigrant lives in Sweden and gets tested there: the sampling country is Sweden, but the likely infection country is Somalia. Filling out the infection country field is a bit of a judgement call, so this field should be regarded as 'likely infection country'. If additional geographic information is known, we also include it in the database in comment lines or with patient information. For example, if an Ethiopian moved to Isreal and was included in a study of HIV in Isreal, the sequence would be listed as Isreali sample, and the additional information about Ethopia would be included elsewhere in the entry.

What are the abbreviations for each country?

We represent the country by a two letter country code based on the international naming convention (ISO 3166). These two letter codes are intuitive and short (for example, UG = Uganda, JP = Japan, etc.) so they can be easily linked to a sequence name for more informative representation in alignments and phylogenetic trees.

How reliable is information about risk group, infection date, country, etc.?

We try to only include information that is 'very likely or certain' to be true. This means, for example, that when a dual risk group is listed we do not include risk group information. If a paper states someone was probably infected in country X or country Y, we include that information only as a note. When someone was infected 'between 1989 and 1991' we do not include an infection year.

Why is it that the comment lines in the Los Alamos database accession entries are not always smooth reading?

When we created our relational database, we combined comment lines that were linked to sequences by accession number from several sources in the older versions of our database. Thus information may be repeated. There are tens of thousands of entries in the database, and we felt it was more important to get the information in than to have it read smoothly. As time permits, we are going back to these comments and making them more readable.

Can I submit HIV sequences directly to your database?

No. Sequences must be deposited to one of the major sequence databases: GenBank, EMBL, or DDBJ. All HIV and SIV sequences deposited will automatically enter our database within approximately one month from their public release by any of these databases.

I have some patient/sequence data related to HIV sequences I deposited to GenBank. Can I send you this information to add to your database?

Yes! Before sending data, please contact us at the e-mail address at the bottom of the page for details.

 

Subtypes and Recombinants

What are M, N, O and CPZ sequences?

"M" is the main group of viruses in the HIV-1 global pandemic, and it contains multiple subtypes. O is the "outlier" group, and N is a very distinctive form of the virus that is Non-M, Non-O. CPZ are the primate viruses isolated from chimpanzees, which are the primate viruses most closely related to HIV-1.

What are subtypes?

Subtypes are phylogenetically associated groups of HIV-1 or HIV-2 sequences. Sometimes the word "clade" is used to mean subtype. The sequences within any one subtype are more similar to each other than to sequences from different subtypes. These subtypes represent different lineages of HIV, and have some geographical associations. There are many ambiguities in the subtyping system, however it describes genetic clustering patterns and provides a useful system for organizing viruses by genetic similarity. This topic is explained in detail in HIV and SIV Subtype Nomenclature.

Each year we gather a set of Subtype Reference Sequences that are considered to be representative of all of the subtypes of the the HIV-1 M, N, and O groups. Larger sets of HIV/SIV Alignments of each gene and complete genomes, including the subtype references sequences, are also available.

Why can I no longer find subtype E in the nucleotide or protein alignments in the database?

Subtype E was redesignated as CRF01_AE in 1998. It was originally described as subtype E based on envelope genes from isolates from southeast Asia. When gag genes and complete genomes from these isolates were sequenced, it was found that regions of the genome other than env gene are more similar to the A subtype, so "subtype E" turned out to be a recombinant. Small fragments in the env region are still commonly called "E" because there they do appear to be completely separate from all other subtypes. The E subtype has only been clearly defined in the env region, and the evolutionary history and the origin of this mosaic form remains controversial.

What do multiple letters representing a subtype mean?

Multiple letters indicate that the sequence is a recombinant of parental viruses originating from 2 or more clades. For example, AGI indicates that it is thought that three subtypes recombined to form the sequenced virus: A, G, and I. The subtypes are listed alphabetically. The regions of the genome that are derived from a particular subtype are not indicated by the name, because of the recombinants show complex patterns of breakpoints, and often a single gene can contain several subtypes.

Intersubtype recombinant genomes become designated as "circulating recombinant forms" (CRF) if 3 or more people with no known epidemiological linkage are infected with HIV-1 strains which share the same recombination breakpoints (i.e., are derived from the same ancestral recombinant genome). CRFs are named with a number (e.g., 01_AE or 02_AG). We maintain a current list of specific CRFs: HIV-1 Circulating Recombinant Forms.

How does the HIV database classify sequences and recombinants?

We name recombinants alphabetically. If a recombinant has A, C, and H fragments, it will be labeled ACH. If it is a recombinant of a CRF and a subtype, the number precedes the letter; a recombinant between CRF01 and subtype B will be labeled 01B. Recombinants of multiple CRFs are named in the same way; a recombinant between CRF01 and CRF02 would be labeled 0102. If a recombinant contains unclassified regions, a 'U' will be included in the name.

This topic is addressed in detail in How the HIV Database Classifies Sequences.

Sometimes a sequence is labeled as a recombinant, but seems to be a pure subtype. Why?

When we have multiple sequences from one patient in the database, sometimes some are known to be recombinants, while others appear to be pure subtypes. When we have env sequences of subtype A and gag sequences of subtype C from the same patient, we will usually label all sequences "AC", unless the authors specifically mention that the person is dually infected with two different subtypes.

What is a CRF?

CRFs are viruses whose complete genome has been shown to be recombinant or mosaic, consisting of some regions which cluster with one subtype and other regions of the genome which cluster with another subtype in phylogenetic analyses. CRFs are numbered sequentially in the order in which they are reported in the literature, starting with CRF01_AE, which is the new name of what used to be subtype E. The name of the isolate which was first sequenced and described is used to indicate the prototype of that CRF. This is done because there can be many different recombinant genomes containing the same subtypes, but only some of them have the same recombination breakpoints, and are apparently derived from the same common ancestor. In order to classify a recombinant as a circulating recombinant form, it must be found and sequenced in at least 3 patients who were not directly epidemiologically linked. The structure of all presently known CRFs can be found here: HIV-1 Circulating Recombinant Forms.

Why are subtypes specified for sequences that are gene fragments when they might be embedded in a recombinant genome?

When a short region of sequence has a subtype designation, one should be aware that the subtype designation often refers only to that fragment of sequence, and the virus that it is derived from may be recombinant.

There are some exceptions. Sometimes the sequence is known to come from an isolate from which other fragments are also sequenced; in that case, we try to indicate both subtypes; i.e. if we have an env sequence that is subtype A and a gag sequence that is subtype B, we try to assign 'AB' to both. However, because of the manual effort involved, we don't always manage to do this consistently.

 

Alignments

What is a "consensus sequence" and how is it made?

A consensus sequence is a sequence of the most common nucleotide or amino acid at each position in an alignment. We generally use a 50% cut-off, such that at least 50% of the sequences have the same character at this position, or else we replace the character with a question mark (CONS.a in the example below). Another way to create a consensus is to take the most frequently occurring character, even if it is not the majority (CONS.b in the example below).

CONS.a  ACG?A?CAT?CTATCAGT  
CONS.b  ACGTAGCATACTATCAGT 
------  ------------------ 
SEQ1    ACGTAGCATGCTATCAGT 
SEQ2    ACGTAGCATGCTATCAGT 
SEQ3    ACGTACCATCCTATCAGA 
SEQ4    ACGAAACATCCTATCAGT 
SEQ5    ACGAATCATACTATCAGT 
SEQ6    ACGGATCATACTATCAGT 
SEQ6    ACGCACCATACTATCAGT 

Consensus sequences are built from an alignment. The alignment itself might be dominated by one type of sequence, such as subtype B sequences from the United States. So in general a consensus sequence is not the same as the common ancestor of the sequences, although in some cases it can approximate an ancestral sequence.

To make a consensus from your own sequences, we provide Consensus Maker Tools. We also provide premade HIV-1 Subtype Consensus Sequences.

What are the Intelligenetics, Mase, FastA, and other sequence formats?

Many sequence analysis tools use multiple sequence alignments as the input data for analysis. There are many different ways to format the alignment information. For example, Fasta and Intelligenetics formats list sequences sequentially and have a sequence name on a line by itself, followed by many lines of sequence. Intelligenetics format allows comment lines preceded by semicolons; FastA format does not.

It is relatively simple to convert from one format to another, and most program packages provide scripts or programs to aid in file format conversion. FastA format is considered a standard format, and most sequence analysis programs can import or convert FastA files. Our tools programs will take many common sequence formats. If your sequences are in a format the programs cannot handle, you can find web-based sequence conversion tools, on our site and elsewhere.

Which alignment is best for my purpose?

We provide a variety of alignment sources. These include both premade alignments, and tools you can use to produce an alignment of your own sequences.

How can I make a printable alignment for publication?

We have created a tool for producing publication-quality sequence alignment figures: SeqPublish.

 

Tools

I tried one of your tools and it failed. What should I do?

First, read the information on the web page for that tool, including the Explanation file, if available. Pay particular attention to the input format of your data. Is your data in one of the common sequence formats? Do your sequences need to be aligned (or codon-aligned) for this tool? Do your sequences contain line breaks or any non-standard characters? Does the tool work using the Sample Input? If you cannot find the source of the problem, please write to the address at the bottom of the page for help. Sometimes our tools are not working properly, and it is very useful for us to be informed about the problem. We are happy to help with troubleshooting.

Where can I get an overview of all of your tools and what they do?

The Tools Index lists all of our tools, with brief descriptions of what they do. We also provide links to relevant tools on other websites in our list of External Tools.

How can I determine the subtype/clade of my HIV sequences?

There are many tools that can help you identify the subtype/clade. Each tool has its pros and cons. Some can handle many sequences at once, while some can only analyze one sequence at a time.

The Recombinant Identification Program (RIP) is a program developed at the HIV Database to identify sequences that appear to be mosaics of distinct phylogenetic clades. The idea is that such mosaic sequences are likely recombinants. RIP was designed to detect recombinants of sequences belonging to different subtypes of HIV-1, in response to the identification of intersubtype recombination as an important source of new variation in HIV-1, but it can be used for other applications, including analysis of non-HIV sequences. For additional details, see RIP Explanation.

How do I find where my sequence is located in a gene or protein (for example, where are the boundaries of PCR primer or a CTL epitope)?

We recommend the use of HXB2 as a reference strain, and have developed a tool called HIV/SIV Sequence Locator, which determines the position number boundaries of a stretch of sequence, or alternatively identifies the stretch of sequence which corresponds to specified positions. You can also use this tool to identify which coordinates you need to use to find the region in the database that corresponds to your sequence; the numbering system in the search interface is based on the same method.

What is hypermutation?

In HIV, some viruses have very frequent G -> A mutations. When this happens, the virus is generally not viable, as many stop codons are introduced. We have developed a tool called Hypermut which you can use to screen your aligned sequences for potential hypermutation.

What is principal coordinate analysis?

Principal Coordinate Analysis is a procedure to find meaningful patterns in sequence data with no a priori knowledge about them. The procedure attempts to summarize the variation in the sequences in a limited number of axes or dimensions. A 'dimension' is basically a combination of positions in a sequence that behave similarly (for example, position 133 usually has an A when position 250 has a G). For additional information, see PCOORD Explanation file.

Is it possible to get the group at the Los Alamos database to modify programs or write additional code?

If you have an analysis question and our code doesn't give you the right output, we will try to adapt the code to your needs if we can. Write to the e-mail address below and let us know. If you have something you would like to do to analyze HIV sequences, and can't find the computer code you need to do it, write to us. We will consider writing the programs if we feel they will be generally useful, or we may be able to point you in the right direction if we are aware of code that already exists.

last modified: Wed Apr 16 12:44 2008


Questions or comments? Contact us at seq-info@lanl.gov.