HIV Databases HIV Databases home HIV Databases home
HIV sequence database



HELP for using the search interface

Tips

Accession

To search for a range of accession numbers, use X12345 .. X23456. Note that some sequence names (for example, HXB2) have more than one associated accession number. To retrieve them all, use the Sequence Name field.

Subtype

In this field, you can search on multiple subtypes by clicking the ones you want. Also select the 'include recombinants' check box if you want recombinants of the chosen subtype(s) as well. While most browser/operating system combinations will let you select non-adjacent fields, in some you need to use 'ctrl-click' instead of 'shift-click' to do that. Note that if your search is limited to a specific genomic region, you may bring up some recombinant sequences that are not of the selected subtype in that region.

For more information on the subtype and CRF classifications, see:
Overview of HIV-1, HIV-2, and SIV subtype nomenclature
How the HIV database classifies sequence subtypes
Overview of primate immunodeficiency viruses lists SIV subtypes
CRF mosaic patterns lists recognized CRFs
HIV-1 M group nomenclature publication 1999

Sequence name

This is usually a name based on the patient ID, isolate, or clone name for a sequence, and may be the way a sequence is referred to in publications. This field also searches the GenBank Locus Name field.

Sampling country

The country in which the sample was taken or from which the sequence was obtained. We use 2-letter Country Codes.

Sampling city

The city/province/region in which the sample was obtained.

Sampling year

The year in which the sample was obtained.

Sequence length

The length of the nucleotide sequence, in base pairs.

Problematic sequences

This field marks sequences that users usually want to exclude from a retrieval.

Please note: Our default excludes these sequences from searches, but users may select to include them if desired. Our criteria are very conservative, to have few false positives, but that leaves us with some unlabled sequences that are still problematic. You still need to check for problem sequences prior to publication!

N: (1)
High content of non-ACTG characters, meeting one of the following criteria:
  • more than 100 consecutive non-ACTG characters
  • >7% non-ACTG characters for sequences of length <1000
  • >5% non-ACTG characters for sequences of length 1000-2999
  • >3% non-ACTG characters for sequences of length 3000 or above.
While direct sequences will naturally contain some IUPAC ambiguity characters, sequences annotated as N have such a high fraction that multiple alignment programs and other analysis programs have trouble with them. All new data are screened as they come in.
C: (2)
Likely contamination with a laboratory strain. If a major part of a study set is contaminated, we may label the full set with a C; in other cases, we choose a very conservative (high) level of similarity before we mark a particular sequence C. Different genomic regions will have different standards; for example it is harder to detect potential contamination in a short region in pol than it is in env. Contaminants cannot be reliably annotated through an automatic screen, so many potential contaminant sequences are still unmarked.
H: (3)
Hypermutation. We are currently screening the database for extreme cases of G->A hypermutation, and mark these sequences with an H. Some submissions already note that a sequence is hypermutated in their "note" in the GenBank submission; others do not. We are currently determining parameters that will identify the most extreme cases, as these sequences will carry substitutions not found in viable viruses. Such sequences alter phylogenetic tree branch lengths and complicate the determination of appropriate evolutionary models. For additional information about hypermutation, see Hypermut Tool.
S: (4)
A synthetic sequence, i.e., a sequence that does not represent a naturally-occurring viral sequence. These include sequences that have been manipulated to change codon usage and sequences that contain non-HIV components.
D: (5)
A sequence containing an artifactual deletion of >100 nucleotides. These sequences occur when an author puts together 2 sequences from a single sample (for example, a protease and an RT sequence), but omits some intervening sequence in between. Sequences that represent viruses with naturally-occurring deletions are not annotated in this category.

Patient code

The patient code is displayed in searches as a two-part number, for example "P1(10139520)". The first part is usually the name or number by which the patient is identified in publication(s). The second part is an internal number assigned by our database, the patient ID. A patient code such as "P1" can (and does) refer to more than one patient. However, the sequence records associated with "10139520" are specific to one patient.

To search this field, use the code (for example, "P1"). It is not possible to search using the database ID (for example, "10139520"). If you are looking for a patient from a specific publication, include an author name or PubMed ID in your search. Not all sequences in the database have an assigned patient code; if your search by patient code fails, try entering the search in the "Sequence Name" field (sequence names often contain the patient code).

The search algorithm used for this field is different from the other fields, in that a space (for example, "Patient G") is not interpreted as an AND, as it is in the other fields. Instead, the entire string, including spaces, is used as the search term. You can use the words 'and' and 'or' to join several criteria.

Infection country

This field is used when the infection country is distinct from the sampling country. This field is only filled in if it is actually known with high certainty where a patient was infected. For example, the database contains many sequences for which sampling country is Sweden that are actually most likely from African countries. In these cases, if a patient has indicated that his/her only chance of infection was in an African country, the infection country is set to that; but if the infection country is not expicitly named, it will be blank.

Risk factor

The risk factor describes the risk activity by which the patient most likely was infected. Dual risk factors are not recorded. Again, the risk factor has to be positively established to be recorded in this field.

Infection year

This is the year in which the patient was infected. The year is only recorded when it is known with some certainty.

Viral load

The plasma viral load in units of copies/ml of plasma. A viral load of "1" is entered for data that are below the limit of detection (usually <50 copies/ml).

HLA information

Any HLA information available from the individual who was sequenced.

CD4 count

The CD4+ T-cell count at the time of sampling, in absolute counts of cells/ul.

CD8 count

The CD8+ T-cell count at the time of sampling, in absolute counts of cells/ul.

Coreceptor and phenotype

These fields are annotated based on biological data only, not based on presumed usage inferred from sequences. For information about these fields, see articles:
Biological and Molecular Aspects of HIV-1 Coreceptor Usage
Coreceptor Use by Primate Lentiviruses

Drug naive

Sequences that were sampled prior to the patient receiving any type of antiretroviral treatment. Sequences are annotated by the database staff as drug naive only when there is certainty that the patient has not undergone antiretroviral drug treatment.

Authors

This search assumes an 'and', so if you search for 'smith jones' you will retrieve all sequences for which both Smith and Jones are in the author list. Please do not include initials. Please note that our database does contain some errors in author names; we are unable to correct mistakes made in the original sequence submissions.

Days from seroconversion

The number in this field indicates the number of days between the patient’s seroconversion and the date the sample was taken for sequencing. For samples taken before seroconversion, negative numbers are used.

Please note: Days from infection or seroconversion are always estimates, and different studies use different methods and definitions. We have attempted to translate these values into a single system for study cross-comparisons, but please use these fields with caution; go back to the original papers for confirmation of study-specific timing definitions.

Prior to December 2005, these data were available as “Months from seroconversion”. The decision to convert to days for this field was based on the assumption that the majority of new data being reported will be in days. Previous data in the form of months were converted to days as follows. Most published studies linked to existing data were reviewed by hand. For studies giving data in the form of days, the data were re-entered in days. When data were given in weeks, weeks were converted to days by multiplying by 7. For studies where the data were given in months, the field was converted to days by multiplying months by 30.42 and rounding to the nearest whole number.  In some cases, when all data in a large set of sequences was >6 months, the publication was not examined, and the 30.42 conversion was applied.

In cases where studies give data that is vague, but possibly useful, a text entry may appear in this field. The following text entries may appear:

“Pre-seroconversion”: sample was taken before seroconversion, but the exact number of days is unknown.
“Early”: <1 year after seroconversion
“Late”: ≥2 years after seroconversion
“No data”: the same meaning as a blank field

Days from infection

The number in this field indicates the number of days from the time the patient was infected until the sample was taken for sequencing. Post-infection dates are relatively rare. Most often they are known when a patient seeks medical treatment for acute illness shortly after having a sexual encounter with a stranger. We use this when the primary author presents the data as post infection in the original citation.

Please note: Days from infection or seroconversion are always estimates, and different studies use different methods and definitions. We have attempted to translate these values into a single system for study cross-comparisons, but please use these fields with caution; go back to the original papers for confirmation of study-specific timing definitions.

Prior to December 2005, these data were available as “Months from Infection”. The conversion of data from months to days was performed as described above for “Days from seroconversion”.

If a user is interested in sequences from a particular timepoint relative to infection or seroconversion, it may be wise to perform 2 separate searches of both the ‘days from seroconversion’ and ‘days from infection’ fields, as most data are recorded in one field or the other, not both. For example, if the user wants sequences that are either <90 days post-infection or <30 days post-seroconversion, two separate searches are needed.

PubMed/Medline ID

The PubMed/Medline/GenBank databases switched from one numbering system (Medline ID) to another (PubMed ID). This field searches both listings. The search returns only the PubMed ID, because Medline IDs are no longer being assigned. However, Medline IDs may still be searched and retrieved.

Geographic region

This is a way to retrieve all sequences from (for example) the African continent without having to search for each country separately. For a list of countries included in each region, see:
Definitions of the HIV Database geographic regions

Other fields

This is a shortcut to search on several other fields in the database without cluttering the interface. More fields may be added in the future. If you need to limit your search by 2 or more fields listed here, use the Advanced Search Interface. As elsewhere, entering an asterisk (*) will display any available data without limiting the search.

Infection city

If the infection city is known to be different than the city where the sample was collected, that is noted here. This field rarely contains data.

Title

Any keywords from the title of the journal article.

Comment

When you search on "Comment", the comment fields in both the sequence entries and the patient records will be searched. This search can be very time-consuming. In some cases, epidemiological linkage can be found by searching on comment keywords such as "link", "partner", etc.

Patient sex

Categories are: M and F.

Project

If the name of the project or cohort that the patient was enrolled in is known, it is recorded here.

Progression

This field records the rate of progression of the patient, if recorded by the study. Categories are: EC, LTNP, SP, RP, P (elite controller, long-term non-progressor, slow progressor, rapid progressor, progressor).

Patient age

The age of the patient in integer days when the sample was taken. Ages previously entered in years were converted by multiplying by 365.25. You can use y for year and m for month. For example, to select for sequences from patients under 18 years, enter either "<6575" or "<18y".

Patient health

The health status of the patient at the time of sampling. Categories are: acute infection, asymptomatic, symptomatic, AIDS, and deceased.

Isolate name

This field is usually the same as the "sequence name".

Clone name

The clone number of the sequence. This field is only used for sets of cloned samples.

Sample tissue

The tissue from which the the sample was derived. Categories include: plasma, PBMC, blood, brain, CSF, semen, cervix, feces, etc.

Culture method

For samples derived from PBMC. Categories are: uncultured, primary, expanded, co-cultured.

Genomic Region search

To search for HIV-1 sequences that span a given region, select the region from the list. The exact coordinates (relative to reference strain HXB2) for each fragment are listed below. Note that to be included, a given sequence must span the entire length of the region selected. The exception to this is the "complete genome" category, which yields all sequences over 7000 base pairs. You can also choose any other region of interest based on HXB2 coordinates. For a diagram of HIV-1 showing the HXB2 breakpoints, see HIV-1 Genomic Map.

If you want all sequences that partly cover a selected region, check the "Include fragments" box and enter a minimum length for the included fragments. As another example, if you want to search for fragments over 500 base pairs from any region of the genome, check the "Include fragments" box, enter "500" in the box for minimum length, and enter "complete genome" as the region; this search will not work using the "Any" selection. Do not use symbols such as < in the fragment length box.

This search feature is the incorporation of the program that used to be called HIV-MAP, which allows you to obtain all sequences (or a subset of sequences) that contain a selected region. A full description of the HIV-MAP tools has been published by Gaschen et al. 2001. Briefly, the sequences are internally aligned, and the location of their starting and ending positions are determined; then these positions are compared to the region of interest. This option is only available for HIV-1.

The following are the specific HXB2 coordinates of each of the fragments that can be downloaded from the search interface. The 3' LTR fragments in this list are not options on the search interface, but can be obtained by entering the coordinates. For additional information about HXB2 coordinates, see HIV-1 Genomic Map.

  Fragment HXB2 coordinates  
  complete genome any >7000 bp  
  5' LTR    1 - 633  
  5' LTR R  456 - 551  
  5' LTR U3    1 - 455  
  5' LTR U5  552 - 633  
  TAR  453 - 513  
  Gag-Pol  790 - 5096  
  Gag  790 - 2292  
  p17 (matrix)  790 - 1185  
  p24 (capsid) 1186 - 1878  
  p7 (nucleocapsid) 1921 - 2085  
  p6 2134 - 2292  
  Pol CDS 2085 - 5096  
  p51 (RT) 2550 - 3870  
  p15 (RNAse H) 3870 - 4229  
  p31 (integrase) 4230 - 5096  
  protease 2253 - 2550  
  Vif CDS 5041 - 5619  
  Vpr CDS 5559 - 5850  
  Tat CDS (plus intron) 5831 - 8469  
  Tat exon 1 5831 - 6045  
  Tat exon 2 8379 - 8469  
  Rev CDS (plus intron) 5970 - 8653  
  Rev exon 1 5970 - 6045  
  Rev exon 2 8379 - 8653  
  Vpu CDS 6062 - 6310  
  Env CDS 6225 - 8795  
  V1 6615 - 6691  
  V2 6696 - 6811  
  V3 7110 - 7216  
  V4 7377 - 7477  
  V5 7602 - 7636  
  RRE 7710 - 8061  
  gp41 7758 - 8795  
  gp120 6225 - 7758  
  Nef CDS 8797 - 9417  
  3' LTR 9086 - 9719  
  3' LTR R 9541 - 9636  
  3' LTR U3 9086 - 9540  
  3' LTR U5 9637 - 9719  

 

last modified: Thu Jan 17 11:03 2008


Questions or comments? Contact us at seq-info@lanl.gov.