To search for a range of accession numbers, use the format X12345..X23456.
You can also enter a list of space separated accessions, such as
This is usually the isolate or clone name for a sequence, and may be the way a sequence is referred to in publications. This field also searches the GenBank Locus Name field.
The length of the nucleotide sequence in base pairs.
The year in which the sample was obtained. If the year of sampling is not specified exactly by the authors, the data in the database may be a range of years. You can choose whether or not to include these data by using the "exact" checkbox.
Exact
When the "exact" box is checked, your search will return only sequences sampled in the exact year(s) specified. When unchecked, your search will returned sequences sampled within any range of years that includes the year(s) specified.
The country in which the sample was taken. We use 2-letter Country Codes.
The organism sequenced (e.g., HIV-1, HIV-2, SIV). The choice of virus will determine the choices available in the Subtype field.
In this field, you can search on multiple subtypes by clicking the ones you want. To select non-adjacent fields, use 'ctrl-click' instead of 'shift-click'. Note that if your search is limited to a specific genomic region, you may bring up some recombinant sequences that are not of the selected subtype in that region.
Include recombinants
Select the 'include recombinants' check box if you want recombinants of the chosen subtype(s). You must have one or more subtypes selected.
For more information on the subtype and CRF classifications, see:
Overview of HIV-1, HIV-2, and SIV subtype nomenclature
How the HIV database classifies sequence subtypes
Overview of primate immunodeficiency viruses lists SIV subtypes
Circulating Recombinant Forms lists currently recognized CRFs
HIV-1 M group nomenclature (1999)
A unique identifying number assigned sequentially to each sequence as it is imported into the LANL HIV Database.
The year in which the sequence was entered in GenBank.
This field is usually the same as the "sequence name".
The clone number of the sequence. This field is only used for sets of cloned samples.
The tissue from which the the sample was derived. Categories include: plasma, PBMC, blood, brain, CSF, semen, cervix, feces, etc.
For samples derived from PBMC. Categories are: cultured, uncultured, primary, expanded, co-cultured.
Check box to select only sequences that were sampled prior to the patient receiving any drug treatment. Sequences are annotated as drug naive only when there is certainty that the patient has not been treated. If you want to select for sequences from drug-treated (non-naive) patients, use the Advanced Search.
When you search on "Comment", the comment fields in both the sequence entries and the patient records will be searched.
These fields are annotated based on biological data only, not based on presumed usage inferred from sequences.
For information about these fields, see articles:
Biological and Molecular Aspects of HIV-1 Coreceptor Usage
Coreceptor Use by Primate Lentiviruses
All genomic region searches are based on the program that used to be called HIV-MAP, which allows you to obtain all sequences (or a subset of sequences) that contain a selected region. A full description of the HIV-MAP tools was published by Gaschen et al. 2001. Briefly, the sequences are internally aligned, and the location of their starting and ending positions are determined; then these positions are compared to the region of interest.
You can limit your search to a specific genomic region of the virus. For HIV-1, these regions are defined according to the coordinates below. Additional details about the HXB2 reference coordinates can be found on the HIV-1 Genomic Map. Note that the "complete genome" category yields all sequences over 7000 base pairs, regardless of exact coordinates. Thus, some sequences obtained from the "complete genome" region may lack parts of LTR, gag, and nef.
Fragment | HXB2 coordinates | ||
complete genome | any >7000 bp | ||
5' LTR | 1 - 633 | ||
5' LTR R | 456 - 551 | ||
5' LTR U3 | 1 - 455 | ||
5' LTR U5 | 552 - 633 | ||
TAR | 453 - 513 | ||
Gag-Pol | 790 - 5096 | ||
Gag | 790 - 2292 | ||
p17 (matrix) | 790 - 1185 | ||
p24 (capsid) | 1186 - 1881 | ||
p7 (nucleocapsid) | 1921 - 2085 | ||
p6 | 2134 - 2292 | ||
Pol CDS | 2085 - 5096 | ||
p51 (RT) | 2550 - 3870 | ||
p15 (RNAse H) | 3870 - 4229 | ||
p31 (integrase) | 4230 - 5096 | ||
protease | 2253 - 2550 | ||
Vif CDS | 5041 - 5619 | ||
Vpr CDS | 5559 - 5850 | ||
Tat CDS (plus intron) | 5831 - 8469 | ||
Tat exon 1 | 5831 - 6045 | ||
Tat exon 2 | 8379 - 8469 | ||
Rev CDS (plus intron) | 5970 - 8653 | ||
Rev exon 1 | 5970 - 6045 | ||
Rev exon 2 | 8379 - 8653 | ||
Vpu CDS | 6062 - 6310 | ||
Env CDS | 6225 - 8795 | ||
V1 | 6615 - 6691 | ||
V2 | 6696 - 6811 | ||
V3 | 7110 - 7216 | ||
V4 | 7377 - 7477 | ||
V5 | 7602 - 7636 | ||
RRE | 7710 - 8061 | ||
gp41 | 7758 - 8795 | ||
gp120 | 6225 - 7758 | ||
Nef CDS | 8797 - 9417 | ||
3' LTR | 9086 - 9719 | ||
3' LTR R | 9541 - 9636 | ||
3' LTR U3 | 9086 - 9540 | ||
3' LTR U5 | 9637 - 9719 |
You can also choose to search for sequences based on your own coordinates. The pulldown menu does not include the3' LTR, but it can be obtained by specifying its coordinates.
By default, all sequences obtained from a genomic region search will span the entire region selected. If you want to include sequences that only partly cover the selected region, check the "Include fragments" box and enter a minimum length for the overlap of the included fragments. Do not use symbols such as > in this box.
When you enter a sequence alignment here, you will limit your database search to the genomic region of your input sequences. Your sequences must already be aligned, and it is helpful if they are all approximately the same length.
The publication ID is a unique number assigned to each publication by the HIV Database.
This search assumes an 'and', so if you search for 'smith jones' you will retrieve all sequences for which both Smith and Jones are in the author list. Do not include initials or first names. Author names are taken directly from GenBank; we do not correct mistakes in the sequence records.
This field restricts your search to sequences from a published paper specified by its PubMed ID.
Search with any word or set of words from the Title of the paper or from the name of the Journal itself.
A unique number, assigned by this database, that links sequences from a single, unique patient.
The patient identifier is displayed in searches as a two-part number, for example "P1(19555)". The first part is the code name or number by which the patient is identified in publication(s). The second part is an internal number assigned by our database, the patient ID. A patient code such as "P1" may refer to more than one patient, but, the sequence records associated with "19555" are specific to a single patient.
Not all sequences in the database have an assigned patient code/id; if your search for a patient code fails, try entering the code in the "Sequence Name" field (sequence names often contain a patient code).
The search algorithm used for this field is different from the other fields, in that a space (for example, "Patient 2") is not interpreted as AND. Instead, the entire string, including spaces, is used as the search term.
The risk factor describes the risk activity by which the patient most likely was infected. Dual risk factors are not recorded. The risk factor must be established with reasonable certainty to be recorded in this field.
SG - homosexualThis is the year in which the patient was infected. The year is only recorded when it is known with some certainty.
When the checkbox is selected, you will get only sequences from patients with any known HLA data, and this HLA information will be displayed.
The HLA field can be searched for specific HLA types using the Advanced Search. To search the field, enter a space separated list. Wildcard searches using * do not work because HLA data often contain the * character.
Categories are: M and F.
The number in this field indicates the estimated number of days between the patient's seroconversion and the date the sample was taken for sequencing. For samples taken before seroconversion, negative numbers are used. If the source data were given in weeks or months, these numbers have been converted to days.
Please note: Days from infection or seroconversion are almost always estimates, and different studies use different methods and definitions. In many studies these estimates are very rough. We have attempted to translate these values into a single system for study cross-comparisons, but please use these fields with caution; go back to the original papers to confirm the study-specific timing definitions.
In cases where studies give data that is vague, but possibly useful, a text entry may appear in this field. The following text entries may appear:
“Pre-seroconversion”: sample was taken before seroconversion, but the exact number of days is unknown.
“Early”: <1 year after seroconversion
“Late”: ≥2 years after seroconversion
“No data”: the same meaning as a blank field
The number in this field indicates an estimate of the number of days from the time the patient was infected with the virus until the sample was taken for sequencing. Post-infection dates are relatively rare. Most often they are known when a patient seeks medical treatment for acute illness shortly after having a sexual encounter with a stranger. We use this field when the primary author presents the data as post infection in the original citation. Please note: Days from infection or seroconversion are always estimates, and different studies use different methods and definitions.
If you are interested in sequences from a particular timepoint relative to infection or seroconversion, it may be wise to perform 2 separate searches of both the ‘days from seroconversion’ and ‘days from infection’ fields, as most data are recorded in one field or the other, not both. For example, if you want sequences that are either <90 days post-infection or <30 days post-seroconversion, two separate searches are needed.
If the patient was enrolled in a named project or cohort, it is recorded in this field.
The health status of the patient at the time of sampling. Categories are: acute infection, asymptomatic, symptomatic, AIDS, and deceased.
The age of the patient in integer days when the sample was taken. You can use y for year and m for month. For example, to select for sequences from patients under 18 years, enter either "<6575" or "<18y".
The plasma viral load in units of copies/ml of plasma. A viral load of "1" indicates that the viral load was below the limit of detection (usually <50 copies/ml).
The CD4+ T-cell count at the time of sampling, in absolute counts of cells/ul.
The CD8+ T-cell count at the time of sampling, in absolute counts of cells/ul.
This field records the rate of disease progression of the patient, if recorded by the study.
This field can limit your search to patients with multiple sequences in the database. For example, if you enter ">9" in this field, your output will include only sequences from patients with 10 or more sequences in the database. This option is particularly useful for searches of intrapatient sequence sets. For more options for intrapatient searches, see Intrapatient Search Interface.
A cluster is a group of two or more epidemiologically-linked patients. A cluster ID links two or more patient IDs in the database. Each cluster is assigned a name, which is not necessarily unique to a single cluster. (For example, there may be more than one "chain1" clusters in the database.) Clusters are assigned only when both the publication and the sequences themselves indicate epidemiological linkage of the patients.
The cluster transmission type describes the mode of transmission of the virus among all patients in the cluster. For example, clusters with Heterosexual transmission are pairs or chains of patients linked by heterosexual transmission of the virus. In many cases, a cluster will have more than one transmission type. For example, a cluster consisting of a heterosexual couple and their infected child would have both Heterosexual and Mother->Child transmission types.
The cluster comment field gives information about the cluster of epidemiologically-linked patients.
Fiebig stage is a staging system for early HIV infection. This field can be searched from the Advanced Search and the Intrapatient Search interfaces.
Fiebig stage | Duration in days (range) | Cumulative duration (range) |
Eclipse | 10 (7,21) | 10 (7,21) |
1 (vRNA+) | 7 (5,10) | 17 (13,28) |
2 (p24Ag+) | 5 (4,8) | 22 (18,34) |
3 (ELISA+) | 3 (2,5) | 25 (22,37) |
4 (Western Blot +/-) | 6 (4,8) | 31 (27,43) |
5 (Western Blot +, p31-) | 70 (40,122) | 101 (71,154) |
6 (Western Blot +, p31+) | open-ended |
References for Fiebig staging system:
- Fiebig et al. 2003. Dynamics of HIV viremia and antibody seroconversion in plasma donors: implications for diagnosis and staging of primary HIV infection. AIDS 17(13): 1871-1879.
- Keele et al. 2008. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc Natl Acad Sci U S A. 105(21):7552-7.
Although this is geographical information, the search field is located at the top under Sequence Information for convenience.
The city/province/state/region in which the sample was obtained.
This field records the country in which the patient was infected. This field is filled in only when the infection country differs from the sampling country and the infection country is known specifically and with high certainty.
If the infection city is known to be different than the city where the sample was collected, that is noted here. This field rarely contains data.
This is a way to retrieve all sequences from (for example) the African continent without having to search for each country separately.
For a list of countries included in each region, see:
Definitions of the HIV Database geographic regions
Restricts the search to sequences containing a specified amino acid motif, such as YCVHQRIEIKDTK. The output can be downloaded as nucleotides or amino acids.
Boolian searches with "and" and "or" work normally. Like other fields, a space means "and". So the query "KKE ESK" returns sequences that contain both KKE and ESK. In contrast, the query KKE*ESK will return sequences that contain both tripeptides, but in the specified order (KKE 5' of ESK).
To search for a motif where amino acids are separated by an exact number of residues, use underscore. For example, if you want to find sequences with the HLA-A1 motif xxDExxxxxY, enter the query __DE_____Y
Restricts the motif search to a particular gene for which amino acid information is available.
This field marks sequences that users usually want to exclude from a retrieval. Our default excludes these sequences from searches, but users may choose to include them if desired. Our criteria are very conservative, so that we have few false positives. Thus there are some unlabeled sequences that are still problematic. You still need to check for problem sequences!
High content of non-ACTG characters, meeting one of the following criteria:
While direct sequences will naturally contain some IUPAC ambiguity characters, sequences annotated as N have such a high fraction that multiple alignment programs and other analysis programs have trouble with them. All incoming data are automatically screened.
Likely contamination with a laboratory strain. If a major part of a study set is contaminated, we may label the full set with C. In other cases, we choose a very conservative (high) level of similarity before we mark a particular sequence C. Different genomic regions have different standards; for example it is harder to detect potential contamination in pol than it is in env. In short sequences, it is particularly difficult to say with certainty if a sequence is a contaminant or not. Contaminants cannot be reliably annotated through an automatic screen, so many potential contaminant sequences are still unmarked.
We screen all incoming sequences for extreme cases of G->A hypermutation, and mark these sequences with H. Hypermutated sequences can carry substitutions not found in viable viruses, so such sequences alter phylogenetic tree branch lengths and complicate the determination of appropriate evolutionary models. For additional information about hypermutation, see the Hypermut Tool.
A synthetic sequence does not represent a naturally-occurring viral sequence. There are many ways that this can occur, including:
A sequence containing an artifactual deletion of >100 nucleotides. These sequences occur when an author puts together 2 sequences from a single sample (for example, a protease and an RT sequence), but omits some intervening sequence. Sequences that represent viruses with naturally-occurring deletions are not annotated in this category.
A tiny sequence (< 50 bp).
A sequence that was deposited as its reverse complement.
Percentage of non-ACGT characters in the nucleotide sequence.
Example: to restrict your search to sequences with less than 0.5%
non-ACGT character content use <0.5 in the input box.