HIV Databases HIV Databases home HIV Databases home
HIV sequence database



HELP for the Search Interface

Tips

 


 

Sequence Information

Accession

To search for a range of accession numbers, use the format X12345..X23456.

You can also enter a list of space separated accessions, such as

Sequence name

This is usually the isolate or clone name for a sequence, and may be the way a sequence is referred to in publications. This field also searches the GenBank Locus Name field.

Sequence length

The length of the nucleotide sequence in base pairs.

Sampling year

The year in which the sample was obtained. If the year of sampling is not specified exactly by the authors, the data in the database may be a range of years. You can choose whether or not to include these data by using the "exact" checkbox.

Exact

When the "exact" box is checked, your search will return only sequences sampled in the exact year(s) specified. When unchecked, your search will returned sequences sampled within any range of years that includes the year(s) specified.

Sampling country

The country in which the sample was taken. We use 2-letter Country Codes.

Virus

The organism sequenced (e.g., HIV-1, HIV-2, SIV). The choice of virus will determine the choices available in the Subtype field.

Subtype

In this field, you can search on multiple subtypes by clicking the ones you want. To select non-adjacent fields, use 'ctrl-click' instead of 'shift-click'. Note that if your search is limited to a specific genomic region, you may bring up some recombinant sequences that are not of the selected subtype in that region.

Include recombinants

Select the 'include recombinants' check box if you want recombinants of the chosen subtype(s). You must have one or more subtypes selected.

For more information on the subtype and CRF classifications, see:
Overview of HIV-1, HIV-2, and SIV subtype nomenclature
How the HIV database classifies sequence subtypes
Overview of primate immunodeficiency viruses lists SIV subtypes
Circulating Recombinant Forms lists currently recognized CRFs
HIV-1 M group nomenclature (1999)

 


 

More Sequence Information

SE id

A unique identifying number assigned sequentially to each sequence as it is imported into the LANL HIV Database.

GB Create Date (YYYY)

The year in which the sequence was entered in GenBank.

Isolate name

This field is usually the same as the "sequence name".

Clone name

The clone number of the sequence. This field is only used for sets of cloned samples.

Sample tissue

The tissue from which the the sample was derived. Categories include: plasma, PBMC, blood, brain, CSF, semen, cervix, feces, etc.

Culture method

For samples derived from PBMC. Categories are: cultured, uncultured, primary, expanded, co-cultured.

Drug naive

Check box to select only sequences that were sampled prior to the patient receiving any drug treatment. Sequences are annotated as drug naive only when there is certainty that the patient has not been treated. If you want to select for sequences from drug-treated (non-naive) patients, use the Advanced Search.

Comment

When you search on "Comment", the comment fields in both the sequence entries and the patient records will be searched.

Coreceptor and phenotype

These fields are annotated based on biological data only, not based on presumed usage inferred from sequences. For information about these fields, see articles:
Biological and Molecular Aspects of HIV-1 Coreceptor Usage
Coreceptor Use by Primate Lentiviruses

 


 

Find all sequences for a specific gene or region

All genomic region searches are based on the program that used to be called HIV-MAP, which allows you to obtain all sequences (or a subset of sequences) that contain a selected region. A full description of the HIV-MAP tools was published by Gaschen et al. 2001. Briefly, the sequences are internally aligned, and the location of their starting and ending positions are determined; then these positions are compared to the region of interest.

Genomic Region

You can limit your search to a specific genomic region of the virus. For HIV-1, these regions are defined according to the coordinates below. Additional details about the HXB2 reference coordinates can be found on the HIV-1 Genomic Map. Note that the "complete genome" category yields all sequences over 7000 base pairs, regardless of exact coordinates. Thus, some sequences obtained from the "complete genome" region may lack parts of LTR, gag, and nef.

  Fragment HXB2 coordinates  
  complete genome any >7000 bp  
  5' LTR    1 - 633  
  5' LTR R  456 - 551  
  5' LTR U3    1 - 455  
  5' LTR U5  552 - 633  
  TAR  453 - 513  
  Gag-Pol  790 - 5096  
  Gag  790 - 2292  
  p17 (matrix)  790 - 1185  
  p24 (capsid) 1186 - 1881  
  p7 (nucleocapsid) 1921 - 2085  
  p6 2134 - 2292  
  Pol CDS 2085 - 5096  
  p51 (RT) 2550 - 3870  
  p15 (RNAse H) 3870 - 4229  
  p31 (integrase) 4230 - 5096  
  protease 2253 - 2550  
  Vif CDS 5041 - 5619  
  Vpr CDS 5559 - 5850  
  Tat CDS (plus intron) 5831 - 8469  
  Tat exon 1 5831 - 6045  
  Tat exon 2 8379 - 8469  
  Rev CDS (plus intron) 5970 - 8653  
  Rev exon 1 5970 - 6045  
  Rev exon 2 8379 - 8653  
  Vpu CDS 6062 - 6310  
  Env CDS 6225 - 8795  
  V1 6615 - 6691  
  V2 6696 - 6811  
  V3 7110 - 7216  
  V4 7377 - 7477  
  V5 7602 - 7636  
  RRE 7710 - 8061  
  gp41 7758 - 8795  
  gp120 6225 - 7758  
  Nef CDS 8797 - 9417  
  3' LTR 9086 - 9719  
  3' LTR R 9541 - 9636  
  3' LTR U3 9086 - 9540  
  3' LTR U5 9637 - 9719  

Start/End Coordinates

You can also choose to search for sequences based on your own coordinates. The pulldown menu does not include the3' LTR, but it can be obtained by specifying its coordinates.

Include fragments of minimum length __

By default, all sequences obtained from a genomic region search will span the entire region selected. If you want to include sequences that only partly cover the selected region, check the "Include fragments" box and enter a minimum length for the overlap of the included fragments. Do not use symbols such as > in this box.

 


 

Combine database sequences with your own sequence alignment

When you enter a sequence alignment here, you will limit your database search to the genomic region of your input sequences. Your sequences must already be aligned, and it is helpful if they are all approximately the same length.

Ragged ends

When you paste or upload an alignment, the interface normally defines the genomic region of your search by taking the genome coordinates of the first sequence in your alignment. If you choose "ragged ends", the interface will determine the coordinates of all the sequences in the alignment, and it will use the lowest 5' coordinate and the highest 3' coordinate. Choosing this option will prevent the interface from using the wrong coordinates in cases where your first sequence is shorter than the others. However, this option makes searches significantly slower.

 


 

Publication Information

Publication ID

The publication ID is a unique number assigned to each publication by the HIV Database.

Author Last Name

This search assumes an 'and', so if you search for 'smith jones' you will retrieve all sequences for which both Smith and Jones are in the author list. Do not include initials or first names. Author names are taken directly from GenBank; we do not correct mistakes in the sequence records.

PubMed ID

This field restricts your search to sequences from a published paper specified by its PubMed ID.

Title and Journal

Search with any word or set of words from the Title of the paper or from the name of the Journal itself.

 


 

Patient Information

Patient id

A unique number, assigned by this database, that links sequences from a single, unique patient.

Patient code

The patient identifier is displayed in searches as a two-part number, for example "P1(19555)". The first part is the code name or number by which the patient is identified in publication(s). The second part is an internal number assigned by our database, the patient ID. A patient code such as "P1" may refer to more than one patient, but, the sequence records associated with "19555" are specific to a single patient.

Not all sequences in the database have an assigned patient code/id; if your search for a patient code fails, try entering the code in the "Sequence Name" field (sequence names often contain a patient code).

The search algorithm used for this field is different from the other fields, in that a space (for example, "Patient 2") is not interpreted as AND. Instead, the entire string, including spaces, is used as the search term.

Risk factor

The risk factor describes the risk activity by which the patient most likely was infected. Dual risk factors are not recorded. The risk factor must be established with reasonable certainty to be recorded in this field.

SG - homosexual
SB - bisexual
SM - male sex with male
SH - heterosexual
SW - sex worker
SU - sexual transmission, unspecified type
PH - hemophiliac
PB - Blood transfusion
PI - IV drug use
MB - Mother-baby
NO - Nosocomial
EX - Experimental
NR - not recorded (or unknown)
OT - other

Infection year

This is the year in which the patient was infected. The year is only recorded when it is known with some certainty.

HLA information

When the checkbox is selected, you will get only sequences from patients with any known HLA data, and this HLA information will be displayed.

The HLA field can be searched for specific HLA types using the Advanced Search. To search the field, enter a space separated list. Wildcard searches using * do not work because HLA data often contain the * character.

Patient sex

Categories are: M and F.

Days from seroconversion

The number in this field indicates the estimated number of days between the patient's seroconversion and the date the sample was taken for sequencing. For samples taken before seroconversion, negative numbers are used. If the source data were given in weeks or months, these numbers have been converted to days.

Please note: Days from infection or seroconversion are almost always estimates, and different studies use different methods and definitions. In many studies these estimates are very rough. We have attempted to translate these values into a single system for study cross-comparisons, but please use these fields with caution; go back to the original papers to confirm the study-specific timing definitions.

In cases where studies give data that is vague, but possibly useful, a text entry may appear in this field. The following text entries may appear:

“Pre-seroconversion”: sample was taken before seroconversion, but the exact number of days is unknown.
“Early”: <1 year after seroconversion
“Late”: ≥2 years after seroconversion
“No data”: the same meaning as a blank field

Days from infection

The number in this field indicates an estimate of the number of days from the time the patient was infected with the virus until the sample was taken for sequencing. Post-infection dates are relatively rare. Most often they are known when a patient seeks medical treatment for acute illness shortly after having a sexual encounter with a stranger. We use this field when the primary author presents the data as post infection in the original citation. Please note: Days from infection or seroconversion are always estimates, and different studies use different methods and definitions.

If you are interested in sequences from a particular timepoint relative to infection or seroconversion, it may be wise to perform 2 separate searches of both the ‘days from seroconversion’ and ‘days from infection’ fields, as most data are recorded in one field or the other, not both. For example, if you want sequences that are either <90 days post-infection or <30 days post-seroconversion, two separate searches are needed.

Project

If the patient was enrolled in a named project or cohort, it is recorded in this field.

Patient health

The health status of the patient at the time of sampling. Categories are: acute infection, asymptomatic, symptomatic, AIDS, and deceased.

 


 

More Patient Information

Patient age

The age of the patient in integer days when the sample was taken. You can use y for year and m for month. For example, to select for sequences from patients under 18 years, enter either "<6575" or "<18y".

Viral load

The plasma viral load in units of copies/ml of plasma. A viral load of "1" indicates that the viral load was below the limit of detection (usually <50 copies/ml).

CD4 count

The CD4+ T-cell count at the time of sampling, in absolute counts of cells/ul.

CD8 count

The CD8+ T-cell count at the time of sampling, in absolute counts of cells/ul.

Progression

This field records the rate of disease progression of the patient, if recorded by the study.

# patient sequences

This field can limit your search to patients with multiple sequences in the database. For example, if you enter ">9" in this field, your output will include only sequences from patients with 10 or more sequences in the database. This option is particularly useful for searches of intrapatient sequence sets. For more options for intrapatient searches, see Intrapatient Search Interface.

Cluster name

A cluster is a group of two or more epidemiologically-linked patients. A cluster ID links two or more patient IDs in the database. Each cluster is assigned a name, which is not necessarily unique to a single cluster. (For example, there may be more than one "chain1" clusters in the database.) Clusters are assigned only when both the publication and the sequences themselves indicate epidemiological linkage of the patients.

Cluster transmission type

The cluster transmission type describes the mode of transmission of the virus among all patients in the cluster. For example, clusters with Heterosexual transmission are pairs or chains of patients linked by heterosexual transmission of the virus. In many cases, a cluster will have more than one transmission type. For example, a cluster consisting of a heterosexual couple and their infected child would have both Heterosexual and Mother->Child transmission types.

Cluster comment

The cluster comment field gives information about the cluster of epidemiologically-linked patients.

Fiebig stage

Fiebig stage is a staging system for early HIV infection. This field can be searched from the Advanced Search and the Intrapatient Search interfaces.

Fiebig stage Duration in days (range) Cumulative duration (range)
Eclipse 10 (7,21) 10 (7,21)
1 (vRNA+) 7 (5,10) 17 (13,28)
2 (p24Ag+) 5 (4,8) 22 (18,34)
3 (ELISA+) 3 (2,5) 25 (22,37)
4 (Western Blot +/-) 6 (4,8) 31 (27,43)
5 (Western Blot +, p31-) 70 (40,122) 101 (71,154)
6 (Western Blot +, p31+) open-ended

References for Fiebig staging system:

  • Fiebig et al. 2003. Dynamics of HIV viremia and antibody seroconversion in plasma donors: implications for diagnosis and staging of primary HIV infection. AIDS 17(13): 1871-1879.
  • Keele et al. 2008. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc Natl Acad Sci U S A. 105(21):7552-7.

 


 

Geographical Information

Sampling country

Although this is geographical information, the search field is located at the top under Sequence Information for convenience.

Sampling city

The city/province/state/region in which the sample was obtained.

Infection country

This field records the country in which the patient was infected. This field is filled in only when the infection country differs from the sampling country and the infection country is known specifically and with high certainty.

Infection city

If the infection city is known to be different than the city where the sample was collected, that is noted here. This field rarely contains data.

Geographic region

This is a way to retrieve all sequences from (for example) the African continent without having to search for each country separately. For a list of countries included in each region, see:
Definitions of the HIV Database geographic regions

 


 

Amino Acid Motif Search

Restricts the search to sequences containing a specified amino acid motif, such as YCVHQRIEIKDTK. The output can be downloaded as nucleotides or amino acids.

Boolian searches with "and" and "or" work normally. Like other fields, a space means "and". So the query "KKE ESK" returns sequences that contain both KKE and ESK. In contrast, the query KKE*ESK will return sequences that contain both tripeptides, but in the specified order (KKE 5' of ESK).

To search for a motif where amino acids are separated by an exact number of residues, use underscore. For example, if you want to find sequences with the HLA-A1 motif xxDExxxxxY, enter the query __DE_____Y

Gene

Restricts the motif search to a particular gene for which amino acid information is available.

 


 

Output Options

Problematic sequences

This field marks sequences that users usually want to exclude from a retrieval. Our default excludes these sequences from searches, but users may choose to include them if desired. Our criteria are very conservative, so that we have few false positives. Thus there are some unlabeled sequences that are still problematic. You still need to check for problem sequences!

  1. N: Non-ACTG characters

    High content of non-ACTG characters, meeting one of the following criteria:

    • more than 100 consecutive non-ACTG characters
    • >7% non-ACTG characters for sequences of length <1000
    • >5% non-ACTG characters for sequences of length 1000-2999
    • >3% non-ACTG characters for sequences of length 3000 or above.

    While direct sequences will naturally contain some IUPAC ambiguity characters, sequences annotated as N have such a high fraction that multiple alignment programs and other analysis programs have trouble with them. All incoming data are automatically screened.

  2. C: Contaminant

    Likely contamination with a laboratory strain. If a major part of a study set is contaminated, we may label the full set with C. In other cases, we choose a very conservative (high) level of similarity before we mark a particular sequence C. Different genomic regions have different standards; for example it is harder to detect potential contamination in pol than it is in env. In short sequences, it is particularly difficult to say with certainty if a sequence is a contaminant or not. Contaminants cannot be reliably annotated through an automatic screen, so many potential contaminant sequences are still unmarked.

  3. H: Hypermutant

    We screen all incoming sequences for extreme cases of G->A hypermutation, and mark these sequences with H. Hypermutated sequences can carry substitutions not found in viable viruses, so such sequences alter phylogenetic tree branch lengths and complicate the determination of appropriate evolutionary models. For additional information about hypermutation, see the Hypermut Tool.

  4. S: Synthetic

    A synthetic sequence does not represent a naturally-occurring viral sequence. There are many ways that this can occur, including:

    • spliced sequences containing non-HIV/SIV components
    • sequences altered to change codon usage
    • patent sequences for which we cannot determine their origin
    • sequences where the author has accidentally concatenated two sequences into one
    • sequences where the author has accidentally produced a DNA reverse-translated from protein
  5. D: Deletion

    A sequence containing an artifactual deletion of >100 nucleotides. These sequences occur when an author puts together 2 sequences from a single sample (for example, a protease and an RT sequence), but omits some intervening sequence. Sequences that represent viruses with naturally-occurring deletions are not annotated in this category.

  6. T: Tiny

    A tiny sequence (< 50 bp).

  7. R: Reverse complement

    A sequence that was deposited as its reverse complement.

% non-ACGT

Percentage of non-ACGT characters in the nucleotide sequence.
Example: to restrict your search to sequences with less than 0.5% non-ACGT character content use <0.5 in the input box.

 

last modified: Thu Apr 19 14:04 2012


Questions or comments? Contact us at seq-info@lanl.gov.