HIV sequence database

HELP for the Search Interface

Tips

Queries are case insensitive.
Most fields accept the following characters: A-Z, a-z, 0-9, *, %, =, !, <, >, (, ), [, ], ., -, and _
Most fields accept the following operators: == (exact match), <, >, <=, >=, .. (range), and != (not equal)
To search for a range of values, do not use the dash (-). Use ".." or < > operators.
examples: "100..200" or ">=100 AND <=200"
You can use "and" and "or" between groups of characters.
example: "ab OR ac"
Note: these operators act on individual entries, not on the results as a whole. For example, a search that includes Sampling Country "US and CA" will produce no sequences, because no sequence has both sampling countries.
Spaces between groups of characters are interpreted as "and".
example: "ab cd ef" is equivalent to "ab AND cd AND ef"
You can use ( ) in alpha-numeric fields for logical grouping.
example: "(a1b OR c2d) AND ef"
You can use square brackets [ ] in alpha-numeric fields to enclose ranges or sets, such as [a-f] or [abcdef]
example: AF220[6-7] searches for accessions in the AF2206* to AF2207* range
The wildcard * matches 0 or more characters.
Entering * by itself in most fields will display the available data without limiting the search.
To select multiple non-adjacent options from a list, PC users need to hold down the CTRL key. For most other browser/platform combinations, either shift-click or Command (Apple)-click will do this.
For information about the structure of the database, see Data Dictionary.

Sequence Information

Accession

To search for a range of accession numbers, use the format X12345..X23456.

You can also enter a list of space separated accessions, such as

Sequence name

This is usually the isolate or clone name for a sequence, and may be the way a sequence is referred to in publications. This field also searches the GenBank Locus Name field.

Sequence length

The length of the nucleotide sequence in base pairs.

Sampling year

The year in which the sample was obtained. If the year of sampling is not specified exactly by the authors, the data in the database may be a range of years. You can choose whether or not to include these data by using the "exact" checkbox.

Exact
When the "exact" box is checked, your search will return only sequences sampled in the exact year(s) specified. When unchecked, your search will returned sequences sampled within any range of years that includes the year(s) specified.

Sampling country

The country in which the sample was taken. We use 2-letter Country Codes.

Virus

The organism sequenced (e.g., HIV-1, HIV-2, SIV). The choice of virus will determine the choices available in the Subtype field.

Subtype

In this field, you can search on multiple subtypes by clicking the ones you want. To select non-adjacent fields, use 'ctrl-click' instead of 'shift-click'. Note that if your search is limited to a specific genomic region, you may bring up some recombinant sequences that are not of the selected subtype in that region.

Include recombinants
Select the 'include recombinants' check box if you want recombinants of the chosen subtype(s). You must have one or more subtypes selected.

For more information on the subtype and CRF classifications, see:
Overview of HIV-1, HIV-2, and SIV subtype nomenclature
How the HIV database classifies sequence subtypes
Overview of primate immunodeficiency viruses lists SIV subtypes
Circulating Recombinant Forms lists currently recognized CRFs
HIV-1 M group nomenclature (1999)

More Sequence Information

SE id

A unique identifying number assigned sequentially to each sequence as it is imported into the LANL HIV Database.

GB Create Date (YYYY)

The year in which the sequence was entered in GenBank.

Isolate name

This field is usually the same as the "sequence name".

Clone name

The clone number of the sequence. This field is only used for sets of cloned samples.

Sample tissue

The tissue from which the the sample was derived. Categories include: plasma, PBMC, blood, brain, CSF, semen, cervix, feces, etc.

Culture method

For samples derived from PBMC. Categories are: cultured, uncultured, primary, expanded, co-cultured.

Drug naive

Check box to select only sequences that were sampled prior to the patient receiving any drug treatment. Sequences are annotated as drug naive only when there is certainty that the patient has not been treated. If you want to select for sequences from drug-treated (non-naive) patients, use the Advanced Search.

Comment

When you search on "Comment", the comment fields in both the sequence entries and the patient records will be searched.

Coreceptor and phenotype

These fields are annotated based on biological data only, not based on presumed usage inferred from sequences. For information about these fields, see articles:
Biological and Molecular Aspects of HIV-1 Coreceptor Usage
Coreceptor Use by Primate Lentiviruses

Find all sequences for a specific gene or region

All genomic region searches are based on the program that used to be called HIV-MAP, which allows you to obtain all sequences (or a subset of sequences) that contain a selected region. A full description of the HIV-MAP tools was published by Gaschen et al. 2001. Briefly, the sequences are internally aligned, and the location of their starting and ending positions are determined; then these positions are compared to the region of interest.

Genomic Region

You can limit your search to a specific genomic region of the virus. For HIV-1, these regions are defined according to the coordinates below. Additional details about the HXB2 reference coordinates can be found on the HIV-1 Genomic Map. Note that the "complete genome" category yields all sequences over 7000 base pairs, regardless of exact coordinates. Thus, some sequences obtained from the "complete genome" region may lack parts of LTR, gag, and nef.

	Fragment	HXB2 coordinates
	complete genome	any >7000 bp
	5' LTR	1 - 633
	5' LTR R	456 - 551
	5' LTR U3	1 - 455
	5' LTR U5	552 - 633
	TAR	453 - 513
	Gag-Pol	790 - 5096
	Gag	790 - 2292
	p17 (matrix)	790 - 1185
	p24 (capsid)	1186 - 1881
	p7 (nucleocapsid)	1921 - 2085
	p6	2134 - 2292
	Pol CDS	2085 - 5096
	p51 (RT)	2550 - 3870
	p15 (RNAse H)	3870 - 4229
	p31 (integrase)	4230 - 5096
	protease	2253 - 2550
	Vif CDS	5041 - 5619
	Vpr CDS	5559 - 5850
	Tat CDS (plus intron)	5831 - 8469
	Tat exon 1	5831 - 6045
	Tat exon 2	8379 - 8469
	Rev CDS (plus intron)	5970 - 8653
	Rev exon 1	5970 - 6045
	Rev exon 2	8379 - 8653
	Vpu CDS	6062 - 6310
	Env CDS	6225 - 8795
	V1	6615 - 6691
	V2	6696 - 6811
	V3	7110 - 7216
	V4	7377 - 7477
	V5	7602 - 7636
	RRE	7710 - 8061
	gp41	7758 - 8795
	gp120	6225 - 7758
	Nef CDS	8797 - 9417
	3' LTR	9086 - 9719
	3' LTR R	9541 - 9636
	3' LTR U3	9086 - 9540
	3' LTR U5	9637 - 9719

Start/End Coordinates

You can also choose to search for sequences based on your own coordinates. The pulldown menu does not include the3' LTR, but it can be obtained by specifying its coordinates.

Include fragments of minimum length __

By default, all sequences obtained from a genomic region search will span the entire region selected. If you want to include sequences that only partly cover the selected region, check the "Include fragments" box and enter a minimum length for the overlap of the included fragments. Do not use symbols such as > in this box.

Combine database sequences with your own sequence alignment

When you enter a sequence alignment here, you will limit your database search to the genomic region of your input sequences. Your sequences must already be aligned, and it is helpful if they are all approximately the same length.

Ragged ends

When you paste or upload an alignment, the interface normally defines the genomic region of your search by taking the genome coordinates of the first sequence in your alignment. If you choose "ragged ends", the interface will determine the coordinates of all the sequences in the alignment, and it will use the lowest 5' coordinate and the highest 3' coordinate. Choosing this option will prevent the interface from using the wrong coordinates in cases where your first sequence is shorter than the others. However, this option makes searches significantly slower.

Publication Information

Publication ID

The publication ID is a unique number assigned to each publication by the HIV Database.

Author Last Name

This search assumes an 'and', so if you search for 'smith jones' you will retrieve all sequences for which both Smith and Jones are in the author list. Do not include initials or first names. Author names are taken directly from GenBank; we do not correct mistakes in the sequence records.

PubMed ID

This field restricts your search to sequences from a published paper specified by its PubMed ID.

Title and Journal

Search with any word or set of words from the Title of the paper or from the name of the Journal itself.

Patient Information

Patient id

A unique number, assigned by this database, that links sequences from a single, unique patient.

Patient code

The patient identifier is displayed in searches as a two-part number, for example "P1(19555)". The first part is the code name or number by which the patient is identified in publication(s). The second part is an internal number assigned by our database, the patient ID. A patient code such as "P1" may refer to more than one patient, but, the sequence records associated with "19555" are specific to a single patient.

Not all sequences in the database have an assigned patient code/id; if your search for a patient code fails, try entering the code in the "Sequence Name" field (sequence names often contain a patient code).

The search algorithm used for this field is different from the other fields, in that a space (for example, "Patient 2") is not interpreted as AND. Instead, the entire string, including spaces, is used as the search term.

Risk factor

The risk factor describes the risk activity by which the patient most likely was infected. Dual risk factors are not recorded. The risk factor must be established with reasonable certainty to be recorded in this field.

SG - homosexual
SB - bisexual
SM - male sex with male
SH - heterosexual
SW - sex worker
SU - sexual transmission, unspecified type
PH - hemophiliac
PB - Blood transfusion
PI - IV drug use
MB - Mother-baby
NO - Nosocomial
EX - Experimental
NR - not recorded (or unknown)
OT - other

Infection year

This is the year in which the patient was infected. The year is only recorded when it is known with some certainty.

HLA information

When the checkbox is selected, you will get only sequences from patients with any known HLA data, and this HLA information will be displayed.

The HLA field can be searched for specific HLA types using the Advanced Search. To search the field, enter a space separated list. Wildcard searches using * do not work because HLA data often contain the * character.

Patient sex

Categories are: M and F.

Days from seroconversion

The number in this field indicates the estimated number of days between the patient's seroconversion and the date the sample was taken for sequencing. For samples taken before seroconversion, negative numbers are used. If the source data were given in weeks or months, these numbers have been converted to days.

Please note: Days from infection or seroconversion are almost always estimates, and different studies use different methods and definitions. In many studies these estimates are very rough. We have attempted to translate these values into a single system for study cross-comparisons, but please use these fields with caution; go back to the original papers to confirm the study-specific timing definitions.

In cases where studies give data that is vague, but possibly useful, a text entry may appear in this field. The following text entries may appear:

“Pre-seroconversion”: sample was taken before seroconversion, but the exact number of days is unknown.
“Early”: <1 year after seroconversion
“Late”: ≥2 years after seroconversion
“No data”: the same meaning as a blank field

Days from infection

The number in this field indicates an estimate of the number of days from the time the patient was infected with the virus until the sample was taken for sequencing. Post-infection dates are relatively rare. Most often they are known when a patient seeks medical treatment for acute illness shortly after having a sexual encounter with a stranger. We use this field when the primary author presents the data as post infection in the original citation. Please note: Days from infection or seroconversion are always estimates, and different studies use different methods and definitions.

If you are interested in sequences from a particular timepoint relative to infection or seroconversion, it may be wise to perform 2 separate searches of both the ‘days from seroconversion’ and ‘days from infection’ fields, as most data are recorded in one field or the other, not both. For example, if you want sequences that are either <90 days post-infection or <30 days post-seroconversion, two separate searches are needed.

Project

If the patient was enrolled in a named project or cohort, it is recorded in this field.

Patient health

The health status of the patient at the time of sampling. Categories are: acute infection, asymptomatic, symptomatic, AIDS, and deceased.

More Patient Information

Patient age

The age of the patient in integer days when the sample was taken. You can use y for year and m for month. For example, to select for sequences from patients under 18 years, enter either "<6575" or "<18y".

Viral load

The plasma viral load in units of copies/ml of plasma. A viral load of "1" indicates that the viral load was below the limit of detection (usually <50 copies/ml).

CD4 count

The CD4+ T-cell count at the time of sampling, in absolute counts of cells/ul.

CD8 count

The CD8+ T-cell count at the time of sampling, in absolute counts of cells/ul.

Progression

This field records the rate of disease progression of the patient, if recorded by the study.

# patient sequences

This field can limit your search to patients with multiple sequences in the database. For example, if you enter ">9" in this field, your output will include only sequences from patients with 10 or more sequences in the database. This option is particularly useful for searches of intrapatient sequence sets. For more options for intrapatient searches, see Intrapatient Search Interface.

Cluster name

A cluster is a group of two or more epidemiologically-linked patients. A cluster ID links two or more patient IDs in the database. Each cluster is assigned a name, which is not necessarily unique to a single cluster. (For example, there may be more than one "chain1" clusters in the database.) Clusters are assigned only when both the publication and the sequences themselves indicate epidemiological linkage of the patients.

Cluster transmission type

The cluster transmission type describes the mode of transmission of the virus among all patients in the cluster. For example, clusters with Heterosexual transmission are pairs or chains of patients linked by heterosexual transmission of the virus. In many cases, a cluster will have more than one transmission type. For example, a cluster consisting of a heterosexual couple and their infected child would have both Heterosexual and Mother->Child transmission types.

Cluster comment

The cluster comment field gives information about the cluster of epidemiologically-linked patients.

Fiebig stage

Fiebig stage is a staging system for early HIV infection. This field can be searched from the Advanced Search and the Intrapatient Search interfaces.

Fiebig stage	Duration in days (range)	Cumulative duration (range)
Eclipse	10 (7,21)	10 (7,21)
1 (vRNA+)	7 (5,10)	17 (13,28)
2 (p24Ag+)	5 (4,8)	22 (18,34)
3 (ELISA+)	3 (2,5)	25 (22,37)
4 (Western Blot +/-)	6 (4,8)	31 (27,43)
5 (Western Blot +, p31-)	70 (40,122)	101 (71,154)
6 (Western Blot +, p31+)	open-ended

References for Fiebig staging system:

Fiebig et al. 2003. Dynamics of HIV viremia and antibody seroconversion in plasma donors: implications for diagnosis and staging of primary HIV infection. AIDS 17(13): 1871-1879.

Keele et al. 2008. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc Natl Acad Sci U S A. 105(21):7552-7.

Geographical Information

Sampling country

Although this is geographical information, the search field is located at the top under Sequence Information for convenience.

Sampling city

The city/province/state/region in which the sample was obtained.

Infection country

This field records the country in which the patient was infected. This field is filled in only when the infection country differs from the sampling country and the infection country is known specifically and with high certainty.

Infection city

If the infection city is known to be different than the city where the sample was collected, that is noted here. This field rarely contains data.

Geographic region

This is a way to retrieve all sequences from (for example) the African continent without having to search for each country separately. For a list of countries included in each region, see:
Definitions of the HIV Database geographic regions

Amino Acid Motif Search

Motif

Restricts the search to sequences containing a specified amino acid motif, such as YCVHQRIEIKDTK. The output can be downloaded as nucleotides or amino acids.

Boolian searches with "and" and "or" work normally. Like other fields, a space means "and". So the query "KKE ESK" returns sequences that contain both KKE and ESK. In contrast, the query KKE*ESK will return sequences that contain both tripeptides, but in the specified order (KKE 5' of ESK).

To search for a motif where amino acids are separated by an exact number of residues, use underscore. For example, if you want to find sequences with the HLA-A1 motif xxDExxxxxY, enter the query __DE_____Y

Gene

Restricts the motif search to a particular gene for which amino acid information is available.

Output Options

Problematic sequences

This field marks sequences that users usually want to exclude from a retrieval. Our default excludes these sequences from searches, but users may choose to include them if desired. Our criteria are very conservative, so that we have few false positives. Thus there are some unlabeled sequences that are still problematic. You still need to check for problem sequences!

N: Non-ACTG characters

High content of non-ACTG characters, meeting one of the following criteria:
- more than 100 consecutive non-ACTG characters
- >7% non-ACTG characters for sequences of length <1000
- >5% non-ACTG characters for sequences of length 1000-2999
- >3% non-ACTG characters for sequences of length 3000 or above.
While direct sequences will naturally contain some IUPAC ambiguity characters, sequences annotated as N have such a high fraction that multiple alignment programs and other analysis programs have trouble with them. All incoming data are automatically screened.
C: Contaminant

Likely contamination with a laboratory strain. If a major part of a study set is contaminated, we may label the full set with C. In other cases, we choose a very conservative (high) level of similarity before we mark a particular sequence C. Different genomic regions have different standards; for example it is harder to detect potential contamination in pol than it is in env. In short sequences, it is particularly difficult to say with certainty if a sequence is a contaminant or not. Contaminants cannot be reliably annotated through an automatic screen, so many potential contaminant sequences are still unmarked.
H: Hypermutant

We screen all incoming sequences for extreme cases of G->A hypermutation, and mark these sequences with H. Hypermutated sequences can carry substitutions not found in viable viruses, so such sequences alter phylogenetic tree branch lengths and complicate the determination of appropriate evolutionary models. For additional information about hypermutation, see the Hypermut Tool.
S: Synthetic

A synthetic sequence does not represent a naturally-occurring viral sequence. There are many ways that this can occur, including:
- spliced sequences containing non-HIV/SIV components
- sequences altered to change codon usage
- patent sequences for which we cannot determine their origin
- sequences where the author has accidentally concatenated two sequences into one
- sequences where the author has accidentally produced a DNA reverse-translated from protein
D: Deletion

A sequence containing an artifactual deletion of >100 nucleotides. These sequences occur when an author puts together 2 sequences from a single sample (for example, a protease and an RT sequence), but omits some intervening sequence. Sequences that represent viruses with naturally-occurring deletions are not annotated in this category.
T: Tiny

A tiny sequence (< 50 bp).
R: Reverse complement

A sequence that was deposited as its reverse complement.

% non-ACGT

Percentage of non-ACGT characters in the nucleotide sequence.
Example: to restrict your search to sequences with less than 0.5% non-ACGT character content use <0.5 in the input box.

last modified: Thu Apr 19 14:04 2012

Index of all tools	HIV BLAST	Quality Control
ADRA	HIVAlign	QuickAlign
Branchlength	Hypermut	Rainbow Tree
Codon Alignment	jpHMM at GOBICS	Recombinant HIV-1 Drawing Tool
Consensus Maker	Mosaic Vaccine Tool Suite	RIP
ELF	Motif Scan	SeqPublish
ElimDupes	N-Glycosite	Sequence Locator
Entropy	PCOORD	SNAP
FindModel	PepMap	SUDI Subtyping
Format Converter	PeptGen	SynchAlign
Gap Strip/Squeeze	PhyloPlace	Translate
GenBank Entry Generation	PhyML	TreeMaker
Gene Cutter	Pixel	TreeRate
Heatmap	Poisson-Fitter	VESPA
Hepitope	Protein Feature Accent	External Tools
Highlighter	Protein Structure

HELP for the Search Interface

Accession

Sequence name

Sequence length

Sampling year

Exact

Sampling country

Virus

Subtype

Include recombinants

SE id

GB Create Date (YYYY)

Isolate name

Clone name

Sample tissue

Culture method

Drug naive

Comment

Coreceptor and phenotype

Genomic Region

Start/End Coordinates

Include fragments of minimum length __

Ragged ends

Publication ID

Author Last Name

PubMed ID

Title and Journal

Patient id

Patient code

Risk factor

Infection year

HLA information

Patient sex

Days from seroconversion

Days from infection

Project

Patient health

Patient age

Viral load

CD4 count

CD8 count

Progression

# patient sequences

Cluster name

Cluster transmission type

Cluster comment

Fiebig stage

Sampling country

Sampling city

Infection country

Infection city

Geographic region

Motif

Gene

Problematic sequences

N: Non-ACTG characters

C: Contaminant

H: Hypermutant

S: Synthetic

D: Deletion

T: Tiny

R: Reverse complement

% non-ACGT