HCV Database
HCV sequence database
 


Please read an important announcement about the future of the HCV database here.

[Click here for help with the search results.]

Tips

Description of search fields:

Accession number To search for a range of accession numbers, use X12345 .. X23456. Some sequences (such as VN235) have multiple accession numbers; searching on accession number will match any accession number associated with a sequence, not just the primary accession.

Genotype and subtype In these fields, you can search on multiple genotypes and subtypes by using 'ctrl-click' on all of your choices. A table of the old and new genotype/subtype classification used by the database can be found here.

Include recombinants By default, recombinants are included in the retrieval; uncheck this box to exclude them.

Confirmed Following the standard HCV nomenclature (Simmonds et al, 2005) the HCV database distinguishes 'provisional' and 'confirmed' genotypes. Initially, all genotypes in the database are called "provisional"; the status will be switched to 'confirmed' when the DB staff has made sure that a genotype meets the requirements, as manpower allows.

Sequence name This is often a name based on the isolate or clone name for a sequence, and may be the way a sequence is commonly referred to in publications. This field also searches the Genbank Locus Name field.

Sequence length The length of the nucleotide sequence.

Authors Searches for author(s) listed on the publication. Please do not include initials.

Pubmed/Medline ID The Pubmed/Medline/Genbank databases have switched from using the Medline ID to the Pubmed ID. This field searches both listings, and all entries in the HCV database should have both if there are two defined. The search returns only the pubmed ID, because the Medline ID will become obsolete in the future; since the numbers are linked to Pubmed summaries, we use that numbering system.

Patient code This field needs to be used with care. The patient code is usually the name or number by which the patient is identified in publications. However, the patient code "Patient 1" can refer to many different patients. While there can be many patient records that have the patient code "Patient 1", they are different records; they are distinguished by an internal database variable, the patient ID. Therefore, the sequence records associated with "patient 1" are specific, i.e. records linked to patient 1 from study X are all from the same patient, and different from patient 1 in study Y. When using patient codes, do not assume the code is unique. If you are looking for patient from a specific publication, include an author name or Medline/Pubmed in your search. The patient codes are chosen by the database staff to identify patients in conjunction with a publication and/or a set of sequence records.
The search algorithm used for this field is also different from the other fields, in that a space (as in "patient 1") is not interpreted as an AND, as it is in the other fields. Instead the entire string, including spaces, is used as the search term. You can use the words 'and' and 'or' to join several criteria. For example, to find all sequences that have a patient code, you would search for '[A-Z]* or [0-9]*', to catch any strings starting with a letter between A and Z or a number between 0 and 9.

Infection route The infection route describes the risk activity by which the patient most likely was infected. Dual infection routes are not recorded. Again, the infection route has to be positively established to be recorded in this field.

Patient Health Main categories: chronic infection, fulminant/acute infection, Hepatocellular Carcinoma (HCC), cirrhosis, liver disease (unspecified), death/autopsy. When needed, other information can also be entered in this field.

Treatment response Treatment response at the end of the observation period, as documented in the publication. "Response" pertains to whatever the author describes as such, often but not always viral suppression. Categories are:

non-response no measurable response at any time
breakthrough a patient responds initially but relapses before the end of treatment
relapse initial response followed by relapse
end-of-therapy response response at the end of the treatment period
sustained response response at end of treatment period, which must be longer than the treatment period

Please note that different authors can use different category systems; for example, not all authors report (or recognize) 'breakthrough' patients, often the treatment response is only evaluated at the end of the treatment period, so that patients called breakthrough in one study would be classified as nonresponders is another. Similarly, the observation period after cessation of treatment can vary, so an end-of-treament response could end up either as a relapse or a sustained response if there is no follow-up after treament.

Infection country Use the official two letter country code for the infection country. The infection country is distinct from the sampling country. This field is only filled in if it is actually known with high certainty where a patient was infected. For example, the database contains many sequences for which the sampling country is Sweden that are actually most likely from African countries. In these cases, if a patient has indicated that his/her only chance of infection was in an African country, the infection country is set to that; but if the infection country is not expicitly named, it will be blank.

Infection year This is the year in which the patient was infected. The year is only recorded when it is known exactly; descriptions like 'between 1985 and 1987' are not recorded.

Sampling country The two-letter code for the country in which the sample was taken. Click here for a list of country codes.

Sampling year The year in which the sample was taken from which the sequence was obtained.

Geographic region This is a way to retrieve all sequences from (for example) the African continent without having to search for each country separately. See here for a list of which countries are included in each region.

Other fields This is a shortcut to search several other fields in the database without cluttering the interface even more.

Genomic region search A full description of this tool has been published here. Briefly, the sequences are internally aligned and the location of their starting and ending positions are determined; then these positions are compared to the specified region, and part or all of the sequences are used if they fall within that region. A genomic map showing the regions is here.

Exclude nonhuman host checkbox Checking this box will exclude known non-human sequences from the retrieval. This field is annotated both manually and based on the "specific_host" and "lab_host" fields in the Genbank entry. As with all fields, we cannot guarantee that it is correct, but obvious non-human sequences will probably be excluded.

Exclude known patent sequences checkbox Unchecking this box will include known patent-related sequences in the retrieval. These sequences are very often identical to other sequences in the database, but there is usually less background information about them. This field is annotated based on the occurrence of the word "patent" in the Genbank entry. As with all fields, we cannot guarantee that it is correct, but most patent sequences will be excluded.

Exclude synthetic sequences checkbox Synthetic sequences are sequences that were manipulated in the laboratory. Examples are laboratory-generated recombinants, sequences with artificial mutations, etc. This field is read from Locus line (the top line) of the Genbank entry, but may also be manually changed by the HCV database.

Exclude "bad" sequences checkbox "Bad" sequences are defined as sequences that either contain more than 10% undetermined characters (N's or IUPAC codes) or that have been found to be questionable by the HCV database staff. In the latter case, the reason for the characterization is usually noted in the comments field (shown in green in the Genbank entry).

Help with the search results

On the search interface output page these options are available:
- sorting the sequences on any of the fields shown (click on the column labels)
- excluding epidemiologically related sequences ('Exclude related' button')
- creating a phylogenetic tree from the retrieved sequences. Optionally, sequence labels can be modified, and genotype reference seqeunces can be included. This option opens a new page where tree parameters can be specified.
- downloading the sequences, aligned or unaligned, or translated to amino acids in three frames. The sequences can be labeled in many different formats, including custom user-defined formats. The genotype reference sequences, aligned to the database sequences, can be included in the alignment.
- downloading the background information that is displayed on the output page (and following pages) as a tab-delimited file, with or without the sequences
- displaying a histogram that shows how much sequence information is available in the retrieved sequences for each position in the genome, colored by genotype

Sorting the sequences

To sort the sequences on the content of one of the columns, click on the title of that column.

Excluding epidemiologically related sequences

This button only appears if you have selected a certain genomic region as a search parameter. Its action is to discard all but the first sequence (in the current sorting order) of all clusters of epidemiologically related sequences in the same genomic region. These sequences can either be from the same patient (one sample or different samples), or they can be from another patient who is known to be related to this patient. These relationships are defined in "clusters"; an overview of clusters can be obtained by selecting "clusters" from the "Other fields" pulldown menu, and typing an underscore (_) in the search field (this lists all sequences that have a value in the cluster field). You can see which cluster a patient is a member of (if any) by clicking on the patient code link. Unfortunately, there is currently no way (except by the name) to tell which clusters are nested inside others. We're working on it...

The reason we have limited the use of this button to cases where the genomic region is part of the search parameters is that it doesn't make much sense to discard an NS5B sequence from a patient because there is a Core sequence from that same patient. However, if you want to have it apply to sequences from multiple regions, select all those regions or the complete genome from the 'genomic region' box, and then set the "Include fragments longer than X" box to a small number (or to 0). Please be aware that this will mean that the patient or transmission cluster will not be represented in more than one of the selected regions! More information on how the clusters are annotated is here.

Downloading the sequences

aligned vs. unaligned You can download nucleotide sequences as an alignment, or unaligned. Amino acids only come unaligned, un any (or all) of the three frames. If you choose aligned sequences, two things can happen. If you used a genomic region or sequence coordinates to retrieve your alignment and you have checked the 'clip to selected region' box, your sequences will be limited to the selected region. Otherwise, you will end up with an alignment that covers the entire genome, i.e. is around 11,000 characters long. This can be convenient if you want to align your sequences to a set of complete genomes, or to other sequences retrieved using the same method (these alignments may differ by a few positions). Please note that the alignments are not necessarily optimal and may require manual adjustment; but they form a very good starting point.
WARNING: If you download an alignment, sequences that do not have valid coordinates relative to H77 will not be included in the alignment. This can happen if the sequences are very short, if they contain non-HCV inserts, or if they are reverse complements. These sequences can be recognized in the search interface output because they do not have the icon that shows the location, but say "no location info" instead.
Sequence labels We have pre-defined five ways to label your sequences; if you prefer another way, on the 'Download sequences' menu there is also a link called "Compose labels". If you click this, you can set the order in which a large number of database fields appear in the label, and what symbols are used as field separators and as 'missing' indicators.
Including the reference sequence You can include the reference sequence in your downloaded sequences. This will often make it easier to use the SynchAligns interface to align these sequences to other sets.

Creating a phylogenetic tree

We have incorporated a phylogenetic treemaker in the search interface. You can make a tree from (all or a subset of) your retrieved sequences, and include the genotype references sequences or the consensus sequences. The interface allows you to compose labels for your sequences, to choose the evolutionary model for the distance calculation (currently F84, Kimura 2-parameter, Log-det, or Jukes-Cantor), to set the transition-transversion ratio, and to determine the outgroup. In the near future, many more options will become available. The alignment, treefile, and various graphical representations of the tree can be downloaded.

How the sequences are aligned.
When the sequences are uploaded into the database, they are internally aligned against a 'model sequence' that represents all sequences that are already present in the database. For this alignment we use the HMMER program, written by Sean Eddy. The start and end coordinates of each sequence relative to the model sequence, as well as the location of all the gaps, are stored in the database. When you request all sequences encompassing the core gene, for example, the coordinates for the core gene in the model sequence are retrieved, and all sequences with a lower (or equal) start point and a higher (or equal) stop point are retrieved. When the sequences are downloaded, the gaps relative to the model sequences are inserted. For the little image that shows the location of the sequence relative to the genome, a slightly different set of coordinates is used, relative to H77 instead of the model sequence. These coordinates are produced by an algorithm, and are identical to the coordinates that the Sequence Locator tool produces. Please note that the location of some sequences cannot be accurately determined, often because they are too short or because they are located in a region where H77 is undefined (such as D85026). In these cases, the sequence will not be included in the aligned download, but if you download the sequences unaligned it will be there.

In August 2004, we have updated and adjusted the model sequence so that almost all downloaded alignments are now codon-aligned, meaning that they can be translated into amino acids without having to move gaps. This can cause the UTR alignments to look a little strange, since there also gaps tend to be inserted in triplets; however, the total length of the UTR model sequence has actually decreased, showing that the alignments are not more gappy (stretched out) than they were before. The exact method used for creating the internal alignments and retrieving the regions has been described here.

Downloading the background information

It is possible to download the output shown in the search interface as tab-delimited files, which allows you to tabulate background data for the retrieved set that will not show up in the sequence names. Examples of background information: patient information (code, health status, age, gender, risk factor, infection date, infection country and city), comments from the authors or HCV database staff, tissue type, strain - isolate - and clone name. More...

Displaying a histogram

The histogram is a graphic overview of how many sequences in the retrieved sequence set cover each region of the genome. It can be very useful, for example, to select a region to sequence for which a lot of relevant comparison sequences (e.g. genotype 6 from Vietnam) are available. These histograms are computationally intensive for large numbers of sequences, and can take some time to compute.


Questions or comments? Contact us at hcv-info@lanl.gov