NCBI Resource Guide |
PubMed | Entrez | BLAST | OMIM | Taxonomy | Structure |
Each link in this Resource Guide leads to a brief description of the resource on this page, then to the resource itself. A graphical Site Map and an Alphabetical Quicklinks Table provide direct links to resources and bypass the descriptions. |
indicates a resource which has become available in the last 12 months. |
About NCBI | Overview |
About NCBI - The science behind our resources. An introduction for researchers, educators and the public. Includes a Science Primer, with plain language introductions to bioinformatics, genome mapping, molecular modeling, SNPs, ESTs, microarray technology, molecular genetics, pharmacogenomics, and phylogenetics. |
Programs and Services - basic research, databases and software, outreach and education |
Contact Information - postal address, phone, e-mail addresses for various services |
Exhibit Schedule - NCBI exhibits at upcoming conferences |
NCBI Handbook - an online book, written by NCBI staff, that discusses the many resources available at NCBI. Each chapter is devoted to one service; after a brief overview on using the resource, there is an account of how the resource works, including topics such as how data are included in a database, database design, query processing, and how the different resources relate to each other. |
Organizational Structure - functions of the three NCBI branches: Computational Biology Branch (CBB), Information Engineering Branch (IEB), and Information Resources Branch (IRB) |
Board of Scientific Counselors - advises the NIH Director and the Deputy Director for Intramural Research; the NLM Director, and the NCBI Director about the intramural research and development programs of the NCBI. |
Postdoctoral Fellowships - general information, application procedure |
Statistics for NCBI Resources - A page listing statistics that are available for selected NCBI resources, including number of records present in various databases, number of genomes available at NCBI and statistics for the individual genomes, and server usage. |
Site Search - Search the NCBI web site and display results in various formats. The default Homepage view sorts NCBI pages based on the number of other NCBI pages that link to them. The NCBI Site Search function is part of the Entrez system (described below). Therefore, the search features described in the Entrez help document also apply to the site search function. |
News and Announcements |
|
|
|
|
GenBank | Overview |
General Information (sample record, release notes, GenBank divisions, statistics), Submissions (general, special categories, other data types), International Collaboration, FTP GenBank |
General Information |
What is GenBank? - a database of nucleotide sequences from >160,000 organisms. Records that are annotated with coding region (CDS) features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases (described below), which also includes EMBL and DDBJ. GenBank is updated daily in NCBI search systems, and a full release is issued on the FTP site approximately the 15th of every February, April, June, August, October, and December. It contains all the data present in GenBank as of the cutoff date specified in the release notes (described below). The FTP site also provides daily cumulative an non-cumulative update files (more about the FTP site below). |
Sample Record - detailed
description of each field in a GenBank record. Includes, for example, information about accession number formats, sequence identifiers (GI number and accession.version), a listing of GenBank divisions, and more. Describes some commonly annotated biological features, such as CDS, and provides links to documents that list and define the complete set of biological features that can be annotated on sequence records. Includes a link to a sequence revision history tool that can be used to track changes that have occurred to the sequence data in a record. Also lists the Entrez search field(s) that can be used to search each part of a sequence record. |
GenBank Divisions - summary of GenBank divisions, including abbreviations, full spellings, information about what the GenBank divisions are, and what they are not. (This information is part of the GenBank sample record, described above.) |
Access GenBank - through Entrez Nucleotides. Search by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez is below. Use BLAST for sequence similarity searches against GenBank and other databases. An option to download the GenBank full release and updates via FTP is also available. |
Growth Statistics (graph) - see also Release Notes sections 2.2.6 (per division statistics), 2.2.7 (per organism statistics), 2.2.8 (growth of GenBank). For statistics on other NCBI databases, please see the page that summarizes sources of Statistics for NCBI Resources. |
GenBank Release Notes - A document that accompanies each full release (described in "What is GenBank?", above) of the GenBank database. The release notes describe the format and content of the flat files that comprise the release. They also include notices of recent and upcoming changes, information about GenBank divisions, growth statistics, citing GenBank, and more. |
Genetic Codes - synopsis of 17 genetic codes; used to ensure correct translation of coding sequences in GenBank records. |
GenBank Bionet Newsgroup - A moderated list that includes announcements of new GenBank releases, recent and upcoming changes, and discussion among subscribers. For information on how to subscribe by e-mail, see the NCBI Announcements Email Lists page. |
GenBank Submissions |
General Information |
|
In addition to GenBank, there are other databases at NCBI to which a variety of data types can be submitted (third party annotations (TPA), variation, expression, MHC data, SKY/M-FISH/CGH data, traces). |
Submission Software Programs |
|
|
Special Types of Submissions to GenBank |
Genomes,
Alignments,
ESTs,
GSSs,
HTGs,
STSs,
WGS |
|
|
|
|
|
|
Other Types of Data Submissions (Other NCBI databases, separate from GenBank, to which data can be submitted) |
|
|
|
|
|
|
International Nucleotide Sequence Database Collaboration |
GenBank, DDBJ, EMBL - Overview of collaborative projects and links to home pages. The GenBank, DDBJ (DNA Data Bank of Japan), and EMBL (European Molecular Biology Laboratory) databases share data on a daily basis and are therefore equivalent. The record formats and search systems might differ among the databases, but the accession numbers, sequence data, and annotations are the same in all of them. E.g., you can retrieve the record with accession number U12345 from GenBank, DDBJ, or EMBL and it will contain the same sequence data, references, etc. in all three databases. |
DDBJ/EMBL/GenBank
Feature
Table - feature table formats and standards used in the annotation of
sequence
records by the collaborating databases; makes possible sharing of data; includes
detailed appendices such as:
|
FTP GenBank and Daily Updates |
GenBank flat file format - see sample GenBank record and detailed description in GenBank release notes; download most recent full release (described above) and daily cumulative or non-cumulative update files. |
ASN.1 format - Abstract Syntax Notation 1, an International Standards Organization (ISO) data representation format; download most recent full release (described above) and daily cumulative or non-cumulative update files. (more on ASN.1) |
FASTA format - definition line followed by sequence data only (example); see readme file for database descriptions, including nt.Z (daily updated non-redundant BLAST nucleotide database, contains GenBank+EMBL+DDBJ+PDB sequences, but no EST, STS, GSS, or HTGS sequences), nr.Z (daily updated non-redundant proteins), est.Z, gss.Z, htg.Z, sts.Z, and others. |
Molecular Databases | Overview |
Nucleotide Sequences, Protein Sequences, Structures, Genes, Expression, Taxonomy |
Nucleotide Sequence Databases |
Entrez Nucleotides - combines data from a number of source databases, including GenBank, RefSeq, TPA, and PDB. Data can be searched by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez below. For retrieval of large data sets, Batch Entrez (described below) is available. |
GenBank - a database of nucleotide sequences from >160,000 organisms. Records that are annotated with coding region (CDS) features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases (described above), which also includes EMBL and DDBJ. A sample record, which provides a detailed description of each field in a GenBank record, is also available. A variety of sequence records exist in GenBank, such as characterized genes that have been well-studied and annotated, batch produced sequences (ESTs, GSSs, STSs), high throughput genomic sequences, complete genomes, and more. Additional information about GenBank is given in the GenBank Overview section of this guide. |
RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits. Nucleotide sequence records have accessions: NT_123456, NM_123456, NC_123456, NG_123456, XM_123456, XR_123456 (more info about accession numbers and access). Additional details about RefSeq are provided in the NCBI Handbook, which is available online in the Entrez Books database. |
Consensus CoDing Sequence (CCDS) Database - The CCDS project is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. The collaborators include the National Center for Biotechnology Information (NCBI, Map Viewer), European Bioinformatics Institute (EBI, Ensembl), University of California, Santa Cruz (UCSC, Genome Browser), and Wellcome Trust Sanger Institute (WTSI, Vega). They identify the position of protein-coding regions of genes that are (1) annotated consistently on the human genome by all of the participating centers and (2) supported by transcript evidence, use of canonical splice sites, and other quality assurance measures. Additional information about the curation, process flow, and quality testing is available on the CCDS web site. |
Third Party Annotation (TPA)
database - a database of experimentally supported annotations on assemblies
of
sequences already present in DDBJ/EMBL/GenBank. Whereas DDBJ/EMBL/GenBank
contains
primary sequence data and corresponding annotations submitted by the
laboratories
that did the sequencing, the TPA database contains third-party assemblies of
primary
data with experimentally supported annotation that has been published in a
peer-reviewed scientific journal. Details about how to submit data, as well as
examples of what can and cannot be submitted to TPA, are provided on the TPA home page.
Note: Although TPA records are derived from DDBJ/EMBL/GenBank, TPA is actually a separate database. Therefore, TPA records are not present in the GenBank FTP files, but will be available in separate FTP files. |
dbEST - database of expressed
sequence tags; short, single pass read cDNA (mRNA) sequences. Also includes
cDNA
sequences from differential display experiments and RACE experiments. Note: EST sequences are available from two sources: dbEST and the EST division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ. (data submission instructions...) |
dbGSS - database of genome
survey
sequences; short, single pass read genomic sequences, exon trapped sequences,
cosmid/BAC/YAC ends, others. Note: GSS sequences are available from two sources: dbGSS and the GSS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ. (data submission instructions...) |
dbMHC - Provides a platform where the human leukocyte antigen (HLA) community can submit, edit, view, and exchange Major Histocompatibility Complex (MHC) data. The MHC database is fully integrated with other NCBI resources, as well as with the International Histocompatibility Working Group (IHWG) Web site, and provides links to the IMmunoGeneTics HLA (IMGT/HLA) database. Additional details are available in the NCBI Handbook. |
dbSNP - database of single nucleotide
polymorphisms, small-scale insertions/deletions, polymorphic repetitive
elements,
and microsatellite variation. dbSNP includes polymorphism data that is
experimentally derived, computationally derived, as well as hybrid data that is
determined by the alignment of an experimentally derived molecule to genomic
sequence data. Currently, dbSNP is comprised of 4 general classes of
submissions: (a) The SNP Consortium (TSC) - candidate SNPs identified by
sequencing
using either the reduced representation shotgun strategy or by alignment of
random
reads to genomic sequence; (b)
Overlaps - candidate SNPs were identified in sequence overlaps between
individual
BACs or PACs; (c) ESTs - SNPs identified in EST clusters, including those
identified by the Cancer Genome Anatomy Project (described below); (d) Other - SNPs identified after screening
larger
numbers of chromosomes include many with alleles of lower frequency (1%-20%).
(data submission
instructions) To receive announcements about updates and new
features to dbSNP, see the NCBI Announcements
Email Lists
page. Note: Although dbSNP is a separate database from GenBank, SNP records include cross-references to GenBank records. |
dbSTS - database of sequence
tagged
sites; short sequences that are operationally unique in the genome, used to
generate
mapping reagents. Note: STS sequences are available from two sources: dbSTS and the STS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ. (data submission instructions...) |
UniSTS - a unified, non-redundant view of sequence tagged sites (STSs). UniSTS integrates marker and mapping data from a variety of public resources. If two or more markers have different names but the same primer pair, a single STS record is presented for the primer pair and all the marker names are shown. Each UniSTS record displays the primer sequences, product size, mapping information, and cross references to Entrez Gene, dbSNP, RHdb, GDB, MGD, and the Map Viewer. The marker report also lists GenBank and RefSeq records that contain the primer sequences, as determined by Electronic PCR (e-PCR). Data sources include dbSTS, RHdb, GDB, various human maps (Genethon genetic map, Marshfield genetic map, Whitehead RH map, Whitehead YAC map, Stanford RH map, NHGRI chr 7 physical map, WashU chrX physical map), various mouse maps (Whitehead RH map, Whitehead YAC map, Jackson laboratory's MGD map). |
UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human), and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete data set can be downloaded from the repository/UniGene directory of the FTP site. In addition, UniGene DDD (described below) can be used to show differential expression of genes between cDNA libraries. The organisms represented in UniGene are listed on the UniGene home page. |
HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. Organisms represented are listed on the HomoloGene home page. |
Mammalian Gene Collection (MGC) - The NIH Mammalian Gene Collection (MGC) is a trans-NIH initiative that seeks to identify and sequence a representative full open reading frame (FL-ORF) clone for each human, mouse, and rat gene. The MGC project entails the production of cDNA libraries and sequences, database and repository development, as well as the support of research for improved library construction, sequencing, and analytic technologies. All the resources generated by the MGC are publicly accessible to the biomedical research community. |
Trace Archives - a repository of the raw sequence traces generated by large sequencing projects. It allows retrieval of both the sequence file and the underlying data which generated the file. In the case of projects that rely on a Whole Genome Shotgun (WGS) strategy, the Trace Archive will be the sole source of raw sequence data. (More information about WGS projects is provided in the ResourceGuide section on special types of submissions to GenBank/WGS.) NCBI will be exchanging data regularly with the Ensembl Trace Server. The Trace Archive can be searched by using Trace BLAST (described below), or by entering a term in the search box at the top of the Trace Archives Page. (data submission instructions...) |
Short Read Archive - houses sequencing data generated by new sequencing platforms. |
Assembly Archive - links the raw sequence information found in the Trace Archives with assembly information found in publicly available sequence repositories (GenBank/EMBL/DDBJ). The Assembly Viewer allows a user to see the multiple sequence alignments as well as the actual sequence chromatogram. |
UniVec - a database that can be used to quickly identify segments within nucleic acid sequences which may be of vector origin. Screening using UniVec is efficient because a large number of redundant sub-sequences have been eliminated to create a database that contains only one copy of every unique sequence segment from a large number of vectors. The VecScreen tool, described below (under sequence analysis tools), can be used to compare a query sequence against the UniVec database in order to identify possible vector contamination. |
Genomes - Resources in the Genomes and Maps section contain the nucleotide sequences for a variety of genomes. Examples of the genomes available include: >1000 organisms in Entrez Genome, human, mouse, rat, zebrafish, Drosophila, nematode, plant genomes, yeast, malaria, microbial genomes, viruses, viroids, plasmids, eukaryotic organelles. |
Nucleotide Sequence Analysis - various tools are available for analyzing nucleotide sequences and are described below. |
Protein Sequence Databases |
Entrez Proteins - search protein sequence records (from GenPept + RefSeq + Swiss-Prot + PIR + RPF + PDB) by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez below. For retrieval of large data sets, Batch Entrez (described below) is available. Entrez proteins also includes BLink ("BLAST Link"), a feature which displays the results of BLAST searches that have been done for every protein sequence in the Entrez Proteins data domain. To access it, follow the BLink link displayed beside any hit in the results of an Entrez Proteins search. More information about BLink is provided below. |
RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits. Protein sequence records have accessions: NP_123456 or XP_123456 (more info about accession numbers and access). |
FTP GenPept - download the "relxxx.fsa_aa.gz" file. The filename stands for "Release number XXX FASTA formatted amino acid translations". The translations are extracted from GenBank/EMBL/DDBJ records that are annotated with one or more CDS features |
Conserved Domain Database (CDD) - a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It includes domains from Smart and Pfam, as well as domains contributed by NCBI researchers. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database (described below). CDD can be used to identify conserved domains in a protein query sequence, using the CD-Search service (described below). In addition, the CDART tool (described below) uses CDD and RPS-BLAST (described below) to retrieve proteins with similar domain architectures. |
HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliograhies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data. More information about this database is provided under "Literature Databases". |
PROW - Protein Resources on the Web - short authoritative guides on the approximately 200 human CD cell-surface molecules. Peer-reviewed; provides approximately 20 standardized categories of information (biochemical function, ligands, etc.) for each CD antigen. |
Protein Sequence Analysis - various tools are available for analyzing protein sequences and are described below. |
Proteomes |
|
|
|
Structure Databases |
Structure Home - general information about the NCBI Structure Group and its research projects, as well as access to the Molecular Modeling Database (MMDB) and related tools to search and display structures. |
MMDB: Molecular Modeling Database- a database of three-dimensional biomolecular structures derived from X-ray crystallography and NMR-spectroscopy. MMDB is a subset of three-dimensional structures obtained from the Brookhaven Protein DataBank (PDB), excluding theoretical models. MMDB reorganizes and validates the information in a way that enables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. Its data specification includes a description of a biopolymer's spatial structure, a description of how it is organized chemically, and a set of pointers linking the two. By integrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structure-based homology modeling and protein structure prediction. MMDB records are stored in ASN.1 format and can be displayed with the Cn3D, Rasmol, or Kinemage viewers. In addition, similar structures within the database have been identified usingVAST, and new structures can be compared against the database using VASTsearch. |
3D Domains Database - compact structural domains identified automatically in MMDB, Entrez's macromolecular three-dimensional structure database. These domains are identified by searching for breakpoints in the structure between major secondary structure elements so that the ratio of intra- to inter-domain contacts falls above a set threshhold. 3D Domains are the units of comparison for structure neighbor ("related structures") calculations using the VAST algorithm. |
Conserved Domain Database (CDD) - a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It includes domains from Smart and Pfam, as well as domains contributed by NCBI researchers. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database (described above). CDD can be used to identify conserved domains in a protein query sequence, using the CD-Search service (described below). In addition, the CDART tool (described below) uses CDD and RPS-BLAST (described below) to retrieve proteins with similar domain architectures. |
PubChem - contains the chemical structures of small organic molecules and information on their biological activities. It is intended to support the Molecular Libraries and Imaging component of the NIH Roadmap Initiative. PubChem's chemical structure database may be searched on the basis of descriptive terms, chemical properties, and structural similarity. When possible, PubChem's chemical structure records are linked to other NCBI databases, including the PubMed scientific literature database and NCBI's protein 3D structure database. PubChem also contains the results of high-throughput biological screening experiments. PubChem is organized as three linked databases within the Entrez/PubMed information retrieval system. |
|
|
|
Structure-Related Tools - in addition to the structure databases described above, NCBI offers several tools: |
|
|
|
|
Genes |
Entrez Gene - Entrez Gene provides a gene-based view of the data from a wide range of genomes. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. Each record represents a single gene from a given organism. The minimum set of data in a gene record includes a unique identifier or GeneID assigned by NCBI, a preferred symbol, and any of sequence information, map information, or official nomenclature from an authority list. In addition, a gene record can also include expression, structure, functional, and homology data, when available. Entrez Gene includes data from all organisms that have RefSeq genome records (with NC_* accessions, see more info above), and can also include data from recognized genome-specific databases that provide NCBI with information about genes (preferably with defining sequence) or mapped phenotypes. Entrez Gene is the successor to LocusLink (described below). |
GeneRIF - Gene References into Function (GeneRIFs) provide a simple mechanism to allow scientists to add to the functional annotation of loci described in Entrez Gene. They appear as annotated bibliographies in Entrez Gene records, and consist of brief statements on gene function with links to the corresponding PubMed records (example: human MLH1). The GeneRIF help page describes the simple steps needed to submit information. GeneRIFs are also added to the Entrez Gene records by the MEDLINE Indexing Staff of the National Library of Medicine. GeneRIFs are currently available for a subset of organisms in Entrez Gene, and will be provided for the loci of other organisms as the development of Entrez Gene continues. |
LocusLink - LocusLink was discontinued as of March 1, 2005. It provided a foundation for what is now Entrez Gene and was described in several articles ( Pruitt KD, Maglott DR (2001), Pruitt KD, Katz KS, Sicotte H, Maglott DR (2000)). It contained data for a number of species such as human, mouse, rat, zebrafish, nematode, fruit fly, cow, sea urchin, African clawed frog, HIV-1, and a few other model and commonly studied organisms. Data for these organisms (and from the ongoing collaboration among the groups listed above) are now available in the Entrez Gene database (described above), which is the successor to LocusLink. The major differences between LocusLink and Entrez Gene are scope of data and search interface. Entrez Gene contains data from all organisms with RefSeq genome records. (RefSeq is described in the Molecular Databases/Nucleotide Sequences section of this guide). Entrez Gene also uses the Entrez search system, and therefore offers the helpful functions such as Preview/Index, History, and LinkOut that are available for other Entrez databases. The Entrez Gene help document includes numerous tips for previous users of LocusLink. |
Consensus CoDing Sequence (CCDS) Database - The CCDS project is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. The collaborators include the National Center for Biotechnology Information (NCBI, Map Viewer), European Bioinformatics Institute (EBI, Ensembl), University of California, Santa Cruz (UCSC, Genome Browser), and Wellcome Trust Sanger Institute (WTSI, Vega). They identify the position of protein-coding regions of genes that are (1) annotated consistently on the human genome by all of the participating centers and (2) supported by transcript evidence, use of canonical splice sites, and other quality assurance measures. Additional information about the curation, process flow, and quality testing is available on the CCDS web site. |
UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human), and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete data set can be downloaded from the repository/UniGene directory of the FTP site. In addition, UniGene DDD (described below) can be used to show differential expression of genes between cDNA libraries. The organisms represented in UniGene are listed on the UniGene home page. |
HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. Organisms represented are listed on the HomoloGene home page. |
Mammalian Gene Collection (MGC) - The NIH Mammalian Gene Collection (MGC) is a trans-NIH initiative that seeks to identify and sequence a representative full open reading frame (FL-ORF) clone for each human, mouse, and rat gene. The MGC project entails the production of cDNA libraries and sequences, database and repository development, as well as the support of research for improved library construction, sequencing, and analytic technologies. All the resources generated by the MGC are publicly accessible to the biomedical research community. |
HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliograhies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data. More information about this database is provided under "Literature Databases". |
AceView (Acembly) - AceView offers an integrated view of the human, nematode and Arabidopsis genes reconstructed by co-alignment of all publicly available mRNAs and ESTs on the genome sequence. The goals are to offer a reliable up-to-date resource on the genes and their functions and to stimulate further validating experiments at the bench. AceView carefully computes co-alignment and clustering of experimental cDNA sequences, no prediction is involved. The resulting AceView genes and their alternative variants are analyzed in terms of expression, intron-exon structure, alternative features, regulation and neighbor relationships; the protein products are analyzed for completeness, their best covering clones are identified, the proteins are searched for motifs, membership to a protein family, conservation in evolution, closest homologues in other species and signals for subcellular localization. The genes are presented in the context of biological annotations gathered from various sources. AceView can be queried by meaningful words or sentences as well as by most standard identifiers. |
Expression |
Gene Expression Omnibus (GEO) - a gene
expression and hybridization array data repository, as well as a curated, online
resource for gene expression data browsing, query and retrieval. GEO was the
first
fully public high-throughput gene expression data repository, and became
operational
in July 2000. Many types of gene expression data from platforms such as spotted
microarray (microarray), high-density oligonucleotide array (HDA), hybridization
filter (filter) and serial analysis of gene expression (SAGE) data, are
accepted,
accessioned, and archived as a public data set. GEO data can be accessed
through
several search and browsing tools on the GEO home page, Entrez (via Entrez GEO
Profiles and Entrez GDS (GEO
DataSets)),
and the FTP site. The Tools/Gene
Expression section of this file provides information about data visualization and exploration capabilities
available in GEO. |
GENSAT - The Gene Expression Nervous System Atlas, or GENSAT, project aims to map the expression of genes in the central nervous system of the mouse, using both in situ hybridization and transgenic mouse techniques. The GENSAT database contains a series of images related to gene expression experiments. The images are indexed on a number of fields relevant to biological discovery. Search criteria include gene names, gene symbols, gene aliases and synonyms, mouse ages, and imaging protocols. The GENSAT project is a collaboration among the National Institute of Neurological Disorders and Stroke (NINDS), Rockefeller University, St. Jude Children's Research Hospital, and NCBI. |
Expression-Related Tools - in addition to the GEO database, described above, NCBI offers several tools: |
|
|
|
Taxonomy |
NCBI Taxonomy Database Home - general information about the Taxonomy project, including taxonomic resources and a list of outside curators collaborating with NCBI taxonomists. The NCBI Taxonomy Database contains the names and lineages of >160,000 organisms, both living and extinct, that are represented in the genetic databases with at least one nucleotide or protein sequence. New organisms are added to the database as sequence data are deposited for them. The purpose of the taxonomy project at NCBI is to build a consistent phylogenetic taxonomy for the sequence databases. |
Taxonomy Browser - The search bar on the Taxonomy home page allows you to browse the NCBI taxonomy database. Enter the scientific or common name of a species (e.g., Canis familiaris or dog) or a higher taxon (e.g., Canidae) to view that organism or taxon's lineage; retrieve the available nucleotide, protein, structure, and genome records; and browse up and down the taxonomic tree. (Tip: For the broadest search results, select the "token set" option in the search bar, which searches for any string, whether in the beginning, middle, or end of a word.) Entrez also provides an interface for browsing the taxonomy database, and offers features such as the Common Tree function, which allows you to build a tree for your own selection of organisms or taxa (more...). |
Taxonomy BLAST - an
implementation of Gapped BLAST (2.x) that groups hits by source organism,
according
to information in NCBI's Taxonomy database. Species are listed in order of
sequence
similarity to the query sequence; the strongest match listed first. Three report
views are available:
|
TaxPlot - a tool for 3-way comparisons of genomes on the basis of the protein sequences they encode. To use TaxPlot, one selects a reference genome to which two other genomes are compared. Pre-computed BLAST results are then used to plot a point for each predicted protein in the reference genome, based on the best alignment with proteins in each of the two genomes being compared. |
Literature Databases | Overview |
PubMed - A database of citations and abstracts for biomedical literature. These citations are from MEDLINE and additional life science journals. PubMed also includes links to many sites providing full text articles and other related resources. PubMed is accessible through the Entrez search and retrieval system (described below) |
|
|
|
PubMed Central - a digital archive of biomedical and life sciences journal literature, including clinical medicine and public health, managed by the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine (NLM). It is not a journal publisher. Access to PubMed Central (PMC) is free and unrestricted. |
OMIM - Online Mendelian Inheritance in Man - continuously updated catalog of human genes and genetic disorders, with links to associated literature references, sequence records, maps, and related databases. |
Entrez Books - In collaboration with book publishers, the NCBI is adapting textbooks for the web and linking them to PubMed, the biomedical bibliographic database. The idea is to provide background information to PubMed, so that users can explore unfamiliar concepts found in PubMed search results. |
HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. RefSeq protein sequence records serve as anchors for collecting published information about interactions between HIV-1 and human proteins. Each HIV Interactions database record lists an HIV protein and the human proteins with which it has been found to interact. In turn, the Entrez Gene record for each human protein contains annotated HIV-1 Interactions bibliographies, which consist of brief statements on protein interactions with links to the corresponding PubMed records and sequence data. The HIV Interactions database is a collaborative project among the developers of RefSeq (description) and Entrez Gene (description), and is similar in concept to GeneRIF (description). In contrast to GeneRIFs for single genes, however, the publications cited in the HIV Interactions Database contain statements about binding between two proteins rather than statements about the function of a single gene. |
Genomes and Maps | Overview |
organism collections (including Entrez Genome, Entrez Genome Project, Map Viewer, Entrez Gene, UniGene, HomoloGene, and COGs), and organism-specific resources, such as: human, mouse, rat, zebrafish, Drosophila, nematode, plant genomes, yeast, malaria, microbial genomes, viruses, viroids, plasmids, eukaryotic organelles |
Organism Collections |
Genomic Biology - An introduction to the field of genomic biology, with links to the genome resources pages for major organisms and organism groups, as well as links to additional NCBI genome resources. |
Entrez Genome -
sequence and map data from the whole
genomes of over 1000 organisms. The genomes represent both completely sequenced
organisms and those for which sequencing is in progress. All three main domains
of
life - bacteria,
archaea,
and eukaryota
- are represented, as well as many viruses,
phages,
viroids,
plasmids,
and organelles.. Entrez Genome
provides
graphical overviews of complete genomes/chromosomes, and the ability to explore
regions of interest in progressively greater detail. ProtTables and TaxTables are provided for organisms on
which analyses have been done by NCBI staff. In addition, the Map Viewer, a software component of Entrez Genome, provides
views of integrated chromosome maps for a variety of organisms (see additional
information about the Map Viewer below).
Information about submitting genome data from complete genomes is provided in the Resource Guide section on Submission of complete genomes. After data from complete genomes are submitted, they are made available in Entrez Genome (as complete genomes or chromosomes) and Entrez Nucleotide (as chromosome or genome fragments such as contigs). Entrez Nucleotide also provides access to the records for complete genomes/chromosomes, but the default view of those records is the Nucleotide database is GenBank format, whereas the default view in Entrez Genome is a graphical overview. A companion database, Entrez Genome Project, is described below. |
Entrez Genome Project - a companion database to Entrez Genome (described above). The actual data from genome sequencing projects are contained in Entrez Genome (as complete genomes chromosomes) and Entrez Nucleotide (as chromosome or genome fragments such as contigs). The Genome Project database, on the other hand, provides an umbrella view of the status of each genome project, links to project data in the other Entrez databases, and links to a variety of other NCBI and external resources associated with a given genome project. A genome project's status can be complete or in-progress, and the project can include large-scale sequencing, assembly, annotation, and mapping efforts. New genome sequencing projects can be registered through the Genome project submission form. More information about the submission of data from complete genomes is provided in the Resource Guide section on Submission of complete genomes. (Although the Entrez Genome Project database does not include viral genome sequencing projects, data from those projects are submitted to GenBank and are available in the Entrez Nucleotide and Entrez Genome databases. There is also a special set of resources at NCBI dedicated to Viral Genomes.) |
Genomes Announcements - To receive announcements about recently completed genomes, see the NCBI Announcements Email Lists page. |
Map Viewer - The Map Viewer is a software component of Entrez Genome (described above) that provides special browsing capabilities for a subset of organisms. It allows you to view and search an organism's complete genome, display chromosome maps, and zoom into progressively greater levels of detail, down to the sequence data for a region of interest. If multiple maps are available for a chromosome, it displays them aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The organisms currently represented in the Map Viewer are listed on the Map Viewer home page and in the Map Viewer help document, which provides general information on how to use that tool. The number and types of available maps vary by organism, and are described in the "data and search tips" file provided for each organism. |
Entrez Gene - Entrez Gene provides a gene-based view of the data from a wide range of genomes. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. Each record represents a single gene from a given organism. The minimum set of data in a gene record includes a unique identifier or GeneID assigned by NCBI, a preferred symbol, and any of sequence information, map information, or official nomenclature from an authority list. In addition, a gene record can also include expression, structure, functional, and homology data, when available. Entrez Gene includes data from all organisms that have RefSeq genome records (with NC_* accessions, see more info above), and can also include data from recognized genome-specific databases that provide NCBI with information about genes (preferably with defining sequence) or mapped phenotypes. |
UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human), and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete data set can be downloaded from the repository/UniGene directory of the FTP site. In addition, UniGene DDD (described below) can be used to show differential expression of genes between cDNA libraries. The organisms represented in UniGene are listed on the UniGene home page. |
HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. Organisms represented are listed on the HomoloGene home page. |
COGs - Clusters of Orthologous Groups - natural system of gene families from complete genomes. Clusters of Orthologous Groups (COGs) were delineated by comparing protein sequences encoded in complete unicellular genomes representing 30 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. The Initial Version of COGs includes 44 organisms. The Updated Version of COGs includes 66 organisms in the Unicellular Clusters, plus Eukaryotic Clusters (called KOGs). More organisms will be added in the future. |
Download Genomes <350 KB via Entrez Genome pages for individual organisms |
Download Genomes >350 KB from the NCBI ftp site - see FTP information below; ftp links are also available from Entrez Genome pages for individual organisms |
Genome Sequencing Centers - list of genome sequencing centers and the organisms on which they work |
Human Genome |
Guide, Chromosomes, Sequences, Genes, BLAST, Clones, Genome Maps, Mapped Markers, Cytogenetics, Gene Expression, Genetic Variation, Disorders, Cancer Research, FTP |
Guide |
|
|
|
|
Chromosomes |
|
|
Sequences |
|
|
|
|
Genes |
|
|
|
|
|
BLAST against human genomic sequence data |
|
Clones |
NCBI does not distribute clones. However, some NCBI resources contain information about clones and the sources from which they can be obtained. |
|
|
|
|
Clone Information for Other (Non-human) Organisms - Some organisms have additional clone information resources. For example, the resources available for the mouse genome include several items mentioned above, plus a CloneFinder, described below. In addition, many records in dbEST (described above) include information about clone sources such as the I.M.A.G.E. consortium. |
Genome Maps |
|
|
|
|
|
|
|
|
Mapped Markers | |
|
|
|
|
|
|
|
|
Cytogenetics | |
|
|
|
|
Gene Expression |
|
|
|
|
Genetic Variation | |
|
|
|
|
Disorders |
|
|
|
|
Cancer Research |
|
|
|
|
|
|
FTP |
|
Mouse Genome |
Guide, Chromosomes, Sequences, Genes, Clones, Maps and Mapped Markers, Cytogenetics, BLAST, FTP |
Guide |
|
Chromosomes |
|
|
Sequences |
|
Genes |
|
|
Clones |
|
|
|
Maps and Mapped Markers |
|
|
|
Cytogenetics |
|
BLAST |
|
FTP |
|
Rat Genome |
Rat Genome Resources Guide - brings together information on diverse rat-related resources from multiple centers: sequence, mapping, and clone information as well as pointers to strain and mutant resources. |
Map Viewer - integrated chromosome maps - The Map Viewer is a software component of Entrez Genomes that displays one or more maps which have been aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The maps that are currently available for rat are described in the Rattus norvegicus data and search tips document. The Map Viewer help document provides general information on how to use that tool. |
Entrez Gene - a gene-based view of the data from a wide range of genomes, including rat. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. More information about Entrez Gene is provided above, in the Molecular Databases/Genes section. |
BLAST against the rat genome - Nucleotide or protein query sequences can be used. A variety of database choices are provided. |
UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above. |
HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms, including human, mouse, rat, zebrafish, and fruit fly, in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. |
Cow Genome |
UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above. |
Zebrafish Genome |
Zebrafish Genome Resources Guide - brings together information on diverse zebrafish-related resources from multiple centers: sequence, mapping, and clone information as well as pointers to strain and mutant resources. |
Entrez Gene - a gene-based view of the data from a wide range of genomes, including zebrafish. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. More information about Entrez Gene is provided above, in the Molecular Databases/Genes section. |
Map Viewer - integrated chromosome maps - The Map Viewer is a software component of Entrez Genomes that displays one or more maps which have been aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The maps that are currently available for Danio rerio are described in the Danio rerio genome data and search tips document. The Map Viewer help document provides general information on how to use that tool. |
UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above. |
HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms, including human, mouse, rat, zebrafish, and fruit fly, in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. |
Drosophila Genome |
Drosophila melanogaster Home Page - provides an overview of available resources for that organism, graphically displays all the chromosomes (to scale), and allows you search both cytogenetic and sequence data across the whole genome through the Entrez Genomes browser. Entrez Genome presents a unified graphical view of maps (genetic and physical) and sequence data for an organism. After you search for a term such as a gene symbol, it presents a graphic Genome View of search results, from which you can zoom into progressively more detailed Map Views of the region of interest, and link to sequence data and associated resources that contain additional detail. |
Map Viewer - integrated chromosome maps - The Map Viewer is a software component of Entrez Genomes that displays one or more maps which have been aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The sequence and cytogenetic maps that are currently available for Drosophila are described in the Drosophila melanogaster genome data and search tips document. The Map Viewer help document provides general information on how to use that tool. |
Entrez Gene - a gene-based view of the data from a wide range of genomes, including Drosophila. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. More information about Entrez Gene is provided above, in the Molecular Databases/Genes section. |
HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms, including human, mouse, rat, zebrafish, and fruit fly, in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. |
BLAST against Drosophila melanogaster
genome sequence
|
FTP Site - see additional information about the genomes FTP directories, below |
Nematode Genome |
Caenorhabditis elegans Home Page - Graphical representation of chromosomes that can be viewed in their entirety or explored in progressively greater detail in the Map Viewer (described above). Home page also includes links to many related resources, such as sequencing centers, other nematode sequencing projects, related databases, etc. | |
FTP Site - the chromosome data sets are available for ftp in a variety of formats, including GenBank, FastA, and ASN.1, and others in the genbank/genomes/C_elega ns/< /a> directory of the NCBI FTP site (ftp://ftp.ncbi.nih.gov/). An NCBI curated version of the data is available in the genomes/C_elegans/ directory. (See additional note in the FTP section, below, about the two different FTP directories) |
Plant Genomes |
Plant Genomes Central - provides access to data from large-scale sequencing projects, genetic maps, and large-scale EST sequencing projects. All organism names on the page are linked to the corresponding taxonomic information in NCBI's Taxonomy database (described above). In addition, organisms listed under "large-scale sequencing projects" and "genetic maps" are represented in the Map Viewer (described above). Organisms listed under "large-scale EST sequencing projects" are linked to their EST sequences in Entrez (described above). |
UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above. |
Yeast Genome |
Saccharomyces cerevisiae Home Page - baker's yeast - graphical representation of chromosomes that can be viewed in their entirety or explored in progressively greater detail in Entrez Genome (described above), with links to associated sequence data. Home page also includes links to many related resources, such as sequencing centers, other fungi sequencing projects, related databases, etc. |
Schizosaccharomyces pombe Home Page - fission yeast - similar to the home page for Saccharomyces cerevisiae, described above. |
COGs - Clusters of Orthologous Groups - natural system of gene families from complete genomes. Clusters of Orthologous Groups (COGs) were delineated by comparing protein sequences encoded in complete unicellular genomes representing 30 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. The Initial Version of COGs includes 44 organisms. The Updated Version of COGs includes 66 organisms in the Unicellular Clusters, plus Eukaryotic Clusters (called KOGs). More organisms will be added in the future. |
BLAST against the Saccharomyces
cerevisiae
or Schizosaccharomyces pombe genome sequences
|
FTP Saccharomyces cerevisiae Chromosomes |
Malaria Genome |
Malaria Genetics & Genomics - provides data and information relevant to malaria genetics and genomics. Resources include organism specific sequence BLAST databases (Plasmodium falciparum only, all Plasmodium, and all Toxoplasma), genome maps, linkage markers, and information about genetic studies. Links are provided for other malaria web sites and genetic data on related apicomplexan parasites, including Toxoplasma gondii. |
Map Viewer - The Map Viewer (described above) provides graphical views and search capabilities for both Plasmodium falciparum and Anopheles gambiae (malaria mosquito). |
BLAST against
Malaria
sequences
|
FTP
|
Microbial Genomes |
Entrez Genome - Graphical representation of complete bacterial genomes that can be viewed in their entirety or explored in progressively greater detail; links to associated sequence data. A "ProtTable" of protein coding genes is provided for each bacterium. There are also links to a "TaxTable," showing the distribution of BLAST protein homologs by taxa (sequences grouped by superkingdom), and to a distribution of BLAST protein homologs by 3-D structure (sequences with known structure). Additional information about Entrez Genome is also provided above. |
Entrez Genome Project - provides an umbrella view of the status of a wide range of genome projects, and includes information about microbial genome sequencing projects. Tabs allow you to switch between lists of completed and in-progress microbial genome projects. The list of completed genomes includes links to NCBI graphical views of the data (in Entrez Genome), sequencing centers, and the results of various analyses that have been done on the genomes at NCBI (e.g., TaxTable, COG Table, 3-D Neighbors, and more). The list of in-progress sequencing projects includes links to sequencing centers and, when available, to BLASTable data. A more detailed description of the Entrez Genome Project database is provided in the section on Genomes and Maps/Organism Collections. |
COGs - Clusters of Orthologous Groups - natural system of gene families from complete genomes. Clusters of Orthologous Groups (COGs) were delineated by comparing protein sequences encoded in complete unicellular genomes representing 30 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. The Initial Version of COGs includes 44 organisms. The Updated Version of COGs includes 66 organisms in the Unicellular Clusters, plus Eukaryotic Clusters (called KOGs). More organisms will be added in the future. |
BLAST against Microbial Genomes - sequences from selected completed and unfinished eukaryotic and prokaryotic genomes; partial genomic sequences have been graciously provided by the sequencing centers or extracted from GenBank. NCBI encourages sequencing centers to submit partially sequenced genomes to be included in this BLAST page. Data can be submitted via ftp, after contacting genomes@ncbi.nlm.nih.gov to set up an account. |
FTP - download complete bacterial genomes in a variety of
formats,
including GenBank flat file (*.gbk), GenBank summary file (*.gbs), FASTA Nucleic
Acid file (*.fna), FASTA Amino Acid file (*.faa), Protein Table (*.ptt), and
others.
(See additional note in the FTP section, below, about the two different FTP directories) |
Viral Genomes |
Viruses Home Page provides brief background information on the biology of viruses, links to viral genome sequences in Entrez Genome (described below), and a wide range of related resources. It also includes information about Viral Reference Sequences, a collection of reference sequences for more than 1000 viral genomes. |
Entrez Genome - Graphical representation of complete viral genomes that can be viewed in their entirety or explored in progressively greater detail; links to associated sequence data. A summary of Coding Regions (described above) is provided for each virus. Additional information about Entrez Genome is also provided above. |
Influenza Virus Resource - A collection of resources specifically designed to support the research on the flu virus. Includes links to genome sequence data, analytical tools, epidemiological information, and the Influenza Genome Sequencing Project, funded by the National Institute of Allergy and Infectious Diseases (NIAID). |
Retrovirus Resources - A collection of resources specifically designed to support the research of retroviruses. Resources include a genotyping tool that uses the BLAST algorithm to identify the genotype of a query sequence; an alignment tool for global alignment of multiple sequences; an HIV-1 automatic sequence annotation tool; and annotated maps of 16 retroviruses viewable in GenBank, FASTA, and graphic formats, with links to associated sequence records. |
HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliograhies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data. More information about this database is provided under "Literature Databases". |
PASC (PAirwise Sequence Comparison) - a web tool for analysis of pairwise identity distribution within viral families. The identities are pre-computed for every pair within the families and with distribution plotted in a form of histogram where each bar corresponds to an interval of identities. Only complete genomes should be used as query sequences. The results from partial sequences are not suitable for the purpose of this tool. After you submit your sequence, PASC will start computing pairwise identities between the external genome and the existing genome sequences of the family. At the end of the process, you will be presented with the list of 15 closest matches to the genome within the family. The documentation provides more details about using PASC. |
Viroid Genomes |
Entrez Genome - Graphical representation of complete viroid genomes with links to corresponding sequence records. Additional information about Entrez Genome is also provided above. |
Plasmids |
Entrez Genome - Graphical representation of complete plasmids that can be viewed in their entirety or explored in progressively greater detail; links to associated sequence data. A summary of Coding Regions (described above) is provided for each plasmid. Additional information about Entrez Genome is also provided above. |
Eukaryotic Organelles |
Eukaryotic Organelles Home Page - Provides an overview of eukaryotic organelles; a description of the Organelle Reference Sequences project (part of RefSeq, see above); and links to (a) lists of completely sequenced organelles shown in taxonomic hierarchy and alphabetically by organism, (b) gene and RNA order in metazoan mitochondria, and (c) related web sites. |
Entrez Genome - Graphical representation of complete eukaryotic organelles that can be viewed in their entirety or explored in progressively greater detail; links to associated sequence data. A summary of Coding Regions (described above) is provided for each organelle. Additional information about Entrez Genome is also provided above. |
Tools | Overview |
Text Term Searching (Entrez), Sequence Similarity Searching (BLAST), Nucleotide Sequence Analysis, Protein Sequence Analysis and Proteomics, 3-D Structure Display and Similarity Searching, Genome Analysis, Gene Expression |
Data Retrieval - Text Term Searching |
Entrez - provides integrated access to
nucleotide and protein sequence data from >160,000 organisms, along with 3D
protein
structures, genomic mapping information, PubMed MEDLINE, and more.Sequence data
are
combined from various sources, including GenBank, EMBL, DDBJ, RefSeq,
PIR-International, PRF, Swiss-Prot, and PDB. A Data Model provides a schematic
illustration of the connections between the many data types in Entrez.
|
Batch Entrez - allows you to retrieve a large number of nucleotide sequences or protein sequences from Entrez, in a batch mode, by importing a file containing a list of the desired GI or accession numbers. Search results are saved directly to a local disk file on your computer. |
Entrez Utilities - Entrez Programming Utilities, also called E-Utilities, are tools that provide access to Entrez data outside of the regular web query interface. They represent a method of making WWW links to Entrez. Each utility performs a specialized retrieval task, and can be used simply by writing a specially formatted URL. For example, EFetch retrieves records in the requested format from a list of one or more primary IDs or from the user's environment. The E-Utilities web page describes the available utilities and links to a brief help document for each one. E-Utilities can be helpful for retrieving search results for future use in another environment. To receive announcements about about Entrez Utilities, see the NCBI Email Lists page. |
LinkOut - a registry service to create links from specific articles, journals, or biological data in Entrez (described above) to resources on external web sites. Third parties can provide a URL, resource name, brief description of their web site, and specification of the NCBI data from which they would like to establish links. The specification can be written as a valid Boolean query to Entrez, or as a list of identifiers for specific articles or sequences. Entrez PubMed users can then select which external links are visible in their searches, through the NCBI My NCBI service (described below). To receive announcements about updates and new features in LinkOut, see the NCBI Announcements Email Lists page. |
My NCBI - Formerly known as "Cubby", My NCBI allows Entrez users to store and update searches, receive automatic e-mails of search updates, select the Filter folder tabs shown by default for any Entrez database, and customize their LinkOut (described above) display to include or exclude links to providers. My NCBI requires that your system accepts cookies. You must also complete a brief registration form in which you select a username and password. You will need those in order to access your "My NCBI" account. There is also an option to remain logged into My NCBI, if desired. For additional information, see the help document and tutorial. |
Query E-mail Server - The Query server, which provided e-mail access to a subset of Entrez databases, was discontinued on April 15, 2002 because of limited usage. Almost all Entrez searchers now use the WWW Entrez interface, described above. It provides access to more databases and more features than are possible through the e-mail interface. |
Citation Matcher - allows you to find the PubMed ID of any article in the PubMed database, given its bibliographic information (journal, volume, page, etc.). |
Sequence Similarity Searching |
|
BLAST Announcements - To receive announcements about updates and new features, and advance notices about upcoming changes in the NCBI BLAST service, see the NCBI Announcements Email Lists page. |
BLAST 2.x - A version of BLAST (Altschul, et al., 1997) that permits gaps in the alignments it produces. Assessments of statistical significance are based upon prior simulations using random sequences. (more...) |
QBLAST - A queuing system that allows users to retrieve Gapped BLAST results at their convenience and format their results multiple times with different formatting options. This system also allows the NCBI to more efficiently use computational resources, better serving the community. As of Fall 1999, the QBLAST system is used for all BLAST searches. (more...) |
MegaBLAST - permits searching with batches of ESTs or with large cDNA or genomic sequences. (more...) |
|
PHI-BLAST - Pattern Hit Initiated BLAST (Zhang, et al., 1998) - A program to search a protein database using a protein query, seeking only alignments that preserve a specified pattern contained within the query. (more...) |
PSI-BLAST - Position-Specific Iterated BLAST (Altschul, et al., 1997) - A program for searching protein databases using protein queries, in order to find other members of the same protein family. All statistically significant alignments found by BLAST are combined into a multiple alignment, from which a position-specific score matrix is constructed. This matrix is used to search the database for additional significant alignments, and the process may be iterated until no new alignments are found. (more...) |
RPS-BLAST - Reverse
Position-Specific BLAST - A program used to identify conserved
domains in a protein query sequence. It does this by comparing a query
protein sequence to position-specific score matrices that have been prepared
from conserved domain alignments. The service is accessible through
Conserved Domain Search
(CD-Search),
described below. A readme file provides additional detail
about
the RPS-BLAST program.
Note: RPS-BLAST is a "reverse" version of position-specific iterated BLAST (PSI-BLAST), described above. Both RPS-BLAST and PSI-BLAST use multiple alignments and position-specific score matrices (PSSMs) to derive conserved features of a protein family. However, RPS-BLAST compares a query sequence against a database of profiles prepared from ready-made alignments, while PSI-BLAST builds alignments starting from a single protein sequence. The programs also differ in purpose: RPS-BLAST is used to identify conserved domains in a query sequence, while PSI-BLAST is used to identify other members of the protein family to which a query sequence belongs. |
Taxonomy BLAST - an
implementation of Gapped BLAST (2.x) that groups hits by source organism,
according
to information in NCBI's Taxonomy database. Species are listed in order of
sequence
similarity to the query sequence; the strongest match listed first. Three report
views are available:
|
Primer-BLAST - find primers specific to a PCR template. |
BLAST 2 Sequences - A BLAST-based tool for aligning two nucleotide or protein sequences, producing a pairwise DNA-DNA or protein-protein sequence comparison. (more...) |
IgBLAST - IgBLAST was developed to facilitate analysis of immunoglobulin sequences in GenBank. It allows blastp or blastn searches of either the nr database or a special database of Immunoglobulin (Ig) germline V (variable region) genes. Searches may be limited to either human or mouse genes. IgBLAST performs three main functions: (1) reports the variable, D, or J regions that most closely match the query sequence; (2) annotates the immunoglobulin domains (FWR1 through FWR3) according to Kabat et al.; and (3) for searches against the nucleotide nr or protein nr database, simplifies the process of identifying related sequences by matching the IgBLAST hits to the closest germline V genes. (more...) |
BLink - BLink ("BLAST Link") displays the results of BLAST searches that have been done for every protein sequence in the Entrez Proteins data domain. To access it, follow the Blink link displayed beside any hit in the results of an Entrez Proteins search. In contrast to Entrez's "Related Sequences" feature, which lists the titles of similar sequences, BLink displays the graphical output of pre-computed blastp results against the protein non-redundant (nr) database. The output includes the positions of up to 200 BLAST hits on the query sequence, scores, and alignments. (View sample BLink output for human MLH1 protein.) BLink offers a variety of display options, including the distribution of hits by taxonomic grouping, the best hit to each organism, the protein domains in the query sequence, similar sequences that have known 3-D structures, and more. Additional options allow you to specify which taxa you would like to exclude, increase or decrease the BLAST cutoff score, or filter the BLAST hits to show only those from a specific source database, such as RefSeq or Swiss-Prot. See the BLink help document for additional information. |
BLAST E-mail server - an e-mail-based sequence similarity search service; this was discontinued on June 17, 2002 because of limited usage. Most BLAST searchers are now done through BLAST web page. |
Network BLAST - a TCP/IP-based client-server version of WWW Entrez. Makes a direct connection with the NCBI databases over the Internet to retrieve data. No web browser is required. Client software is available for PC, Mac, and Unix on the FTP site at ftp://ftp.ncbi.nih.gov/blast/blastcl3/ |
Stand-alone BLAST - download BLAST executables for local use from ftp://ftp.ncbi.nih.gov/blast/executables/. Binaries are provided for IRIX 6.2, Solaris 2.6, DEC OSF1 (ver. 4.0d), LINUX, and Win32 systems. Please read the README file in the ftp directory for more information. BLAST databases also available for downloading. There is also some information on setting up Standalone BLAST at the NHGRI site at http://genome.nhgri.n ih.g ov/blastall/blast_install. |
Nucleotide Sequence Analysis |
BLAST - see sequence similarity searching, above, for a complete list of BLAST programs. |
e-PCR - Electronic PCR - compare a
query sequence to a database of mapped sequence-tagged sites (STSs) to find a
possible map location for the query sequence, or compare a query STS to a database
of nucleotide sequences to identify the sequences that contain the STS.
e-PCR can be used on the WWW, or the
software can be downloaded from the /pub/schuler/e-PCR
directory of the NCBI ftp site. Additional information is provided by Schuler, G.D. There are two versions of e-PCR:
|
Entrez Gene - as described in the Molecular Databases/Genes section of this guide, each Entrez Gene record encapsulates a wide range of information for a given gene and organism. When possible, the information includes results of analyses that have been done on the sequence data. The amount and type of information presented depend on what is available for a particular gene and organism and can include: (1) graphic summary of the genomic context, intron/exon structure, and flanking genes, (2) link to a graphic view of the mRNA sequence, which in turn shows biological features such as CDS, SNPs, etc., (3) links to gene ontology and phenotypic information, (4) links to corresponding protein sequence data and conserved domains, (5) links to related resources, such as mutation databases. |
Malaria Genetics and Genomics - provides data and information relevant to malaria genetics and genomics. Resources include organism specific sequence BLAST databases (Plasmodium falciparum only, all Plasmodium, and all Toxoplasma). More about the Malaria genome resources below. |
Model Maker - allows you to view the evidence (mRNAs, ESTs, and gene predictions) that was aligned to assembled genomic sequence in order to build a gene model, and to edit the model by selecting or removing putative exons. You can then view the mRNA sequence and potential ORFs for the edited model, and save the mRNA sequence data for use in other programs. Model Maker is accessible from sequence maps that were analyzed at NCBI and displayed in Map Viewer (described above). To see an example, follow the "mm" link beside any gene annotated on the human "Gene_Sequence" map in the Map Viewer. (More info about human data in Map Viewer is given above.) |
ORF Finder - graphical analysis tool which finds all open reading frames of a selected minimum size in a user's sequence or in a sequence already in the database. Designed for prokaryotic sequences. Identifies all open reading frames using the standard or alternative genetic codes. The deduced amino acid sequence can be saved in various formats and searched against the sequence database using the WWW BLAST server. The ORF Finder is also packaged with the Sequin sequence submission software. The stand alone program can be downloaded from NCBI ftp site. |
ProtEST - a tool that presents a graphical view of matches between nucleotide sequences in UniGene and possible translational products. To generate the alignments, the 6-frame translations of mRNA and EST sequences in UniGene are compared to protein sequences using BLASTX with -e 1e-6. The translated nucleotide sequences are compared with proteins from a number of model organisms and the best match in each organism is recorded. ProtEST links are displayed in UniGene (description) reports in the section on model organism protein similarities. |
PASC (PAirwise Sequence Comparison) - a web tool for analysis of pairwise identity distribution within viral families. The identities are pre-computed for every pair within the families and with distribution plotted in a form of histogram where each bar corresponds to an interval of identities. Only complete genomes should be used as query sequences. The results from partial sequences are not suitable for the purpose of this tool. After you submit your sequence, PASC will start computing pairwise identities between the external genome and the existing genome sequences of the family. At the end of the process, you will be presented with the list of 15 closest matches to the genome within the family. The documentation provides more details about using PASC. |
Retroviruses Resources - A collection of resources specifically designed to support the research of retroviruses. Resources include a genotyping tool that uses the BLAST algorithm to identify the genotype of a query sequence; an alignment tool for global alignment of multiple sequences; an HIV-1 automatic sequence annotation tool; and annotated maps of 16 retroviruses viewable in GenBank, FASTA, and graphic formats, with links to associated sequence records. |
SAGEmap - SAGEmap provides a tool for performing statistical tests designed specifically for differential-type analyses of SAGE (Serial Analysis of Gene Expression) data. The data include SAGE libraries generated by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP, described above), which have been submitted to Gene Expression Omnibus (GEO, described above). Gene expression profiles that compare the expression in different SAGE libraries are also available on the Entrez GEO Profiles pages. It is possible to enter a query sequence in the SAGEmap resource to determine what SAGE tags are in the sequence, then map to associated SAGEtag records and view the expression of those tags in different CGAP SAGE libraries. |
Spidey - mRNA-to-genomic alignment program that was designed to find good alignments regardless of intron size, and to avoid getting confused by nearby pseudogenes and paralogs. It uses a combination of alignment algorithms and heuristics to construct its models. Spidey has been optimized for both intraspecies and interspecies alignments. (See Spidey documentation for more information.) |
Splign - a utility for computing cDNA-to-Genomic, or spliced sequence alignments. It is based on a variation of the Needleman–Wunsch global alignment algorithm and specifically accounts for introns and splice signals. It is due to this algorithm that Splign is accurate in determining splice sites and tolerant to sequencing errors. Splign also uses BLAST hits to identify possible locations of genes and their duplications on genomic sequences and to speed up the core dynamic programming. (See Splign documentation for more information.) |
UniGene DDD - Digital Differential Display - an online tool to compare computed gene expression profiles between selected cDNA libraries. Using a statistical test, genes whose expression levels differ significantly from one tissue to the next are identified and shown to the user. Additional information about UniGene is in the Molecular Databases/Genes section. |
VecScreen - a tool for identifying segments of a nucleic acid sequence that may be of vector, linker or adapter origin prior to sequence analysis or submission. VecScreen was developed to combat the problem of vector contamination in public sequence databases. It is also useful to run a new sequence through VecScreen before performing any kind of analysis on the sequence, since the presence of vector sequences can lead to misleading BLAST hits, etc. VecScreen compares a query sequence against the UniVec database, described above. |
Protein Sequence Analysis and Proteomics |
BLAST - see sequence similarity searching, above, for a complete list of BLAST programs. |
BLink - BLink ("BLAST Link") displays the results of BLAST searches that have been done for every protein sequence in the Entrez Proteins data domain. To access it, follow the BLink link displayed beside any hit in the results of an Entrez Proteins search. In contrast to Entrez's "Related Sequences" feature, which lists the titles of similar sequences, BLink displays the graphical output of pre-computed blastp results against the protein non-redundant (nr) database. The output includes the positions of up to 200 BLAST hits on the query sequence, scores, and alignments. (View sample BLink output for human MLH1 protein.) BLink offers a variety of display options, including the distribution of hits by taxonomic grouping, the best hit to each organism, the protein domains in the query sequence, similar sequences that have known 3-D structures, and more. Additional options allow you to specify which taxa you would like to exclude, increase or decrease the BLAST cutoff score, or filter the BLAST hits to show only those from a specific source database, such as RefSeq or Swiss-Prot. See the BLink help document for additional information. |
CD-Search - The Conserved Domain Search Service (CD-Search) can be used to identify the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (described above) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD) (described above). Hits can be displayed as a pairwise alignment of the query sequence with a representative domain sequence, or as a multiple alignment. Alignments are also mapped to known 3-dimensional structures, and can be displayed using Cn3D (described above). In the Cn3D display, residues in sequence alignments are variously colored, based on their degree of conservation. (more...) |
COGnitor - compare your sequence to the COGs database (described above) to identify the cluster of orthologous groups to which it belongs. A stand-alone dignitor program is also available. It runs cognitor in batch mode, comparing a large group of proteins to the COGs database, and can be downloaded from the ftp site. |
Conserved Domain Architecture Retrieval Tool (CDART) - When given a protein query sequence, CDART displays the functional domains that make up the protein and lists proteins with similar domain architectures. The functional domains for a sequence are found by comparing the protein sequence to a database of conserved domain alignments, CDD (described above), using RPS-BLAST (described below). |
Open Mass Spectrometry Search Algorithm (OMSSA) - a public search service that allows proteomics researchers to submit the mass spectra of peptides and proteins for identification. OMSSA then compares these mass spectra to theoretical ions generated from databases of known protein sequences and then ranks the results using a score derived from classical hypothesis testing. References available from the OMSSA home page describe the OMSSA algorithm and its validation. |
ProtEST - a tool that presents a graphical view of matches between nucleotide sequences in UniGene and possible translational products. To generate the alignments, the 6-frame translations of mRNA and EST sequences in UniGene are compared to protein sequences using BLASTX with -e 1e-6. The translated nucleotide sequences are compared with proteins from a number of model organisms and the best match in each organism is recorded. ProtEST links are displayed in UniGene (description) reports in the section on model organism protein similarities. |
TaxPlot - a tool for 3-way comparisons of genomes on the basis of the protein sequences they encode. To use TaxPlot, one selects a reference genome to which two other genomes are compared. Pre-computed BLAST results are then used to plot a point for each predicted protein in the reference genome, based on the best alignment with proteins in each of the two genomes being compared. |
3-D Structure Display and Similarity Searching |
Cn3D - "See in 3-D," a structure and sequence alignment viewer for NCBI databases. It allows viewing of 3-D structures and sequence-structure or structure-structure alignments. Cn3D can work as a helper application to your browser, or as a client-server application that retrieves structure records from MMDB (described above) directly over the internet. The Cn3D home page provides access to information on how to install the program, a tutorial to get started, and a comprehensive help document. |
VAST - Vector Alignment Search Tool - a computer algorithm developed at NCBI and used to identify similar protein 3-dimensional structures. The "structure neighbors" for every structure in MMDB are pre-computed and accessible via links on the MMDB Structure Summary pages. These neighbors can be used to identify distant homologs that cannot be recognized by sequence comparison alone. |
VAST search - - structure-structure similarity search service. Compares 3D coordinates of a newly determined protein structure to those in the MMDB/PDB database. VAST Search computes a list of structure neighbors that you may browse interactively, viewing superpositions and alignments by molecular graphics. |
CD-Search - The Conserved Domain Search Service (CD-Search) can be used to identify the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (described above) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD) (described above). Hits can be displayed as a pairwise alignment of the query sequence with a representative domain sequence, or as a multiple alignment. Alignments are also mapped to known 3-dimensional structures, and can be displayed using Cn3D (described above). In the Cn3D display, residues in sequence alignments are variously colored, based on their degree of conservation. |
Threading - As part of NCBI's Computational Biology Branch (described above), the Structure group, led by Dr. Steve Bryant, conducts research in protein threading. Protein threading predicts the three-dimensional structure of a protein sequence by threading it through known structures and calculating its energy. The experimental software developed by the NCBI Structure group is available on the FTP site. A readme file provides more information as well as references. |
Genome Analysis Tools |
Entrez Genome - whole genomes of over 1000 organisms. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses, phages, viroids, plasmids, and organelles.. Entrez Genome provides graphical overviews of complete genomes/chromosomes, and the ability to explore regions of interest in progressively greater detail. ProtTables and TaxTables are provided for organisms on which analyses have been done by NCBI staff. |
Map Viewer - shows integrated views of chromosome maps for many organisms. Used to view the NCBI assembly of complete genomes, including human, Map Viewer is a valuable tool for the identification and localization of genes, particularly those that contribute to diseases. Additional information about Map Viewer is provided in the Genomes and Maps section of this guide. |
SKY/M-FISH & CGH Database - The NCI and NCBI SKY/M-FISH and CGH Database is a repository of publicly submitted data from Spectral Karyotyping (SKY), Multiplex Fluorescence In Situ Hybridization (M-FISH), and Comparative Genomic Hybridization (CGH), which are complementary fluorescent molecular cytogenetic techniques. SKY/M-FISH permits the simultaneous visualization of each human or mouse chromosome in a different color, facilitating the identification of chromosomal aberrations; CGH can be used to generate a map of DNA copy number changes in tumor genomes. Collaborative project with the National Cancer Institute. (data submission instructions...) |
PASC (PAirwise Sequence Comparison) - a web tool for analysis of pairwise identity distribution within viral families. The identities are pre-computed for every pair within the families and with distribution plotted in a form of histogram where each bar corresponds to an interval of identities. Only complete genomes should be used as query sequences. The results from partial sequences are not suitable for the purpose of this tool. After you submit your sequence, PASC will start computing pairwise identities between the external genome and the existing genome sequences of the family. At the end of the process, you will be presented with the list of 15 closest matches to the genome within the family. The documentation provides more details about using PASC. |
Retrovirus Resources - A collection of resources specifically designed to support the research of retroviruses. Resources include a genotyping tool that uses the BLAST algorithm to identify the genotype of a query sequence; an alignment tool for global alignment of multiple sequences; an HIV-1 automatic sequence annotation tool; and annotated maps of 16 retroviruses viewable in GenBank, FASTA, and graphic formats, with links to associated sequence records. |
Gene Expression Tools |
Gene Expression Omnibus (GEO) - provides
several tools to assist with the visualization and exploration of GEO data.
Datasets may be viewed as hierarchical cluster heat maps, providing insight into
the
relationships between samples and co-regulated genes. Individual gene
expression
profiles showing significant differences between experimental subsets may be
located
using average subset rank value comparisons. Related gene expression profiles
may
be identified on the basis of sequence similarity, profile similarity, or
homology.
Indicators of dataset normalization quality are provided as distribution graphs,
and
by flagging outliers. Links to other
NCBI sequence, mapping and publication database resources are provided where
possible. (More information about GEO is provided in the
Molecular Databases/Gene Expression section of this file.) |
SAGEmap - SAGEmap provides a tool for performing statistical tests designed specifically for differential-type analyses of SAGE (Serial Analysis of Gene Expression) data. The data include SAGE libraries generated by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP, described above), which have been submitted to Gene Expression Omnibus (GEO, described above). Gene expression profiles that compare the expression in different SAGE libraries are available on the Entrez GEO Profiles pages. It is also possible to enter a query sequence in the SAGEmap resource to determine what SAGE tags are in the sequence, then map to associated SAGEtag records and view the expression of those tags in different CGAP SAGE libraries. (More information about SAGEmap is provided in the Molecular Databases/Gene Expression section of this file.) |
Cancer Genome Anatomy Project (CGAP) - an
interdisciplinary program to identify the human genes expressed in different
cancerous states, based on cDNA (EST) libraries, and to determine the molecular
profiles of normal, precancerous, and malignant cells. CGAP is a collaboration
among the National Cancer Institute, the NCBI, and numerous research labs.
(Related
resources are listed under human genome/cancer
research.) The following tools are provided by the National Cancer
Institute
(NCI) through their CGAP web page:
|
UniGene DDD - Digital Differential Display - an online tool to compare computed gene expression profiles between selected cDNA libraries. Using a statistical test, genes whose expression levels differ significantly from one tissue to the next are identified and shown to the user. Additional information about UniGene is in the Molecular Databases/Genes section. |
Research at NCBI | Overview |
Computational Biology Branch Home Page - Overview of the research program in the Computational Biology Branch (CBB) of NCBI and a list of Senior Investigators. The research programs focus on theoretical, analytical, and applied approaches to a broad range of fundamental problems in molecular biology, including biomolecular structures, genome analysis, theory of sequence analysis, hardware design, software and database design, and text retrieval and document analysis. |
Senior Investigators in PubMed - publications written by senior investigators in the NCBI Computational Biology Branch and represented in the PubMed database. The PubMed records include links to publisher web sites and/or full text articles when available. |
Seminar Schedule - Seminars held at NCBI on a wide range of molecular biology and mathematical topics. These seminars are open to the NIH community and the general public, and are presented by NCBI staff as well as visiting scientists. |
Postdoctoral Fellowships - general information, application procedure |
SoftwareEngineering | Overview |
Information Engineering Branch Home Page - Overview of the functions of the Information Engineering Branch (IEB) of NCBI, which is responsible for designing and building NCBI's production software and databases. |
NCBI ToolBox - Supported software tools from IEB. Describes the three components of the ToolBox: data model, data encoding, and programming libraries. Provides access to documentation for the data model, C toolkit, C++ toolkit, NCBI Toolkit Source Browser, XML demo program, XML DTDs, and the FTP site. Additional information about the FTP site is provided below. |
R&D Projects - The IEB Research and Development Area is a place for IEB projects and datasets which may never become fully supported NCBI resources. This includes early prototypes of software, results of early or one-off analyses, tools that a fully functional but not integrated into the main, public NCBI systems, or datasets that may have some value but do not fit well into the main NCBI pages. |
ASN.1 - The software in the NCBI ToolBox is primarily designed to read Abstract Syntax Notation 1 (ASN.1) format records, an International Standards Organization (ISO) data representation format. The readme files in the toolbox and toolbox/ncbi_tools directories of the FTP site contain more information about the toolbox and ASN.1. An ASN.1 summary is also available. The ToolBox can produce data as either ASN.1, as before, or as XML (more about XML). Additional information about the ToolBox, documentation, and demo programs are available on the NCBI ToolBox page. |
Education | Overview |
News, Books, Glossaries, Tutorials, Courses, Additional Resources |
News - keeping up with the changes at NCBI |
NCBI News - announcements about new resources, enhancements to existing resources, staff publications, tutorials, FAQs. |
What's New - recently released resources and enhancements to existing resources |
NCBI Announcements Email Lists - Receive announcements about changes and updates to a variety of NCBI services. In addition to a general NCBI-announce list, topic-specific e-mail lists are available for BLAST, GenBank, dbSNP, Genomes, LinkOut, RefSeq, Sequin, and Entrez Utilities (for making WWW Links to Entrez). Information on how to subscribe is provided. |
Books |
Coffee Break - a collection of short reports on recent biological discoveries. Each report incorporates interactive tutorials that show how bioinformatics tools are used as a part of the research process. |
Genes and Disease - introduction to the relationship between genetic factors and human disease. Summary information for ~60 genetic diseases with links to related databases and organizations. |
NCBI Handbook - an online book, written by NCBI staff, that discusses the many resources available at NCBI. Each chapter is devoted to one service; after a brief overview on using the resource, there is an account of how the resource works, including topics such as how data are included in a database, database design, query processing, and how the different resources relate to each other. |
Entrez Books - In collaboration with book publishers, the NCBI is adapting textbooks for the web and linking them to PubMed, the biomedical bibliographic database. The idea is to provide background information to PubMed, so that users can explore unfamiliar concepts found in PubMed search results. |
Glossaries |
NCBI Handbook Glossary - part of the NCBI Handbook, described above. Includes a variety of terms pertaining to biological data and bioinformatics. |
FieldGuide Glossary - developed for the Field Guide course described below. |
Genome Glossary - commonly used genome terms; includes links to associated literature for each term. |
|
NHGRI Talking Glossary of Genetic Terms - by the National Human Genome Research Institute (NHGRI). |
Tutorials |
BLAST QuickStart: Example-Driven Web-Based BLAST Tutorial - a tutorial based on the former NCBI minicourse, "BLAST Quick Start", within the Comparative Genomics online book. |
PSI-BLAST Tutorial - a chapter within the Comparative Genomics online book. |
Identification of Disease Genes: Example-Driven Web-Based Tutorial - a tutorial based on the former NCBI minicourse, "Identification of Disease Genes", within the Comparative Genomics online book. |
Science Primer - The science behind our resources. An introduction for researchers, educators and the public. Provides a plain language introductions to bioinformatics, genome mapping, molecular modeling, SNPs, ESTs, microarray technology, molecular genetics, pharmacogenomics, and phylogenetics. |
PubMed Tutorial - comprehensive instruction on using PubMed's various features |
Entrez Tutorial - show users how to make use of the full power of the Entrez data retrieval system. Using a human gene as an example, it demonstrates the variety of information that can be gathered for a single gene across a number of Entrez databases. |
BLAST Statistics |
3-D Protein Structure
Tutorial: Cn3D structure viewing program |
Map Viewer
Exercises - a chapter within the NCBI Handbook
(described above). |
Coffee Break - a collection of short reports on recent biological discoveries. Each report incorporates interactive tutorials that show how bioinformatics tools are used as a part of the research process. |
Courses |
Education - for information on past NCBI courses, please see the Education home page. |
Getting Started with Linkout - LinkOut is a feature of PubMed that provides users with links from PubMed and other Entrez databases to a wide variety of relevant web-accessible online resources, including full-text publications, biological databases, consumer health information, research tools, and more. The goal is to facilitate access to relevant online resources beyond the Entrez system to extend, clarify, or supplement information found in the Entrez database. This hands-on class is designed to introduce students to LinkOut and provide step-by-step instruction on activating LinkOut for print and electronic journal collections, allowing users to see their own library's holdings and access electronic full-text through the PubMed interface. Topics covered are registration for LinkOut, entering holdings, displaying a library's icon for "branding" purposes, and access to free full-text through LinkOut. Getting Started with LinkOut is a free class and is awarded 4 MLA continuing education credits. For more information and to register, visit the NLM's National Training Center and Clearinghouse (NTCC) website: http://nnlm.gov/mar/online/. Questions about the class can be sent to lib-linkout@ncbi.nlm.nih.gov |
Additional Resources |
Cancer Information - a wide range of accurate, credible cancer information brought to you by the National Cancer Institute (NCI). CancerNet information is reviewed regularly by oncology experts and is based on the latest research. It includes information selected and organized for patients, health professionals, and basic researchers. |
Human Genome Project - an international research effort to characterize the genomes of human and selected model organisms through complete mapping and sequencing of their DNA; to develop technologies for genomic analysis; to examine the ethical, legal, and social implications of human genetics research; and to train scientists who will be able to utilize the tools and resources developed through the HGP to pursue biological studies that will improve human health. This link leads to the information provided on the National Human Genome Research Institute (NHGRI) web site. |
NHGRI Educational Resources - the National Human Genome Research Institute (NHGRI) provides a range of educational resources, including glossaries, fact sheets, multimedia educational kits, genetic education modules for use by teachers, and a variety of online materials. |
NIH Office of Science Education - offers a wide variety of educational resources for students at various grade levels, teachers, and the general public. Resources cover a wide range of topics, including Genetics, and formats of educational materials range from lesson plans and curricula to multimedia, online materials, and more. Website also includes a section on career exploration. |
FTP Site | Overview |
Download Databases |
BLAST databases - a collection of databases formatted for use with the BLAST software. A readme file provides database descriptions. |
GenBank and Daily Updates |
|
|
|
RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits, for example: NT_123456, NM_123456, NP_123456, NC_123456, NG_123456, XM_123456, XR_123456, XP_123456 (more info about accession numbers and access). |
Entrez Gene - a collection of files from the Entrez Gene database, which is described in the Molecular Databases/Genes section of this guide. |
dbSNP - database of
single nucleotide polymorphisms, small-scale insertions/deletions, polymorphic
repetitive elements, and microsatellite variation |
Taxonomy - data from the NCBI Taxonomy database (described above). Includes a UNIX compressed tar file called "taxdump.tar.Z" that is updated daily and contains a dump of the taxonomy information from SyBase. Note that the *.dmp files are not human-friendly files, but can be uploaded into SyBase with the BCP facility. When you uncompress and untar the file, you will see several files, including a Readme file that contains more information. |
Repository of databases - This FTP directory contains a mix of NCBI databases (e.g., UniGene, GeneMap, dbEST, dbGSS, dbSTS, OMIM) and a number of externally developed databases (e.g., EPD, TFD). The external databases are made available on the FTP site as a service to the scientific community. They are contributed by outside scientists and maintained independently of NCBI. All the files in the FTP directory of a non-NCBI database are placed there and maintained by the developers of that database. Questions about non-NCBI databases should be directed to the contacts listed in the readme or other background files for the individual databases. Note that additional NCBI databases are also found in the root directory of the FTP site (under the database name, such as GenBank, Gene, RefSeq), or in the "pub" directory (usually under the name of the primary resource developer). |
Download Genomes |
Human
Genome
Project Data - the ftp://ftp.ncbi.nih.gov/genomes/
H_sa
piens/ directory contains one folder for each chromosome, which includes
genomic
contigs (NT_* records) built from finished and unfinished sequence data. The
contigs are available in various formats, described below. The contig assembly and annotation process is
described in a separate document.
|
Other Genomes - such as bacteria, nematode, mouse, and
others can be downloaded from one of two directories:
Note: In some cases, an organism might be listed in both directories. This can happen for several reasons: (1) there are two versions of the genome are available - one in GenBank, and one in RefSeq; or (2) the organism's data was assembled at NCBI and was available from the "/genbank/genomes/" directory before the new "/genomes/" directory was set up. In the latter case, the data now exists in the new "/genomes/" directory, but a symbolic link was preserved in the original directory to facilitate user access. |
Download Software |
BLAST Programs |
|
|
|
NOTE: Preformatted BLAST databases also available for downloading, in addition to the software listed above. A readme file provides database descriptions. |
Client/server programs |
|
|
|
Cn3D - "See in 3-D," a structure and sequence alignment viewer for NCBI databases. It allows viewing of 3-D structures and sequence-structure or structure-structure alignments. Cn3D can work as a helper application to your browser, or as a client-server application that retrieves structure records from MMDB (described above) directly over the internet. The Cn3D home page provides access to information on how to install the program, a tutorial to get started, and a comprehensive help document. |
NCBI Software ToolBox - set of software and data exchange specifications used by NCBI to produce portable, modular software for molecular biology. The software in the Toolbox is primarily designed to read Abstract Syntax Notation 1 (ASN.1) format records, an International Standards Organization (ISO) data representation format. The software is available to the public in the toolbox/ncbi_tools directory of NCBI's ftp site, and can be used in its own right or as a foundation for building tools with similar properties. The readme files in the toolbox and toolbox/ncbi_tools directories of the FTP site contain more information about the toolbox and ASN.1. An ASN.1 summary is also available. The ToolBox can produce data as either ASN.1, as before, or as XML (more about XML). Additional information about the ToolBox, documentation, and demo programs are available on the NCBI ToolBox page. Additional information about the Information Engineering Branch (IEB) of NCBI, which develops the ToolBox, is provided above, along with other items of interest to software developers. |
Software programs developed as personal projects by various NCBI scientists - /pub directory of FTP site contains programs such as MACAW (multiple sequence alignments) and e-PCR (description above). |
Help Desk | NCBI | NLM | NIH | Credits |
Revised: February 3, 2009. |