NCBI Resource Guide
PubMed Entrez BLAST OMIM Taxonomy Structure

Each link in this Resource Guide leads to a brief description of the resource on this page, then to the resource itself. A graphical Site Map and an Alphabetical Quicklinks Table provide direct links to resources and bypass the descriptions.



RESOURCES BY CATEGORY

About NCBI
programs and services, contact information, NCBI handbook, news (what's new, NCBI News, announcements e-mail lists, RSS feeds), exhibit schedule, postdoctoral fellowships, organizational structure, resource statistics, site search

GenBank
overview, submit sequences, submit genomes, sample record, GenBank divisions, statistics, release notes, international collaboration, FTP GenBank

Molecular Databases
nucleotides, proteins, structures, genes, gene expression, taxonomy

Literature Databases
PubMed, PubMedCentral, Journals, OMIM, Books, Citation Matcher

Genomes and Maps
organism collections (including Entrez Genome, Entrez Genome Project, Map Viewer, Entrez Gene, UniGene, HomoloGene, and COGs), and organism-specific resources, such as: human, mouse, rat, cow, zebrafish, Drosophila, nematode, plant genomes, yeast, malaria, microbial genomes, viruses, viroids, plasmids, eukaryotic organelles

Tools
Entrez, LinkOut, My NCBI, BLAST, nucleotide sequence analysis, protein sequence analysis, 3-D structure display and similarity searching, genome analysis, gene expression

Research at NCBI
Computational Biology Branch (CBB), senior investigators in PubMed, seminar schedule, postdoctoral fellowships

Software Engineering
IEB home page, NCBI ToolBox, R&D projects, ASN.1

Education
news, science primer, books, glossaries, tutorials, courses, additional resources

FTP Site
download databases, genomes, and software, NCBI Software ToolBox
ALPHABETICAL INDEX
with links to resource descriptions
(To bypass descriptions, use the Alphabetical Quicklinks Table.)
About NCBI GenBank sample record Plant Genomes
Announcements Genes Protein Sequences
ASN.1 Genes and Disease PubChem
BankIt Genomes (data, projects, submissions) PubMed
BLAST GENSAT PubMed Central
BLink GEO RefSeq
Books Glossaries Research at NCBI
Cancer Chromosomes Handbook Retroviruses
CCDS HIV Interactions SAGEmap
CDART HTGs Science Primer
CDD HomoloGene Seminars
CGAP Human Genome Resources Sequin
Clones Human-Mouse Homology Maps Site Search
Cn3D Journals SKY/M-FISH & CGH Database
Coffee Break LinkOut Software Engineering
COGs Malaria Splign
Computational Biology Branch Map Viewer Statistics
Data Submissions MeSH Structures
dbEST MGC Submit Data
dbGSS Microbial Genomes Taxonomy
dbMHC MMDB Tools
dbSNP Model Maker TPA
dbSTS Mutation Databases Trace Archive
Education My NCBI UniGene
e-PCR NCBI Home UniSTS
Entrez NCBI News VAST
Entrez Utilities Nucleotide Sequences VecScreen
Expression OMIM Viruses
FTP OMSSA WGS
GenBank ORF Finder What's New

   indicates a resource which has become available in the last 12 months.  

About NCBI Overview back to top

About NCBI - The science behind our resources. An introduction for researchers, educators and the public. Includes a Science Primer, with plain language introductions to bioinformatics, genome mapping, molecular modeling, SNPs, ESTs, microarray technology, molecular genetics, pharmacogenomics, and phylogenetics.
Programs and Services - basic research, databases and software, outreach and education
Contact Information - postal address, phone, e-mail addresses for various services
Exhibit Schedule - NCBI exhibits at upcoming conferences
NCBI Handbook - an online book, written by NCBI staff, that discusses the many resources available at NCBI. Each chapter is devoted to one service; after a brief overview on using the resource, there is an account of how the resource works, including topics such as how data are included in a database, database design, query processing, and how the different resources relate to each other.
Organizational Structure - functions of the three NCBI branches: Computational Biology Branch (CBB), Information Engineering Branch (IEB), and Information Resources Branch (IRB)
Board of Scientific Counselors - advises the NIH Director and the Deputy Director for Intramural Research; the NLM Director, and the NCBI Director about the intramural research and development programs of the NCBI.
Postdoctoral Fellowships - general information, application procedure
Statistics for NCBI Resources - A page listing statistics that are available for selected NCBI resources, including number of records present in various databases, number of genomes available at NCBI and statistics for the individual genomes, and server usage.
Site Search - Search the NCBI web site and display results in various formats. The default Homepage view sorts NCBI pages based on the number of other NCBI pages that link to them. The NCBI Site Search function is part of the Entrez system (described below). Therefore, the search features described in the Entrez help document also apply to the site search function.
News and Announcements back to
top
  • What's New - recently released resources and enhancements to existing resources.
  • NCBI News - announcements about new resources, enhancements to existing resources, staff publications, tutorials, FAQs.
  • NCBI Announcements Email Lists - Receive announcements about changes and updates to a variety of NCBI services. In addition to a general NCBI-announce list, topic-specific e-mail lists are available for BLAST, GenBank, dbSNP, Genomes, LinkOut, RefSeq, Sequin, and Entrez Utilities (for making WWW Links to Entrez). Follow the link to the NCBI Announcements Email Lists page to see a complete list of available topics. Information on how to subscribe is provided.
  • NCBI RSS Feeds - Receive announcements about various NCBI services using an RSS (Real simple syndication) feed reader. RSS feeds are available for resources such as Bookshelf, HomoloGene, PubMed Central, PubMed New and Noteworthy, Probe Database, and UniGene. Follow the link to the NCBI RSS Feeds page to see a complete list of available topics. Additional information about RSS is provided in a short series of FAQs.

GenBank Overview back to top

General Information (sample record, release notes, GenBank divisions, statistics),   Submissions (general, special categories, other data types),   International Collaboration,   FTP GenBank
 

General Information back to
top

What is GenBank? - a database of nucleotide sequences from >160,000 organisms. Records that are annotated with coding region (CDS) features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases (described below), which also includes EMBL and DDBJ.  GenBank is updated daily in NCBI search systems, and a full release is issued on the FTP site approximately the 15th of every February, April, June, August, October, and December. It contains all the data present in GenBank as of the cutoff date specified in the release notes (described below). The FTP site also provides daily cumulative an non-cumulative update files (more about the FTP site below).
Sample Record - detailed description of each field in a GenBank record.
Includes, for example, information about accession number formats, sequence identifiers (GI number and accession.version), a listing of GenBank divisions, and more. Describes some commonly annotated biological features, such as CDS, and provides links to documents that list and define the complete set of biological features that can be annotated on sequence records. Includes a link to a sequence revision history tool that can be used to track changes that have occurred to the sequence data in a record.  Also lists the Entrez search field(s) that can be used to search each part of a sequence record.
GenBank Divisions - summary of GenBank divisions, including abbreviations, full spellings, information about what the GenBank divisions are, and what they are not. (This information is part of the GenBank sample record, described above.)
Access GenBank - through Entrez Nucleotides. Search by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez is below. Use BLAST for sequence similarity searches against GenBank and other databases. An option to download the GenBank full release and updates via FTP is also available.
Growth Statistics (graph) - see also Release Notes sections 2.2.6 (per division statistics), 2.2.7 (per organism statistics), 2.2.8 (growth of GenBank). For statistics on other NCBI databases, please see the page that summarizes sources of Statistics for NCBI Resources.
GenBank Release Notes - A document that accompanies each full release (described in "What is GenBank?", above) of the GenBank database. The release notes describe the format and content of the flat files that comprise the release. They also include notices of recent and upcoming changes, information about GenBank divisions, growth statistics, citing GenBank, and more.
Genetic Codes - synopsis of 17 genetic codes; used to ensure correct translation of coding sequences in GenBank records.
GenBank Bionet Newsgroup - A moderated list that includes announcements of new GenBank releases, recent and upcoming changes, and discussion among subscribers. For information on how to subscribe by e-mail, see the NCBI Announcements Email Lists page.

GenBank Submissions back to
top
General Information back to
top
 
In addition to GenBank, there are other databases at NCBI to which a variety of data types can be submitted (third party annotations (TPA), variation, expression, MHC data, SKY/M-FISH/CGH data, traces).
 
Submission Software Programs back to
top
  • BankIt - WWW submission tool for one or few submissions, designed to make the submission process quick and easy.  (BankIt also automatically uses VecScreen to identify segments of nucleic acid sequence which may be of vector, adapter, or linker origin to combat the problem of vector contamination in GenBank.)
  • Sequin - submission software program for one or many submissions, long sequences, complete genomes, alignments, population/phylogenetic/mutation studies. Can be used as a stand-alone application or in a TCP/IP-based "network aware" mode, with links to other NCBI resources and software such as Entrez.  (Use VecScreen prior to submission).  To receive announcements about updates to the Sequin submission software, see the NCBI Announcements Email Lists page.

Special Types of Submissions to GenBank back to
top
Genomes,   Alignments,   ESTs,   GSSs,   HTGs,   STSs,   WGS
 
  • Submission of complete genomes and other large sequence records - Recent enhancements to Sequin make it convenient for genome sequencing centers to annotate their records with Sequin and submit the resulting ASN.1 file to GenBank. After the Sequin files are prepared, large genomes should be submitted by ftp; write to genomes@ncbi.nlm.nih.gov to obtain an ftp account. Smaller records less than 350 kb can be sent by email to gb-sub@ncbi.nlm.nih.gov.

    More information about submitting genomes and other large sequence records is provided on the following pages: GenBank submissions, Sequin, tabular layout for submitting annotated features, bacterial genome submission guidelines.

    In addition, sequencing centers can register a sequencing project with NCBI prior to the submission of any data. This can be done through a Genome project submission form. For each registered project, NCBI will create a sequencing project page that describes the project, links out to genome-specific reosurces, and provides a focal point for the addition of links to NCBI resources such as Map Viewer and genomic BLAST. Projects can be listed publicly or remain unlisted, and sequences may be held until publication (the default), released immediately, or made available for BLAST searches only. The form can also be used to set up an FTP site for the upload of data to NCBI, or to specify a URL to be used by NCBI for download of project or sequence data. (See Fall 2003/Winter 2004 issue of NCBI News for more information.)
  • ESTs - expressed sequence tags; short, single pass read cDNA (mRNA) sequences. Also includes cDNA sequences from differential display experiments and RACE experiments.
  • GSSs - genome survey sequences; short, single pass read genomic sequences, exon trapped sequences, cosmid/BAC/YAC ends, others.
  • HTGs - high throughput genome sequences from large scale genome sequencing centers; unfinished (phase 0, 1, 2) and finished (phase 3) sequences. (Note that contigs assembled from draft and finished human HTG sequences are accessible from the Map Viewer, described below.)
  • STSs - sequence tagged sites; short sequences that are operationally unique in the genome, used to generate mapping reagents.
  • WGS - data from Whole Genome Shotgun (WGS) sequencing projects can be submitted to GenBank. The data can contain annotations and an entire project is updated as sequencing progresses. WGS submissions are given accession numbers in the format of four letters followed by eight digits, e.g., XXXX00000000. The four letters are a stable project_ID, which does not change as the project is updated. The first two digits represent the version number, which corresponds to a particular project update. The last six digits represent an individual contig within the WGS project. For example, if a project's assigned accession number is XXXX00000000, then that project's first assembly version would be XXXX01000000, and the first contig of that version would be XXXX01000001. (more...)
    The nucleotide data from WGS projects go into the appropriate organismal GenBank Divisions and the BLAST wgs database. The protein translations of annotated coding sequences go into the BLAST protein nr database. In addition, quality data from many WGS projects are submitted to the Trace Archive (described in the ResourceGuide section on Nucleotide Sequence Databases).
Other Types of Data Submissions
(Other NCBI databases, separate from GenBank, to which data can be submitted)
back to
top
  • Third Party Annotations (TPA) - a database of experimentally supported annotations on assemblies of sequences already present in DDBJ/EMBL/GenBank. Whereas DDBJ/EMBL/GenBank contains primary sequence data and corresponding annotations submitted by the laboratories that did the sequencing, the TPA database contains third-party assemblies of primary data with experimentally supported annotation that has been published in a peer-reviewed scientific journal. Details about how to submit data, as well as examples of what can and cannot be submitted to TPA, are provided on the TPA home page. Additional information about the TPA database is provided below.

International Nucleotide Sequence Database Collaboration back to
top

GenBank, DDBJ, EMBL - Overview of collaborative projects and links to home pages. The GenBank, DDBJ (DNA Data Bank of Japan), and EMBL (European Molecular Biology Laboratory) databases share data on a daily basis and are therefore equivalent. The record formats and search systems might differ among the databases, but the accession numbers, sequence data, and annotations are the same in all of them. E.g., you can retrieve the record with accession number U12345 from GenBank, DDBJ, or EMBL and it will contain the same sequence data, references, etc. in all three databases.
DDBJ/EMBL/GenBank Feature Table - feature table formats and standards used in the annotation of sequence records by the collaborating databases; makes possible sharing of data; includes detailed appendices such as:
  • biological features reference key (alphabetical list also available)
  • feature qualifiers
  • IUPAC abbreviations for nucleotides
  • IUPAC abbreviations for amino acids
  • FTP GenBank and Daily Updates back to
top

    GenBank flat file format - see sample GenBank record and detailed description in GenBank release notes; download most recent full release (described above) and daily cumulative or non-cumulative update files.
    ASN.1 format - Abstract Syntax Notation 1, an International Standards Organization (ISO) data representation format; download most recent full release (described above) and daily cumulative or non-cumulative update files.  (more on ASN.1)
    FASTA format - definition line followed by sequence data only (example); see readme file for database descriptions, including nt.Z (daily updated non-redundant BLAST nucleotide database, contains GenBank+EMBL+DDBJ+PDB sequences, but no EST, STS, GSS, or HTGS sequences), nr.Z (daily updated non-redundant proteins), est.Z, gss.Z, htg.Z, sts.Z, and others.


    Molecular Databases Overview back to top

    Nucleotide Sequences,   Protein Sequences,   Structures,   Genes,   Expression,   Taxonomy
     

    Nucleotide Sequence Databases back to
top

    Entrez Nucleotides - combines data from a number of source databases, including GenBank, RefSeq, TPA, and PDB. Data can be searched by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez below. For retrieval of large data sets, Batch Entrez (described below) is available.
    GenBank - a database of nucleotide sequences from >160,000 organisms. Records that are annotated with coding region (CDS) features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases (described above), which also includes EMBL and DDBJ. A sample record, which provides a detailed description of each field in a GenBank record, is also available. A variety of sequence records exist in GenBank, such as characterized genes that have been well-studied and annotated, batch produced sequences (ESTs, GSSs, STSs), high throughput genomic sequences, complete genomes, and more. Additional information about GenBank is given in the GenBank Overview section of this guide.
    RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits. Nucleotide sequence records have accessions: NT_123456, NM_123456, NC_123456, NG_123456, XM_123456, XR_123456 (more info about accession numbers and access). Additional details about RefSeq are provided in the NCBI Handbook, which is available online in the Entrez Books database.
    Consensus CoDing Sequence (CCDS) Database - The CCDS project is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. The collaborators include the National Center for Biotechnology Information (NCBI, Map Viewer), European Bioinformatics Institute (EBI, Ensembl), University of California, Santa Cruz (UCSC, Genome Browser), and Wellcome Trust Sanger Institute (WTSI, Vega). They identify the position of protein-coding regions of genes that are (1) annotated consistently on the human genome by all of the participating centers and (2) supported by transcript evidence, use of canonical splice sites, and other quality assurance measures. Additional information about the curation, process flow, and quality testing is available on the CCDS web site.
    Third Party Annotation (TPA) database - a database of experimentally supported annotations on assemblies of sequences already present in DDBJ/EMBL/GenBank. Whereas DDBJ/EMBL/GenBank contains primary sequence data and corresponding annotations submitted by the laboratories that did the sequencing, the TPA database contains third-party assemblies of primary data with experimentally supported annotation that has been published in a peer-reviewed scientific journal. Details about how to submit data, as well as examples of what can and cannot be submitted to TPA, are provided on the TPA home page.
    Note:  Although TPA records are derived from DDBJ/EMBL/GenBank, TPA is actually a separate database. Therefore, TPA records are not present in the GenBank FTP files, but will be available in separate FTP files.

    The TPA database uses an accession format similar to GenBank records (e.g., two letters followed by six digits) and is organized into similar divisions. (A list of GenBank divisions is given in the GenBank Sample Record. Some divisions, such as EST, GSS, HTG and are present in GenBank but will not be present in TPA.)

    TPA records can be easily recognized because the definition lines begin with the the letters "TPA", and they contain "Third Party Annotation; TPA" in the Keywords field. This is illustrated in a sample TPA record, BK000627.

    TPA records can be retrieved from Entrez Nucleotides (described above). To only see data from TPA, use the "Index" mode to select "tpa" from the Properties search field, or simply add the command AND tpa[prop] to your query.

    Details about how to submit data, as well as examples of what can and cannot be submitted to TPA, are provided on the TPA home page. An announcement and additional information about the TPA database is provided in section 1.4.5, "Third-Party Annotation and Consensus Sequences (TPA)" of the GenBank 133.0 release notes.
    dbEST - database of expressed sequence tags; short, single pass read cDNA (mRNA) sequences. Also includes cDNA sequences from differential display experiments and RACE experiments.
    Note: EST sequences are available from two sources: dbEST and the EST division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ.  (data submission instructions...)
    dbGSS - database of genome survey sequences; short, single pass read genomic sequences, exon trapped sequences, cosmid/BAC/YAC ends, others.
    Note: GSS sequences are available from two sources: dbGSS and the GSS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ.  (data submission instructions...)
    dbMHC - Provides a platform where the human leukocyte antigen (HLA) community can submit, edit, view, and exchange Major Histocompatibility Complex (MHC) data. The MHC database is fully integrated with other NCBI resources, as well as with the International Histocompatibility Working Group (IHWG) Web site, and provides links to the IMmunoGeneTics HLA (IMGT/HLA) database. Additional details are available in the NCBI Handbook.
    dbSNP - database of single nucleotide polymorphisms, small-scale insertions/deletions, polymorphic repetitive elements, and microsatellite variation.  dbSNP includes polymorphism data that is experimentally derived, computationally derived, as well as hybrid data that is determined by the alignment of an experimentally derived molecule to genomic sequence data.  Currently, dbSNP is comprised of 4 general classes of submissions: (a) The SNP Consortium (TSC) - candidate SNPs identified by sequencing using either the reduced representation shotgun strategy or by alignment of random reads to genomic sequence;  (b) Overlaps - candidate SNPs were identified in sequence overlaps between individual BACs or PACs;   (c) ESTs - SNPs identified in EST clusters, including those identified by the Cancer Genome Anatomy Project (described below);  (d) Other - SNPs identified after screening larger numbers of chromosomes include many with alleles of lower frequency (1%-20%).  (data submission instructions)   To receive announcements about updates and new features to dbSNP, see the NCBI Announcements Email Lists page.
    Note: Although dbSNP is a separate database from GenBank, SNP records include cross-references to GenBank records.  
    dbSTS - database of sequence tagged sites; short sequences that are operationally unique in the genome, used to generate mapping reagents.
    Note: STS sequences are available from two sources: dbSTS and the STS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ.  (data submission instructions...)
    UniSTS - a unified, non-redundant view of sequence tagged sites (STSs). UniSTS integrates marker and mapping data from a variety of public resources. If two or more markers have different names but the same primer pair, a single STS record is presented for the primer pair and all the marker names are shown. Each UniSTS record displays the primer sequences, product size, mapping information, and cross references to Entrez Gene, dbSNP, RHdb, GDB, MGD, and the Map Viewer. The marker report also lists GenBank and RefSeq records that contain the primer sequences, as determined by Electronic PCR (e-PCR). Data sources include dbSTS, RHdb, GDB, various human maps (Genethon genetic map, Marshfield genetic map, Whitehead RH map, Whitehead YAC map, Stanford RH map, NHGRI chr 7 physical map, WashU chrX physical map), various mouse maps (Whitehead RH map, Whitehead YAC map, Jackson laboratory's MGD map).

    UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human), and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete data set can be downloaded from the repository/UniGene directory of the FTP site. In addition, UniGene DDD (described below) can be used to show differential expression of genes between cDNA libraries. The organisms represented in UniGene are listed on the UniGene home page.
    HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. Organisms represented are listed on the HomoloGene home page.
    Mammalian Gene Collection (MGC) - The NIH Mammalian Gene Collection (MGC) is a trans-NIH initiative that seeks to identify and sequence a representative full open reading frame (FL-ORF) clone for each human, mouse, and rat gene. The MGC project entails the production of cDNA libraries and sequences, database and repository development, as well as the support of research for improved library construction, sequencing, and analytic technologies. All the resources generated by the MGC are publicly accessible to the biomedical research community.
    Trace Archive - a repository of the raw sequence traces generated by large sequencing projects. It allows retrieval of both the sequence file and the underlying data which generated the file. In the case of projects that rely on a Whole Genome Shotgun (WGS) strategy, the Trace Archive will be the sole source of raw sequence data. (More information about WGS projects is provided in the ResourceGuide section on special types of submissions to GenBank/WGS.) NCBI will be exchanging data regularly with the Ensembl Trace Server. The Trace Archive can be searched by using MegaBLAST (described below), or by entering a term in the search box at the top of the Trace Archive Page. (data submission instructions...)
    Assembly Archive - links the raw sequence information found in the Trace Archive with assembly information found in publicly available sequence repositories (GenBank/EMBL/DDBJ). The Assembly Viewer allows a user to see the multiple sequence alignments as well as the actual sequence chromatogram.
    UniVec - a database that can be used to quickly identify segments within nucleic acid sequences which may be of vector origin. Screening using UniVec is efficient because a large number of redundant sub-sequences have been eliminated to create a database that contains only one copy of every unique sequence segment from a large number of vectors. The VecScreen tool, described below (under sequence analysis tools), can be used to compare a query sequence against the UniVec database in order to identify possible vector contamination.
    Genomes - Resources in the Genomes and Maps section contain the nucleotide sequences for a variety of genomes. Examples of the genomes available include:   >1000 organisms in Entrez Genome, human, mouse, rat, zebrafish, Drosophila, nematode, plant genomes, yeast, malaria, microbial genomes, viruses, viroids, plasmids, eukaryotic organelles.
    Nucleotide Sequence Analysis - various tools are available for analyzing nucleotide sequences and are described below.

    Protein Sequence Databases back to
top

    Entrez Proteins - search protein sequence records (from GenPept + RefSeq + Swiss-Prot + PIR + RPF + PDB) by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez below. For retrieval of large data sets, Batch Entrez (described below) is available. Entrez proteins also includes BLink ("BLAST Link"), a feature which displays the results of BLAST searches that have been done for every protein sequence in the Entrez Proteins data domain. To access it, follow the BLink link displayed beside any hit in the results of an Entrez Proteins search. More information about BLink is provided below.
    RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits. Protein sequence records have accessions: NP_123456 or XP_123456 (more info about accession numbers and access).
    FTP GenPept - download the "relxxx.fsa_aa.gz" file. The filename stands for "Release number XXX FASTA formatted amino acid translations". The translations are extracted from GenBank/EMBL/DDBJ records that are annotated with one or more CDS features
    Conserved Domain Database (CDD) - a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It includes domains from Smart and Pfam, as well as domains contributed by NCBI researchers. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database (described below). CDD can be used to identify conserved domains in a protein query sequence, using the CD-Search service (described below). In addition, the CDART tool (described below) uses CDD and RPS-BLAST (described below) to retrieve proteins with similar domain architectures.
    HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliograhies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data. More information about this database is provided under "Literature Databases".
    PROW - Protein Resources on the Web - short authoritative guides on the approximately 200 human CD cell-surface molecules. Peer-reviewed; provides approximately 20 standardized categories of information (biochemical function, ligands, etc.) for each CD antigen.
    Protein Sequence Analysis - various tools are available for analyzing protein sequences and are described below.
    Proteomes
    • Entrez Genome - provides ProtTable and TaxTable for various organisms. The ProtTable provides a summary of protein coding regions in a genome, and provides links to the corresponding nucleotide and protein sequences in FASTA format. The TaxTable, also referred to as the "distribution of BLAST protein homologs by taxa," summarizes the results of BLAST analyses done for the proteins, and displays the relationship of the organism to others through a color-coded graphical summary. (Additional information about Entrez Genome is provided below.)
    • FTP Genome Proteins - download an *.faa file (FASTA formatted amino acid sequences) and *ptt file (protein table) for various organisms from the genbank/genomes directory of the ftp site; see readme file for more information. Protein tables can also be viewed in Entrez Genome, as noted above.

    Structure Databases back to
top

    Structure Home - general information about the NCBI Structure Group and its research projects, as well as access to the Molecular Modeling Database (MMDB) and related tools to search and display structures.
    MMDB: Molecular Modeling Database- a database of three-dimensional biomolecular structures derived from X-ray crystallography and NMR-spectroscopy. MMDB is a subset of three-dimensional structures obtained from the Brookhaven Protein DataBank (PDB), excluding theoretical models. MMDB reorganizes and validates the information in a way that enables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. Its data specification includes a description of a biopolymer's spatial structure, a description of how it is organized chemically, and a set of pointers linking the two. By integrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structure-based homology modeling and protein structure prediction. MMDB records are stored in ASN.1 format and can be displayed with the Cn3D, Rasmol, or Kinemage viewers. In addition, similar structures within the database have been identified usingVAST, and new structures can be compared against the database using VASTsearch.
    3D Domains Database - compact structural domains identified automatically in MMDB, Entrez's macromolecular three-dimensional structure database. These domains are identified by searching for breakpoints in the structure between major secondary structure elements so that the ratio of intra- to inter-domain contacts falls above a set threshhold. 3D Domains are the units of comparison for structure neighbor ("related structures") calculations using the VAST algorithm.
    Conserved Domain Database (CDD) - a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It includes domains from Smart and Pfam, as well as domains contributed by NCBI researchers. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database (described above). CDD can be used to identify conserved domains in a protein query sequence, using the CD-Search service (described below). In addition, the CDART tool (described below) uses CDD and RPS-BLAST (described below) to retrieve proteins with similar domain architectures.
    PubChem - contains the chemical structures of small organic molecules and information on their biological activities. It is intended to support the Molecular Libraries and Imaging component of the NIH Roadmap Initiative. PubChem's chemical structure database may be searched on the basis of descriptive terms, chemical properties, and structural similarity. When possible, PubChem's chemical structure records are linked to other NCBI databases, including the PubMed scientific literature database and NCBI's protein 3D structure database. PubChem also contains the results of high-throughput biological screening experiments. PubChem is organized as three linked databases within the Entrez/PubMed information retrieval system.
    • PubChem Substance - Primary data NCBI obtains from the various public depositories. The PubChem Substance database contains approximately 13 million records as of October 2006, provided by various sources, DTP/NCI, NIAID, ChemIDplus, NIST, NIST webbook, MOLI/NCI, ChemBank, MMDB, KEGG, and more. Substance information includes chemical structures, synonyms, registration IDs, descriptions, related urls, and database cross-reference links to PubMed, protein 3D structures, and biological screening results.
    • PubChem Compound - A database made by NCBI and derived from PCSubstance. It is a non-redundant view of the chemically validated substances in PubChem Substance. There is one PubChem Compound record for each unique substance, and for each unique substance component. There can be multiple PubChem Substance records associated with one PubChem Compound record. PubChem Compound contains all standardized structures, mixture components, and precalculated structure neighboring links. Compound information includes structure, compound property information (molecular weight, formula, xLogP, count of the rotatable bonds, H bond donor, H bond acceptor, etc.), and structure description (SMILES, IUPAC name, INCHI).
    • PubChem BioAssay - The assay database consists of deposited bioactivity data and descriptions of bioactivity assays used for screening of the chemical substances contained in PubChem Substance, including descriptions of the conditions and the readouts (bioactivity levels) specific to the screening procedure. The assay database includes DTP/NCI's 710 million lines of in vitro and in vivo data covering from cancer, HIV, to many other fields.
    Structure-Related Tools - in addition to the structure databases described above, NCBI offers several tools:
    • Cn3D - "See in 3-D," a structure and sequence alignment viewer for NCBI databases. It allows viewing of 3-D structures and sequence-structure or structure-structure alignments. Cn3D can work as a helper application to your browser, or as a client-server application that retrieves structure records from MMDB (described above) directly over the internet. The Cn3D home page provides access to information on how to install the program, a tutorial to get started, and a comprehensive help document.
    • CD-Search - The Conserved Domain Search Service (CD-Search) can be used to identify the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (described above) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD) (described above). Hits can be displayed as a pairwise alignment of the query sequence with a representative domain sequence, or as a multiple alignment. Alignments are also mapped to known 3-dimensional structures, and can be displayed using Cn3D (described above). In the Cn3D display, residues in sequence alignments are variously colored, based on their degree of conservation.
    • VAST - Vector Alignment Search Tool - a computer algorithm developed at NCBI and used to identify similar protein 3-dimensional structures. The "structure neighbors" for every structure in MMDB are pre-computed and accessible via links on the MMDB Structure Summary pages. These neighbors can be used to identify distant homologs that cannot be recognized by sequence comparison alone.
    • VAST Search - structure-structure similarity search service. Compares 3D coordinates of a newly determined protein structure to those in the MMDB/PDB database. VAST Search computes a list of structure neighbors that you may browse interactively, viewing superpositions and alignments by molecular graphics.

    Genes back to
top

    Entrez Gene - Entrez Gene provides a gene-based view of the data from a wide range of genomes. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. Each record represents a single gene from a given organism. The minimum set of data in a gene record includes a unique identifier or GeneID assigned by NCBI, a preferred symbol, and any of sequence information, map information, or official nomenclature from an authority list. In addition, a gene record can also include expression, structure, functional, and homology data, when available. Entrez Gene includes data from all organisms that have RefSeq genome records (with NC_* accessions, see more info above), and can also include data from recognized genome-specific databases that provide NCBI with information about genes (preferably with defining sequence) or mapped phenotypes. Entrez Gene is the successor to LocusLink (described below).
    GeneRIF - Gene References into Function (GeneRIFs) provide a simple mechanism to allow scientists to add to the functional annotation of loci described in Entrez Gene. They appear as annotated bibliographies in Entrez Gene records, and consist of brief statements on gene function with links to the corresponding PubMed records (example: human MLH1). The GeneRIF help page describes the simple steps needed to submit information. GeneRIFs are also added to the Entrez Gene records by the MEDLINE Indexing Staff of the National Library of Medicine. GeneRIFs are currently available for a subset of organisms in Entrez Gene, and will be provided for the loci of other organisms as the development of Entrez Gene continues.
    LocusLink - LocusLink was discontinued as of March 1, 2005. It provided a foundation for what is now Entrez Gene and was described in several articles ( Pruitt KD, Maglott DR (2001), Pruitt KD, Katz KS, Sicotte H, Maglott DR (2000)). It contained data for a number of species such as human, mouse, rat, zebrafish, nematode, fruit fly, cow, sea urchin, African clawed frog, HIV-1, and a few other model and commonly studied organisms. Data for these organisms (and from the ongoing collaboration among the groups listed above) are now available in the Entrez Gene database (described above), which is the successor to LocusLink. The major differences between LocusLink and Entrez Gene are scope of data and search interface. Entrez Gene contains data from all organisms with RefSeq genome records. (RefSeq is described in the Molecular Databases/Nucleotide Sequences section of this guide). Entrez Gene also uses the Entrez search system, and therefore offers the helpful functions such as Preview/Index, History, and LinkOut that are available for other Entrez databases. The Entrez Gene help document includes numerous tips for previous users of LocusLink.
    Consensus CoDing Sequence (CCDS) Database - The CCDS project is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. The collaborators include the National Center for Biotechnology Information (NCBI, Map Viewer), European Bioinformatics Institute (EBI, Ensembl), University of California, Santa Cruz (UCSC, Genome Browser), and Wellcome Trust Sanger Institute (WTSI, Vega). They identify the position of protein-coding regions of genes that are (1) annotated consistently on the human genome by all of the participating centers and (2) supported by transcript evidence, use of canonical splice sites, and other quality assurance measures. Additional information about the curation, process flow, and quality testing is available on the CCDS web site.
    UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human), and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete data set can be downloaded from the repository/UniGene directory of the FTP site. In addition, UniGene DDD (described below) can be used to show differential expression of genes between cDNA libraries. The organisms represented in UniGene are listed on the UniGene home page.
    HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. Organisms represented are listed on the HomoloGene home page.
    Mammalian Gene Collection (MGC) - The NIH Mammalian Gene Collection (MGC) is a trans-NIH initiative that seeks to identify and sequence a representative full open reading frame (FL-ORF) clone for each human, mouse, and rat gene. The MGC project entails the production of cDNA libraries and sequences, database and repository development, as well as the support of research for improved library construction, sequencing, and analytic technologies. All the resources generated by the MGC are publicly accessible to the biomedical research community.
    HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliograhies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data. More information about this database is provided under "Literature Databases".
    AceView (Acembly) - AceView offers an integrated view of the human, nematode and Arabidopsis genes reconstructed by co-alignment of all publicly available mRNAs and ESTs on the genome sequence. The goals are to offer a reliable up-to-date resource on the genes and their functions and to stimulate further validating experiments at the bench. AceView carefully computes co-alignment and clustering of experimental cDNA sequences, no prediction is involved. The resulting AceView genes and their alternative variants are analyzed in terms of expression, intron-exon structure, alternative features, regulation and neighbor relationships; the protein products are analyzed for completeness, their best covering clones are identified, the proteins are searched for motifs, membership to a protein family, conservation in evolution, closest homologues in other species and signals for subcellular localization. The genes are presented in the context of biological annotations gathered from various sources. AceView can be queried by meaningful words or sentences as well as by most standard identifiers.

    Expression back to
top

    Gene Expression Omnibus (GEO) - a gene expression and hybridization array data repository, as well as a curated, online resource for gene expression data browsing, query and retrieval. GEO was the first fully public high-throughput gene expression data repository, and became operational in July 2000. Many types of gene expression data from platforms such as spotted microarray (microarray), high-density oligonucleotide array (HDA), hybridization filter (filter) and serial analysis of gene expression (SAGE) data, are accepted, accessioned, and archived as a public data set. GEO data can be accessed through several search and browsing tools on the GEO home page, Entrez (via Entrez GEO Profiles and Entrez GDS (GEO DataSets)), and the FTP site. The Tools/Gene Expression section of this file provides information about data visualization and exploration capabilities available in GEO.
    GENSAT - The Gene Expression Nervous System Atlas, or GENSAT, project aims to map the expression of genes in the central nervous system of the mouse, using both in situ hybridization and transgenic mouse techniques. The GENSAT database contains a series of images related to gene expression experiments. The images are indexed on a number of fields relevant to biological discovery. Search criteria include gene names, gene symbols, gene aliases and synonyms, mouse ages, and imaging protocols. The GENSAT project is a collaboration among the National Institute of Neurological Disorders and Stroke (NINDS), Rockefeller University, St. Jude Children's Research Hospital, and NCBI.
    Expression-Related Tools - in addition to the GEO database, described above, NCBI offers several tools:
    • SAGEmap - Serial Analysis of Gene Expression, or SAGE, is an experimental technique designed to quantitatively measure gene expression. SAGEmap is an online tool to compare computed gene expression profiles between SAGE libraries generated by the Cancer Genome Anatomy Project (CGAP, described under human genome/cancer research) and submitted by others through the Gene Expression Omnibus (GEO, described above). SAGEmap also includes a comprehensive analysis of SAGE tags in human GenBank records, in which a UniGene identifier is assigned to each human sequence that contains a SAGE tag. Data can be retrieved by tag, by sequence, by UniGene cluster ID and by library name. When retrieving data by sequence or UniGene cluster ID, follow a SAGE tag's hotlink to find out its expression level in different SAGE libraries, and how it is represented in the rest of the sequences in GenBank. Retrieving data by library name takes one to GEO, where all SAGEmap data has been stored by library. Analytical tools include xProfiler, which compares gene expression between SAGE libraries of your choice as well as uploaded data. More information about the additional analytical capabilities of the SAGEmap resource is provided in the tools/gene expression section of this file.
    • CGAP - Cancer Genome Anatomy Project - interdisciplinary program to identify the human genes expressed in different cancerous states, based on cDNA (EST) libraries, and to determine the molecular profiles of normal, precancerous, and malignant cells. Collaboration among the National Cancer Institute, the NCBI, and numerous research labs. Additional information about CGAP is provided in the tools/gene expression section of this file. Related resources are described in the human genome/cancer research section.
    • UniGene DDD - Digital Differential Display - an online tool to compare computed gene expression profiles between selected cDNA libraries. Using a statistical test, genes whose expression levels differ significantly from one tissue to the next are identified and shown to the user. Additional information about UniGene is in the molecular databases/genes section.

    Taxonomy back to
top

    NCBI Taxonomy Database Home - general information about the Taxonomy project, including taxonomic resources and a list of outside curators collaborating with NCBI taxonomists. The NCBI Taxonomy Database contains the names and lineages of >160,000 organisms, both living and extinct, that are represented in the genetic databases with at least one nucleotide or protein sequence. New organisms are added to the database as sequence data are deposited for them. The purpose of the taxonomy project at NCBI is to build a consistent phylogenetic taxonomy for the sequence databases.
    Taxonomy Browser - The search bar on the Taxonomy home page allows you to browse the NCBI taxonomy database. Enter the scientific or common name of a species (e.g., Canis familiaris or dog) or a higher taxon (e.g., Canidae) to view that organism or taxon's lineage; retrieve the available nucleotide, protein, structure, and genome records; and browse up and down the taxonomic tree. (Tip:   For the broadest search results, select the "token set" option in the search bar, which searches for any string, whether in the beginning, middle, or end of a word.)  Entrez also provides an interface for browsing the taxonomy database, and offers features such as the Common Tree function, which allows you to build a tree for your own selection of organisms or taxa (more...).
    Taxonomy BLAST - an implementation of Gapped BLAST (2.x) that groups hits by source organism, according to information in NCBI's Taxonomy database. Species are listed in order of sequence similarity to the query sequence; the strongest match listed first. Three report views are available:
    • organism report - sorts the BLAST hits according to species, so that all of the hits to the same organism will appear together
    • lineage report - gives a simplified view of the relationships between the organisms, according to their classification in the taxonomy database. This report is "focused" on the organism which yielded the strongest BLAST hit. It answers the question, "how closely are the organisms in the BLAST hit list related to the query sequence according to the taxonomy database?"
    • taxonomy report - provides a more detailed report about the relationships among all of the organisms found in the BLAST hit list, including a summary of the taxa that are represented, the number of species and subspecies, and the number of BLAST hits at each node in the taxonomic hierarchy.
    TaxPlot - a tool for 3-way comparisons of genomes on the basis of the protein sequences they encode. To use TaxPlot, one selects a reference genome to which two other genomes are compared. Pre-computed BLAST results are then used to plot a point for each predicted protein in the reference genome, based on the best alignment with proteins in each of the two genomes being compared.

    Literature Databases Overview back to top

    PubMed - A database of citations and abstracts for biomedical literature. These citations are from MEDLINE and additional life science journals. PubMed also includes links to many sites providing full text articles and other related resources. PubMed is accessible through the Entrez search and retrieval system (described below)
    • Journals Database - allows you to lookup journals that are cited in any of the Entrez databases, including PubMed. Journals can be searched using the journal title, MEDLINE or ISO abbreviation, ISSN, or the NLM Catalog ID.
    • MeSH - The Medical Subject Headings (MeSH) database is NLM's controlled vocabulary used for indexing articles for MEDLINE/PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts.
    PubMed Central - a digital archive of biomedical and life sciences journal literature, including clinical medicine and public health, managed by the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine (NLM). It is not a journal publisher. Access to PubMed Central (PMC) is free and unrestricted.
    OMIM - Online Mendelian Inheritance in Man - continuously updated catalog of human genes and genetic disorders, with links to associated literature references, sequence records, maps, and related databases.
    Entrez Books - In collaboration with book publishers, the NCBI is adapting textbooks for the web and linking them to PubMed, the biomedical bibliographic database. The idea is to provide background information to PubMed, so that users can explore unfamiliar concepts found in PubMed search results.
    HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. RefSeq protein sequence records serve as anchors for collecting published information about interactions between HIV-1 and human proteins. Each HIV Interactions database record lists an HIV protein and the human proteins with which it has been found to interact. In turn, the Entrez Gene record for each human protein contains annotated HIV-1 Interactions bibliographies, which consist of brief statements on protein interactions with links to the corresponding PubMed records and sequence data. The HIV Interactions database is a collaborative project among the developers of RefSeq (description) and Entrez Gene (description), and is similar in concept to GeneRIF (description). In contrast to GeneRIFs for single genes, however, the publications cited in the HIV Interactions Database contain statements about binding between two proteins rather than statements about the function of a single gene.

    Genomes and Maps Overview back to top
    organism collections (including Entrez Genome, Entrez Genome Project, Map Viewer, Entrez Gene, UniGene, HomoloGene, and COGs),   and organism-specific resources, such as: human,   mouse,   rat,   zebrafish,   Drosophila,   nematode,   plant genomes,   yeast,   malaria,   microbial genomes,   viruses,   viroids,   plasmids,   eukaryotic organelles
     

    Organism Collections back to
top

    Genomic Biology - An introduction to the field of genomic biology, with links to the genome resources pages for major organisms and organism groups, as well as links to additional NCBI genome resources.
    Entrez Genome - sequence and map data from the whole genomes of over 1000 organisms. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses, phages, viroids, plasmids, and organelles.. Entrez Genome provides graphical overviews of complete genomes/chromosomes, and the ability to explore regions of interest in progressively greater detail. ProtTables and TaxTables are provided for organisms on which analyses have been done by NCBI staff. In addition, the Map Viewer, a software component of Entrez Genome, provides views of integrated chromosome maps for a variety of organisms (see additional information about the Map Viewer below).
    Information about submitting genome data from complete genomes is provided in the Resource Guide section on Submission of complete genomes. After data from complete genomes are submitted, they are made available in Entrez Genome (as complete genomes or chromosomes) and Entrez Nucleotide (as chromosome or genome fragments such as contigs). Entrez Nucleotide also provides access to the records for complete genomes/chromosomes, but the default view of those records is the Nucleotide database is GenBank format, whereas the default view in Entrez Genome is a graphical overview. A companion database, Entrez Genome Project, is described below.
    Entrez Genome Project - a companion database to Entrez Genome (described above). The actual data from genome sequencing projects are contained in Entrez Genome (as complete genomes chromosomes) and Entrez Nucleotide (as chromosome or genome fragments such as contigs). The Genome Project database, on the other hand, provides an umbrella view of the status of each genome project, links to project data in the other Entrez databases, and links to a variety of other NCBI and external resources associated with a given genome project. A genome project's status can be complete or in-progress, and the project can include large-scale sequencing, assembly, annotation, and mapping efforts. New genome sequencing projects can be registered through the Genome project submission form. More information about the submission of data from complete genomes is provided in the Resource Guide section on Submission of complete genomes. (Although the Entrez Genome Project database does not include viral genome sequencing projects, data from those projects are submitted to GenBank and are available in the Entrez Nucleotide and Entrez Genome databases. There is also a special set of resources at NCBI dedicated to Viral Genomes.)
    Genomes Announcements - To receive announcements about recently completed genomes, see the NCBI Announcements Email Lists page.
    Map Viewer - The Map Viewer is a software component of Entrez Genome (described above) that provides special browsing capabilities for a subset of organisms. It allows you to view and search an organism's complete genome, display chromosome maps, and zoom into progressively greater levels of detail, down to the sequence data for a region of interest. If multiple maps are available for a chromosome, it displays them aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The organisms currently represented in the Map Viewer are listed on the Map Viewer home page and in the Map Viewer help document, which provides general information on how to use that tool. The number and types of available maps vary by organism, and are described in the "data and search tips" file provided for each organism.
    Entrez Gene - Entrez Gene provides a gene-based view of the data from a wide range of genomes. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. Each record represents a single gene from a given organism. The minimum set of data in a gene record includes a unique identifier or GeneID assigned by NCBI, a preferred symbol, and any of sequence information, map information, or official nomenclature from an authority list. In addition, a gene record can also include expression, structure, functional, and homology data, when available. Entrez Gene includes data from all organisms that have RefSeq genome records (with NC_* accessions, see more info above), and can also include data from recognized genome-specific databases that provide NCBI with information about genes (preferably with defining sequence) or mapped phenotypes.
    UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human), and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete data set can be downloaded from the repository/UniGene directory of the FTP site. In addition, UniGene DDD (described below) can be used to show differential expression of genes between cDNA libraries. The organisms represented in UniGene are listed on the UniGene home page.
    HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. Organisms represented are listed on the HomoloGene home page.
    COGs - Clusters of Orthologous Groups - natural system of gene families from complete genomes. Clusters of Orthologous Groups (COGs) were delineated by comparing protein sequences encoded in complete unicellular genomes representing 30 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. The Initial Version of COGs includes 44 organisms. The Updated Version of COGs includes 66 organisms in the Unicellular Clusters, plus Eukaryotic Clusters (called KOGs). More organisms will be added in the future.
    Download Genomes <350 KB via Entrez Genome pages for individual organisms
    Download Genomes >350 KB from the NCBI ftp site - see FTP information below; ftp links are also available from Entrez Genome pages for individual organisms
    Genome Sequencing Centers - list of genome sequencing centers and the organisms on which they work

    Human Genome back to
top
    Guide,   Chromosomes,   Sequences,   Genes,   BLAST,   Clones,   Genome Maps,   Mapped Markers,   Cytogenetics,   Gene Expression,   Genetic Variation,   Disorders,   Cancer Research,   FTP
     
    Guide back to
top
    • Human Genome Resources Guide - overview of available human genome data resources. Includes bulletins and progress reports concerning the Human Genome Project and provides centralized access to previously disparate data.
    • Introduction to NCBI's Genome Resource - overview of the nature of data generated by the human genome project, the processes use to assemble and annotate the data, and to integrate it with a wide range of information from other resources.
    • NCBI Contig Assembly and Annotation Process - describes the processes use to assemble contigs from the high throughput genomic sequences (HTGs, described above), and to annotate the contigs with features. It also describes the various resources that can be used to access the human genome data.
    • Tour of the Draft Human Genome Sequence - provides an introduction to how the draft sequence of the human genome can be used by biologists, and includes examples of the types of questions that can be answered with the data.
    Chromosomes back to
top
    • Map Viewer - integrated views of chromosome maps - The Map Viewer (described above) displays one or more maps which have been aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. For human, the Map Viewer includes >20 sequence, cytogenetic, genetic linkage, radiation hybrid, and other maps.  (When viewing a chromosome, use the "Maps & Options" dialog box to display the map(s) of interest.)   The sequence maps are based on the contigs built from the draft and finished sequence data generated by the Human Genome Project. A list of available human maps and their descriptions is provided. The Map Viewer help document provides general information on how to use that tool. Information about the NCBI Contig Assembly and Annotation Process is also available.

    Sequences back to
top
    • RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits, for example, NC_123456, NT_123456, NM_123456, NP_123456 (more info about accession numbers and access).
    • Entrez - provides integrated access to nucleotide and protein sequence data in GenBank, EMBL, DDBJ, RefSeq, PIR-International, PRF, Swiss-Prot, and PDB, along with 3D protein structures, genomic mapping information, and PubMed MEDLINE. Entrez contains pre-computed similarity searches for each database record, producing a list of related sequences, structures, and MEDLINE records. Includes sequence data from >160,000 species; use the organism field to limit searches to human records.
    • UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above.
    • dbEST - Database of Expressed Sequence Tags - short (about 300-500 bp) cDNA sequences representing single-pass reads from mRNA. Usually produced in large numbers and represent a snapshot of the genes expressed in a given tissue, and/or at a given developmental stage. Also includes ESTs generated by the CGAP project (see Cancer Research, below), and sequences from differential display and RACE experiments.
    Genes back to
top
    • Entrez Gene - a gene-based view of the data from a wide range of genomes, including human. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. More information about Entrez Gene is provided above, in the Molecular Databases/Genes section.
    • OMIM - Online Mendelian Inheritance in Man - continuously updated catalog of human genes and genetic disorders, with links to associated literature references, sequence records, maps, and related databases.
    • RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits, for example, NC_123456, NT_123456, NM_123456, NP_123456 (more info about accession numbers and access).
    • UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above.
    • HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms, including human, mouse, rat, zebrafish, and fruit fly, in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene.
    BLAST against human genomic sequence data back to
top
    Clones back to
top
    NCBI does not distribute clones. However, some NCBI resources contain information about clones and the sources from which they can be obtained.
    • Clone Maps - various clone maps for human have been included in the Map Viewer, described below. The document that describes the various maps available for human includes a section listing the maps that contain clone information. To select those maps for display, use the Maps&Options dialog box when viewing any human chromosome. (Several other organisms accessible through the Map Viewer are represented by maps that contain clone information. The organism-specific "data and search tips" files provide additional detail about the maps available for each organism.)
    • Human BAC Resource - A cytogenetic resource of large-insert, FISH-mapped clones containing sequence-tagged sites. Will help integrate cytogenetic, radiation-hybrid, linkage, and sequence maps of the human genome. Includes links to clone distributors.
    • Mammalian Gene Collection (MGC) - The NIH Mammalian Gene Collection (MGC) is a trans-NIH initiative that seeks to identify and sequence a representative full open reading frame (FL-ORF) clone for each human, mouse, and rat gene. The MGC project entails the production of cDNA libraries and sequences, database and repository development, as well as the support of research for improved library construction, sequencing, and analytic technologies. All the resources generated by the MGC are publicly accessible to the biomedical research community.
    Clone Information for Other (Non-human) Organisms - Some organisms have additional clone information resources. For example, the resources available for the mouse genome include several items mentioned above, plus a CloneFinder, described below. In addition, many records in dbEST (described above) include information about clone sources such as the I.M.A.G.E. consortium.
    Genome Maps back to
top
    • Entrez Genome - links to the human chromosome views in the Map Viewer (details below). Entrez Genome also includes a view of the human mitochondrion (accessible under eukaryotic organelles), which can be viewed in its entirety or explored in progressively greater detail (additional information about Entrez Genome above).
    • Map Viewer - integrated views of chromosome maps - The Map Viewer is a software component of Entrez Genome that displays one or more maps which have been aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. For human, the Map Viewer includes >20 sequence, cytogenetic, genetic linkage, radiation hybrid, and other maps.  (When viewing a chromosome, use the "Maps & Options" dialog box to display the map(s) of interest.)   The sequence maps are based on the contigs built from the draft and finished sequence data generated by the Human Genome Project. A list of available human maps and their descriptions is provided. The Map Viewer help document provides general information on how to use that tool. Information about the NCBI Contig Assembly and Annotation Process is also available.
    • GeneMap'99 - physical map of >35,000 human gene-based markers, constructed by the International Radiation Hybrid Mapping Consortium using a consistent set of RH reagents and methodologies. Provides a framework for accelerated sequencing efforts by highlighting key landmarks (gene-rich regions) of the chromosomes, and represents the cooperative efforts of more than one hundred scientists throughout the world.
      Note: The GeneMap'99 data have also been included in the Map Viewer, described above. When viewing a chromosome, use the "Maps & Options" dialog box to select the map(s) of interest.
    • NCBI RH Map - NCBI Integrated Radiation Hybrid Map contains 23,723 markers from both the G3 and GB4 RH panels of GeneMap'99. Those markers were mapped with respect to 1084 framework markers (a subset of markers common to the G3 and GB4 panels). All markers from both panels were interpolated onto the GB4 scale. The article by R. Agarwala et al. provides detail about the integration strategy, as well as the methods used to evaluate the quality of the integrated map.
      Note: The NCBI RH Map data have also been included in the Map Viewer, described above. When viewing a chromosome, use the "Maps & Options" dialog box to select the map(s) of interest.
    • OMIM Gene Map - cytogenetic locations of genes that have been reported in the literature and determined by a variety of mapping methods. Can be searched by gene symbol or cytogenetic chromosomal location. Accessible from the OMIM page (described above).
      Note: The OMIM Gene Map data have also been included in the Genes_Cytogenetic Map of the Map Viewer (described above). When viewing a chromosome, use the "Maps & Options" dialog box to select the map(s) of interest.
    • OMIM Morbid Map - alphabetical listing of diseases and corresponding cytogenetic map locations, with links to OMIM entries. Accessible from the OMIM page (described above).
      Note: The OMIM Morbid Map data have also been included in the Map Viewer, described above. When viewing a chromosome, use the "Maps & Options" dialog box to select the map(s) of interest.
    • Human-Mouse Homology Maps - a table comparing genes in homologous segments of DNA from human and mouse, sorted by position in each genome. Computed by integrating orthologs identified at The Jackson Laboratory with putative orthologs identified by sequence homology. The original maps by M. F. Seldin of the University of California at Davis are also available.
    Mapped Markers back to
top
    • dbSTS - Database of Sequence Tagged Sites - short (about 200-500 bp) genomic sequences that are thought to be operationally unique in a genome, and therefore define a specific position on the physical map.
    • UniSTS - a unified, non-redundant view of sequence tagged sites (STSs). UniSTS integrates marker and mapping data from a variety of public resources. If two or more markers have different names but the same primer pair, a single STS record is presented for the primer pair and all the marker names are shown. Each UniSTS record displays the primer sequences, product size, mapping information, and cross references to Entrez Gene, dbSNP, RHdb, GDB, MGD, and the Map Viewer. The marker report also lists GenBank and RefSeq records that contain the primer sequences, as determined by Electronic PCR (e-PCR). Data sources include dbSTS, RHdb, GDB, various human maps (Genethon genetic map, Marshfield genetic map, Whitehead RH map, Whitehead YAC map, Stanford RH map, NHGRI chr 7 physical map, WashU chrX physical map), various mouse maps (Whitehead RH map, Whitehead YAC map, Jackson laboratory's MGD map).
    • e-PCR (Electronic PCR) - find putative map location of a query sequence. Computational procedure for finding sequence tagged sites in DNA sequences. (See additional information in the Tools/Nucleotide Sequence Analysis Section.)
    • GeneMap'99 - physical map of >35,000 human gene-based markers, constructed by the International Radiation Hybrid Mapping Consortium using a consistent set of RH reagents and methodologies. Provides a framework for accelerated sequencing efforts by highlighting key landmarks (gene-rich regions) of the chromosomes, and represents the cooperative efforts of more than one hundred scientists throughout the world.
      Note: The GeneMap'99 data have also been included in the Map Viewer, described above. When viewing a chromosome, use the "Maps & Options" dialog box to select the map(s) of interest.
    • Rl2pl4tg - cytogenetic locations of genes that have been reported in the literature and determined by a variety of mapping methods. Can be searched by gene symbol or cytogenetic chromosomal location. Accessible from the OMIM page (see Genes, above).
      Note: The OMIM Gene Map data have also been included in the Genes_Cytogenetic Map of the Map Viewer (described above). When viewing a chromosome, use the "Maps & Options" dialog box to select the map(s) of interest.
    Cytogenetics back to
top
    • Human BAC Resource - A cytogenetic resource of large-insert, FISH-mapped clones containing sequence-tagged sites. Will help integrate cytogenetic, radiation-hybrid, linkage, and sequence maps of the human genome. Includes links to clone distributors.
    • SKY/M-FISH & CGH Database - The NCI and NCBI SKY/M-FISH and CGH Database is a repository of publicly submitted data from Spectral Karyotyping (SKY), Multiplex Fluorescence In Situ Hybridization (M-FISH), and Comparative Genomic Hybridization (CGH), which are complementary fluorescent molecular cytogenetic techniques. SKY/M-FISH permits the simultaneous visualization of each human or mouse chromosome in a different color, facilitating the identification of chromosomal aberrations; CGH can be used to generate a map of DNA copy number changes in tumor genomes. Collaborative project with the National Cancer Institute.  (data submission instructions...)
    Gene Expression back to
top
    • Gene Expression Omnibus (GEO) - a gene expression and hybridization array data repository, as well as a curated, online resource for gene expression data browsing, query and retrieval. GEO was the first fully public high-throughput gene expression data repository, and became operational in July 2000. Many types of gene expression data from platforms such as spotted microarray (microarray), high-density oligonucleotide array (HDA), hybridization filter (filter) and serial analysis of gene expression (SAGE) data, are accepted, accessioned, and archived as a public data set. GEO data can be accessed through several search and browsing tools on the GEO home page, Entrez (via Entrez GEO Profiles and Entrez GDS (GEO DataSets)), and the FTP site. The Tools/Gene Expression section of this file provides information about data visualization and exploration capabilities available in GEO.
    • CGAP - Cancer Genome Anatomy Project - interdisciplinary program to identify the human genes expressed in different cancerous states, based on cDNA (EST) libraries, and to determine the molecular profiles of normal, precancerous, and malignant cells. Collaboration among the National Cancer Institute, the NCBI, and numerous research labs. Additional information about CGAP is provided in the Tools/Gene Expression section of this file. Related resources are described in the Human Genome/Cancer Research section.
    • SAGEmap - Serial Analysis of Gene Expression, or SAGE, is an experimental technique designed to quantitatively measure gene expression. SAGEmap is an online tool to compare computed gene expression profiles between SAGE libraries generated by the Cancer Genome Anatomy Project (CGAP, described under human genome/cancer research) and submitted by others through the Gene Expression Omnibus (GEO, described under molecular databases/gene expression). SAGEmap also includes a comprehensive analysis of SAGE tags in human GenBank records, in which a UniGene identifier is assigned to each human sequence that contains a SAGE tag. Data can be retrieved by tag, by sequence, by UniGene cluster ID and by library name. When retrieving data by sequence or UniGene cluster ID, follow a SAGE tag's hotlink to find out its expression level in different SAGE libraries, and how it is represented in the rest of the sequences in GenBank. Retrieving data by library name takes one to GEO, where all SAGEmap data has been stored by library. Analytical tools include xProfiler, which compares gene expression between SAGE libraries of your choice as well as uploaded data. More information about the additional analytical capabilities of the SAGEmap resource is provided in the tools/gene expression section of this file.
    • UniGene DDD - Digital Differential Display - an online tool to compare computed gene expression profiles between selected cDNA libraries. Using a statistical test, genes whose expression levels differ significantly from one tissue to the next are identified and shown to the user. Additional information about UniGene is above.
    Genetic Variation back to
top
    • dbSNP - Database of Single Nucleotide Polymorphisms - NCBI database of single nucleotide polymorphisms, microsatellites, and small-scale insertions and deletions. dbSNP contains population-specific frequency and genotype data, experimental conditions, molecular context, and mapping information for both neutral polymorphisms and clinical mutations.
    • OMIM - Online Mendelian Inheritance in Man - allelic variants in ~900 (9%) of OMIM records. To view a list of those OMIM records, use the OMIM Limits page in Entrez and check the box for "Allelic Variants" under the section titled "Only records with." (For more information about OMIM, see Genes, above.)
    • Locus Specific Mutation Databases - links to numerous external mutation databases are provided on the OMIM allied resources page and from related OMIM entries. When an individual OMIM entry contains links to locus-specific mutation databases, the links are shown under the "LinkOut" header in the blue sidebar. (The LinkOut header appears only in entries that have links to resources outside of the Entrez system.)
    Disorders back to
top
    • Genes and Disease - introduction to the relationship between genetic factors and human disease. Summary information for ~60 genetic diseases with links to related databases and organizations.
    • OMIM - Online Mendelian Inheritance in Man - continuously updated catalog of human genes and genetic disorders, with links to associated literature references, sequence records, maps, and related databases.
    • OMIM Morbid Map - alphabetical listing of diseases and corresponding cytogenetic map locations, with links to OMIM entries. Accessible from OMIM page (see Genes).
    Cancer Research back to
top
    • CCAP - Cancer Chromosome Aberration Project - designed to expedite the definition and detailed characterization of the distinct chromosomal alterations that are associated with malignant transformation. Collaboration among the National Cancer Institute, the NCBI, and numerous research labs.
    • CGAP - Cancer Genome Anatomy Project - interdisciplinary program to identify the human genes expressed in different cancerous states, based on cDNA (EST) libraries, and to determine the molecular profiles of normal, precancerous, and malignant cells. Collaboration among the National Cancer Institute, the NCBI, and numerous research labs. Additional information about CGAP is provided in the Tools/Gene Expression section of this file.
    • SKY/M-FISH & CGH Database - The NCI and NCBI SKY/M-FISH and CGH Database is a repository of publicly submitted data from Spectral Karyotyping (SKY), Multiplex Fluorescence In Situ Hybridization (M-FISH), and Comparative Genomic Hybridization (CGH), which are complementary fluorescent molecular cytogenetic techniques. SKY/M-FISH permits the simultaneous visualization of each human or mouse chromosome in a different color, facilitating the identification of chromosomal aberrations; CGH can be used to generate a map of DNA copy number changes in tumor genomes. Collaborative project with the National Cancer Institute.  (data submission instructions...)
    • SAGE Analysis - differential expression of SAGE tags in cancer libraries. (See additional information about SAGEmap, below.)
    FTP back to
top
    • Human chromosome data - the ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/ directory contains one folder for each chromosome, which includes genomic contigs (NT_* records) built from finished and unfinished sequence data. The contigs are available in various formats, described below. The contig assembly and annotation process is described in a separate document.
      hs_chr*.asn ASN.1 format (description above)
      hs_chr*.fa.gz FASTA format (description above)
      hs_chr*.gbk.gz GenBank flat file format
      (annotations currently include STS markers; known and predicted genes will be added in coming months)
      hs_chr*.gbs GenBank summary format
      (this format does not contain sequence data, but instead contains a "CONTIG" field, showing how the contig is assembled from individual GenBank accessions)
      Data from the Map Viewer (described above), is available in the ftp://ftp.ncbi.nih .gov /genomes/H_sapiens/maps/mapview/ subdirectory.

    Mouse Genome back to
top
    Guide,   Chromosomes,   Sequences,   Genes,   Clones,   Maps and Mapped Markers,   Cytogenetics,   BLAST,   FTP
     
    Guide back to
top
    • Mouse Genome Resources Guide - brings together information on diverse mouse-related resources from multiple centers: sequence, mapping, and clone information as well as pointers to strain and mutant resources.
    Chromosomes back to
top
    • Mouse Genome Sequencing - sequencing progress of the Mouse Genome Project; high-throughput genomic sequence contigs assembled from finished (phase 3) data; view by chromosome number, size and physical position; download sequence data by contig or by chromosome; BLAST against contigs.
    • Map Viewer - integrated chromosome maps - The Map Viewer is a software component of Entrez Genome that displays one or more maps which have been aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The maps that are currently available for Mus musculus are described in the Mus musculus data and search tips document. The Map Viewer help document provides general information on how to use that tool.
    Sequences back to
top
    • Mouse Genome Sequencing - sequencing progress of the Mouse Genome Project; high-throughput genomic sequence contigs assembled from finished (phase 3) data; view by chromosome number, size and physical position; download sequence data by contig or by chromosome; BLAST against contigs.
    • Entrez - includes sequence data from >160,000 species; use the organism field to limit searches to mouse records. See additional information about Entrez above, and Batch Entrez, below.
    Genes back to
top
    • Entrez Gene - a gene-based view of the data from a wide range of genomes, including mouse. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. More information about Entrez Gene is provided above, in the Molecular Databases/Genes section.
    • UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above.
    • HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms, including human, mouse, rat, zebrafish, and fruit fly, in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene.
    Clones back to
top
    • Mammalian Gene Collection (MGC) - The NIH Mammalian Gene Collection (MGC) is a trans-NIH initiative that seeks to identify and sequence a representative full open reading frame (FL-ORF) clone for each human, mouse, and rat gene. Additional information is provided above.
    • CloneFinder - allow users to identify clones that contain an object, or that are contained within a particular genomic region. Clone placement is based on the alignment of BAC end sequences (BES) to the current genome assembly. Currently, CloneFinder is available for mouse only.
    Maps and Mapped Markers back to
top
    • Map Viewer - integrated chromosome maps - The Map Viewer is a software component of Entrez Genome that displays one or more maps which have been aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The maps that are currently available for Mus musculus are described in the Mus musculus data and search tips document. The Map Viewer help document provides general information on how to use that tool.
    • Human-Mouse Homology Maps - a table comparing genes in homologous segments of DNA from human and mouse, sorted by position in each genome. Computed by integrating orthologs identified at The Jackson Laboratory with putative orthologs identified by sequence homology. The original maps by M. F. Seldin of the University of California at Davis are also available.
    • UniSTS - a unified, non-redundant view of sequence tagged sites (STSs). UniSTS integrates marker and mapping data from a variety of public resources. If two or more markers have different names but the same primer pair, a single STS record is presented for the primer pair and all the marker names are shown. Each UniSTS record displays the primer sequences, product size, mapping information, and cross references to Entrez Gene, dbSNP, RHdb, GDB, MGD, and the Map Viewer. The marker report also lists GenBank and RefSeq records that contain the primer sequences, as determined by Electronic PCR (e-PCR). Data sources include dbSTS, RHdb, GDB, various human maps (Genethon genetic map, Marshfield genetic map, Whitehead RH map, Whitehead YAC map, Stanford RH map, NHGRI chr 7 physical map, WashU chrX physical map), various mouse maps (Whitehead RH map, Whitehead YAC map, Jackson laboratory's MGD map).
    Cytogenetics back to
top
    • SKY/M-FISH & CGH Database - The NCI and NCBI SKY/M-FISH and CGH Database is a repository of publicly submitted data from Spectral Karyotyping (SKY), Multiplex Fluorescence In Situ Hybridization (M-FISH), and Comparative Genomic Hybridization (CGH), which are complementary fluorescent molecular cytogenetic techniques. SKY/M-FISH permits the simultaneous visualization of each human or mouse chromosome in a different color, facilitating the identification of chromosomal aberrations; CGH can be used to generate a map of DNA copy number changes in tumor genomes. Collaborative project with the National Cancer Institute.  (data submission instructions...)
    BLAST back to
top

    FTP back to
top
    • Mouse chromosome data - the ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ directory contains one folder for each chromosome, which includes genomic contigs (NT_* records) built from finished sequence data. The contigs are available in various formats:
      mm_chr*.asn ASN.1 format (description above)
      mm_chr*.fa.gz FASTA format (description above)
      mm_chr*.gbk.gz GenBank flat file format
      (annotations currently include STS markers; known and predicted genes will be added in coming months)
      mm_chr*.gbs GenBank summary format
      (this format does not contain sequence data, but instead contains a "CONTIG" field, showing how the contig is assembled from individual GenBank accessions)
      See additional information about the genomes FTP directories, below

    Rat Genome back to
top
    Rat Genome Resources Guide - brings together information on diverse rat-related resources from multiple centers: sequence, mapping, and clone information as well as pointers to strain and mutant resources.
    Map Viewer - integrated chromosome maps - The Map Viewer is a software component of Entrez Genomes that displays one or more maps which have been aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The maps that are currently available for rat are described in the Rattus norvegicus data and search tips document. The Map Viewer help document provides general information on how to use that tool.
    Entrez Gene - a gene-based view of the data from a wide range of genomes, including rat. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. More information about Entrez Gene is provided above, in the Molecular Databases/Genes section.
    BLAST against the rat genome -   Nucleotide or protein query sequences can be used. A variety of database choices are provided.
    UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above.
    HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms, including human, mouse, rat, zebrafish, and fruit fly, in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene.

    Cow Genome back to
top
    UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above.

    Zebrafish Genome back to
top
    Zebrafish Genome Resources Guide - brings together information on diverse zebrafish-related resources from multiple centers: sequence, mapping, and clone information as well as pointers to strain and mutant resources.
    Entrez Gene - a gene-based view of the data from a wide range of genomes, including zebrafish. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. More information about Entrez Gene is provided above, in the Molecular Databases/Genes section.
    Map Viewer - integrated chromosome maps - The Map Viewer is a software component of Entrez Genomes that displays one or more maps which have been aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The maps that are currently available for Danio rerio are described in the Danio rerio genome data and search tips document. The Map Viewer help document provides general information on how to use that tool.
    UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above.
    HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms, including human, mouse, rat, zebrafish, and fruit fly, in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene.

    Drosophila Genome back to
top
    Drosophila melanogaster Home Page - provides an overview of available resources for that organism, graphically displays all the chromosomes (to scale), and allows you search both cytogenetic and sequence data across the whole genome through the Entrez Genomes browser. Entrez Genome presents a unified graphical view of maps (genetic and physical) and sequence data for an organism. After you search for a term such as a gene symbol, it presents a graphic Genome View of search results, from which you can zoom into progressively more detailed Map Views of the region of interest, and link to sequence data and associated resources that contain additional detail.
    Map Viewer - integrated chromosome maps - The Map Viewer is a software component of Entrez Genomes that displays one or more maps which have been aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The sequence and cytogenetic maps that are currently available for Drosophila are described in the Drosophila melanogaster genome data and search tips document. The Map Viewer help document provides general information on how to use that tool.
    Entrez Gene - a gene-based view of the data from a wide range of genomes, including Drosophila. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. More information about Entrez Gene is provided above, in the Molecular Databases/Genes section.
    HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms, including human, mouse, rat, zebrafish, and fruit fly, in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene.
    BLAST against Drosophila melanogaster genome sequence
    • select Drosophila genome as the target database when using the nucleotide BLAST, protein BLAST, or translated BLAST search pages.  Or,
    • check the box for Drosophila melanogaster in the list of organisms on the BLAST with Eukaryotic genomes page.
    FTP Site - see additional information about the genomes FTP directories, below

    Nematode Genome back to
top
    Caenorhabditis elegans Home Page - Graphical representation of chromosomes that can be viewed in their entirety or explored in progressively greater detail in the Map Viewer (described above). Home page also includes links to many related resources, such as sequencing centers, other nematode sequencing projects, related databases, etc.
    FTP Site - the chromosome data sets are available for ftp in a variety of formats, including GenBank, FastA, and ASN.1, and others in the genbank/genomes/C_elega ns/< /a> directory of the NCBI FTP site (ftp://ftp.ncbi.nih.gov/). An NCBI curated version of the data is available in the genomes/C_elegans/ directory.  (See additional note in the FTP section, below, about the two different FTP directories)

    Plant Genomes back to
top
    Plant Genomes Central - provides access to data from large-scale sequencing projects, genetic maps, and large-scale EST sequencing projects. All organism names on the page are linked to the corresponding taxonomic information in NCBI's Taxonomy database (described above). In addition, organisms listed under "large-scale sequencing projects" and "genetic maps" are represented in the Map Viewer (described above). Organisms listed under "large-scale EST sequencing projects" are linked to their EST sequences in Entrez (described above).
    UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. Additional information about UniGene is provided above.

    Yeast Genome back to
top
    Saccharomyces cerevisiae Home Page - baker's yeast - graphical representation of chromosomes that can be viewed in their entirety or explored in progressively greater detail in Entrez Genome (described above), with links to associated sequence data. Home page also includes links to many related resources, such as sequencing centers, other fungi sequencing projects, related databases, etc.
    Schizosaccharomyces pombe Home Page - fission yeast - similar to the home page for Saccharomyces cerevisiae, described above.
    COGs - Clusters of Orthologous Groups - natural system of gene families from complete genomes. Clusters of Orthologous Groups (COGs) were delineated by comparing protein sequences encoded in complete unicellular genomes representing 30 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. The Initial Version of COGs includes 44 organisms. The Updated Version of COGs includes 66 organisms in the Unicellular Clusters, plus Eukaryotic Clusters (called KOGs). More organisms will be added in the future.
    BLAST against the Saccharomyces cerevisiae or Schizosaccharomyces pombe genome sequences
    • check the box for Saccharomyces cerevisiae and/or Schizosaccharomyces pombe in the list of organisms on the BLAST with Eukaryotic genomes page.   OR
    • select yeast as the target database when using the nucleotide BLAST, protein BLAST, or translated BLAST search pages; this searches only Saccharomyces cerevisiae data, however.
    FTP Saccharomyces cerevisiae Chromosomes

    Malaria Genome back to
top

    Malaria Genetics & Genomics - provides data and information relevant to malaria genetics and genomics. Resources include organism specific sequence BLAST databases (Plasmodium falciparum only, all Plasmodium, and all Toxoplasma), genome maps, linkage markers, and information about genetic studies. Links are provided for other malaria web sites and genetic data on related apicomplexan parasites, including Toxoplasma gondii.
    Map Viewer - The Map Viewer (described above) provides graphical views and search capabilities for both Plasmodium falciparum and Anopheles gambiae (malaria mosquito).
    BLAST against Malaria sequences
    FTP

    Microbial Genomes back to
top

    Entrez Genome - Graphical representation of complete bacterial genomes that can be viewed in their entirety or explored in progressively greater detail; links to associated sequence data. A "ProtTable" of protein coding genes is provided for each bacterium. There are also links to a "TaxTable," showing the distribution of BLAST protein homologs by taxa (sequences grouped by superkingdom), and to a distribution of BLAST protein homologs by 3-D structure (sequences with known structure). Additional information about Entrez Genome is also provided above.
    Entrez Genome Project - provides an umbrella view of the status of a wide range of genome projects, and includes information about microbial genome sequencing projects. Tabs allow you to switch between lists of completed and in-progress microbial genome projects. The list of completed genomes includes links to NCBI graphical views of the data (in Entrez Genome), sequencing centers, and the results of various analyses that have been done on the genomes at NCBI (e.g., TaxTable, COG Table, 3-D Neighbors, and more). The list of in-progress sequencing projects includes links to sequencing centers and, when available, to BLASTable data. A more detailed description of the Entrez Genome Project database is provided in the section on Genomes and Maps/Organism Collections.
    COGs - Clusters of Orthologous Groups - natural system of gene families from complete genomes. Clusters of Orthologous Groups (COGs) were delineated by comparing protein sequences encoded in complete unicellular genomes representing 30 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. The Initial Version of COGs includes 44 organisms. The Updated Version of COGs includes 66 organisms in the Unicellular Clusters, plus Eukaryotic Clusters (called KOGs). More organisms will be added in the future.
    BLAST against Microbial Genomes - sequences from selected completed and unfinished eukaryotic and prokaryotic genomes; partial genomic sequences have been graciously provided by the sequencing centers or extracted from GenBank. NCBI encourages sequencing centers to submit partially sequenced genomes to be included in this BLAST page. Data can be submitted via ftp, after contacting genomes@ncbi.nlm.nih.gov to set up an account.
    FTP - download complete bacterial genomes in a variety of formats, including GenBank flat file (*.gbk), GenBank summary file (*.gbs), FASTA Nucleic Acid file (*.fna), FASTA Amino Acid file (*.faa), Protein Table (*.ptt), and others.
    (See additional note in the FTP section, below, about the two different FTP directories)

    Viral Genomes back to
top

    Viruses Home Page provides brief background information on the biology of viruses, links to viral genome sequences in Entrez Genome (described below), and a wide range of related resources. It also includes information about Viral Reference Sequences, a collection of reference sequences for more than 1000 viral genomes.
    Entrez Genome - Graphical representation of complete viral genomes that can be viewed in their entirety or explored in progressively greater detail; links to associated sequence data. A summary of Coding Regions (described above) is provided for each virus. Additional information about Entrez Genome is also provided above.
    Influenza Virus Resource - A collection of resources specifically designed to support the research on the flu virus. Includes links to genome sequence data, analytical tools, epidemiological information, and the Influenza Genome Sequencing Project, funded by the National Institute of Allergy and Infectious Diseases (NIAID).
    Retrovirus Resources - A collection of resources specifically designed to support the research of retroviruses. Resources include a genotyping tool that uses the BLAST algorithm to identify the genotype of a query sequence; an alignment tool for global alignment of multiple sequences; an HIV-1 automatic sequence annotation tool; and annotated maps of 16 retroviruses viewable in GenBank, FASTA, and graphic formats, with links to associated sequence records.
    HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliograhies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data. More information about this database is provided under "Literature Databases".
    PASC (PAirwise Sequence Comparison) - a web tool for analysis of pairwise identity distribution within viral families. The identities are pre-computed for every pair within the families and with distribution plotted in a form of histogram where each bar corresponds to an interval of identities. Only complete genomes should be used as query sequences. The results from partial sequences are not suitable for the purpose of this tool. After you submit your sequence, PASC will start computing pairwise identities between the external genome and the existing genome sequences of the family. At the end of the process, you will be presented with the list of 15 closest matches to the genome within the family. The documentation provides more details about using PASC.

    Viroid Genomes back to
top
    Entrez Genome - Graphical representation of complete viroid genomes with links to corresponding sequence records. Additional information about Entrez Genome is also provided above.

    Plasmids back to
top
    Entrez Genome - Graphical representation of complete plasmids that can be viewed in their entirety or explored in progressively greater detail; links to associated sequence data. A summary of Coding Regions (described above) is provided for each plasmid. Additional information about Entrez Genome is also provided above.

    Eukaryotic Organelles back to
top
    Eukaryotic Organelles Home Page - Provides an overview of eukaryotic organelles; a description of the Organelle Reference Sequences project (part of RefSeq, see above); and links to (a) lists of completely sequenced organelles shown in taxonomic hierarchy and alphabetically by organism, (b) gene and RNA order in metazoan mitochondria, and (c) related web sites.
    Entrez Genome - Graphical representation of complete eukaryotic organelles that can be viewed in their entirety or explored in progressively greater detail; links to associated sequence data. A summary of Coding Regions (described above) is provided for each organelle. Additional information about Entrez Genome is also provided above.

    Tools Overview back to top

    Text Term Searching (Entrez),   Sequence Similarity Searching (BLAST),   Nucleotide Sequence Analysis,   Protein Sequence Analysis and Proteomics,   3-D Structure Display and Similarity Searching,   Genome Analysis,   Gene Expression
     

    Data Retrieval - Text Term Searching back to
top

    Entrez - provides integrated access to nucleotide and protein sequence data from >160,000 organisms, along with 3D protein structures, genomic mapping information, PubMed MEDLINE, and more.Sequence data are combined from various sources, including GenBank, EMBL, DDBJ, RefSeq, PIR-International, PRF, Swiss-Prot, and PDB. A Data Model provides a schematic illustration of the connections between the many data types in Entrez.
    • Two unique features of Entrez are:
      1. pre-computed similarity searches for each database record, identifying the related records ("neighbors") within that database. The algorithm used to identify related records depends upon the database.
      2. links from a record in one database to associated records in the other Entrez databases, providing integrated access across the various databases. For exmaple, if a MEDLINE record cites a GenBank nucleotide sequence record, which in turn is linked to a protein translation, there will be a link between those three records. The Entrez Data Model illustrates the links that exist among the various Entrez Databases.
    • Entrez can be searched with a wide variety of text terms such as author name, journal name, gene or protein name, organism, unique identifier (e.g., accession number, sequence ID, PubMed ID), and other terms, depending on the database being searched.
    • The help document provides more information about the databases available in Entrez as well as search tips. External resources can be linked to Entrez records using the new Linkout service (described below). Entrez also allows users to store search strategies and select a customized subset of LinkOut links through the NCBI My NCBI service (described below).
    Batch Entrez - allows you to retrieve a large number of nucleotide sequences or protein sequences from Entrez, in a batch mode, by importing a file containing a list of the desired GI or accession numbers. Search results are saved directly to a local disk file on your computer.
    Entrez Utilities - Entrez Programming Utilities, also called E-Utilities, are tools that provide access to Entrez data outside of the regular web query interface. They represent a method of making WWW links to Entrez. Each utility performs a specialized retrieval task, and can be used simply by writing a specially formatted URL. For example, EFetch retrieves records in the requested format from a list of one or more primary IDs or from the user's environment. The E-Utilities web page describes the available utilities and links to a brief help document for each one. E-Utilities can be helpful for retrieving search results for future use in another environment. To receive announcements about about Entrez Utilities, see the NCBI Email Lists page.
    LinkOut - a registry service to create links from specific articles, journals, or biological data in Entrez (described above) to resources on external web sites. Third parties can provide a URL, resource name, brief description of their web site, and specification of the NCBI data from which they would like to establish links. The specification can be written as a valid Boolean query to Entrez, or as a list of identifiers for specific articles or sequences. Entrez PubMed users can then select which external links are visible in their searches, through the NCBI My NCBI service (described below).   To receive announcements about updates and new features in LinkOut, see the NCBI Announcements Email Lists page.
    My NCBI - Formerly known as "Cubby", My NCBI allows Entrez users to store and update searches, receive automatic e-mails of search updates, select the Filter folder tabs shown by default for any Entrez database, and customize their LinkOut (described above) display to include or exclude links to providers. My NCBI requires that your system accepts cookies. You must also complete a brief registration form in which you select a username and password. You will need those in order to access your "My NCBI" account. There is also an option to remain logged into My NCBI, if desired. For additional information, see the help document and tutorial.
    Query E-mail Server - The Query server, which provided e-mail access to a subset of Entrez databases, was discontinued on April 15, 2002 because of limited usage. Almost all Entrez searchers now use the WWW Entrez interface, described above. It provides access to more databases and more features than are possible through the e-mail interface.
    Citation Matcher - allows you to find the PubMed ID of any article in the PubMed database, given its bibliographic information (journal, volume, page, etc.).

    Sequence Similarity Searching back to
top

    BLAST Home Page - provides access to BLAST (Basic Local Alignment Search Tool) programs, overview, help documentation, FAQs.   BLAST programs include:

    BLAST Announcements - To receive announcements about updates and new features, and advance notices about upcoming changes in the NCBI BLAST service, see the NCBI Announcements Email Lists page.
    BLAST 2.x - A version of BLAST (Altschul, et al., 1997) that permits gaps in the alignments it produces. Assessments of statistical significance are based upon prior simulations using random sequences. (more...)
    QBLAST - A queuing system that allows users to retrieve Gapped BLAST results at their convenience and format their results multiple times with different formatting options. This system also allows the NCBI to more efficiently use computational resources, better serving the community. As of Fall 1999, the QBLAST system is used for all BLAST searches. (more...)
    MegaBLAST - permits searching with batches of ESTs or with large cDNA or genomic sequences. (more...)
    • MegaBLAST against the Trace Archives - compare nucleotide sequence data against the raw data underlying all of the sequence generated by various genome projects. Additional information about the Trace Archive is above.
    PHI-BLAST - Pattern Hit Initiated BLAST (Zhang, et al., 1998) - A program to search a protein database using a protein query, seeking only alignments that preserve a specified pattern contained within the query. (more...)
    PSI-BLAST - Position-Specific Iterated BLAST (Altschul, et al., 1997) - A program for searching protein databases using protein queries, in order to find other members of the same protein family. All statistically significant alignments found by BLAST are combined into a multiple alignment, from which a position-specific score matrix is constructed. This matrix is used to search the database for additional significant alignments, and the process may be iterated until no new alignments are found. (more...)
    RPS-BLAST - Reverse Position-Specific BLAST - A program used to identify conserved domains in a protein query sequence. It does this by comparing a query protein sequence to position-specific score matrices that have been prepared from conserved domain alignments. The service is accessible through Conserved Domain Search (CD-Search), described below. A readme file provides additional detail about the RPS-BLAST program.
    Note: RPS-BLAST is a "reverse" version of position-specific iterated BLAST (PSI-BLAST), described above. Both RPS-BLAST and PSI-BLAST use multiple alignments and position-specific score matrices (PSSMs) to derive conserved features of a protein family. However, RPS-BLAST compares a query sequence against a database of profiles prepared from ready-made alignments, while PSI-BLAST builds alignments starting from a single protein sequence. The programs also differ in purpose: RPS-BLAST is used to identify conserved domains in a query sequence, while PSI-BLAST is used to identify other members of the protein family to which a query sequence belongs.
    Taxonomy BLAST - an implementation of Gapped BLAST (2.x) that groups hits by source organism, according to information in NCBI's Taxonomy database. Species are listed in order of sequence similarity to the query sequence; the strongest match listed first. Three report views are available:
    • organism report - sorts the BLAST hits according to species, so that all of the hits to the same organism will appear together
    • lineage report - gives a simplified view of the relationships between the organisms, according to their classification in the taxonomy database. This report is "focused" on the organism which yielded the strongest BLAST hit. It answers the question, "how closely are the organisms in the BLAST hit list related to the query sequence according to the taxonomy database?"
    • taxonomy report - provides a more detailed report about the relationships among all of the organisms found in the BLAST hit list, including a summary of the taxa that are represented, the number of species and subspecies, and the number of BLAST hits at each node in the taxonomic hierarchy.
    BLAST 2 Sequences - A BLAST-based tool for aligning two nucleotide or protein sequences, producing a pairwise DNA-DNA or protein-protein sequence comparison. (more...)
    IgBLAST - IgBLAST was developed to facilitate analysis of immunoglobulin sequences in GenBank. It allows blastp or blastn searches of either the nr database or a special database of Immunoglobulin (Ig) germline V (variable region) genes. Searches may be limited to either human or mouse genes. IgBLAST performs three main functions: (1) reports the variable, D, or J regions that most closely match the query sequence; (2) annotates the immunoglobulin domains (FWR1 through FWR3) according to Kabat et al.; and (3) for searches against the nucleotide nr or protein nr database, simplifies the process of identifying related sequences by matching the IgBLAST hits to the closest germline V genes. (more...)
    BLink - BLink ("BLAST Link") displays the results of BLAST searches that have been done for every protein sequence in the Entrez Proteins data domain. To access it, follow the Blink link displayed beside any hit in the results of an Entrez Proteins search. In contrast to Entrez's "Related Sequences" feature, which lists the titles of similar sequences, BLink displays the graphical output of pre-computed blastp results against the protein non-redundant (nr) database. The output includes the positions of up to 200 BLAST hits on the query sequence, scores, and alignments. (View sample BLink output for human MLH1 protein.) BLink offers a variety of display options, including the distribution of hits by taxonomic grouping, the best hit to each organism, the protein domains in the query sequence, similar sequences that have known 3-D structures, and more. Additional options allow you to specify which taxa you would like to exclude, increase or decrease the BLAST cutoff score, or filter the BLAST hits to show only those from a specific source database, such as RefSeq or Swiss-Prot.  See the BLink help document for additional information.
    BLAST E-mail server - an e-mail-based sequence similarity search service; this was discontinued on June 17, 2002 because of limited usage. Most BLAST searchers are now done through BLAST web page.
    Network BLAST - a TCP/IP-based client-server version of WWW Entrez. Makes a direct connection with the NCBI databases over the Internet to retrieve data. No web browser is required. Client software is available for PC, Mac, and Unix on the FTP site at ftp://ftp.ncbi.nih.gov/blast/blastcl3/
    Stand-alone BLAST - download BLAST executables for local use from ftp://ftp.ncbi.nih.gov/blast/executables/. Binaries are provided for IRIX 6.2, Solaris 2.6, DEC OSF1 (ver. 4.0d), LINUX, and Win32 systems. Please read the README file in the ftp directory for more information. BLAST databases also available for downloading. There is also some information on setting up Standalone BLAST at the NHGRI site at http://genome.nhgri.n ih.g ov/blastall/blast_install.

    Nucleotide Sequence Analysis back to
top

    BLAST - see sequence similarity searching, above, for a complete list of BLAST programs.
    e-PCR - Electronic PCR - compare a query sequence to a database of mapped sequence-tagged sites (STSs) to find a possible map location for the query sequence, or compare a query STS to a database of nucleotide sequences to identify the sequences that contain the STS.  e-PCR can be used on the WWW, or the software can be downloaded from the /pub/schuler/e-PCR directory of the NCBI ftp site. Additional information is provided by Schuler, G.D. There are two versions of e-PCR:
    • Forward e-PCR - Search STS database with a query sequence. Electronic PCR (e-PCR) is computational procedure that is used to identify sequence tagged sites(STSs), within DNA sequences. e-PCR looks for potential STSs in DNA sequences by searching for subsequences that closely match the PCR primers and have the correct order, orientation, and spacing that could represent the PCR primers used to generate known STSs.
    • Reverse e-PCR - Search sequence database with STS. The main motivation for implementing reverse searching (called Reverse e-PCR) was to make it feasible to search the human genome sequence and other large genomes. The new version of e-PCR provides a search mode using a query sequence against a sequence database.
    Entrez Gene - as described in the Molecular Databases/Genes section of this guide, each Entrez Gene record encapsulates a wide range of information for a given gene and organism. When possible, the information includes results of analyses that have been done on the sequence data. The amount and type of information presented depend on what is available for a particular gene and organism and can include: (1) graphic summary of the genomic context, intron/exon structure, and flanking genes, (2) link to a graphic view of the mRNA sequence, which in turn shows biological features such as CDS, SNPs, etc., (3) links to gene ontology and phenotypic information, (4) links to corresponding protein sequence data and conserved domains, (5) links to related resources, such as mutation databases.
    Malaria Genetics and Genomics - provides data and information relevant to malaria genetics and genomics. Resources include organism specific sequence BLAST databases (Plasmodium falciparum only, all Plasmodium, and all Toxoplasma). More about the Malaria genome resources below.
    Model Maker - allows you to view the evidence (mRNAs, ESTs, and gene predictions) that was aligned to assembled genomic sequence in order to build a gene model, and to edit the model by selecting or removing putative exons. You can then view the mRNA sequence and potential ORFs for the edited model, and save the mRNA sequence data for use in other programs. Model Maker is accessible from sequence maps that were analyzed at NCBI and displayed in Map Viewer (described above).  To see an example, follow the "mm" link beside any gene annotated on the human "Gene_Sequence" map in the Map Viewer. (More info about human data in Map Viewer is given above.)
    ORF Finder - graphical analysis tool which finds all open reading frames of a selected minimum size in a user's sequence or in a sequence already in the database. Designed for prokaryotic sequences. Identifies all open reading frames using the standard or alternative genetic codes. The deduced amino acid sequence can be saved in various formats and searched against the sequence database using the WWW BLAST server. The ORF Finder is also packaged with the Sequin sequence submission software. The stand alone program can be downloaded from NCBI ftp site.
    ProtEST - a tool that presents a graphical view of matches between nucleotide sequences in UniGene and possible translational products. To generate the alignments, the 6-frame translations of mRNA and EST sequences in UniGene are compared to protein sequences using BLASTX with -e 1e-6. The translated nucleotide sequences are compared with proteins from a number of model organisms and the best match in each organism is recorded. ProtEST links are displayed in UniGene (description) reports in the section on model organism protein similarities.
    PASC (PAirwise Sequence Comparison) - a web tool for analysis of pairwise identity distribution within viral families. The identities are pre-computed for every pair within the families and with distribution plotted in a form of histogram where each bar corresponds to an interval of identities. Only complete genomes should be used as query sequences. The results from partial sequences are not suitable for the purpose of this tool. After you submit your sequence, PASC will start computing pairwise identities between the external genome and the existing genome sequences of the family. At the end of the process, you will be presented with the list of 15 closest matches to the genome within the family. The documentation provides more details about using PASC.
    Retroviruses Resources - A collection of resources specifically designed to support the research of retroviruses. Resources include a genotyping tool that uses the BLAST algorithm to identify the genotype of a query sequence; an alignment tool for global alignment of multiple sequences; an HIV-1 automatic sequence annotation tool; and annotated maps of 16 retroviruses viewable in GenBank, FASTA, and graphic formats, with links to associated sequence records.
    SAGEmap - SAGEmap provides a tool for performing statistical tests designed specifically for differential-type analyses of SAGE (Serial Analysis of Gene Expression) data. The data include SAGE libraries generated by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP, described above), which have been submitted to Gene Expression Omnibus (GEO, described above). Gene expression profiles that compare the expression in different SAGE libraries are also available on the Entrez GEO Profiles pages. It is possible to enter a query sequence in the SAGEmap resource to determine what SAGE tags are in the sequence, then map to associated SAGEtag records and view the expression of those tags in different CGAP SAGE libraries.
    Spidey - mRNA-to-genomic alignment program that was designed to find good alignments regardless of intron size, and to avoid getting confused by nearby pseudogenes and paralogs. It uses a combination of alignment algorithms and heuristics to construct its models. Spidey has been optimized for both intraspecies and interspecies alignments. (See Spidey documentation for more information.)
    Splign - a utility for computing cDNA-to-Genomic, or spliced sequence alignments. It is based on a variation of the Needleman–Wunsch global alignment algorithm and specifically accounts for introns and splice signals. It is due to this algorithm that Splign is accurate in determining splice sites and tolerant to sequencing errors. Splign also uses BLAST hits to identify possible locations of genes and their duplications on genomic sequences and to speed up the core dynamic programming. (See Splign documentation for more information.)
    UniGene DDD - Digital Differential Display - an online tool to compare computed gene expression profiles between selected cDNA libraries. Using a statistical test, genes whose expression levels differ significantly from one tissue to the next are identified and shown to the user. Additional information about UniGene is in the Molecular Databases/Genes section.
    VecScreen - a tool for identifying segments of a nucleic acid sequence that may be of vector, linker or adapter origin prior to sequence analysis or submission. VecScreen was developed to combat the problem of vector contamination in public sequence databases. It is also useful to run a new sequence through VecScreen before performing any kind of analysis on the sequence, since the presence of vector sequences can lead to misleading BLAST hits, etc. VecScreen compares a query sequence against the UniVec database, described above.

    Protein Sequence Analysis and Proteomics back to
top

    BLAST - see sequence similarity searching, above, for a complete list of BLAST programs.
    BLink - BLink ("BLAST Link") displays the results of BLAST searches that have been done for every protein sequence in the Entrez Proteins data domain. To access it, follow the BLink link displayed beside any hit in the results of an Entrez Proteins search. In contrast to Entrez's "Related Sequences" feature, which lists the titles of similar sequences, BLink displays the graphical output of pre-computed blastp results against the protein non-redundant (nr) database. The output includes the positions of up to 200 BLAST hits on the query sequence, scores, and alignments. (View sample BLink output for human MLH1 protein.) BLink offers a variety of display options, including the distribution of hits by taxonomic grouping, the best hit to each organism, the protein domains in the query sequence, similar sequences that have known 3-D structures, and more. Additional options allow you to specify which taxa you would like to exclude, increase or decrease the BLAST cutoff score, or filter the BLAST hits to show only those from a specific source database, such as RefSeq or Swiss-Prot.  See the BLink help document for additional information.
    CD-Search - The Conserved Domain Search Service (CD-Search) can be used to identify the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (described above) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD) (described above). Hits can be displayed as a pairwise alignment of the query sequence with a representative domain sequence, or as a multiple alignment. Alignments are also mapped to known 3-dimensional structures, and can be displayed using Cn3D (described above). In the Cn3D display, residues in sequence alignments are variously colored, based on their degree of conservation. (more...)
    COGnitor - compare your sequence to the COGs database (described above) to identify the cluster of orthologous groups to which it belongs. A stand-alone dignitor program is also available. It runs cognitor in batch mode, comparing a large group of proteins to the COGs database, and can be downloaded from the ftp site.
    Conserved Domain Architecture Retrieval Tool (CDART) - When given a protein query sequence, CDART displays the functional domains that make up the protein and lists proteins with similar domain architectures. The functional domains for a sequence are found by comparing the protein sequence to a database of conserved domain alignments, CDD (described above), using RPS-BLAST (described below).
    Open Mass Spectrometry Search Algorithm (OMSSA) - a public search service that allows proteomics researchers to submit the mass spectra of peptides and proteins for identification. OMSSA then compares these mass spectra to theoretical ions generated from databases of known protein sequences and then ranks the results using a score derived from classical hypothesis testing. References available from the OMSSA home page describe the OMSSA algorithm and its validation.
    ProtEST - a tool that presents a graphical view of matches between nucleotide sequences in UniGene and possible translational products. To generate the alignments, the 6-frame translations of mRNA and EST sequences in UniGene are compared to protein sequences using BLASTX with -e 1e-6. The translated nucleotide sequences are compared with proteins from a number of model organisms and the best match in each organism is recorded. ProtEST links are displayed in UniGene (description) reports in the section on model organism protein similarities.
    TaxPlot - a tool for 3-way comparisons of genomes on the basis of the protein sequences they encode. To use TaxPlot, one selects a reference genome to which two other genomes are compared. Pre-computed BLAST results are then used to plot a point for each predicted protein in the reference genome, based on the best alignment with proteins in each of the two genomes being compared.

    3-D Structure Display and Similarity Searching back to
top

    Cn3D - "See in 3-D," a structure and sequence alignment viewer for NCBI databases. It allows viewing of 3-D structures and sequence-structure or structure-structure alignments. Cn3D can work as a helper application to your browser, or as a client-server application that retrieves structure records from MMDB (described above) directly over the internet. The Cn3D home page provides access to information on how to install the program, a tutorial to get started, and a comprehensive help document.
    VAST - Vector Alignment Search Tool - a computer algorithm developed at NCBI and used to identify similar protein 3-dimensional structures. The "structure neighbors" for every structure in MMDB are pre-computed and accessible via links on the MMDB Structure Summary pages. These neighbors can be used to identify distant homologs that cannot be recognized by sequence comparison alone.
    VAST search - - structure-structure similarity search service. Compares 3D coordinates of a newly determined protein structure to those in the MMDB/PDB database. VAST Search computes a list of structure neighbors that you may browse interactively, viewing superpositions and alignments by molecular graphics.
    CD-Search - The Conserved Domain Search Service (CD-Search) can be used to identify the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (described above) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD) (described above). Hits can be displayed as a pairwise alignment of the query sequence with a representative domain sequence, or as a multiple alignment. Alignments are also mapped to known 3-dimensional structures, and can be displayed using Cn3D (described above). In the Cn3D display, residues in sequence alignments are variously colored, based on their degree of conservation.
    Threading - As part of NCBI's Computational Biology Branch (described above), the Structure group, led by Dr. Steve Bryant, conducts research in protein threading. Protein threading predicts the three-dimensional structure of a protein sequence by threading it through known structures and calculating its energy. The experimental software developed by the NCBI Structure group is available on the FTP site. A readme file provides more information as well as references.

    Genome Analysis Tools back to
top

    Entrez Genome - whole genomes of over 1000 organisms. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses, phages, viroids, plasmids, and organelles.. Entrez Genome provides graphical overviews of complete genomes/chromosomes, and the ability to explore regions of interest in progressively greater detail. ProtTables and TaxTables are provided for organisms on which analyses have been done by NCBI staff.
    Map Viewer - shows integrated views of chromosome maps for many organisms. Used to view the NCBI assembly of complete genomes, including human, Map Viewer is a valuable tool for the identification and localization of genes, particularly those that contribute to diseases. Additional information about Map Viewer is provided in the Genomes and Maps section of this guide.
    SKY/M-FISH & CGH Database - The NCI and NCBI SKY/M-FISH and CGH Database is a repository of publicly submitted data from Spectral Karyotyping (SKY), Multiplex Fluorescence In Situ Hybridization (M-FISH), and Comparative Genomic Hybridization (CGH), which are complementary fluorescent molecular cytogenetic techniques. SKY/M-FISH permits the simultaneous visualization of each human or mouse chromosome in a different color, facilitating the identification of chromosomal aberrations; CGH can be used to generate a map of DNA copy number changes in tumor genomes. Collaborative project with the National Cancer Institute.  (data submission instructions...)
    PASC (PAirwise Sequence Comparison) - a web tool for analysis of pairwise identity distribution within viral families. The identities are pre-computed for every pair within the families and with distribution plotted in a form of histogram where each bar corresponds to an interval of identities. Only complete genomes should be used as query sequences. The results from partial sequences are not suitable for the purpose of this tool. After you submit your sequence, PASC will start computing pairwise identities between the external genome and the existing genome sequences of the family. At the end of the process, you will be presented with the list of 15 closest matches to the genome within the family. The documentation provides more details about using PASC.
    Retrovirus Resources - A collection of resources specifically designed to support the research of retroviruses. Resources include a genotyping tool that uses the BLAST algorithm to identify the genotype of a query sequence; an alignment tool for global alignment of multiple sequences; an HIV-1 automatic sequence annotation tool; and annotated maps of 16 retroviruses viewable in GenBank, FASTA, and graphic formats, with links to associated sequence records.

    Gene Expression Tools back to
top

    Gene Expression Omnibus (GEO) - provides several tools to assist with the visualization and exploration of GEO data. Datasets may be viewed as hierarchical cluster heat maps, providing insight into the relationships between samples and co-regulated genes. Individual gene expression profiles showing significant differences between experimental subsets may be located using average subset rank value comparisons. Related gene expression profiles may be identified on the basis of sequence similarity, profile similarity, or homology. Indicators of dataset normalization quality are provided as distribution graphs, and by flagging outliers. Links to other NCBI sequence, mapping and publication database resources are provided where possible. (More information about GEO is provided in the Molecular Databases/Gene Expression section of this file.)
    SAGEmap - SAGEmap provides a tool for performing statistical tests designed specifically for differential-type analyses of SAGE (Serial Analysis of Gene Expression) data. The data include SAGE libraries generated by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP, described above), which have been submitted to Gene Expression Omnibus (GEO, described above). Gene expression profiles that compare the expression in different SAGE libraries are available on the Entrez GEO Profiles pages. It is also possible to enter a query sequence in the SAGEmap resource to determine what SAGE tags are in the sequence, then map to associated SAGEtag records and view the expression of those tags in different CGAP SAGE libraries. (More information about SAGEmap is provided in the Molecular Databases/Gene Expression section of this file.)
    Cancer Genome Anatomy Project (CGAP) - an interdisciplinary program to identify the human genes expressed in different cancerous states, based on cDNA (EST) libraries, and to determine the molecular profiles of normal, precancerous, and malignant cells. CGAP is a collaboration among the National Cancer Institute, the NCBI, and numerous research labs. (Related resources are listed under human genome/cancer research.) The following tools are provided by the National Cancer Institute (NCI) through their CGAP web page:
    UniGene DDD - Digital Differential Display - an online tool to compare computed gene expression profiles between selected cDNA libraries. Using a statistical test, genes whose expression levels differ significantly from one tissue to the next are identified and shown to the user. Additional information about UniGene is in the Molecular Databases/Genes section.

    Research at NCBI Overview back to top

    Computational Biology Branch Home Page - Overview of the research program in the Computational Biology Branch (CBB) of NCBI and a list of Senior Investigators. The research programs focus on theoretical, analytical, and applied approaches to a broad range of fundamental problems in molecular biology, including biomolecular structures, genome analysis, theory of sequence analysis, hardware design, software and database design, and text retrieval and document analysis.
    Senior Investigators in PubMed - publications written by senior investigators in the NCBI Computational Biology Branch and represented in the PubMed database. The PubMed records include links to publisher web sites and/or full text articles when available.
    Seminar Schedule - Seminars held at NCBI on a wide range of molecular biology and mathematical topics. These seminars are open to the NIH community and the general public, and are presented by NCBI staff as well as visiting scientists.
    Postdoctoral Fellowships - general information, application procedure

    SoftwareEngineering Overview back to top

    Information Engineering Branch Home Page - Overview of the functions of the Information Engineering Branch (IEB) of NCBI, which is responsible for designing and building NCBI's production software and databases.
    NCBI ToolBox - Supported software tools from IEB. Describes the three components of the ToolBox: data model, data encoding, and programming libraries. Provides access to documentation for the data model, C toolkit, C++ toolkit, NCBI Toolkit Source Browser, XML demo program, XML DTDs, and the FTP site. Additional information about the FTP site is provided below.
    R&D Projects - The IEB Research and Development Area is a place for IEB projects and datasets which may never become fully supported NCBI resources. This includes early prototypes of software, results of early or one-off analyses, tools that a fully functional but not integrated into the main, public NCBI systems, or datasets that may have some value but do not fit well into the main NCBI pages.
    ASN.1 - The software in the NCBI ToolBox is primarily designed to read Abstract Syntax Notation 1 (ASN.1) format records, an International Standards Organization (ISO) data representation format. The readme files in the toolbox and toolbox/ncbi_tools directories of the FTP site contain more information about the toolbox and ASN.1. An ASN.1 summary is also available. The ToolBox can produce data as either ASN.1, as before, or as XML (more about XML). Additional information about the ToolBox, documentation, and demo programs are available on the NCBI ToolBox page.

    Education Overview back to top

    News,   Books,   Glossaries,   Tutorials,   Courses,   Additional Resources
     

    News - keeping up with the changes at NCBI back to
top

    NCBI News - announcements about new resources, enhancements to existing resources, staff publications, tutorials, FAQs
    What's New - recently released resources and enhancements to existing resources
    NCBI Announcements Email Lists - Receive announcements about changes and updates to a variety of NCBI services. In addition to a general NCBI-announce list, topic-specific e-mail lists are available for BLAST, GenBank, dbSNP, Genomes, LinkOut, RefSeq, Sequin, and Entrez Utilities (for making WWW Links to Entrez). Information on how to subscribe is provided.

    Books back to
top

    Coffee Break - a collection of short reports on recent biological discoveries. Each report incorporates interactive tutorials that show how bioinformatics tools are used as a part of the research process.
    Genes and Disease - introduction to the relationship between genetic factors and human disease. Summary information for ~60 genetic diseases with links to related databases and organizations.
    NCBI Handbook - an online book, written by NCBI staff, that discusses the many resources available at NCBI. Each chapter is devoted to one service; after a brief overview on using the resource, there is an account of how the resource works, including topics such as how data are included in a database, database design, query processing, and how the different resources relate to each other.
    Entrez Books - In collaboration with book publishers, the NCBI is adapting textbooks for the web and linking them to PubMed, the biomedical bibliographic database. The idea is to provide background information to PubMed, so that users can explore unfamiliar concepts found in PubMed search results.

    Glossaries back to
top

    NCBI Handbook Glossary - part of the NCBI Handbook, described above. Includes a variety of terms pertaining to biological data and bioinformatics.
    BLAST Tutorial Glossary of Terms - includes terms pertaining to BLAST sequence similarity searching.
    FieldGuide Glossary - developed for the Field Guide course described below.
    Genome Glossary - commonly used genome terms; includes links to associated literature for each term.
    NHGRI Talking Glossary of Genetic Terms - by the National Human Genome Research Institute (NHGRI).

    Tutorials back to
top

    Science Primer - The science behind our resources. An introduction for researchers, educators and the public. Provides a plain language introductions to bioinformatics, genome mapping, molecular modeling, SNPs, ESTs, microarray technology, molecular genetics, pharmacogenomics, and phylogenetics.
    PubMed Tutorial - comprehensive instruction on using PubMed's various features
    Entrez Tutorial - show users how to make use of the full power of the Entrez data retrieval system. Using a human gene as an example, it demonstrates the variety of information that can be gathered for a single gene across a number of Entrez databases.
    BLAST tutorials for new and veteran users
    • Query Tutorial - formulating a BLAST query; entering sequence data; beginners welcome!
    • BLAST Tutorial - setting up a protein query; parameters - how and why; interpreting BLAST output
    • BLAST Guide - printable; setting-up a query; deciphering results; post-BLAST analysis
    • PSI-BLAST Tutorial - when to use PSI-BLAST; understanding iterations; interpreting PSI-BLAST output
    • More Information - principles of similarity searching; rules of thumb; glossary of terms; references.
    BLAST Statistics
    3-D Protein Structure Tutorial: Cn3D structure viewing program
    Map Viewer Exercises - a chapter within the NCBI Handbook (described above).
    Coffee Break - a collection of short reports on recent biological discoveries. Each report incorporates interactive tutorials that show how bioinformatics tools are used as a part of the research process.

    Courses back to
top

    Field Guide to GenBank and NCBI Resources - three-hour lecture plus two-hour optional hands-on computer lab designed for end users with a science background. Presented at universities across the United States as well as on-site at NLM. Companion courses are also available and provide more detailed coverage on the following topics:
    • Exploring 3D Molecular Structures Using NCBI Tools - a companion to the Field Guide that focuses specifically on effectively using the NCBI databases, search services, and analysis tools to mine 3D macromolecular structure data. The course includes a 60-minute lecture plus a 90-minute computer workshop. It is offered by NCBI staff at various universities throughout the country can be requested along with a Field Guide or separately.
    Medical Library Association approved courses - designed for library staff (including librarians and scientists employed by libraries) who are establishing educational workshops and end-user support services in the use of molecular biology databases, retrieval systems, and analytical tools.
    • Introduction to Molecular Biology Information Resources - a three-day Medical Library Association (MLA) CE Course designed for librarians who have little or no experience with molecular biology databases and search systems, and who handle occasional questions about those resources at the reference desk. Course format combines lecture, demonstration, and hands-on experience and is approved by MLA for 20 CE hours.
    • NCBI Advanced Workshop for Bioinformatics Information Specialists - a five-day workshop designed for library staff with a science background who have full-time bioinformatics support positions. This includes bioinformatics librarians as well as scientists who have been hired by libraries to establish training and user support programs. Applicants must already have some experience with molecular biology databases and software programs. Course format combines lecture, demonstration, and hands-on experience and is approved by the Medical Library Association (MLA) for 40 CE hours.
    Mini-Courses - NCBI bioinformatics mini-courses are either problem based, such as "Identification of Disease Genes" or NCBI resource based such as "BLAST Quick Start". The courses are 2 hours in length with first hour devoted to an overview that is followed by a one hour hands-on session. The courses are free and, with the exception of those sessions offered at NIH's CIT, are open to anyone who would like to attend.

    Principles of PubChem - a course including lectures and computer workshops on effectively using the NCBI PubChem system: a collection of databases, search services, and analysis tools that focus on small chemicals and their biological activities.

    Power Tools - 3-day workshops for bioinformatics and information specialists who are interested in accessing NCBI data in an automated fashion. Each course is offfered quarterly at the NCBI Training Center, NIH Building 38A.
    Getting Started with Linkout - LinkOut is a feature of PubMed that provides users with links from PubMed and other Entrez databases to a wide variety of relevant web-accessible online resources, including full-text publications, biological databases, consumer health information, research tools, and more. The goal is to facilitate access to relevant online resources beyond the Entrez system to extend, clarify, or supplement information found in the Entrez database. This hands-on class is designed to introduce students to LinkOut and provide step-by-step instruction on activating LinkOut for print and electronic journal collections, allowing users to see their own library's holdings and access electronic full-text through the PubMed interface. Topics covered are registration for LinkOut, entering holdings, displaying a library's icon for "branding" purposes, and access to free full-text through LinkOut. Getting Started with LinkOut is a free class and is awarded 4 MLA continuing education credits. For more information and to register, visit the NLM's National Training Center and Clearinghouse (NTCC) website: http://nnlm.gov/mar/online/. Questions about the class can be sent to lib-linkout@ncbi.nlm.nih.gov

    Additional Resources back to
top

    Cancer Information - a wide range of accurate, credible cancer information brought to you by the National Cancer Institute (NCI). CancerNet information is reviewed regularly by oncology experts and is based on the latest research. It includes information selected and organized for patients, health professionals, and basic researchers.
    Human Genome Project - an international research effort to characterize the genomes of human and selected model organisms through complete mapping and sequencing of their DNA; to develop technologies for genomic analysis; to examine the ethical, legal, and social implications of human genetics research; and to train scientists who will be able to utilize the tools and resources developed through the HGP to pursue biological studies that will improve human health. This link leads to the information provided on the National Human Genome Research Institute (NHGRI) web site.
    NHGRI Educational Resources - the National Human Genome Research Institute (NHGRI) provides a range of educational resources, including glossaries, fact sheets, multimedia educational kits, genetic education modules for use by teachers, and a variety of online materials.
    NIH Office of Science Education - offers a wide variety of educational resources for students at various grade levels, teachers, and the general public. Resources cover a wide range of topics, including Genetics, and formats of educational materials range from lesson plans and curricula to multimedia, online materials, and more. Website also includes a section on career exploration.

    FTP Site Overview back to top

    Download Databases back to
top

    BLAST databases - a collection of databases formatted for use with the BLAST software. A readme file provides database descriptions.
    GenBank and Daily Updates
    • ASN.1 format - Abstract Syntax Notation 1, an International Standards Organization (ISO) data representation format; download most recent full release (described above) and daily cumulative or non-cumulative update files.  (more on ASN.1)
    • FASTA format - definition line followed by sequence data only (example). The FASTA formatted data are available in the BLAST databases directory of the FTP site. A readme file in that directory provides descriptions of the available data sets, such as nt.Z (daily updated non-redundant BLAST nucleotide database, contains GenBank+EMBL+DDBJ+PDB sequences, but no EST, STS, GSS, or HTGS sequences), nr.Z (daily updated non-redundant proteins), est.Z, gss.Z, htg.Z, sts.Z, and others.
    RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits, for example:  NT_123456, NM_123456, NP_123456, NC_123456, NG_123456, XM_123456, XR_123456, XP_123456 (more info about accession numbers and access).
    Entrez Gene - a collection of files from the Entrez Gene database, which is described in the Molecular Databases/Genes section of this guide.
    dbSNP - database of single nucleotide polymorphisms, small-scale insertions/deletions, polymorphic repetitive elements, and microsatellite variation
    Taxonomy - data from the NCBI Taxonomy database (described above). Includes a UNIX compressed tar file called "taxdump.tar.Z" that is updated daily and contains a dump of the taxonomy information from SyBase. Note that the *.dmp files are not human-friendly files, but can be uploaded into SyBase with the BCP facility. When you uncompress and untar the file, you will see several files, including a Readme file that contains more information.
    Repository of databases - This FTP directory contains a mix of NCBI databases (e.g., UniGene, GeneMap, dbEST, dbGSS, dbSTS, OMIM) and a number of externally developed databases (e.g., EPD, TFD). The external databases are made available on the FTP site as a service to the scientific community. They are contributed by outside scientists and maintained independently of NCBI. All the files in the FTP directory of a non-NCBI database are placed there and maintained by the developers of that database. Questions about non-NCBI databases should be directed to the contacts listed in the readme or other background files for the individual databases. Note that additional NCBI databases are also found in the root directory of the FTP site (under the database name, such as GenBank, Gene, RefSeq), or in the "pub" directory (usually under the name of the primary resource developer).

    Download Genomes back to
top

    Human Genome Project Data - the ftp://ftp.ncbi.nih.gov/genomes/ H_sa piens/ directory contains one folder for each chromosome, which includes genomic contigs (NT_* records) built from finished and unfinished sequence data. The contigs are available in various formats, described below. The contig assembly and annotation process is described in a separate document.
    hs_chr*.asn ASN.1 format (description above)
    hs_chr*.fa.gz FASTA format (description above)
    hs_chr*.gbk.gz GenBank flat file format
    (annotations currently include STS markers; known and predicted genes will be added in coming months)
    hs_chr*.gbs GenBank summary format
    (this format does not contain sequence data, but instead contains a "CONTIG" field, showing how the contig is assembled from individual GenBank accessions)
    hs_chr*.mfa.gz masked FASTA format (masked nucleotides are lower case)
    Data from the Map Viewer (described above) are available in the ftp://ftp.ncbi.nih .gov /genomes/H_sapiens/maps/mapview/ subdirectory.
    Other Genomes - such as bacteria, nematode, mouse, and others can be downloaded from one of two directories:
    Note: In some cases, an organism might be listed in both directories. This can happen for several reasons: (1) there are two versions of the genome are available - one in GenBank, and one in RefSeq; or (2) the organism's data was assembled at NCBI and was available from the "/genbank/genomes/" directory before the new "/genomes/" directory was set up. In the latter case, the data now exists in the new "/genomes/" directory, but a symbolic link was preserved in the original directory to facilitate user access.
    Download Software back to
top
    BLAST Programs
    • BLAST Web Server Program - allows you to set up your own in-house version of the NCBI BLAST web pages on a UNIX web server. You can set up the program to search your own custom databases or downloaded copies of the NCBI databases. This server is not intended to handle the large loads which may exist in public service settings. A Read me file provides more information.
    • Network BLAST - a TCP/IP-based client-server version of WWW Gapped BLAST (2.0). Makes a direct connection with the NCBI databases over the Internet to retrieve data. Client software is available for PC, Mac, and Unix. For general information about Gapped BLAST, see above.
    NOTE: Preformatted BLAST databases also available for downloading, in addition to the software listed above. A readme file provides database descriptions.
    Client/server programs
    • Sequin - submission software program for one or many submissions, long sequences, complete genomes, alignments, population/ phylogenetic/ mutation studies. Can be used as a stand-alone application or in a TCP/IP-based "network aware" mode, with links to other NCBI resources and software such as Entrez.
    • Network Entrez - a TCP/IP-based client-server version of WWW Entrez. Makes a direct connection with the NCBI databases over the Internet to retrieve data. Client software is available for PC, Mac, and Unix. For general information about Entrez, see above.
    Cn3D - "See in 3-D," a structure and sequence alignment viewer for NCBI databases. It allows viewing of 3-D structures and sequence-structure or structure-structure alignments. Cn3D can work as a helper application to your browser, or as a client-server application that retrieves structure records from MMDB (described above) directly over the internet. The Cn3D home page provides access to information on how to install the program, a tutorial to get started, and a comprehensive help document.
    NCBI Software ToolBox - set of software and data exchange specifications used by NCBI to produce portable, modular software for molecular biology. The software in the Toolbox is primarily designed to read Abstract Syntax Notation 1 (ASN.1) format records, an International Standards Organization (ISO) data representation format. The software is available to the public in the toolbox/ncbi_tools directory of NCBI's ftp site, and can be used in its own right or as a foundation for building tools with similar properties. The readme files in the toolbox and toolbox/ncbi_tools directories of the FTP site contain more information about the toolbox and ASN.1. An ASN.1 summary is also available. The ToolBox can produce data as either ASN.1, as before, or as XML (more about XML). Additional information about the ToolBox, documentation, and demo programs are available on the NCBI ToolBox page. Additional information about the Information Engineering Branch (IEB) of NCBI, which develops the ToolBox, is provided above, along with other items of interest to software developers.
    Software programs developed as personal projects by various NCBI scientists - /pub directory of FTP site contains programs such as MACAW (multiple sequence alignments) and e-PCR (description above).

    Help Desk NCBI NLM NIH Credits

    Revised September 5, 2007
    Questions about NCBI resources to  info@ncbi.nlm.nih.gov
    Comments about resource guide to Renata Geer  renata@ncbi.nlm.nih.gov

    Disclaimer      Privacy statement