NCBI Virus: Test drive our new SARS-CoV-2 interactive data dashboard!

Are you looking for SARS-CoV-2 sequence data? Look no further! The NCBI Virus SARS-CoV-2 Data Hub now has an interactive data dashboard (Figure 1) that shows the collection location (country and US state), the date of collection, and the date of public availability for SARS-CoV-2 sequence data. You can view available nucleotide and protein sequences based on criteria you select and send these to a data table. You can further filter by normalized source information including sequence length, protein content, host, anatomical isolation source. The sequence records have links to related SRA records and publications in PubMed when available. You can download the data as FASTA-formatted sequences with customizable titles, accession lists, or as a table including data descriptors. See the help documentation for more details.

The sequences in NCBI Virus were submitted to members of the International Sequence Database Consortium (INSDC) – GenBank, EMBL, and DDBJ. This collaborative effort ensures that data is freely available to the scientific and public health communities where it can be used to understand the biology, evolution, and spread of SARS-CoV-2.

Figure 1. The NCBI Virus SARS-CoV-2 Data Hub Dashboard. You can narrow down sequence data using collection location, collection date, or the public release date. After making your selections, click “View results, Analyze, or Download” near the top of the page to see your dataset in the results table, which shows nucleotide, protein, and RefSeq sequences as well as associated metadata.

Continue reading “NCBI Virus: Test drive our new SARS-CoV-2 interactive data dashboard!” →

Read assembly and Annotation Pipeline Tool (RAPT) is available for use and testing

We are excited to launch a beta version of RAPT, the Read assembly and Annotation Pipeline Tool, a one-step application for the genome assembly and gene annotation of archaeal and bacterial isolates. Start from an Illumina run in SRA or on your local machine and get a fully annotated genome!

A RAPT Docker container includes SKESA, a high-accuracy assembler for short reads, PGAP, the annotation pipeline written in the common workflow language (CWL) and used by RefSeq, and cwltool, the reference implementation for CWL. A RAPT release also includes a set of reference data that are critical for a quality annotation. RAPT can be executed with Docker, Singularity or podman on any local or remote machine meeting basic requirements. For users of the Google Cloud Platform, RAPT can be launched from the Google Shell without configuring a virtual machine in advance.

To learn more about RAPT, register for our upcoming webinar.

Questions? Interest in becoming a beta tester? Contact us!

RAPT is available here.

New Columns added to the web BLAST Descriptions Table

In response to your requests, we have added new columns to the Descriptions Table for the web BLAST output. The new columns are Scientific Name, Common Name, Taxid, and Accession Length. Common Name and Accession Length are now part of the default display. You can click ‘Select columns’ or ‘Manage columns’ to add or remove columns from the display (Figure 1). Your preferences will be saved for your next visit to BLAST, and when you download your results, whatever columns you have displayed will be saved.

Figure 1. The web BLAST Descriptions Table with all possible columns. You can remove columns through the ‘Manage columns’ menu. If you are not displaying any non-default columns, you can add them using the same menu that will be titled ‘Select columns’.

Customize columns in NCBI’s Multiple Sequence Alignment Viewer

We’re excited to report that researchers using the NCBI Multiple Sequence Alignment Viewer (MSAV) can now add or remove columns from the alignment view. In this way, you can choose to show only columns with data relevant for analysis of the sequences in your alignment.

When you arrive at an MSA alignment view, you’ll see columns for the Sequence ID (e.g., sequence accession number), Start and End of the alignment, and the organism (species name).

Sometimes, the information in these default columns isn’t the most useful information for sorting through the alignment. In the example above, all the sequences are from the same organism, so looking at the Organism column won’t help in figuring out the differences among the different sequences in the alignment.

Continue reading “Customize columns in NCBI’s Multiple Sequence Alignment Viewer” →

Search NCBI’s Pathogen Detection websites with simple keywords

We’ve redesigned the filters on NCBI’s Pathogen Detection websites to make searching easier!

For example, say you wanted to search for outbreak isolates related to flour. Before the filters were redesigned, you’d have to know that some of the available metadata terms include “flour”, “All-purpose Wheat flour”, and “wheat flour”, along with seventeen other terms. Now, you can see all of your available options after typing in your search term and select only those that are relevant to your search.

*Figure 1.* Isolates Browser. The “Filters” button to see and search all filters.

Continue reading “Search NCBI’s Pathogen Detection websites with simple keywords” →

BLAST+ 2.11.0 now available with limited usage reporting to help improve BLAST

BLAST+ 2.11.0 release is now available from our FTP site. With this release, BLAST+ now provides usage reports to NCBI to help us improve BLAST. This information is limited to the name of the BLAST program, some basic database metadata, a few BLAST parameters, as well the number and total size of your queries (Figure 1).

Figure 1. An example of the report sent back to NCBI from the 2.11.0 BLAST programs.

Continue reading “BLAST+ 2.11.0 now available with limited usage reporting to help improve BLAST” →

RefSeq Release 203 now available

RefSeq release 203 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 2, 2020, and contains 256,340,911 records, including 186,482,096 proteins, 34,176,314 RNAs, and sequences from 105,349 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements:

RefSeq annotation of mouse GRCm39
RefSeq has finished its initial annotation of the new mouse reference assembly, GRCm39, recently released by the Genome Reference Consortium. This is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38, resolving over 400 issues, almost doubling the scaffold N50, closing almost half the gaps, and adding 1.9 Mb of sequence.

The annotation report for annotation release 109 is available here.

The annotation products are available in the sequence databases and on the FTP site.

New eukaryotic genome annotations
In addition to mouse (GRCm39), this release contains new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 27 species, including:

Pallas’s mastiff bat annotation release 100, based on the assembly mMolMol1.p (GCF_014108415.1)
Myotis myotis bat annotation release 100, based on the assembly mMyoMyo1.p (GCF_014108235.1)
southern grasshopper mouse annotation release 100, based on the new assembly mOncTor1.1 (GCF_903995425.1)
American pika (pictured above) annotation release 102 based on new assembly OchPri4.0 (GCF_014633375.1)
pharaoh ant annotation release 102 based on new assembly ASM1337386v2 (GCF_013373865.1)
olive fruit fly annotation release 101, based on the assembly MU_Boleae_v2 (GCF_001188975.3)

Updated human genome Annotation Release 105.20201022 (GRCh37.p13)
Annotation Release 105.20201022 is an annotation update for the previous human reference assembly, GRCh37.p13 (hg19). This update is not a part of RefSeq FTP release but the annotation products are available in the sequence databases and on the genomes FTP site.

COVID-19 related human gene annotation now in NCBI RefSeq and Gene
The RefSeq group has compiled a set of human genes with roles in coronavirus infection and disease. You can now see and search for these genes and their regulatory elements in NCBI Gene and RefSeq.

Matched Annotation by NCBI and EMBL-EBI (MANE) version 0.92
NCBI RefSeq and Ensembl/GENCODE announced MANE v0.92, which covers 16,865 genes or ~88% of known human protein-coding genes.

NCBI Datasets

NCBI Datasets now provides downloads of gene data for more than 30 thousand organisms.

Human GRCh37 (hg19) RefSeq annotation update

The NCBI RefSeq group has been in overdrive, making improvements to our human genome annotation and reference transcript and protein sets, with 8,000 new and 15,000 updated transcripts in the last year alone! That’s about 30% of our curated transcript dataset (the transcripts with NM_ and NR_ accessions), with a big focus on transcripts that are well-expressed, have conserved exons, or are transcribed from new promoters.

With all these improvements, we’ve been updating the RefSeq annotation of GRCh38.p13 every quarter. But what about GRCh37 (hg19), which many of you still use?

Continue reading “Human GRCh37 (hg19) RefSeq annotation update “ →