Frequently Asked questions.

FAQs

Q: Old blastn vs. new blastn
Q: What happened to the "month" database?
Q: What are the lower case, grey letters in the query sequence in BLAST results?
Q: Submitting primers or other short sequences
Q: Default database for nucleotide-nucleotide searches
Q: How to limit a search to an organism or taxonomic group
Q: How to limit a search to a subset of database sequences
Q: Saving your search parameters
Q: How can I search a batch of sequences with BLAST?
Q: How to write a program to submit jobs to NCBI's BLAST servers
Q: How to use BLAST to align two or more sequences without a database search.
Q: What is the Expect (E) Value?
Q: What is Low Complexity sequence?

Troubleshooting

ERROR: "No significant similarity found"
ERROR: An error has occurred on the server, Too many HSPs to save all
ERROR: An error has occurred on the server, [blastsrv4.REAL]:Error: CPU usage limit was exceeded, resulting in SIGXCPU (24).
ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence
Why some batch searches on the web may seem to take longer than expected.

Frequently Asked questions

Q: Old blastn vs. new blastn

Blastn in the new interface, which became the BLAST home page on April 16, 2007, uses a different scoring matrix than before. The match reward (-r) and mismatch penalty (-q) settings are now 2, -3 (from 1, -3). The rationale for this change is as follows:
  • Match/mismatch scores of 1/-3 are best suited for alignments of about 99% identity. We reasoned that if you are searching for alignments with that high a percent identity, megablast is a better algorithm. If you choose not to use megablast, then the assumption is that you want to look farther phylogenetically (lower percent identity), so discontiguous megablast or blastn is a better choice. Match/mismatch scores of 2/-3 target alignments of about 90% identity.
  • Also note that the nucleotide BLAST programs now have 'mask for lookup table only' on by default. This filter setting masks low complexity regions of the query sequence only during construction of the lookup table, meaning that word matches outside of a masked region are allowed to extend through masked regions. This was not the default filter setting in 'old blast' and could be another cause of different results between old and new blastn.

Q: What happened to the Month database?

The BLAST month database is no longer searchable from the web pages, although it can still be downloaded from:
  • ftp.ncbi.nlm.nih.gov/blast/db/FASTA.
You can search recently added or modified sequence records by using an Entrez query. Create the BLAST database on the fly, and change the time period to whatever you want, by using an Entrez query such as:
  • 2007/06/30:2007/07/31[mdat] (mdat = modification date)
Or, use this simpler Entrez query text:
  • 1 month[filter]
  • 2 months[filter]
  • ...
  • 6 months[filter]
This applies to whatever database you select. Also, this Entrez query retrieves records modified in any way, not only those records whose sequence has changed.

Q: What are the lower case grey letters in the query sequence in BLAST results?

You are seeing the result of automatic filtering of your query for low-complexity sequence. This filter is on by default on most pages in order to prevent alignments that many consider to be artifacts. The filter substitutes any low-complexity sequence with lowercase grey characters. This allows you to see the sequence that was filtered instead of the "X"s and "N"s of the previous BLAST output.

Q: Submitting primers or other short sequences

The "Search for short, nearly exact matches" nucleotide and protein pages no longer exist. Instead, the nucleotide and protein blast programs automatically check for short queries and adjust the search parameters accordingly. This adjustment occurs when the query, either nucleotide or amino acid, is of length 30 or less. The translating blast programs or searches on the genome blast pages do not have this auto adjust feature. If you are submitting forward and reverse primers in the same search by concatenating them with "N"s, and your primers total more than 30 bases, the adjustment will not occur (the Ns are not included in the 30 threshold). In this case, perform a manual adjustment by clicking on 'Algorithm parameters', uncheck 'Automatically adjust parameters for short input sequences', and then set your own parameters.
A good starting point for nucleotide queries is:
  • word size 7
  • expect value 1000
  • blastn
  • turn off low complexity filter
For protein queries, start with:
  • word size 2
  • expect value 30000
  • matrix PAM30
  • turn off low complexity filter
  • set composition-based statistics to 'no adjustment'
Once you are satisfied with the parameters you want, you can bookmark that page for future use. The bookmark link is near the top right of the blast pages. You can also save the search, if logged into My NCBI, which will save that search strategy.

Q: Default database for nucleotide-nucleotide searches

The default nucleotide-nucleotide BLAST database is currently "Human genomic + transcript". If you want to search the nucleotide 'nr' database (also called 'nt') or another database, click the radio button for "Others" and use the drop down menu to select a database.

Q: Saving your search parameters

Once you are satisfied with the parameters for a particular search, you can bookmark that page for future use. The bookmark link is near the top right of the search input pages. You can also save the search, if logged into My NCBI, which will save that search strategy. To save, click on the "[Save Search Strategies]" link near the top of the blast results page.

Q: How to limit a search to an organism or taxonomic group

To search only sequences from an organism or taxonomic group, use the "Organism" text box. On the nucleotide blast pages, first click the radio button for "Others (nr etc.)". The "Organism" text box has an auto fill function. Begin to enter an organism common name (rat, bacteria, etc.), a genus or species (elegans, danio, etc.), or an NCBI taxonomy id; then select a name from the list.
You can also use Entrez Query terms as before. Put those in the Entrez Query box just below the Organism field; for example, rattus norvegicus[organism] or simply, rat[orgn]. Also, see the FAQ, "How to limit a search to a subset of database sequences."
You can search for taxa in the Taxonomy Browser.

Q: How to limit a search to a subset of database sequences

In addition to specifying the organism, you can use the Entrez Query text box to limit the database that you have chosen in other ways.
For example:
  • to search against mammals other than human, use: mammals[orgn] NOT human[orgn]
  • to exclude all mammals, use: all[filter] NOT mammals[orgn]
  • to search against all records that contain "phosphorylase" in the title (definition line) of the record, use: phosphorylase[title].
Get help with writing Entrez queries in the NCBI Handbook. Since these queries initiate an Entrez search, the BLAST search often takes longer to run.

Q: How can I search a batch of sequences with BLAST?

There are three options for "batch" BLAST searches:
  • 1) Web megablast. This program is optimized for aligning nucleotide sequences that differ slightly as a result of sequencing or other similar "errors", and is good for scanning a large number of EST type sequences (about 500 kb in length) against a large database. You can import a file of EST sequences in FASTA format or as a list of GenBank accessions or GIs. The default output is an easily reviewable Hit Table format, although you can download and save the results in Standard pairwise HTML or any of the other result output options. Web megablast is available from the BLAST home page. Megablast is also part of the Standalone BLAST executables and an option in the Network BLAST client (see below).
  • 2) Standalone BLAST executables. These are command line programs which run BLAST searches against local, downloaded copies of the NCBI BLAST databases, or against custom databases formatted for BLAST. The programs will handle either a single large file with multiple FASTA query sequences, or you can create a script to send multiple files one at a time. The executables are available for a wide variety of platforms, including many "flavors" of UNIX (LINUX, Solaris, etc.), Windows, and Mac OSX.
    The Standalone package can be downloaded at http://www.ncbi.nlm.nih.gov/blast/download.shtml or the anonymous FTP location, ftp://ftp.ncbi.nih.gov/blast/executables/; get the "blast" package for your platform.
    Documentation for the programs is bundled with the downloaded binaries and is also available on the download page. More detailed installation instructions and program documentation are available here: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/.
  • 3) Network BLAST client (also called netblast and blastcl3). The Network client is a simple commandline program that allows you to submit a single file of FASTA sequences over an internet connection to the NCBI BLAST databases. You submit searches through the client to the NCBI servers and do not need to download the databases locally. There are client versions for various UNIX platforms, Windows, and Mac OSX.
    The client is available at http://www.ncbi.nlm.nih.gov/blast/download.shtml or the anonymous FTP location, ftp://ftp.ncbi.nih.gov/blast/executables/; get "netblast" for your platform.

Q: How to write a program to submit jobs to NCBI's BLAST servers

Use the URLAPI. Documentation also available in postscript and PDF formats.

Q: How to use BLAST to align two sequences without a database search.

NCBI has a tool for aligning two sequences provided by the user. The tool is called BLAST 2 Sequences, which uses the chosen BLAST algorithm to align sequences as if they were found in a database search. This can be helpful for observing differences between two sequences, however, it still performs local alignments, not global alignments.
Because BLAST 2 Sequences uses the size of the current nucleotide or protein nr database to calculate Expect values, you may need to significantly increase the Expect threshold in order to see shorter alignments. Also, the low complexity filter is on by default; this may be the cause of "missing" alignments.
If comparing very large sequences (on the order of hundreds of kilobases), you may need to specify a sequence sub range with the "from" and "to" boxes. Also, submitting the shorter of two sequences as Sequence 1 may help when the two queries are of very different lengths.

Q: What is the Expect (E) value?

The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.

The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance. For more details please see the calculations in the BLAST Course.

The Expect value can also be used as a convenient way to create a significance threshold for reporting results. You can change the Expect value threshold on most BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.

What is "low-complexity" sequence?

Regions with low-complexity sequence have an unusual composition that can create problems in sequence similarity searching. For amino acid queries this compositional bias is determined by the SEG program (Wootton and Federhen, 1996). For nucleotide queries it is determined by the DustMasker program (Morgulis, et al., 2006).

Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits.

In BLAST searches performed without a filter, high scoring hits may be reported only because of the presence of a low-complexity region. Most often, it is inappropriate to consider this type of match as the result of shared homology. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.

Troubleshooting

ERROR: "No significant similarity found"

Below are common reasons that a BLAST search results in the "No significant similarity found" message.
  • Short query sequences: Short alignments may have Expect values above the default threshold, which is 10 on most pages, and, therefore, are not displayed. Try increasing the Expect threshold (under 'Algorithm parameters'). Also, see the FAQ Submitting primers or other short sequences.
  • Filtering: Some of the BLAST programs mask regions of low complexity by default. These regions are not allowed to initiate alignments, so if your query is largely low complexity, the filter may prevent all hits to the database. On the Basic BLAST pages, adjust the filter settings in the section 'Filters and Masking', under 'Algorithm parameters'. For a description of low complexity filters, see "What is low-complexity sequence?"

ERROR: An error has occurred on the server, Too many HSPs to save all

This error occurs when the total number of high-scoring segment pairs (HSPs) is far too many for the BLAST servers to return the results. This is rare as the results have to be several hundred megabytes of information for this to happen. However, there are certain searches which could generate a huge amount of data. Most typically this error occurs when the default filters are turned off or when the query sequences have repeat elements in them. If you get this error, you have numerous options depending on your goals:
  • 1) If using tblastx, try blastx instead. The tblastx program is very CPU intensive as it not only translates the query in six reading frames but every database sequence as well. Often, using tblastx is a measure of last resort; a blastx search against a database of known proteins may provide what you need.
  • 2) Search a smaller database, such as refseq_rna. Larger databases obviously contain more sequences and for some queries this results in numerous "background" hits. If you want a database of known mRNAs (and their translations) then refseq_rna is a good choice.
  • 3) Break up large queries into smaller pieces; submit each piece in a separate search. A common cause of errors in BLAST is searching with a huge sequence, like a complete chromosome, against a large database like nr. This is better accomplished in portions rather than one large, continuous sequence.
  • 4) Limit the database by taxonomy. Start with large groups, such as mammals, bacteria, etc. Any taxonomic node or tax id number that you can find in the Taxonomy browser can be used in the 'Organism' text box; see the BLAST FAQ, How to limit a search to an organism or taxonomic group." Also see the Taxonomy browser.
  • 5) You may be hitting a large number of 'PREDICTED' or 'hypothetical protein' records. If you do not want these hits, use an Entrez Query such as: all[filter] NOT predicted[title].
  • 6) If your queries contain repeat regions, you either need to have one of the species-specific repeat filters turned on, or you can first run your query through a program such as RepeatMasker to identify the repeat regions and remove them from your query. You can check the filter settings on the Basic BLAST pages in the 'Filters and Masking' section, under 'Algorithm parameters'. On the Genome BLAST pages, the default filter includes a repeat filter.
  • 7) For megablast and blastn searches, try increasing the word size and/or decreasing the Expect threshold.

ERROR: An error has occurred on the server, [blastsrv4.REAL]:Error: CPU usage limit was exceeded, resulting in SIGXCPU (24).

This error occurs when your search is so large that the backend machines can not complete it in the time allowed, which is about one hour of combined CPU time. This is distinct from the "Too many HSPs" error in that there are no results and the servers have essentially killed the process. However, the causes, such as large output or unfiltered queries, are similar.

If you get this error you have numerous options depending on your goals. See the BLAST FAQ, "ERROR: An error has occurred on the server, Too many HSPs to save all".

Why do I get the message "ERROR:BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence" ?

This will happen if your entire query sequence has been masked by low complexity filtering. You will need to turn filtering off to get hits. For further information on filtering, please read the BLAST FAQ on What is Low Complexity sequence?

Why some batch searches on the web may seem to take longer than expected.

The NCBI WWW BLAST server is a shared resource and it would be unfair for a few users to monopolize it. To prevent this, the server keeps track of how many queries are in the queue for each user and penalizes those users with many queries in the queue. This is done by calculating a 'Time of Execution' (TOE). If a user has only one query in the queue, then the TOE is set to the current time. As a user adds more queries to the queue, then the TOE is set to the current time, plus 60 seconds for every query in the queue. An example would be if a user sent in five requests one after the other without waiting for any to be worked on, then the TOE's for the requests would be:
1st request: current time
2nd request: current time + 60 seconds
3rd request: current time + 120 seconds
4th request: current time + 180 seconds
5th request: current time + 240 seconds

The BLAST server works through requests in the order of earliest to latest TOE. A query will be executed before it's TOE, if there are no other queries with an earlier TOE. Users with large numbers of queries are encouraged to use the BLAST servers at off-peaks hours, which are from 8 p.m. to 8 a.m. (EST).