Frequently Asked questions.
FAQs
Troubleshooting
Frequently Asked questions
Q: Old blastn vs. new blastn
- Match/mismatch scores of 1/-3 are best suited for alignments of about 99% identity. We reasoned that if you are searching for alignments with that high a percent identity, megablast is a better algorithm. If you choose not to use megablast, then the assumption is that you want to look farther phylogenetically (lower percent identity), so discontiguous megablast or blastn is a better choice. Match/mismatch scores of 2/-3 target alignments of about 90% identity.
- Also note that the nucleotide BLAST programs now have 'mask for lookup table only' on by default. This filter setting masks low complexity regions of the query sequence only during construction of the lookup table, meaning that word matches outside of a masked region are allowed to extend through masked regions. This was not the default filter setting in 'old blast' and could be another cause of different results between old and new blastn.
Q: What happened to the Month database?
- ftp.ncbi.nlm.nih.gov/blast/db/FASTA.
- 2007/06/30:2007/07/31[mdat] (mdat = modification date)
- 1 month[filter]
- 2 months[filter]
- ...
- 6 months[filter]
Q: What are the lower case grey letters in the query sequence in BLAST results?
Q: Submitting primers or other short sequences
- word size 7
- expect value 1000
- blastn
- turn off low complexity filter
- word size 2
- expect value 30000
- matrix PAM30
- turn off low complexity filter
- set composition-based statistics to 'no adjustment'
Q: Default database for nucleotide-nucleotide searches
Q: Saving your search parameters
Q: How to limit a search to an organism or taxonomic group
Q: How to limit a search to a subset of database sequences
- to search against mammals other than human, use: mammals[orgn] NOT human[orgn]
- to exclude all mammals, use: all[filter] NOT mammals[orgn]
- to search against all records that contain "phosphorylase" in the title (definition line) of the record, use: phosphorylase[title].
Q: How can I search a batch of sequences with BLAST?
- 1) Web megablast. This program is optimized for aligning nucleotide sequences that differ slightly as a result of sequencing or other similar "errors", and is good for scanning a large number of EST type sequences (about 500 kb in length) against a large database. You can import a file of EST sequences in FASTA format or as a list of GenBank accessions or GIs. The default output is an easily reviewable Hit Table format, although you can download and save the results in Standard pairwise HTML or any of the other result output options. Web megablast is available from the BLAST home page. Megablast is also part of the Standalone BLAST executables and an option in the Network BLAST client (see below).
- 2) Standalone BLAST executables. These are command line programs which run BLAST searches against local, downloaded copies of the NCBI BLAST databases, or against custom databases formatted for BLAST. The programs will handle either a single large file with multiple FASTA query sequences, or you can create a script to send multiple files one at a time. The executables are available for a wide variety of platforms, including many "flavors" of UNIX (LINUX, Solaris, etc.), Windows, and Mac OSX.
The Standalone package can be downloaded at http://www.ncbi.nlm.nih.gov/blast/download.shtml or the anonymous FTP location, ftp://ftp.ncbi.nih.gov/blast/executables/; get the "blast" package for your platform.Documentation for the programs is bundled with the downloaded binaries and is also available on the download page. More detailed installation instructions and program documentation are available here: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/.
- 3) Network BLAST client (also called netblast and blastcl3). The Network client is a simple commandline program that allows you to submit a single file of FASTA sequences over an internet connection to the NCBI BLAST databases. You submit searches through the client to the NCBI servers and do not need to download the databases locally. There are client versions for various UNIX platforms, Windows, and Mac OSX.
The client is available at http://www.ncbi.nlm.nih.gov/blast/download.shtml or the anonymous FTP location, ftp://ftp.ncbi.nih.gov/blast/executables/; get "netblast" for your platform.
Q: How to write a program to submit jobs to NCBI's BLAST servers
Q: How to use BLAST to align two sequences without a database search.
Q: What is the Expect (E) value?
The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.
The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance. For more details please see the calculations in the BLAST Course.
The Expect value can also be used as a convenient way to create a significance threshold for reporting results. You can change the Expect value threshold on most BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.
What is "low-complexity" sequence?
Regions with low-complexity sequence have an unusual composition that can create problems in sequence similarity searching. For amino acid queries this compositional bias is determined by the SEG program (Wootton and Federhen, 1996). For nucleotide queries it is determined by the DustMasker program (Morgulis, et al., 2006).
Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits.
In BLAST searches performed without a filter, high scoring hits may be reported only because of the presence of a low-complexity region. Most often, it is inappropriate to consider this type of match as the result of shared homology. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.
Troubleshooting
ERROR: "No significant similarity found"
- Short query sequences: Short alignments may have Expect values above the default threshold, which is 10 on most pages, and, therefore, are not displayed. Try increasing the Expect threshold (under 'Algorithm parameters'). Also, see the FAQ Submitting primers or other short sequences.
- Filtering: Some of the BLAST programs mask regions of low complexity by default. These regions are not allowed to initiate alignments, so if your query is largely low complexity, the filter may prevent all hits to the database. On the Basic BLAST pages, adjust the filter settings in the section 'Filters and Masking', under 'Algorithm parameters'. For a description of low complexity filters, see "What is low-complexity sequence?"
ERROR: An error has occurred on the server, Too many HSPs to save all
- 1) If using tblastx, try blastx instead. The tblastx program is very CPU intensive as it not only translates the query in six reading frames but every database sequence as well. Often, using tblastx is a measure of last resort; a blastx search against a database of known proteins may provide what you need.
- 2) Search a smaller database, such as refseq_rna. Larger databases obviously contain more sequences and for some queries this results in numerous "background" hits. If you want a database of known mRNAs (and their translations) then refseq_rna is a good choice.
- 3) Break up large queries into smaller pieces; submit each piece in a separate search. A common cause of errors in BLAST is searching with a huge sequence, like a complete chromosome, against a large database like nr. This is better accomplished in portions rather than one large, continuous sequence.
- 4) Limit the database by taxonomy. Start with large groups, such as mammals, bacteria, etc. Any taxonomic node or tax id number that you can find in the Taxonomy browser can be used in the 'Organism' text box; see the BLAST FAQ, How to limit a search to an organism or taxonomic group." Also see the Taxonomy browser.
- 5) You may be hitting a large number of 'PREDICTED' or 'hypothetical protein' records. If you do not want these hits, use an Entrez Query such as: all[filter] NOT predicted[title].
- 6) If your queries contain repeat regions, you either need to have one of the species-specific repeat filters turned on, or you can first run your query through a program such as RepeatMasker to identify the repeat regions and remove them from your query. You can check the filter settings on the Basic BLAST pages in the 'Filters and Masking' section, under 'Algorithm parameters'. On the Genome BLAST pages, the default filter includes a repeat filter.
- 7) For megablast and blastn searches, try increasing the word size and/or decreasing the Expect threshold.
ERROR: An error has occurred on the server, [blastsrv4.REAL]:Error: CPU usage limit was exceeded, resulting in SIGXCPU (24).
If you get this error you have numerous options depending on your goals. See the BLAST FAQ, "ERROR: An error has occurred on the server, Too many HSPs to save all".
Why do I get the message "ERROR:BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence" ?
Why some batch searches on the web may seem to take longer than expected.
1st request: current time |
2nd request: current time + 60 seconds |
3rd request: current time + 120 seconds |
4th request: current time + 180 seconds |
5th request: current time + 240 seconds |
The BLAST server works through requests in the order of earliest to latest TOE. A query will be executed before it's TOE, if there are no other queries with an earlier TOE. Users with large numbers of queries are encouraged to use the BLAST servers at off-peaks hours, which are from 8 p.m. to 8 a.m. (EST).