Performance and Optimization
Benchmarking Software on Hopper and Carver
PURPOSE
- Test the performance impact of multithreading with representative public domain software including blastn, blastp, rpsblast, hmmsearch, usearch.
- Run on Hopper (24 cores/node) and Carver (8 cores/node) with different combinations of the number of tasks and threads.
- Provide useful set of parameters to maximize throughput
PROGRAMS TESTED
- BLAST+ programs (blastn, blastp, rpsblast) version 2.2.26
- usearch verison 5.2.32
- hmmsearch version 3.0
DATASETS
- usearch: a collection of protein sequences (~900MB) against a reference "udb" (~900MB)
- blastn
- Query: a collection of nucleotide sequences from NCBI Microbial database from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/ (34,154 sequences, 3.2GB)
- Subject: a collection of NCBI RefSeq genomic sequences from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ forrmatted to NCBI database using makeblastdb (484 sequences, 4.2Gbp)
- blastp
- Query: a collection of protein sequences from NCBI Microbial database from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/ (1,318,911 sequences, 513MB)
- Subject: a collection of NCBI nr sequences from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ formatted to NCBI database using makeblastdb (602,888 sequences, 200Mbp)
- rpsblast
- Query: same with blastp
- Subject: 13,672 pfam raw HMM models from ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz converted to a database using makeprofiledb (~300MB)
- hmmsearch
- Query: same with blastp
- Subject: Pfam-A database from ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam26.0/ (1.1GB)
RESULTS I - Multithreading performance with different input size
The below figures show the performance behavior of each program using different multithreading settings on Hopper and Carver. The y-axis labeled with “B-rate” represents the throughput per unit time and is calculated by the following equation:
((input size per task) * (number of tasks)) / (total runtime) |
The x-axis shows increasing number of input sequence data. For the runs on Hopper, we used 24 cores from the allocated node and varies the number of tasks / threads. For Caver, we used 8 cores. The legend of (p x t) means p tasks with t threads. For example, “4 tasks x 2 threads” means each program is forked to four processes and each process runs in two threads.
Figure 1 and Figure 2 show how blastn performs in terms of the b-rate metric on Hopper and Carver. Interestingly, it is shown that the b-rate is degraded when more threads are used and increasing the query size results in increasing the b-rate. However, the rates are saturated around 1,600 sequences and starts decreasing on both Hopper and Carver. First of all, multithreading doesn't improve the throughput. For all the experiments, runs with one thread always show the best performance. The BLAST program consists of three steps: (i) scan for matches between fixed size words, (ii) extend each matching word as an ungapped alignment on the condition that there is another word match nearby, and (iii) perform gapped alignment for those matches that passed the second step. Note that the multithreading is only implemented for the word matching step of blastn. The construction of separate hash tables for each thread in order to do the word matching and then merging those tables for the next step (i.e. ungapped alignment) has large overhead and explains this trend.
The saturation points in the curves are explained by the physical memory limitations on the compute nodes. FYI, Hopper provides 32GB main memory per node (1.33GB per core) and Carver has 24GB memory space for the "smallmem" configuration (3 GB per core). It is clearly shown that the multithreaded runs of blastn have more overhead than single threaded runs. Also, the matching step scans through all subject databases. If each thread needs to access different databases, that can also undermine the memory sharing advantage of multithreading because different part of the database should be loaded.
The steep increase of the throughput with smaller query size can be explained by the two ways. First, blastn is quite I/O intensive. As long as there is memory available, it can load the data into memory and avoid writing to disk. Second, blastn deals with short query sequences effectively: it concatenates set of input sequences as long one to maximize performance. Until the saturation point, smaller number of queries can take advantage of this feature. For example, we’ve done a rapid comparison between BLAST and BLAST+ using nt and a short query sequence:
$ time blastall -p blastn -i gene.fa -d nt -o out.blst $ time blastn -db nt -query gene.fa -out out.blstp |
The BLAST+ is somehow faster, for a single sequence against nt database. Again it has been reported that significant speed-ups are possible when querying with long sequences. A whole genome sequence is used as a query for next experiment and BLAST+ shows 30x speedup over the traditional BLAST implementation.
$ time blastall -a 4 -p blastn -i NC_011353.fna -d nt -o out.blst $ time blastn -num_threads 4 -db nt -query NC_011353.fna -out out.blstp |
Figure 3 and Figure 4 demonstrate the behaviors of blastp. Again, enabling multithreading doesn’t seem to be useful in terms of the b-rate. Further, the query size is not related with the rate as well. This may due to the compute-intensive nature of blastp in which a very large amount of word candidates should be scanned thought the database and recorded in the hash table. So we see the opposite behavior from the blastn case, loading data into memory, which significantly decreased the I/O overhead in blastn, does not improve the performance of blastp.
|
Figure 5 and Figure 6 show the performance of rpsblast. The trends are quite similar with the case of blastp except for the run with smallest query input. That can be explained by that the loading PSSMs takes larger portion of its operation than computing similarities of input sequences against the database.
Figure 7 and Figure 8 show the performance of hmmsearch. Like the others, we’ve seen that multithreading increases overhead and does not improve throughput. We've found the saturation point is around 51,200 sequences.
Figure 9 and Figure 10 show the behavior of usearch on Hopper and Carver. The througputs in both plots stop increasing around the input size of 819,200.
RESULTS II - Performance with different database size
Now that the optimal input size and the number of threads are determined, the throughput of the softwares with varying database size is evaluated. The following table shows is the input size in basepair and the original database size used in the above experiments.
S/W | Input size (bp) | Database size (MB) |
blastn | 20M | 1,018 |
blastp | 300K | 560 |
rpsblast | 300K | 630 |
usearch | 60M | 841 |
For varying the database size, each original database is reduced to 1/16, 1/8, 1/4, and 1/2, respectively, and additionally one more database from doubling up the original size is generated. As we found the 1-thread run is efficient from the previous tests, all runs use only one thread. The final throughput is normalized by the database size.
The x-axis means the DB size. The “1” means the original database size which I used for the multithreading tests. The y-axis means the throughput ratio. The blastp shows 3.4~3.6 times increases in throughput when using only 1/16-sized database. It means that the smaller database size, the better throughput we can get from it. Interestingly, the usearch seems like not quite being affected by the database size even though the throughput is getting worse with larger database size. For rpsblast and blastn runs, the throughput keeps increasing until the original database size and it decreases when the database size is doubled. On Carver, the blastn shows the best throughput with the half size database. The hmmsearch results are not shown in this report because it shows large variability in throughout when it searches against smaller databases.
SUMMARY AND RECOMMENDATIONS
To summarize, we’ve found that the programs tested are not able to take advantage of the multithreading on Hopper and Carver. These results are interesting because it is expected that multithreading will improve performance by sharing most of its resources with the parent process. The programs cannot take advantage of multithreading because it needs to scan through large subject database. Accessing different portions of the database definitely dilutes the merit of multithreading. It is also closely related with the available physical memory space on a computation node. If the database is small (<500MB, for example), it might possible that when the whole database can be loaded into memory, the multithreaded performance could be maximized. Most JGI users perform computations that require very large database as subject; thus, we would recommend NOT using multithreading options for those programs. Additionally, for running BLAST programs, we would recommend that:
- Use evalue as strict as possible, for example, -evalue 1.0e-10.
- Control the number of results to show, for example, -max_target_seq 1, if only the best hit is enough (for tabular format output).
- Set the amount of alignment calculation as small as possible. The alignment calculation is very slow. Set “-num_alignments” smaller that 250 (for non-tabular output format).
- Use bigger word size. By default, the word size is 3 for proteins, 11 for nucleotides.
As for the optimal input and database sizes on Hopper and Carver, the recommended settings are shown in the following table (In the case of hmmsearch, we found large variability in the throughput when it searches against smaller databases).
S/W | Input size (bp) | Database size (MB) | Number of threads |
blastn | 20 ~ 40M | 500 ~ 1,000 | 1 |
blastp | any | < 560 (the smaller, the better) | 1 |
rpsblast | 300 ~ 600K | 300 ~ 630 | 1 |
usearch | 60 ~ 120M | < 841 (the smaller, the better) | 1 |
hmmsearch | 20 ~ 40M | - | 1 |