NERSCPowering Scientific Discovery Since 1974

Performance and Optimization

 

Benchmarking Software on Hopper and Carver

 

 

PURPOSE
 

  • Test the performance impact of multithreading with representative public domain software including blastn, blastp, rpsblast, hmmsearch, usearch.
  • Run on Hopper (24 cores/node) and Carver (8 cores/node) with different combinations of the number of tasks and threads.
  • Provide useful set of parameters to maximize throughput

 

PROGRAMS TESTED
 

  • BLAST+ programs (blastn, blastp, rpsblast) version 2.2.26
  • usearch verison 5.2.32
  • hmmsearch version 3.0

 

DATASETS
 

RESULTS I - Multithreading performance with different input size


The below figures show the performance behavior of each program using different multithreading settings on Hopper and Carver. The y-axis labeled with “B-rate” represents the throughput per unit time and is calculated by the following equation:

((input size per task) * (number of tasks)) / (total runtime)

The x-axis shows increasing number of input sequence data. For the runs on Hopper, we used 24 cores from the allocated node and varies the number of tasks / threads. For Caver, we used 8 cores. The legend of (p x t) means p tasks with t threads. For example, “4 tasks x 2 threads” means each program is forked to four processes and each process runs in two threads.

 

Fig. 1.  Performance of blastn on Hopper

 

Fig. 2. Performance of blastn on Carver

Figure 1 and Figure 2 show how blastn performs in terms of the b-rate metric on Hopper and Carver. Interestingly, it is shown that the b-rate is degraded when more threads are used and  increasing the query size results in increasing the b-rate. However, the rates are saturated around 1,600 sequences and starts decreasing on both Hopper and Carver. First of all, multithreading doesn't improve the throughput. For all the experiments, runs with one thread always show the best performance. The BLAST program consists of three steps: (i) scan for matches between fixed size words, (ii) extend each matching word as an ungapped alignment on the condition that there is another word match nearby, and (iii) perform gapped alignment for those matches that passed the second step. Note that the multithreading is only implemented for the word matching step of blastn. The construction of separate hash tables for each thread in order to do the word matching and then merging those tables for the next step (i.e. ungapped alignment) has large overhead and explains this trend.

The saturation points in the curves are explained by the physical memory limitations on the compute nodes. FYI, Hopper provides 32GB main memory per node (1.33GB per core) and Carver has 24GB memory space for the "smallmem" configuration (3 GB per core). It is clearly shown that the multithreaded runs of blastn have more overhead than single threaded runs. Also, the matching step scans through all subject databases. If each thread needs to access different databases, that can also undermine the memory sharing advantage of multithreading because different part of the database should be loaded.

The steep increase of the throughput with smaller query size can be explained by the two ways. First, blastn is quite I/O intensive. As long as there is memory available, it can load the data into memory and avoid writing to disk.  Second, blastn deals with short query sequences effectively: it concatenates set of input sequences as long one to maximize performance. Until the saturation point, smaller number of queries can take advantage of this feature. For example, we’ve done a rapid comparison between BLAST and BLAST+ using nt and a short query sequence:

$ time blastall -p blastn -i gene.fa -d nt -o out.blst 
32.19s user 3.01s system 99% cpu 35.469 total

$ time blastn -db nt -query gene.fa -out out.blstp
11.61s user 2.22s system 99% cpu 13.846 total

The BLAST+ is somehow faster, for a single sequence against nt database. Again it has been reported that significant speed-ups are possible when querying with long sequences. A whole genome sequence is used as a query for next experiment and BLAST+ shows 30x speedup over the traditional BLAST implementation.

$ time blastall -a 4 -p blastn -i NC_011353.fna -d nt -o out.blst 
46115.20s user 24.96s system 388% cpu 3:17:58.91 total

$ time blastn -num_threads 4 -db nt -query NC_011353.fna -out out.blstp
1462.23s user 7.98s system 233% cpu 10:29.37 total

 

Fig. 3. Performance of blastp on Hopper

Fig. 4. Performance of blastp on Carver

Figure 3 and Figure 4 demonstrate the behaviors of blastp. Again, enabling multithreading doesn’t seem to be useful in terms of the b-rate. Further, the query size is not related with the rate as well. This may due to the compute-intensive nature of blastp in which a very large amount of word candidates should be scanned thought the database and recorded in the hash table. So we see the opposite behavior from the blastn case, loading data into memory, which significantly decreased the I/O overhead in blastn, does not improve the performance of blastp.

Fig. 5. Performance of rpsblast on Hopper

Fig. 6. Performance of rpsblast on Carver


Figure 5 and Figure 6 show the performance of rpsblast. The trends are quite similar with the case of blastp except for the run with smallest query input. That can be explained by that the loading PSSMs takes larger portion of its operation than computing similarities of input sequences against the database.

 

Fig. 7. Performance of hmmsearch on Hopper

 

Fig. 8. Performance of hmmsearch on Carver

Figure 7 and Figure 8 show the performance of hmmsearch. Like the others, we’ve seen that multithreading increases overhead and does not improve throughput. We've found the saturation point is around 51,200 sequences.

Fig. 9. Performance of usearch on Hopper

Fig. 10. Performance of usearch on Carver

Figure 9 and Figure 10 show the behavior of usearch on Hopper and Carver. The througputs in both plots stop increasing around the input size of 819,200.

 

RESULTS II - Performance with different database size 


Now that the optimal input size and the number of threads are determined, the throughput of the softwares with varying database size is evaluated. The following table shows is the input size in basepair and the original database size used in the above experiments.

 

S/W Input size (bp) Database size (MB)
blastn 20M 1,018
blastp 300K 560
rpsblast 300K 630
usearch 60M 841

For varying the database size, each original database is reduced to 1/16, 1/8, 1/4, and 1/2, respectively, and additionally one more database from doubling up the original size is generated. As we found the 1-thread run is efficient from the previous tests, all runs use only one thread. The final throughput is normalized by the database size.

Fig. 11. Performance with varying database size on Hopper

Fig. 12. Performance with varying database size on Carver


The x-axis means the DB size. The “1” means the original database size which I used for the multithreading tests. The y-axis means the throughput ratio. The blastp shows 3.4~3.6 times increases in throughput when using only 1/16-sized database. It means that the smaller database size, the better throughput we can get from it. Interestingly, the usearch seems like not quite being affected by the database size even though the throughput is getting worse with larger database size. For rpsblast and blastn runs, the throughput keeps increasing until the original database size and it decreases when the database size is doubled. On Carver, the blastn shows the best throughput with the half size database. The hmmsearch results are not shown in this report because it shows large variability in throughout when it searches against smaller databases.

 

SUMMARY AND RECOMMENDATIONS


To summarize, we’ve found that the programs tested are not able to take advantage of the multithreading on Hopper and Carver. These results are interesting because it is expected that multithreading will improve performance by sharing most of its resources with the parent process. The programs cannot take advantage of multithreading because it needs to scan through large subject database. Accessing different portions of the database definitely dilutes the merit of multithreading. It is also closely related with the available physical memory space on a computation node. If the database is small (<500MB, for example), it might possible that when the whole database can be loaded into memory, the multithreaded performance could be maximized. Most JGI users perform computations that require very large database as subject; thus, we would recommend NOT using multithreading options for those programs. Additionally, for running BLAST programs, we would recommend that:

  • Use evalue as strict as possible, for example, -evalue 1.0e-10. 
  • Control the number of results to show, for example, -max_target_seq 1, if only the best hit is enough (for tabular format output). 
  • Set the amount of alignment calculation as small as possible. The alignment calculation is very slow. Set “-num_alignments” smaller that 250 (for non-tabular output format). 
  • Use bigger word size. By default, the word size is 3 for proteins, 11 for nucleotides.


As for the optimal input and database sizes on Hopper and Carver, the recommended settings are shown in the following table (In the case of hmmsearch, we found large variability in the throughput when it searches against smaller databases).

S/W Input size (bp) Database size (MB) Number of threads
blastn 20 ~ 40M 500 ~ 1,000 1
blastp any < 560 (the smaller, the better) 1
rpsblast 300 ~ 600K 300 ~ 630 1
usearch 60 ~ 120M < 841 (the smaller, the better) 1
hmmsearch 20 ~ 40M - 1