pmc logo imageJournal ListSearchpmc logo image
Logo of nihpaNIHPA bannerabout author manuscriptssubmit a manuscript
Lect Notes Comput Sci. Author manuscript; available in PMC 2008 May 13.
Published in final edited form as:
Lect Notes Comput Sci. 2008 May; 4983(LNBI): 122–133.
PMCID: PMC2380387
NIHMSID: NIHMS40326
Accelerating the neighbor-joining algorithm using the adaptive bucket data structure
Leonid Zaslavsky and Tatiana A. Tatusova
National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, MD, 20894, USA
Leonid Zaslavsky Email: zaslavsk/at/ncbi.nlm.nih.gov
Tatiana A. Tatusova Email: tatiana/at/ncbi.nlm.nih.gov
Abstract
The complexity of the neighbor joining method is determined by the complexity of the search for an optimal pair (”neighbors to join”) performed globally at each iteration. Accelerating the neighbor-joining method requires performing a smarter search for an optimal pair of neighbors, avoiding re-evaluation of all possible pairs of points at each iteration.
We developed an acceleration technique for the neighbor-joining method that significantly decreases complexity for important applications without any change in the neighbor-joining method. This technique utilizes the bucket data structure. The pairs of nodes are arranged in buckets according to values of the goal function δij = ui + ujdij. Buckets are adaptively re-arranged after each neighbor-joining step. While the pairs of nodes in the top bucket are re-evaluated at every iteration, pairs in lower buckets are accessed more rarely, when the algorithm determines that the elements of the bucket need to be re-evaluated based on new values of δij. As a result, only a small portion of candidate pairs of nodes is examined at each iteration.
The algorithm is cache efficient, since the bucket data structures are able to exploit locality and adjust to cache properties.
Keywords: neighbor-joining algorithm, bucket data structure, adaptive, cache-efficient
1 Introduction
The neighbor-joining algorithm [1], [2] is one of the most popular distance methods for the creation of phylogenetic trees. It is a greedy agglomerative algorithm that constructs a tree in steps [3]. The algorithm is based on the minimum-evolution criterion for phylogenetic trees. It is well-tested and studied theoretically, provides good results and is statistically consistent under many models of evolution [4], [5], [6], [7], [8]. Several algorithms have been developed as improvements to the classical neighbor-joining method [9], [10]. Since the neighbor-joining method is much more efficient than other algorithms of comparable quality, it is widely used for phylogenetic analysis as the tool of choice for preliminary analysis, with results being verified and refined by maximum likelihood and Baysian methods [11].
However, the usage of the neighbor-joining method within interactive exploratory analysis tools makes it desirable to further accelerate the algorithm for large datasets. This is especially true if the bootstrap analysis is performed and multiple trees need to be calculated [12], [3]. Since the O(N3) complexity of the neighbor joining method is determined by the amount of operations per search step performed globally at each iteration to find an optimal pair (”neighbors to join”), accelerating the neighbor-joining method requires a smarter search methodology which avoids brute-force reevaluation of all possible pairs of points. Our interest in accelerating the neighbor-joining method is motivated by our ongoing efforts to develop and improve NCBI interactive analysis web tools, such as the NCBI Influenza Virus Resource [13], [14], where the neighbor-joining method is the default tree method. The goal is to enable bootstrap analysis for meaningful dataset, in a timeframe acceptable for interactive web tools. This paper describes an ongoing effort toward this goal.
Accelerating strategies for the neighbor-joining method have been proposed by several authors. The QuickJoin algorithm [15], [16] uses the quad-tree data structure to accelerate the search for optimal value of the goal function in the neighbor-joining algorithm, while still constructing the same tree as the original algorithm. The ClearCut algorithm [17], [18] implements the relaxed neighbor-joining approach. The algorithm does not search for a globally optimal pair, but selects a locally optimal pair (i, j) at each step, such that δij = ui + ujdij is maximum for both δik and δjk for all other nodes k. In the Fast Neighbor Joining method [19] the goal function is not optimized globally, but is rather optimized over a set, called the ”visible set”. The algorithm is guaranteed to produce the same results as the neighbor-joining methods for an additive input.
In this paper we pursue the same goal as [16]: to accelerate the search for a pair of nodes to be join while constructing the same tree as the classical neighbor-joining algorithm. We arrange the candidate pairs of nodes in buckets according to the value of the NJ goal function (see Figure 1). When number of nodes is large, the value of the neighbor-joining goal function δij = ui + ujdij changes slightly with each iteration for each pair of nodes. As a result, only elements of the top bucket are re-evaluated at every iteration, while elements of lower buckets are accessed more rarely, when needed. Only a small portion of candidate pairs of nodes is examined at each iteration. O(N) buckets are retained eliminating the need to evaluate O(N2) pairs per iteration. Instead, we rearrange O(N) buckets without accessing the content and evaluate pairs in the top bucket. The proposed bucket-based data structure allows cache-efficient manipulations [20], [21]. Note that bucket data structures have been used in the bucket-sort algorithm [22] and shortest path algorithms in [23], [24], [25], [26], [27].
Fig. 1Fig. 1
A bucket data structure.
2 Methodology
Below we first describe the classical neighbor-joining algorithm and then show how to use the bucket data to perform an efficient search for a pair of nodes to be joined.
2.1 The neighbor-joining method
Classical NJ algorithm.
At each iteration m = 0, …, N − 2 of the classical neighbor-joining method [2],[3], average distances
equation M1
(1)
are calculated for each node. A global search is performed to find a pair of nodes (i*, j*) such that
equation M2
(2)
where
equation M3
(3)
Nodes i* and j* are joined in new node k*.
The branch lengths υi* and υj* are calculated as
equation M4
(4)
equation M5
(5)
and distances from node k* to the rest of the nodes are determined for each node p by the formula
equation M6
(6)
Preserving non-negativity of branch lengths and distances
For the implementations used in our web analysis tools [13], we chosen to keep branch lengths and distances non-negative. We modify formulas (4) and (5) as follows. Define
equation M7
(7)
Analogues of equations (4) and (5) are:
equation M8
(8)
equation M9
(9)
A non-negative analogue of equation (6) is
equation M10
(10)
Recursive formulas
New average distances equation M11 can be calculated from old ones equation M12 by formulas:
equation M13
(11)
or, using (10),
equation M14
(12)
2.2 Upper bound for change in the goal function value for a pair of points over a neighbor-joining step
Estimates for growth of average distances ui
From (12), it is easy to obtain the following inequalities:
equation M15
(13)
and
equation M16
(14)
Estimates for growth of δij
From (3), it is easy to see that
equation M17
Using the inequalities (13) and (14), we get
equation M18
or
equation M19
Finally, we obtain two growth estimates:
equation M20
(15)
and
equation M21
(16)
where
equation M22
and
equation M23
The estimates (15) and (16) together can be written as:
equation M24
(17)
where
equation M25
Note. These estimates show that when the number of nodes is large, value equation M26 can increase from equation M27 only slightly. For example, when n = Nm is about 103, the increase is limited by approximately 2·10−3 of the maximal average distance.
2.3 Construction of buckets and operating them
The arrangement of pairs (i, j) in groups is performed according to the values of the neighbor-joining goal function δij defined by formula (3). Our purpose is to limit evaluation of the individual pairs only to those that were close to optimal in the previous iteration and may become optimal at the current step.
First, the treatment of pairs with zero or near zero distances between nodes is considered. Lets introduce a special bucket for pairs (i, j) such that
equation M28
where ε is a small number (ex., ε = 10−6). We join zero-distance and near-zero-distance pairs first, as we do in our current implementation of the classical neighbor joining algorithm [13].
Let us consider regular pairs. Define the bucket intervals as follows:
equation M29
(18)
where
equation M30
Our initial idea was to use intervals constant step Δm:
equation M31
(19)
If parameter equation M32 satisfies the condition
equation M33
(20)
where equation M34, only elements of the top bucket can have equation M35 equal to equation M36. All other buckets contain suboptimal elements. Moreover, because of the estimate (15), pairs that are currently in the buckets with k ≥ 2 remain suboptimal in the next iteration.
At each neighbor-joining iteration, a new collection of N buckets is constructed according to (19). New pairs appearing at the current iteration are placed in the buckets accordingly. Contents of most of the existing buckets is placed into new buckets without being evaluated. First, the new bucket index knew is determined by formula
equation M37
(21)
for each old bucket. If knew ≥ 1, the content of the kth old bucket with k ≥ 1 is placed into the knewth new bucket without any evaluation. However, if according to formula (21) knew ≤ 0, then each pair of nodes in the kth old bucket is evaluated and placed into an appropriate new bucket. The pairs of nodes in the top bucket are always evaluated for finding the optimal pair. In process of evaluation, a pair can be removed from the top bucket for two reasons:
  • if the bucket contains a node that has already been eliminated, the pair of nodes is removed;
  • if the pair has a low value of equation M38, it is moved from the top bucket to a corresponding lower bucket.
Values equation M39 selected in each iteration should satisfy (20). In the initial step, for m = 0, the value equation M40 is used, since the maximal value is computed when all pairs of nodes are placed in buckets. In subsequent steps, the maximal equation M41 in the top bucket is taken as equation M42.
However, this simple construction (19) would not be efficient if the values equation M43 are distributed non-homogeneously: many intervals may be empty, while others overpopulated (In an extreme case, one outlier pair may stay on the top, while all others are placed in the bottom interval). In order to overcome this problem, we propose an adaptive construction of buckets.
In the adaptive construction, the intervals for the initial step (m = 0) are defined as follows:
equation M44
(22)
equation M45
(23)
for k = 0, 1, 2, …, N − 2. If there are no equation M46 below equation M47, we stop adding intervals. The values equation M48 are defined in the same way as before: equation M49, and subsequent values equation M50 are set to the maximal equation M51 in the top bucket. The formulas (22)–(23) guarantee that each bucket at the initial step (m = 0) is not empty. Otherwise, it has the same useful properties as (19).
In the subsequent iterations (m = 1, 2, 3, …) the buckets are operated as follows:
  • The maximal equation M52 in the top bucket is taken as equation M53. The elements of the top bucket are evaluated as described above.
  • Existing buckets are moved to the top by setting equation M54.
  • If equation M55, the value of equation M56 is refreshed by resetting it to the maximal equation M57 in the bucket.
  • If the refreshed value equation M58, the bucket is merged with the top bucket;
  • New pairs (k*, p) are considered in descenting order and first are attempted to be placed in an existing bucket. If
    equation M59
    (24)
    then the pair (k*, p) is placed in the bucket providing the minimum value. Otherwise, a bucket is created. The same procedure is applied to the pairs moved from the top bucket to lower buckets.
  • If number of buckets exceeds N, the top N buckets are taken and the remaining buckets are merged in the lowest bucket. Empty buckets, if any, are also removed.
Data structures
Below we briefly describe data structures for a record, a bucket, and a bucket collection.
Record
For a pair of nodes i and j (i < j), we keep a record consisting of two indices and the value of the distance between the nodes: R = (i, j, Dij). There is no reason to save the actual value of the goal function δij since it is changed at each algorithm step and cannot be reused. However, keeping the Dij value in the record allows to avoid gathering these values from a large two-dimensional array and makes the algorithm more cache-optimal [20], [21].
Bucket
Each bucket contains a linked list of records. In our initial implementation, we use the C++ STL List class. A standard constant-time splice algorithm [28] is used to combine link lists. Records referring to nodes which have already have been eliminated are erased using a C++ STL constant-time List erase() function. Special memory allocation and reallocation can be used to provide cache-efficient placement of bucket elements [29].
Bucket Collection
Each bucket collection contains two arrays: the first contains real numbers in decreasing order and describes bucket intervals, while the second contains pointers to buckets. In our initial implementation, we use C++ STL vector class for these arrays. As described above, N buckets are allocated at each neighbor-joining step.
In addition to bucket-based data structures we use arrays implemented as C++ STL vector objects for equation M60, and for describing status of the node (active/non-active) that is changed when node is eliminated. Since calculating new distances by formula (6) requires direct access to Dij, we keep this two-dimensional matrix as vector< vector<double> > in our initial implementation.
3 Test results
To evaluate the algorithm, we used the following four data sets containing full-length Influenza A H3N2 hemagglutinin protein coding sequences obtained from the NCBI Influenza Virus Resource [13]:
  • Dataset containing 44 sequences from 1968–1972;
  • Dataset containing 252 sequences from 1971–1996;
  • Dataset containing 504 sequences from 1972–2000;
  • Dataset containing 1000 sequences from 1972–2005.
Figure 2 illustates the performance of the algorithm for these 4 datasets. The total number of pair evaluations stays approximately proportional to N2, where N is the number of sequences in the dataset, with a constant c ≈ 3 for N = 252, N = 504 and N = 1000, while c ≈ 7.5 for N = 42.
Fig. 2Fig. 2
Number of pair evaluations per cycle divided by N2.
Figure 3 shows the initial distribution of δ(i, j) = u(i) + u(j) − d(i, j) for non-identical pairs of sequences in the largest dataset (e.g., N = 1000). The maximal value is 0.441. Only 100 pairs of sequences out of 499, 500 (0.02%) have values of δ(i, j) greater than 0.3, and only about 1,000 pairs of sequences (0.2%) have values of δ(i, j) above 0.1.
Fig. 3Fig. 3
Values of u(i) + u(j) − d(i, j) for pairs of sequences, N = 1000.
Figure 4 shows the number of pair evaluations in each iteration of the algorithm. The total number of pair evaluations is accumulated from a very small number of evaluations in the majority of steps and a large number of evaluations on rare occasions.
Fig. 4Fig. 4
Number of pair evaluations in each iteration.
Figure 5 shows the execution times for the tests on two NCBI Linux computers: gizmo2 - a computer having an Intel Xeon E5345 2.33 GHz processor with 4MB cache size, and sutils0 - a computer having an Intel Xeon 3.20 GHz processor with 512 KB cache size. Corresponding execution times divided by N2, where N is number of sequences in the test, are shown in the table below. Note that while sutils0 has a faster processor, gizmo2 consistently demonstrates 1.7 – 1.8 times faster execution time, probably because its cache size is much larger. We believe that the speed of memory access rather than processor speed is critical for the speed of execution, and the code will greatly benefit from tuning aimed to minimize cache misses. Comparison of our normalized times 1.33 · 10−6 seconds and 2.34 · 10−6 seconds to 5.64 · 10−6 seconds reported for QuickJoin (based on Table 1 in [15]), allows us to be very optimistic about the performance of an optimized version of the code.
Fig. 5Fig. 5
Execution times
Table 1Table 1
Normalized execution times
4 Discussion
Accelerating the neighbor-joining method is important for enhancing performance of the online web analysis tools, where users expect to perform initial exploratory analysis of the datasets in real time and perform bootstrapping as fast as possible. In this paper we present an adaptive bucket algorithm able to significantly reduce the amount of evaluations in search steps by distributing candidate pairs in buckets and evaluating a small portion of all pairs in each iteration. The proposed construction helps to avoid empty buckets and allows the algorithm to handle the values of the neighbor-joining goal function which are distributed non-homogeneously, including cases when outliers are present. The algorithm uses simple data structures that can be further optimized, including optimizing the cache-efficiency. Our the preliminary test results are shown above, and we continue optimizing the code and plan to perform comprehensive comparisons. While the proposed algorithm is designed to produce the same results as classical neighbor-joining, the degree of acceleration it provides is determined by the distribution of the values of the neighbor-joining goal function that, in turn, depends on the structure of the dataset.
5 Acknowledgements
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.
The authors are thankful to David J. Lipman, Alejandro Schaffer, Stacy Ciufo, Vahan Grigoryan and Yuri Kapustin for productive discussions.
References
1.
Saitau N, Nei M. The neighbor-joining method: new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987;4:406–425. [PubMed]
2.
Studier JA, Keppler KJ. A note on the neighbor-joining algorithm of Saitou and Nei. Molecular Biology and Evolution. 1988;5:729–731. [PubMed]
3.
Felsenstein J. Inferring Phylogenies. Cambridge University Press; 2003.
4.
Atteson K. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica. 1999;25:251–278.
5.
Tamura K, Nei M, Kumar S. Prospects for inferring very large phylogenies by using the neighbor-joining method. PNAS. 2004;101:11030–11035. [PubMed]
6.
Bryant D. On the uniqueness of the selection criterion in neighbor-joining. Journal of Classification. 2005;22
7.
Desper R, Gascuel O. The minimum-evolution distance-based approach to phylogenetic interference. In: Gascuel O, editor. Mathematics of evolution and phylogeny. Oxford University Press; 2005. pp. 1–32.
8.
Gascuel O, Steel M. Neighbor-joining revealed. Molecular Biology and Evolution. 2006;23:1997–2000. [PubMed]
9.
Gascuel O. BIONJ: an improved version of the nj algorithm based on a simple model of sequence data. Mol. Biol. Evol. 1997;14:685–695. [PubMed]
10.
Bruno WJ, Socci N, Halpern AL. Weighted neighbor-joining: a likelihood-based approach to distance-based phyloginy reconstruction. Mol. Biol. Evol. 2000;17:189–197. [PubMed]
11.
Yang Z. Computational Molecular Evolution. Oxford University Press; 2006.
12.
Bryant D. A classification of consensus methods for phylogenies. In: Janowitz M, Lapointe FJ, McMorris F, Mirkin B, Roberts F, editors. BioConsensus. DIMACS, Americal Mathematical Society; 2003. pp. 163–184.
13.
Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, Tatusova T, Ostell J, Lipman D. The Influenza Virus Resource at the National Center for Biotechnology Information. Journal of Virology. 2008;82:596–601. [PubMed]
14.
Zaslavsky L, Bao Y, Tatusova TA. An adaptive resolution tree visualization of large influenza virus sequence datasets. In: Mandoiu I, Zelikovsky A, editors. Bioinformatics Research and Applications, Proc. of ISBRA 2007. Volume LNBI 4463 of Lecture Notes in Bioinformatics. Springer-Verlag; 2007. pp. 192–202.
15.
Mailund T, Pedersen CN. Quickjoin – fast neighbor-joining tree reconstruction. Bioinformatics. 2004;20:3261–3262. [PubMed]
16.
Mailund T, Brodal GS, Fagerberg R, Pedersen CNS, Phillips D. Recrafting the neighbor-joining method. BMC Bioinformatics. 2006;7(29)
17.
Shenerman L, Evans J, Foster JA. Clearcut: fast implementation of relaxed neighbor joining. Bioinformatics. 2006;22:2823–2824. [PubMed]
18.
Evans J, Shenerman L, Foster J. Relaxed Neighbor-Joining: A Fast Distance-Based Phylogenetic Tree Construction Method. J. Mol. Evol. 2006;62:785–792. [PubMed]
19.
Elias I, Lagergren J. Fast neighbor joining. In: Caeires L, et al., editors. ICALP. Volume 3580 of Lect. Notes Comp. Sci. Springer-Verlag; 2005. pp. 1263–1274.
20.
LaMarca A, Ladner RE. The influence of caches on the performance of sorting. Journal of Algorithms. 1999;31:66–104.
21.
Brodal GS, Fagerberg R, Vinther K. Engineering a cache-oblivious sorting algorithm. Journal of Experimental Algorithmics. 2007;12:2.1.
22.
Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms. Second Edition. MIT Press and McGraw-Hill; 2001.
23.
Dial RB. Algorithm 360: Shortest path forest with topological ordering. Comm. ACM. 1969;12:632–633.
24.
Wagner RA. A shortest path algorithm for edge-aparse graphs. J. Assoc. Comput. Mach. 1976;23:50–57.
25.
Dinic EA. Economical algorithms for finding shortest path in network. In: Popkov YS, Shmulyan BL, editors. Transportation Modeling Systems, The Institute for System Studies. In Russian: 1978. pp. 36–44.
26.
Denardo EV, Fox BL. Shortest-route methods: 1. reaching, pruning, and buckets. Oper. Res. 1979;27:161–186.
27.
Cherkassky BV, Goldberg AV, Silverstein C. Buckets, heaps, lists, and monotone priority queues. SIAM Journal of Computing, 1999. 1999;28:1326–1346.
28.
Musser DR, Derge GJ, Saini A. STL Tutorial and Reference Guide: C++ Programming with the Standard Template Library. 2 edn. Addison-Wesley Professional. 2001
29.
Meyers S. Effective STL. Addison-Wesley Professional. 2001