Accelerating the neighbor-joining algorithm using the adaptive bucket data structure

Journal List > NIHPA Author Manuscripts

Lect Notes Comput Sci. Author manuscript; available in PMC 2008 May 13.

Published in final edited form as:

Lect Notes Comput Sci. 2008 May; 4983(LNBI): 122–133.

PMCID: PMC2380387

NIHMSID: NIHMS40326

Accelerating the neighbor-joining algorithm using the adaptive bucket data structure

Leonid Zaslavsky and Tatiana A. Tatusova

National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, MD, 20894, USA

Leonid Zaslavsky Email: zaslavsk/at/ncbi.nlm.nih.gov

Tatiana A. Tatusova Email: tatiana/at/ncbi.nlm.nih.gov

Abstract

The complexity of the neighbor joining method is determined by the complexity of the search for an optimal pair (”neighbors to join”) performed globally at each iteration. Accelerating the neighbor-joining method requires performing a smarter search for an optimal pair of neighbors, avoiding re-evaluation of all possible pairs of points at each iteration.

We developed an acceleration technique for the neighbor-joining method that significantly decreases complexity for important applications without any change in the neighbor-joining method. This technique utilizes the bucket data structure. The pairs of nodes are arranged in buckets according to values of the goal function δ_ij = u_i + u_j − d_ij. Buckets are adaptively re-arranged after each neighbor-joining step. While the pairs of nodes in the top bucket are re-evaluated at every iteration, pairs in lower buckets are accessed more rarely, when the algorithm determines that the elements of the bucket need to be re-evaluated based on new values of δ_ij. As a result, only a small portion of candidate pairs of nodes is examined at each iteration.

The algorithm is cache efficient, since the bucket data structures are able to exploit locality and adjust to cache properties.

Keywords: neighbor-joining algorithm, bucket data structure, adaptive, cache-efficient

1 Introduction

The neighbor-joining algorithm [1], [2] is one of the most popular distance methods for the creation of phylogenetic trees. It is a greedy agglomerative algorithm that constructs a tree in steps [3]. The algorithm is based on the minimum-evolution criterion for phylogenetic trees. It is well-tested and studied theoretically, provides good results and is statistically consistent under many models of evolution [4], [5], [6], [7], [8]. Several algorithms have been developed as improvements to the classical neighbor-joining method [9], [10]. Since the neighbor-joining method is much more efficient than other algorithms of comparable quality, it is widely used for phylogenetic analysis as the tool of choice for preliminary analysis, with results being verified and refined by maximum likelihood and Baysian methods [11].

However, the usage of the neighbor-joining method within interactive exploratory analysis tools makes it desirable to further accelerate the algorithm for large datasets. This is especially true if the bootstrap analysis is performed and multiple trees need to be calculated [12], [3]. Since the O(N³) complexity of the neighbor joining method is determined by the amount of operations per search step performed globally at each iteration to find an optimal pair (”neighbors to join”), accelerating the neighbor-joining method requires a smarter search methodology which avoids brute-force reevaluation of all possible pairs of points. Our interest in accelerating the neighbor-joining method is motivated by our ongoing efforts to develop and improve NCBI interactive analysis web tools, such as the NCBI Influenza Virus Resource [13], [14], where the neighbor-joining method is the default tree method. The goal is to enable bootstrap analysis for meaningful dataset, in a timeframe acceptable for interactive web tools. This paper describes an ongoing effort toward this goal.

Accelerating strategies for the neighbor-joining method have been proposed by several authors. The QuickJoin algorithm [15], [16] uses the quad-tree data structure to accelerate the search for optimal value of the goal function in the neighbor-joining algorithm, while still constructing the same tree as the original algorithm. The ClearCut algorithm [17], [18] implements the relaxed neighbor-joining approach. The algorithm does not search for a globally optimal pair, but selects a locally optimal pair (i, j) at each step, such that δ_ij = u_i + u_j − d_ij is maximum for both δ_ik and δ_jk for all other nodes k. In the Fast Neighbor Joining method [19] the goal function is not optimized globally, but is rather optimized over a set, called the ”visible set”. The algorithm is guaranteed to produce the same results as the neighbor-joining methods for an additive input.

In this paper we pursue the same goal as [16]: to accelerate the search for a pair of nodes to be join while constructing the same tree as the classical neighbor-joining algorithm. We arrange the candidate pairs of nodes in buckets according to the value of the NJ goal function (see Figure 1). When number of nodes is large, the value of the neighbor-joining goal function δ_ij = u_i + u_j − d_ij changes slightly with each iteration for each pair of nodes. As a result, only elements of the top bucket are re-evaluated at every iteration, while elements of lower buckets are accessed more rarely, when needed. Only a small portion of candidate pairs of nodes is examined at each iteration. O(N) buckets are retained eliminating the need to evaluate O(N²) pairs per iteration. Instead, we rearrange O(N) buckets without accessing the content and evaluate pairs in the top bucket. The proposed bucket-based data structure allows cache-efficient manipulations [20], [21]. Note that bucket data structures have been used in the bucket-sort algorithm [22] and shortest path algorithms in [23], [24], [25], [26], [27].

Fig. 1

A bucket data structure.

2 Methodology

Below we first describe the classical neighbor-joining algorithm and then show how to use the bucket data to perform an efficient search for a pair of nodes to be joined.

2.1 The neighbor-joining method

Classical NJ algorithm.

At each iteration m = 0, …, N − 2 of the classical neighbor-joining method [2],[3], average distances

(1)

are calculated for each node. A global search is performed to find a pair of nodes (i_*, j_*) such that

(2)

where

(3)

Nodes i_* and j_* are joined in new node k_*.

The branch lengths υ_{i_*} and υ_{j_*} are calculated as

(4)

(5)

and distances from node k_* to the rest of the nodes are determined for each node p by the formula

(6)

Preserving non-negativity of branch lengths and distances

For the implementations used in our web analysis tools [13], we chosen to keep branch lengths and distances non-negative. We modify formulas (4) and (5) as follows. Define

(7)

Analogues of equations (4) and (5) are:

(8)

(9)

A non-negative analogue of equation (6) is

(10)

Recursive formulas

New average distances equation M11

can be calculated from old ones equation M12

by formulas:

(11)

or, using (10),

(12)

2.2 Upper bound for change in the goal function value for a pair of points over a neighbor-joining step

Estimates for growth of average distances u_i

From (12), it is easy to obtain the following inequalities:

(13)

and

(14)

Estimates for growth of δ_ij

From (3), it is easy to see that

Using the inequalities (13) and (14), we get

Finally, we obtain two growth estimates:

(15)

and

(16)

where

and

The estimates (15) and (16) together can be written as:

(17)

where

Note. These estimates show that when the number of nodes is large, value equation M26

can increase from equation M27

only slightly. For example, when n = N − m is about 10³, the increase is limited by approximately 2·10⁻³ of the maximal average distance.

2.3 Construction of buckets and operating them

The arrangement of pairs (i, j) in groups is performed according to the values of the neighbor-joining goal function δ_ij defined by formula (3). Our purpose is to limit evaluation of the individual pairs only to those that were close to optimal in the previous iteration and may become optimal at the current step.

First, the treatment of pairs with zero or near zero distances between nodes is considered. Lets introduce a special bucket for pairs (i, j) such that

where ε is a small number (ex., ε = 10⁻⁶). We join zero-distance and near-zero-distance pairs first, as we do in our current implementation of the classical neighbor joining algorithm [13].

Let us consider regular pairs. Define the bucket intervals as follows:

(18)

where

Our initial idea was to use intervals constant step Δ^m:

(19)

If parameter equation M32

satisfies the condition

(20)

where

, only elements of the top bucket can have equation M35

equal to

. All other buckets contain suboptimal elements. Moreover, because of the estimate (15), pairs that are currently in the buckets with k ≥ 2 remain suboptimal in the next iteration.

At each neighbor-joining iteration, a new collection of N buckets is constructed according to (19). New pairs appearing at the current iteration are placed in the buckets accordingly. Contents of most of the existing buckets is placed into new buckets without being evaluated. First, the new bucket index k_new is determined by formula

(21)

for each old bucket. If k_new ≥ 1, the content of the kth old bucket with k ≥ 1 is placed into the k_newth new bucket without any evaluation. However, if according to formula (21) k_new ≤ 0, then each pair of nodes in the kth old bucket is evaluated and placed into an appropriate new bucket. The pairs of nodes in the top bucket are always evaluated for finding the optimal pair. In process of evaluation, a pair can be removed from the top bucket for two reasons:

if the bucket contains a node that has already been eliminated, the pair of nodes is removed;
if the pair has a low value of , it is moved from the top bucket to a corresponding lower bucket.

Values

selected in each iteration should satisfy (20). In the initial step, for m = 0, the value equation M40

is used, since the maximal value is computed when all pairs of nodes are placed in buckets. In subsequent steps, the maximal equation M41

in the top bucket is taken as equation M42

However, this simple construction (19) would not be efficient if the values equation M43

are distributed non-homogeneously: many intervals may be empty, while others overpopulated (In an extreme case, one outlier pair may stay on the top, while all others are placed in the bottom interval). In order to overcome this problem, we propose an adaptive construction of buckets.

In the adaptive construction, the intervals for the initial step (m = 0) are defined as follows:

(22)

(23)

for k = 0, 1, 2, …, N − 2. If there are no equation M46

below

, we stop adding intervals. The values equation M48

are defined in the same way as before: equation M49

, and subsequent values equation M50

are set to the maximal equation M51

in the top bucket. The formulas (22)–(23) guarantee that each bucket at the initial step (m = 0) is not empty. Otherwise, it has the same useful properties as (19).

In the subsequent iterations (m = 1, 2, 3, …) the buckets are operated as follows:

The maximal in the top bucket is taken as . The elements of the top bucket are evaluated as described above.
Existing buckets are moved to the top by setting .
If , the value of is refreshed by resetting it to the maximal in the bucket.
If the refreshed value , the bucket is merged with the top bucket;
New pairs (k_*, p) are considered in descenting order and first are attempted to be placed in an existing bucket. If
(24)
then the pair (k_*, p) is placed in the bucket providing the minimum value. Otherwise, a bucket is created. The same procedure is applied to the pairs moved from the top bucket to lower buckets.
If number of buckets exceeds N, the top N buckets are taken and the remaining buckets are merged in the lowest bucket. Empty buckets, if any, are also removed.

Data structures

Below we briefly describe data structures for a record, a bucket, and a bucket collection.

Record

For a pair of nodes i and j (i < j), we keep a record consisting of two indices and the value of the distance between the nodes: R = (i, j, D_ij). There is no reason to save the actual value of the goal function δ_ij since it is changed at each algorithm step and cannot be reused. However, keeping the D_ij value in the record allows to avoid gathering these values from a large two-dimensional array and makes the algorithm more cache-optimal [20], [21].

Bucket

Each bucket contains a linked list of records. In our initial implementation, we use the C++ STL List class. A standard constant-time splice algorithm [28] is used to combine link lists. Records referring to nodes which have already have been eliminated are erased using a C++ STL constant-time List erase() function. Special memory allocation and reallocation can be used to provide cache-efficient placement of bucket elements [29].

Bucket Collection

Each bucket collection contains two arrays: the first contains real numbers in decreasing order and describes bucket intervals, while the second contains pointers to buckets. In our initial implementation, we use C++ STL vector class for these arrays. As described above, N buckets are allocated at each neighbor-joining step.

In addition to bucket-based data structures we use arrays implemented as C++ STL vector objects for equation M60

, and for describing status of the node (active/non-active) that is changed when node is eliminated. Since calculating new distances by formula (6) requires direct access to D_ij, we keep this two-dimensional matrix as vector< vector<double> > in our initial implementation.

3 Test results

To evaluate the algorithm, we used the following four data sets containing full-length Influenza A H3N2 hemagglutinin protein coding sequences obtained from the NCBI Influenza Virus Resource [13]:

Dataset containing 44 sequences from 1968–1972;
Dataset containing 252 sequences from 1971–1996;
Dataset containing 504 sequences from 1972–2000;
Dataset containing 1000 sequences from 1972–2005.

Figure 2 illustates the performance of the algorithm for these 4 datasets. The total number of pair evaluations stays approximately proportional to N², where N is the number of sequences in the dataset, with a constant c ≈ 3 for N = 252, N = 504 and N = 1000, while c ≈ 7.5 for N = 42.

Fig. 2

Number of pair evaluations per cycle divided by N².

Figure 3 shows the initial distribution of δ(i, j) = u(i) + u(j) − d(i, j) for non-identical pairs of sequences in the largest dataset (e.g., N = 1000). The maximal value is 0.441. Only 100 pairs of sequences out of 499, 500 (0.02%) have values of δ(i, j) greater than 0.3, and only about 1,000 pairs of sequences (0.2%) have values of δ(i, j) above 0.1.

Fig. 3

Values of u(i) + u(j) − d(i, j) for pairs of sequences, N = 1000.

Figure 4 shows the number of pair evaluations in each iteration of the algorithm. The total number of pair evaluations is accumulated from a very small number of evaluations in the majority of steps and a large number of evaluations on rare occasions.

Fig. 4

Number of pair evaluations in each iteration.

Figure 5 shows the execution times for the tests on two NCBI Linux computers: gizmo2 - a computer having an Intel Xeon E5345 2.33 GHz processor with 4MB cache size, and sutils0 - a computer having an Intel Xeon 3.20 GHz processor with 512 KB cache size. Corresponding execution times divided by N², where N is number of sequences in the test, are shown in the table below. Note that while sutils0 has a faster processor, gizmo2 consistently demonstrates 1.7 – 1.8 times faster execution time, probably because its cache size is much larger. We believe that the speed of memory access rather than processor speed is critical for the speed of execution, and the code will greatly benefit from tuning aimed to minimize cache misses. Comparison of our normalized times 1.33 · 10⁻⁶ seconds and 2.34 · 10⁻⁶ seconds to 5.64 · 10⁻⁶ seconds reported for QuickJoin (based on Table 1 in [15]), allows us to be very optimistic about the performance of an optimized version of the code.

Fig. 5

Execution times

Table 1

Normalized execution times

4 Discussion

Accelerating the neighbor-joining method is important for enhancing performance of the online web analysis tools, where users expect to perform initial exploratory analysis of the datasets in real time and perform bootstrapping as fast as possible. In this paper we present an adaptive bucket algorithm able to significantly reduce the amount of evaluations in search steps by distributing candidate pairs in buckets and evaluating a small portion of all pairs in each iteration. The proposed construction helps to avoid empty buckets and allows the algorithm to handle the values of the neighbor-joining goal function which are distributed non-homogeneously, including cases when outliers are present. The algorithm uses simple data structures that can be further optimized, including optimizing the cache-efficiency. Our the preliminary test results are shown above, and we continue optimizing the code and plan to perform comprehensive comparisons. While the proposed algorithm is designed to produce the same results as classical neighbor-joining, the degree of acceleration it provides is determined by the distribution of the values of the neighbor-joining goal function that, in turn, depends on the structure of the dataset.

5 Acknowledgements

This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.

The authors are thankful to David J. Lipman, Alejandro Schaffer, Stacy Ciufo, Vahan Grigoryan and Yuri Kapustin for productive discussions.

References

Saitau N, Nei M. The neighbor-joining method: new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987;4:406–425. [PubMed]

Studier JA, Keppler KJ. A note on the neighbor-joining algorithm of Saitou and Nei. Molecular Biology and Evolution. 1988;5:729–731. [PubMed]

Felsenstein J. Inferring Phylogenies. Cambridge University Press; 2003.

Atteson K. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica. 1999;25:251–278.

Tamura K, Nei M, Kumar S. Prospects for inferring very large phylogenies by using the neighbor-joining method. PNAS. 2004;101:11030–11035. [PubMed]

Bryant D. On the uniqueness of the selection criterion in neighbor-joining. Journal of Classification. 2005;22

Desper R, Gascuel O. The minimum-evolution distance-based approach to phylogenetic interference. In: Gascuel O, editor. Mathematics of evolution and phylogeny. Oxford University Press; 2005. pp. 1–32.

Gascuel O, Steel M. Neighbor-joining revealed. Molecular Biology and Evolution. 2006;23:1997–2000. [PubMed]

Gascuel O. BIONJ: an improved version of the nj algorithm based on a simple model of sequence data. Mol. Biol. Evol. 1997;14:685–695. [PubMed]

10.

Bruno WJ, Socci N, Halpern AL. Weighted neighbor-joining: a likelihood-based approach to distance-based phyloginy reconstruction. Mol. Biol. Evol. 2000;17:189–197. [PubMed]

11.

Yang Z. Computational Molecular Evolution. Oxford University Press; 2006.

12.

Bryant D. A classification of consensus methods for phylogenies. In: Janowitz M, Lapointe FJ, McMorris F, Mirkin B, Roberts F, editors. BioConsensus. DIMACS, Americal Mathematical Society; 2003. pp. 163–184.

13.

Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, Tatusova T, Ostell J, Lipman D. The Influenza Virus Resource at the National Center for Biotechnology Information. Journal of Virology. 2008;82:596–601. [PubMed]

14.

Zaslavsky L, Bao Y, Tatusova TA. An adaptive resolution tree visualization of large influenza virus sequence datasets. In: Mandoiu I, Zelikovsky A, editors. Bioinformatics Research and Applications, Proc. of ISBRA 2007. Volume LNBI 4463 of Lecture Notes in Bioinformatics. Springer-Verlag; 2007. pp. 192–202.

15.

Mailund T, Pedersen CN. Quickjoin – fast neighbor-joining tree reconstruction. Bioinformatics. 2004;20:3261–3262. [PubMed]

16.

Mailund T, Brodal GS, Fagerberg R, Pedersen CNS, Phillips D. Recrafting the neighbor-joining method. BMC Bioinformatics. 2006;7(29)

17.

Shenerman L, Evans J, Foster JA. Clearcut: fast implementation of relaxed neighbor joining. Bioinformatics. 2006;22:2823–2824. [PubMed]

18.

Evans J, Shenerman L, Foster J. Relaxed Neighbor-Joining: A Fast Distance-Based Phylogenetic Tree Construction Method. J. Mol. Evol. 2006;62:785–792. [PubMed]

19.

Elias I, Lagergren J. Fast neighbor joining. In: Caeires L, et al., editors. ICALP. Volume 3580 of Lect. Notes Comp. Sci. Springer-Verlag; 2005. pp. 1263–1274.

20.

LaMarca A, Ladner RE. The influence of caches on the performance of sorting. Journal of Algorithms. 1999;31:66–104.

21.

Brodal GS, Fagerberg R, Vinther K. Engineering a cache-oblivious sorting algorithm. Journal of Experimental Algorithmics. 2007;12:2.1.

22.

Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms. Second Edition. MIT Press and McGraw-Hill; 2001.

23.

Dial RB. Algorithm 360: Shortest path forest with topological ordering. Comm. ACM. 1969;12:632–633.

24.

Wagner RA. A shortest path algorithm for edge-aparse graphs. J. Assoc. Comput. Mach. 1976;23:50–57.

25.

Dinic EA. Economical algorithms for finding shortest path in network. In: Popkov YS, Shmulyan BL, editors. Transportation Modeling Systems, The Institute for System Studies. In Russian: 1978. pp. 36–44.

26.

Denardo EV, Fox BL. Shortest-route methods: 1. reaching, pruning, and buckets. Oper. Res. 1979;27:161–186.

27.

Cherkassky BV, Goldberg AV, Silverstein C. Buckets, heaps, lists, and monotone priority queues. SIAM Journal of Computing, 1999. 1999;28:1326–1346.

28.

Musser DR, Derge GJ, Saini A. STL Tutorial and Reference Guide: C++ Programming with the Standard Template Library. 2 edn. Addison-Wesley Professional. 2001

29.

Meyers S. Effective STL. Addison-Wesley Professional. 2001