Accelerated engagement to deploy NINJA for Neighbor-Joining trees with large data sets.
The contact for this project is Travis Wheeler
- Web site...
- Implementation of Neighbor-Joining algorithm.
- Written in Java, no major external dependencies.
- Can do a 220K species tree in six days on one CPU, 2Gbytes RAM
- Test case was 16s rRNA gene sequence data set, the same one as FASTTREE
- NINJA offers the full Neighbor Joining method but is optimized for large data sets.
- Employs filtering steps to minimize the number of pair-wise comparisons required for each iteration.
- Relies on file I/O for externalization (ie query-optimized serialization) of data structures to free up RAM.
- Uses external application (FASTTREE) to pre-process sequence data into a distance matrix. The file I/O involved is non-trivial.
- Completion of the software by adding sequence pre-processing (distance matrix calculation)
- Benchmarking/testing on large data sets
- What is really wanted is to port the algorithm to C or C++ and make parallelizable
- Travis thinks the code is written in such a way that it could be easily ported to C. There are no serious external dependencies.
- Is reliance on disk I/O to free up memory is a potential problem for large-scale parallel computing?
- (edit from Travis: I don't believe this will be an impediment to parallelization, as the data structures are well suited to being handled in a distributed fashion. The core difficulty will be in setting those structures up in the first place ... which I think won't be terribly challenging)
- Calculation of pair-wise distance matrix could be parallelized without too much pain, perl wrapper (for example) for matrix assembly would be the most challenging part, but not that hard.
- (edit from Travis: I'd propose that distance computation should be done internal to NINJA, to avoid the I/O overhead of writing the distance matrix out to disk, then reading it back in for tree construction. To give a sense of scale, for 200,000 sequences, that distance matrix stored in an ascii file will require a file larger than 200GB)
- Travis Wheeler is available to convey domain knowledge and design input but does not have time to be directly involved in coding
- (edit from Travis: I'm able to do some coding, just not most of it)
- Cubic growth in run-time predicts a doubling of the data set (to 440K) would require approx ~~40 days on one CPU
- That would be challenging for parameter tweaking
- No explicit policy for missing data
- Would be best to use the same sequence data as RaXML.
1) port NINJA from Java to C/C++
- probably C++ as the binheaps used in the Java version are also in STL in C++ 2) import distance matrix calculation functionality 3) test/benchmark C++ vs. Java implementation
scale-up and parallelization