NINJA

Accelerated engagement to deploy NINJA for Neighbor-Joining trees with large data sets.
The contact for this project is Travis Wheeler

NINJA

  • Reference...
  • Web site...
  • Implementation of Neighbor-Joining algorithm.
  • Written in Java, no major external dependencies.
  • Can do a 220K species tree in six days on one CPU, 2Gbytes RAM
  • Test case was 16s rRNA gene sequence data set, the same one as FASTTREE
  • NINJA offers the full Neighbor Joining method but is optimized for large data sets.
    • Employs filtering steps to minimize the number of pair-wise comparisons required for each iteration.
    • Relies on file I/O for externalization (ie query-optimized serialization) of data structures to free up RAM.
  • Uses external application (FASTTREE) to pre-process sequence data into a distance matrix. The file I/O involved is non-trivial.

Requirements

  • Completion of the software by adding sequence pre-processing (distance matrix calculation)
  • Benchmarking/testing on large data sets
  • What is really wanted is to port the algorithm to C or C++ and make parallelizable

Notes

  • Travis thinks the code is written in such a way that it could be easily ported to C. There are no serious external dependencies.
  • Is reliance on disk I/O to free up memory is a potential problem for large-scale parallel computing?
    • (edit from Travis: I don't believe this will be an impediment to parallelization, as the data structures are well suited to being handled in a distributed fashion. The core difficulty will be in setting those structures up in the first place ... which I think won't be terribly challenging)
  • Calculation of pair-wise distance matrix could be parallelized without too much pain, perl wrapper (for example) for matrix assembly would be the most challenging part, but not that hard.
    • (edit from Travis: I'd propose that distance computation should be done internal to NINJA, to avoid the I/O overhead of writing the distance matrix out to disk, then reading it back in for tree construction. To give a sense of scale, for 200,000 sequences, that distance matrix stored in an ascii file will require a file larger than 200GB)
  • Travis Wheeler is available to convey domain knowledge and design input but does not have time to be directly involved in coding
    • (edit from Travis:  I'm able to do some coding, just not most of it)
  • Cubic growth in run-time predicts a doubling of the data set (to 440K) would require approx ~~40 days on one CPU
    • That would be challenging for parameter tweaking
  • No explicit policy for missing data
  • Would be best to use the same sequence data as RaXML.

Draft plan

Phase 1

1) port NINJA from Java to C/C++

Phase 2

scale-up and parallelization

note: Travis has posted a page with ideas/suggestions for the details of parallelization.