This box searches only this space. The box at the upper right searches the entire iPlant wiki.

Skip to end of metadata
Go to start of metadata

Accelerated engagement to deploy NINJA for Neighbor-Joining trees with large data sets.
The contact for this project is Travis Wheeler

NINJA

  • Reference...
  • Web site...
  • Implementation of Neighbor-Joining algorithm.
  • Written in Java, no major external dependencies.
  • Can do a 220K species tree in six days on one CPU, 2Gbytes RAM
  • Test case was 16s rRNA gene sequence data set, the same one as FASTTREE
  • NINJA offers the full Neighbor Joining method but is optimized for large data sets.
    • Employs filtering steps to minimize the number of pair-wise comparisons required for each iteration.
    • Relies on file I/O for externalization (ie query-optimized serialization) of data structures to free up RAM.
  • Uses external application (FASTTREE) to pre-process sequence data into a distance matrix. The file I/O involved is non-trivial.

Requirements

  • Completion of the software by adding sequence pre-processing (distance matrix calculation)
  • Benchmarking/testing on large data sets
  • What is really wanted is to port the algorithm to C or C++ and make parallelizable

Notes

  • Travis thinks the code is written in such a way that it could be easily ported to C. There are no serious external dependencies.
  • Is reliance on disk I/O to free up memory is a potential problem for large-scale parallel computing?
    • (edit from Travis: I don't believe this will be an impediment to parallelization, as the data structures are well suited to being handled in a distributed fashion. The core difficulty will be in setting those structures up in the first place ... which I think won't be terribly challenging)
  • Calculation of pair-wise distance matrix could be parallelized without too much pain, perl wrapper (for example) for matrix assembly would be the most challenging part, but not that hard.
    • (edit from Travis: I'd propose that distance computation should be done internal to NINJA, to avoid the I/O overhead of writing the distance matrix out to disk, then reading it back in for tree construction. To give a sense of scale, for 200,000 sequences, that distance matrix stored in an ascii file will require a file larger than 200GB)
  • Travis Wheeler is available to convey domain knowledge and design input but does not have time to be directly involved in coding
    • (edit from Travis:  I'm able to do some coding, just not most of it)
  • Cubic growth in run-time predicts a doubling of the data set (to 440K) would require approx ~~40 days on one CPU
    • That would be challenging for parameter tweaking
  • No explicit policy for missing data
  • Would be best to use the same sequence data as RaXML.

Draft plan

Phase 1

1) port NINJA from Java to C/C++

Phase 2

scale-up and parallelization

Icon

note: Travis has posted a page with ideas/suggestions for the details of parallelization.

  • No labels