TR120521

Agenda

  1. Documentation/DB Schema
    1. Jamie has made a better PDF of the database schema and will upload this to GitHub
    2. Jamie also made a copy of the schema that uses InnoDB table instead of the InnoDB we currently use. This is for folks interested in forced referential integrity. I will push this to github if I have not done this already. This work mainly involved making sure the foreign keys all made sense with no redundancies and specifying table type in the *.SQL code.
  2. Testing database with large trees
    1. To test for database query speeds and scalability, Jamie has uploaded the following species trees:
      1. Deep green species tree from John Kerry
        1. 83 terminal taxa
      2. APG3 Derived megatree
        1. 1,827 labeled nodes
        2. This tree includes labeled internal nodes that are useful for visualization tests and clade based selection queries
        3. This is the tree used by phylomatic
      3. Smith et al. 
        1. 55,473 terminal taxa
          1. To put this in perspective the IUCN Red List estimates about 300,000 plant species on earth and Mora et al also peg the number of plant species to about 300,000. You can also check out Stuart Pimm's paper if you are interested is estimates of plant species.
        2. http://www.amjbot.org/content/98/3/404.abstract
    2. The large reconciled gene trees from John Kerry are ready for upload but not in the database yet.
    3. The database is still very responsive, and I am currently testing subtree selection query speeds and results
      1. Jamie will add a wiki page to document this.
    4. GUI/Visualization tests will need to wait for updates to the services and GUI
  3. DupTree Support
    1. During the conference call on monday we discussed loading output from DupTree and Notung as part of hosting the pilot study from the 1kp project.
    2. It turns out that DupTree does not produce a reconciliation mapping that could be loaded to the database so uploading these results to the database will not be possible.
    3. DupTree has a companion program tree visualization GUI that does do parsimony based reconciliation, but it currently does not output the reconciliation in any format that could be used by other resources.
    4. I brought up (to Gordon Burleigh) the possibility of adding NEXML TR encoding support to their GUI using the work I generated last summer as a Google Summer of Code Project with Daniel Packer. Gordon thought this would be a good idea as was going to discuss this with Andre.
  4. Notung Support
    1. Notung has a different way of representing reconciled trees (see attached powerpoint slides)
    2. The short story is that Notung output CAN be loaded to the database with some additional coding. I have about another day of coding to finish this support for uploading Notung trees. This will support loading the 1kp Notung trees to the database.
  5. Notung LOST Nodes
    1. Unlike TREEBEST and PrimeGSR, Notung also includes an explicit mapping of the LOST gene copies.  The TR database does allow for this mapping.
    2. We could include these in the database GUI presentation, we just need to make sure that the GUI and API are able to handle these.
    3. An alternative is to ignore the LOST copies for now when loading treese, and just present the duplication and speciation nodes.
  6. Given the 1kp deep green data, I believe that Notung will in some cases produce a set of equally likely reconciliation mappings for the parameters used. We need to consider how we want to deal with those situations.
    1. Storage of multiple reconciliations for a given parameter set and program can be supported in the database by adding rank SMALLINT(8) column to the reconciliation table. This would allow all of the equally likely reconciliations to be stored. This column could allow NULL, or we could always set rank to 1 for results that only have  single reconciliation. (Rank does not necessary imply order of quality of the reconciliation, just the rank that they were presented in the NOTUNG format reconciliation file).
    2. Retrieval of reconciled trees and counting of reconciliation mappings would need to consider rank.
      1. The simplest is to only consider to rank=1, in which case we could just load the first tree.
      2. More complex is to summarize the reconciliations across all ranks for a given reconciliation result
      3. For example, counting the number of duplications on a branch in the species tree would need to consider rank so that a single reconciliation result is not counted multiple times