TE101102

Trait Evolution

Nov 2, 2010

Invitees/Attendees (in orange)

Barb Banbury (bbanbury@utk.edu)
Jeremy Beaulieu (jeremy.beaulieu@yale.edu)
Joe Felsenstein (joe@gs.washington.edu)
Eric Lyons (elyons@iplantcollaborative)
Naim Matasci (nmatasci@iplantcollaborative.org)
Sheldon McKay (sheldon.mckay@gmail.com)
Brian O'Meara (omeara.brian@gmail.com)
Ann Stapleton (stapletona@uncw.edu)

Topics for discussion (updates in orange)

ape integration and timeline

The R statistical software package ape (Analyses of Phylogenetics and Evolution) offers a comprehensive suite of phylogenetic methods. A key point in favor of ape is the fact that ape can perform an estimation of uncertainty of ancestral state reconstructions. Naim has started testing the function and the possibility to efficiently run R code on the Condor cluster. Another candidate package for integration is geiger, which includes several methods for model fitting. Naim reports that he could successfully run ace on the Condor cluster and that given that the GUI and filetypes for ace are the same for contrasts he thinks he can have a working demo within 2 weeks. In addition, because ace can also be used to estimate discrete character states, he thinks he could be able to provide that functionality too. One aspect that has not been dealt with yet, is what form the output should have, and in particular the need to merge the tree with the internal node value estimates. Naim will look into what possible solutions are available. Brian and Barb also point out that people will use ace to reconstruct the ancestral character state, whereas if thy want to obtain the model parameters, they might prefer to use geiger. Another important part of the output are the errors and warnings produced by the scripts. Naim reports that he encountered some issues with running R that are now solved. He has submitted the code to core software development who estimates that the implementation effort will take about a week, starting on Wed/Thurs. Unfortunately, core software encountered some problems with the link to the execution framework which resulted in delays. They indicate that the problem might be resolved later this week. In the meantime, Naim changed the CACE R script so that it can handle multiple traits. Brian asked if it can handle missing data, and Naim answered that it can. However, he has not tested it. He will immediately test it end ensure that there are no problems.

Performance

Naim will work on performance testing of the various applications, with particular focus on buffer underflows, which could yield faulty results. Additionally, Jeremy had pointed out the possibility to use another library (NLopt) for maximum likelihood optimization. Testing will provide a measure to decide whether the other library should be used. Following a discussion with core software, Naim tested the computational overhead associated with parsing newick strings into (and deparsing from) the edges and nodes format used internally by ape and geiger. The results indicate that there is a 9x increase in computational time when newick strings are parsed/deparsed, which adds an average of 4 minutes for a 500K tree (however with peaks reaching 17 minutes). Despite not being a major issue, is something that should be taken into account when loading, saving and exchanging data among applications. Brian suggests that given the small size of tree files, an option could be to store trees in both formats and provide the correct one to the various applications. Brian mentions that because of a low adoption rate, support for phyloXML and nexml is not a high priority.

Concurrent implementation of other tools

Regarding the concurrent implementation of other software, Naim reports that AncML cannot be compiled on the Condor machines, and suggests to drop that software, especially given that the ace implementation is underway. Naim also informs the group that Joe is planning to revise PHYLIP and that he could use some help with coding from iPlant; Naim and Sheldon are looking into it. Joe has revised the contrast code and will send it to Naim for integration. Joe is also planning to add new functionalities that are highly relevant for Trait Evolution and could provide better performances. The WG thinks that given the scope, this could be an incubator project and the group will discuss this possibility. Brian mentioned that the iPlant software should be open source and thus made available. Naim indicates that the problem might be what licenses to use. Joe points out that PHYLIP code is not open source and thus iPlant needs to make sure not to accidentally include it into an open source license and that this fact is clearly indicated.

geiger integration

Another priority identified by the working group is Model Fitting. The R package geiger offers this functionality and Naim will start look into it. Barb also points out that there are some additional general method in geiger that could be useful for tree/data matching (e.g. dropping taxa from a tree in case of missing data).

Methods' input validation

To minimize software errors after job submission, a validation step is necessary. The way it is currently designed, the validation only checks whether input values and filetypes are correct. However, a major source of errors is going to be the file content, rather than its format. In particular, a newick file might contain a correctly specified tree, but the tree might not be suitable for certain kind of analyses (e.g. rooted, additive,...). A compatibility table could be used in that case to ensure that the chosen method will accept the input provided. The group also think it would be useful to have such a chart. Naim will put it online.

GUI

One important aspect is to provide a powerful tool and at the same time offer a clean interface that would not dissuade the naive user. Brian indicates that it would be better to avoid an extra window for the advanced options. He suggests as an alternative to separate basic vs. advanced options in the same window and give prominence to the basic ones (e.g. gray vs. white background). Another instance is how to organize model selection, as the parameters required depend on the choice of model. The preferred solution is to have the models listed in a drop-down box and the parameter name/boxed created or renamed accordingly. Naim discussed the GUI options with core software which confirmed that the two groups in a panel option is possible. Naim has also started discussing workflows and organization with core software. TreeTapper provides a good template to how phylogenetic tools can be organized.

Datasets

Ann points out that it will be very important to offer the tools at the same time as the big tree is unveiled. Another requirement is to have datasets to showcase the tool functionalities, for user training and for teaching. One option could be to match published trees with published data, stored in repositories. Ann points out that having the Big Tree, we will have to focus on the data aspect only. Naim and Jeremy will look into the GBIF data. Barb warns us that taxon name synonyms can be a very touchy subject. Naim reports that one solution floated by the Taxonomic Name Resolution Service WG is to allow synonyms and offer discussion pages for the community to comment on. It is also pointed out that it will be critical to obtain authorization and ensure proper citing: The analysis output should directly mention the source of the data. Given the ability of the DE to track the history of a file, Naim thinks that it might be possible to implement and will investigate with core software. Naim has also contacted a researcher at the Ecology and Evolutionary Biology at the University of Arizona who is working on trait evolution on a ~1000 taxa tree.

Tree visualization

Working with Nicole, I found some documents indicating the needs of TE with regard to visualization. It seems that edge coloring is the main priority: Metadata wish list and Brian also provided some examples of relevant trees: Cross cutting needs analysis. Upon Eric's suggestion, the group takes a look at the current state of the Phyloviewer. Brian confirms that edge (and nodes) coloring is the top priority, including the possibility to color edges with gradients and the coloring of the triangles (collapsed lineages). The color of the triangle should represent the proportion of lineages in which a certain trait is present, either by subdividing the the triangle or by averaging the colors. Naim reports that the visualization group is leaning towards supporting nexml and that he will work in contact with the Tree Visualization group on the domain model. Brian reiterated that wedge coloring is a requirement and that users should be able to choose whether they want average color or proportion. [btw, "wedge" here isn't a typo, as I first thought -- it refers to the triangle of an unresolved clade -- Brian]. Naim reports that the focus of the Tree Visualization group is now being redirected towards implementing the needs of the working groups. He compiled a list of requirements based on wiki documents and discussions regarding visualization and invites the group to review and modify it: https://pods.iplantcollaborative.org/wiki/display/iptol/Visualization+Needs However, Ann points out that the value tables are the most relevant output of the analyses.

Taxa name matching

Barb reports that she had some issues with the interface that allow users to match taxa names in the datafile to those in the tree file, especially for large groups. Barb collected problematic cases and provided some suggestions on how to solve them: https://pods.iplantcollaborative.org/jira/browse/TRAILEVOL-8. Some of the problems will be solved (or mitigated) by the Taxonomic Name Reslution Service and Naim will discuss it and the other issues identified by Barb with core.

Open action items (see also TE jira page)

Completed items

Data donation

To investigate an optional “make trees available for internal testing” functionality when trees are uploaded for analysis. Nicole suggested that this function can be made available within the planned collaborative framework by creating a "iPlant Testing" user group.

Method list

I have received some suggestions of additional methods that could be integrated in the discovery environment. At present time the focus is on the ones already identified, but I think keeping track of the various possibilities could be useful in the longer term. A wiki page for such a list will be created. Also, Brian will invite Ann Stapleton, who provided the suggestions, to join the working group. Wiki page: Software Wishlist

Jira adoption

Naim set up a jira page to track, discuss, and collaboratively work on action items, issues and tasks. The project page is located at https://pods.iplantcollaborative.org/jira/browse/TRAILEVOL.