Copy of TE110329

Trait Evolution

March 29th, 2011

Invitees/Attendees (in orange)

Barb Banbury (bbanbury@utk.edu)
Jeremy Beaulieu (jeremy.beaulieu@yale.edu)
Joe Felsenstein (joe@gs.washington.edu)
Eric Lyons (elyons@iplantcollaborative)
Naim Matasci (nmatasci@iplantcollaborative.org)
Sheldon McKay (sheldon.mckay@gmail.com)
Brian O'Meara (omeara.brian@gmail.com)
Ann Stapleton (stapletona@uncw.edu)
Luke Harmon (lukeh@uidaho.edu)
Bill Piel (william.piel@yale.edu)

Open action items (see also TE jira page)

  • Identify test datasets for teaching and presentation. To be listed in a Dataset collection on the wiki – WG
  • Investigate GBIF dataset -- Naim and Jeremy
  • Identify test users – WG
  • Naim to send Bill example datasets for TNRS
  • Naim to investigate the possibilities and deadline to have data propagated through the DE

Topics for discussion (updates in orange)

EOT Effort

Ann would like to go over how to best interface with Dolan to produce an educational version of the TE tools.  This will be the first educational adaptation of the DE, so we need to plan it carefully.  First let's go over what we see as possible and problematical from our end, then at a later meeting we can have the Dolan folks and Sheldon talk with us to finalize the steps in the plan.

Ann has begun discussions with Dolan on educational use of TE, notes of the first meeting are here [missing link]. Questions EOT has for the working group include:

  1. Is it acceptable to show a 'before and after' tree comparison?
  2. How can the before and after trees be colored or other visuals used to best get across the concept of PIC without causing misconceptions?
  3. What exactly would be done with the current tabular output to interpret the results, and are those post-PIC analysis steps available in the iPlant toolbox right now?

PIC using Contrast should be our first priority for education, as it is a more difficult concept and can incorporate teaching about hypothesis testing. Ancestral state reconstruction would be next. For learning PIC, it would be best to show a correlation or regression plot using data before correction and after, rather than showing a tree.  There is no plotter currently available within iPlant, so we'll need to integrate one for educational use. For ancestral state reconstruction we would show a tree (with the ancestral state indicated on the nodes and the current state on the tips, we can use the existing iPlant tree viewer for this.

We first need existing data sets as examples and to learn to use the tools, then we need to build in easy ways to put in data collected in class and do PIC.  Existing data sets need to fit in the biology curriculum several places and to be 'flashy'.  The Ackerly data set will be in the next DE release, Naim has already set it up.  Jeremy will put in his climate data set (the one that has already been published), and also one on genome size.  Other useful ones would be on selfing (Igic's data?), carnivory, poisonous plants, medicinal value, use as supplements, seed size, fruit fleshiness...Naim will send the link to a data page, and each of us will look for data sources to add.  Sample data sets should be organized by topic (eg ecology, plant physiology), especially if topics can overlap.  For input of student data we will need tree pruning, and we'll need to test that common foods, herbs and spices, and horticultural plants are represented in the source tree.  The source tree could be Steven Smith's.

The list of dataset will be stored in a Dataset collectionon the wiki.

We will also need a way to share, right now analysis are viewable only by that login.

Ann informed us that the newly formed EOT WG will be meeting on March 28th. Naim and Ann will attend. The EOT group is planning to have an iPlant educational workshop in mid July (Botany meeting?) and aims to be able to present Contrast. Naim will work with Cornel Ghiban and coordinate.

ape integration and timeline

The R statistical software package ape (Analyses of Phylogenetics and Evolution) offers a comprehensive suite of phylogenetic methods. A key point in favor of ape is the fact that ape can perform an estimation of uncertainty of ancestral state reconstructions. Naim has started testing the function and the possibility to efficiently run R code on the Condor cluster. Another candidate package for integration is geiger, which includes several methods for model fitting. Naim reports that he could successfully run ace on the Condor cluster and that given that the GUI and filetypes for ace are the same for contrasts he thinks he can have a working demo within 2 weeks. In addition, because ace can also be used to estimate discrete character states, he thinks he could be able to provide that functionality too. One aspect that has not been dealt with yet, is what form the output should have, and in particular the need to merge the tree with the internal node value estimates. Naim will look into what possible solutions are available. Brian and Barb also point out that people will use ace to reconstruct the ancestral character state, whereas if thy want to obtain the model parameters, they might prefer to use geiger. Another important part of the output are the errors and warnings produced by the scripts. Naim reports that he encountered some issues with running R that are now solved. He has submitted the code to core software development who estimates that the implementation effort will take about a week, starting on Wed/Thurs. Unfortunately, core software encountered some problems with the link to the execution framework which resulted in delays. They indicate that the problem might be resolved later this week. In the meantime, Naim changed the CACE R script so that it can handle multiple traits. Brian asked if it can handle missing data, and Naim answered that it can. However, he has not tested it. He will immediately test it end ensure that there are no problems. Naim was informed that the problems has been resolved and that the development is now focused on the integration. He was told that he should be able to run R scripts by the Dec 3rd. The core software team can now run R script and is actively integrating DACE into the DE. Integration is still ongoing. Naim will inform the WG as soon as they become available.

DACE and CACE were released on March 11th as part of the 0.3 DE Release.

Concurrent implementation of other tools

Regarding the concurrent implementation of other software, Naim reports that AncML cannot be compiled on the Condor machines, and suggests to drop that software, especially given that the ace implementation is underway. Naim also informs the group that Joe is planning to revise PHYLIP and that he could use some help with coding from iPlant; Naim and Sheldon are looking into it. Joe has revised the contrast code and will send it to Naim for integration. Joe is also planning to add new functionalities that are highly relevant for Trait Evolution and could provide better performances. The WG thinks that given the scope, this could be an incubator project and the group will discuss this possibility. Brian mentioned that the iPlant software should be open source and thus made available. Naim indicates that the problem might be what licenses to use. Joe points out that PHYLIP code is not open source and thus iPlant needs to make sure not to accidentally include it into an open source license and that this fact is clearly indicated. On a related note, iPlant is reviewing its policy for code and process transparency. Two other tools have been identified for integration: a tool to perform ancestral state reconstruction using parsimony and a tool to perform Diversity analyses (spun off geiger) and will be discussed in future meetings.

geiger integration

Another priority identified by the working group is Model Fitting. The R package geiger offers this functionality and Naim will start look into it. Barb also points out that there are some additional general method in geiger that could be useful for tree/data matching (e.g. dropping taxa from a tree in case of missing data). Brian contacted Luke Harmon, the author of geiger and invited him to join the WG. The group started to discuss the integration priorities of the different components of geiger. The functions that have been identified are fitContinuous, fitDiscrete and dtt. Tree transformation functions (deltaTree etc.) can also be used in conjunction with the model fitting function to display the new tree shape. Naim points out that in the near future it will be possible and particularly easy to link tools to one another into workflows. Luke indicates that according to his experience, the most used function is fitContinuous. Accordingly, the implementation of that function will be prioritized and all functionalities will be available to the users. It should be possible for the user to select a number of models (from 1 to all) to be fitted to the data in the same job. The user should be warned if the parameters are close to the function bounds. Naim and Jeremy will work with Luke to modify the optimization functions to perform faster and more reliably.
Two additional points of broader relevance are raised.

  1. As the number of integrated tools increases it will be crucial to provide the users with the details of and bibliographic references to the methods. This information is stored in the tool description and should be outputted as a text file with every job. 
  2. The output of the function dtt includes a graphical plot. As far as member of the group can tell, there is no plan to integrate a plotter into the discovery environment. Because a plotter could be a cross cutting need, Ann will follow up with the other working group and in particular with TV to assess their needs and plans. As the integration of a plotter could take some months, the short term solution is to use R's plotting capabilities and output pdf files. Ann points out that the main missing piece is performing the statistical analyses necessary to turn the raw output data into visual information. Naim mentions the fact that he has started discussing with core software about a framework to directly connect visual tools through adaptors. He thinks that the framework could be generalized to tool output as well. Ann also mentions that DNA Subway has a framework for tool integration and that it would be valuable to follow their development.

Jeremy worked towards improving the performance of the algorithm to compute the variance-covariance matrix used in geiger. Using matrix algebra, his implementation reduces the runtime for a 10,000 tip tree from >12 hours to approximately 4 minutes! This implementation require a lot of memory (~10G) and a 64 bit architecture. Naim will check with Nirav whether he has any suggestion.

Barb notes that Diversity through Time does not fit in with the other functions (fitContinuous, fitDiscrete and deltaTree) and the group agrees to split it from this project and discuss it as a separate project. Also, the name of the project was changed to Model Testing for Mode and Tempo to better reflect the tool's functionalities. Barb points out that we don't have a clear idea of who the target user is and what and how we want to expose these function. The group decides to discuss the scope of this project in the next meeting.

Methods' input validation

To minimize software errors after job submission, a validation step is necessary. The way it is currently designed, the validation only checks whether input values and filetypes are correct. However, a major source of errors is going to be the file content, rather than its format. In particular, a newick file might contain a correctly specified tree, but the tree might not be suitable for certain kind of analyses (e.g. rooted, additive,...). A compatibility table could be used in that case to ensure that the chosen method will accept the input provided. The group also think it would be useful to have such a chart. Naim will put it online.

GUI

One important aspect is to provide a powerful tool and at the same time offer a clean interface that would not dissuade the naive user. Brian indicates that it would be better to avoid an extra window for the advanced options. He suggests as an alternative to separate basic vs. advanced options in the same window and give prominence to the basic ones (e.g. gray vs. white background). Another instance is how to organize model selection, as the parameters required depend on the choice of model. The preferred solution is to have the models listed in a drop-down box and the parameter name/boxed created or renamed accordingly. Naim discussed the GUI options with core software which confirmed that the two groups in a panel option is possible. Naim has also started discussing workflows and organization with core software. TreeTapper provides a good template to how phylogenetic tools can be organized.

Test users

Ann is in contact with a small college which would for a good use case for relatively naive users. The people at the college indicated their willingness to be a test user group. Andrea Schwarzbach from UTBrownsville has data sets on terpenes and volatiles, some with medicinal effects.

Tree visualization

Working with Nicole, I found some documents indicating the needs of TE with regard to visualization. It seems that edge coloring is the main priority: Metadata wish list and Brian also provided some examples of relevant trees: Cross cutting needs analysis. Upon Eric's suggestion, the group takes a look at the current state of the Phyloviewer. Brian confirms that edge (and nodes) coloring is the top priority, including the possibility to color edges with gradients and the coloring of the triangles (collapsed lineages). The color of the triangle should represent the proportion of lineages in which a certain trait is present, either by subdividing the the triangle or by averaging the colors. Naim reports that the visualization group is leaning towards supporting nexml and that he will work in contact with the Tree Visualization group on the domain model. Brian reiterated that wedge coloring is a requirement and that users should be able to choose whether they want average color or proportion. [btw, "wedge" here isn't a typo, as I first thought -- it refers to the triangle of an unresolved clade -- Brian]. Naim reports that the focus of the Tree Visualization group is now being redirected towards implementing the needs of the working groups. He compiled a list of requirements based on wiki documents and discussions regarding visualization and invites the group to review and modify it: https://pods.iplantcollaborative.org/wiki/display/iptol/Visualization+Needs However, Ann points out that the value tables are the most relevant output of the analyses. Naim reports on the outcome of the most recent TV meeting. Currently the major holdup is the lack of an interface and TV discussed some available option (toolbox, right click, visualization mode vs. editing). The group finds that these options have several drawbacks (right click / modifier key might not work as well on all platform, Toolboxes are often cluttered and non intuitive). Barb suggested that the most efficient and intuitive option is to have a collapsible side menu like the one found in FigTree. Brian gave a quick presentation of the software and the consensus was that this solution was indeed the preferred one. Naim will pass on this information to TV. Naim reports from the latest TV meeting that the left-handed collapsible menu has been adopted for the TV prototype and that the viewer development is proceeding very well.

Taxa name matching

Barb reports that she had some issues with the interface that allow users to match taxa names in the datafile to those in the tree file, especially for large groups. Barb collected problematic cases and provided some suggestions on how to solve them: https://pods.iplantcollaborative.org/jira/browse/TRAILEVOL-8. Some of the problems will be solved (or mitigated) by the Taxonomic Name Resolution Service (TNRS) and Naim will discuss it and the other issues identified by Barb with core. Naim has seen a prototype of an improved dialogue that will address these concerns. Barb will start working in close cooperation with iPlant on a project to match tree and data. Bill Piel joined the group to discuss the integration of the DataMatch tool with the TNRS. Although an API doesn't exist yet for the TNRS, Bill indicates that the syntax will be similar to what is used in the GNI Parser. He also points out that the TNRS will sometime return possible synonyms and that a system should be in place for the user to indicate how to resolve synonyms. The preferred approach is to have an automatic mode mode that will accept only one of the returned names, and to have a manual mode that returns the synonyms and allows the user to pick which ones to use. Bill is also interested to have some data examples, so that he can already identify features that would not work with the TNRS (e.g. underscores vs dot, abbreviated genera). The group proceeds in identifying the following milestones: data-tree matching (Next meeting, March 29th), data-data and tree-tree, TNRS integration. Barb and Jeremy will work on the backend and Naim will coordinate with Core Software for the UI. When necessary, we will also coordinate meetings with the TNRS group to discuss the integration.

Other

Completed items

Items

  • Make Ackerly data available through the DE at release -- Naim. The limits of the current DE version prevent this feature.
  • Remove shorebird dataset from the DE -- Naim. Filed a ticket to have the dataset removed.

Datasets

Ann points out that it will be very important to offer the tools at the same time as the big tree is unveiled. Another requirement is to have datasets to showcase the tool functionalities, for user training and for teaching. One option could be to match published trees with published data, stored in repositories. Ann points out that having the Big Tree, we will have to focus on the data aspect only. Naim and Jeremy will look into the GBIF data. Barb warns us that taxon name synonyms can be a very touchy subject. Naim reports that one solution floated by the Taxonomic Name Resolution Service WG is to allow synonyms and offer discussion pages for the community to comment on. It is also pointed out that it will be critical to obtain authorization and ensure proper citing: The analysis output should directly mention the source of the data. Given the ability of the DE to track the history of a file, Naim thinks that it might be possible to implement and will investigate with core software. Naim has also contacted a researcher at the Ecology and Evolutionary Biology at the University of Arizona who is working on trait evolution on a ~1000 taxa tree. Data integration group plans to regularly ingest data from repositories (NCBI), build a phylogenetic tree and make the tree available through the DE. This will require a tool capable of automatically prune large trees to match datasets, and Naim asked Barb if she would be willing and interested in working on that tool as she had already raised the issue of matching tree and data (see below). Based on that previous discussion, Naim had already started drafting the scope for this project and Barb agreed to take over.

User personas, user stories, use cases and acceptance tests

The core software team requires user stories for the analyses being currently implemented (CACE and DACE). The WG has already collected such information for the PIC implementation an it would be useful to review what is already there as a starting point. The group discussed the various reasons to have user personas and user stories. From the development perspective these stories will greatly facilitate the writing of the necessary documentation for the tools as well as provide the base for the user acceptance tests. Given the progress with the integration of DACE and CACE the development of such stories has become more urgent. The stories should be somewhat application specific but also include more general aspects (data gathering, visualization,...). Ann points out that she has starting to work with the Dolan center towards integrating trait evolution methods into the DNA subway. This offers a formidable opportunity to provide a tool that can be used to teach evolutionary concepts. The group will also take the educational needs into account in their user stories. The stories will be developed interactively and iteratively by the group through a wiki page. wiki page. A linked wiki page will be used to define user acceptance tests and test datasets. These are basic tests required to make sure that the applications run properly within the iPlant environment. The use of the dataset (testing, example, illustrating,...) will need to be indicated. The tests for CACE and DACE have been written and submitted to the QA team that will start testing as soon as the tools come online.

Next goals for TE

As iPlant CI matures, the Working Group should be able to work more independently and be able to integrate and consume resources more easily. With this framework in mind, it would be valuable to provide the postdocs with greater access to the CI and development resources so that they can more effectively contribute their expertise and at the same time use or develop these resources for their own research. For example in improving software to handle large dataset. Consistent with the new SCiPlant framework, Naim has started organizing and scoping the TE projects. Jeremy and Barb will be more involved into the active development and tool integration.

Performance

Naim will work on performance testing of the various applications, with particular focus on buffer underflows, which could yield faulty results. Additionally, Jeremy had pointed out the possibility to use another library (NLopt) for maximum likelihood optimization. Testing will provide a measure to decide whether the other library should be used. Following a discussion with core software, Naim tested the computational overhead associated with parsing newick strings into (and deparsing from) the edges and nodes format used internally by ape and geiger. The results indicate that there is a 9x increase in computational time when newick strings are parsed/deparsed, which adds an average of 4 minutes for a 500K tree (however with peaks reaching 17 minutes). Despite not being a major issue, is something that should be taken into account when loading, saving and exchanging data among applications. Brian suggests that given the small size of tree files, an option could be to store trees in both formats and provide the correct one to the various applications. Brian mentions that because of a low adoption rate, support for phyloXML and nexml is not a high priority.

Jeremy and Naim worked on the warnings generated by ace and pinpointed the origin to the optimization routines. The errors are cause by the behavior of the underlying optimization routine (nlimb, package stat. R implementation of the Netlib's PORT ) when dealing with non numeric (and not Inf) results. Such results are generated when the function is estimated on 0. The problem can be easily resolved  either by removing 0 from the bounds of by catching the non-numeric results. Jeremy also suggested using better matrix exponentiation functions and modified the code to allow fixing of root states. Naim presented the results on the completed analysis and will contact the author of ape. Emmanuel Paradis responded immediately and has integrated the suggested bugfixes into a new version of ape (6.2-2) which has already been deployed across iPlant infrastructure.

Tree visualization

Review list of visualization needs: https://pods.iplantcollaborative.org/wiki/display/iptol/Visualization Needs

Data donation

To investigate an optional “make trees available for internal testing” functionality when trees are uploaded for analysis. Nicole suggested that this function can be made available within the planned collaborative framework by creating a "iPlant Testing" user group.

Method list

I have received some suggestions of additional methods that could be integrated in the discovery environment. At present time the focus is on the ones already identified, but I think keeping track of the various possibilities could be useful in the longer term. A wiki page for such a list will be created. Also, Brian will invite Ann Stapleton, who provided the suggestions, to join the working group. Wiki page: Software Wishlist

Jira adoption

Naim set up a jira page to track, discuss, and collaboratively work on action items, issues and tasks. The project page is located at https://pods.iplantcollaborative.org/jira/browse/TRAILEVOL.