TE101019

Trait Evolution

Oct. 19, 2010

Invitees/Attendees (in orange)

Barb Banbury (bbanbury@utk.edu)
Jeremy Beaulieu (jeremy.beaulieu@yale.edu)
Joe Felsenstein (joe@gs.washington.edu)
Eric Lyons (elyons@iplantcollaborative)
Naim Matasci (nmatasci@iplantcollaborative.org)
Sheldon McKay (sheldon.mckay@gmail.com)
Brian O'Meara (omeara.brian@gmail.com)
Ann Stapleton (stapletona@uncw.edu)

Topics for discussion (updates in orange)

ape integration and timeline

The R statistical software package ape (Analyses of Phylogenetics and Evolution) offers a comprehensive suite of phylogenetic methods. A key point in favor of ape is the fact that ape can perform an estimation of uncertainty of ancestral state reconstructions. Naim has started testing the function and the possibility to efficiently run R code on the Condor cluster. Another candidate package for integration is geiger, which includes several methods for model fitting. Naim reports that he could successfully run ace on the Condor cluster and that given that the GUI and filetypes for ace are the same for contrasts he thinks he can have a working demo within 2 weeks. In addition, because ace can also be used to estimate discrete character states, he thinks he could be able to provide that functionality too. One aspect that has not been dealt with yet, is what form the output should have, and in particular the need to merge the tree with the internal node value estimates. Naim will look into what possible solutions are available. Brian and Barb also point out that people will use ace to reconstruct the ancestral character state, whereas if thy want to obtain the model parameters, they might prefer to use geiger. Another important part of the output are the errors and warnings produced by the scripts. Naim reports that he encountered some issues with running R that are now solved. He has submitted the code to core software development who estimates that the implementation effort will take about a week, starting on Wed/Thurs. Naim will work on performance testing of the various applications, with particular focus on buffer underflows, which could yield faulty results. Additionally, Jeremy had pointed out the possibility to use another library (NLopt) for maximum likelihood optimization. Testing will provide a measure to decide whether the other library should be used.

Concurrent implementation of other tools

Regarding the concurrent implementation of other software, Naim reports that AncML cannot be compiled on the Condor machines, and suggests to drop that software, especially given that the ace implementation is underway. Naim also informs the group that Joe is planning to revise PHYLIP and that he could use some help with coding from iPlant; Naim and Sheldon are looking into it. Joe has revised the contrast code and will send it to Naim for integration. Joe is also planning to add new functionalities that are highly relevant for Trait Evolution and could provide better performances. The WG thinks that given the scope, this could be an incubator project and the group will discuss this possibility.

geiger integration

Another priority identified by the working group is Model Fitting. The R package geiger offers this functionality and Naim will start look into it. Barb also points out that there are some additional general method in geiger that could be useful for tree/data matching (e.g. dropping taxa from a tree in case of missing data).

Methods' input validation

To minimize software errors after job submission, a validation step is necessary. The way it is currently designed, the validation only checks whether input values and filetypes are correct. However, a major source of errors is going to be the file content, rather than its format. In particular, a newick file might contain a correctly specified tree, but the tree might not be suitable for certain kind of analyses (e.g. rooted, additive,...). A compatibility table could be used in that case to ensure that the chosen method will accept the input provided. The group also think it would be useful to have such a chart. Naim will put it online.

GUI

One important aspect is to provide a powerful tool and at the same time offer a clean interface that would not dissuade the naive user. Brian indicates that it would be better to avoid an extra window for the advanced options. He suggests as an alternative to separate basic vs. advanced options in the same window and give prominence to the basic ones (e.g. gray vs. white background). Another instance is how to organize model selection, as the parameters required depend on the choice of model. The preferred solution is to have the models listed in a drop-down box and the parameter name/boxed created or renamed accordingly. Naim will investigate these possibilities with core software.

Datasets

Ann points out that it will be very important to offer the tools at the same time as the big tree is unveiled. Another requirement is to have datasets to showcase the tool functionalities, for user training and for teaching.

Tree visualization

Working with Nicole, I found some documents indicating the needs of TE with regard to visualization. It seems that edge coloring is the main priority:Metadata wish list and Brian also provided some examples of relevant trees: Cross cutting needs analysis. Upon Eric's suggestion, the group takes a look at the current state of the Phyloviewer. Brian confirms that edge (and nodes) coloring is the top priority, including the possibility to color edges with gradients and the coloring of the triangles (collapsed lineages). The color of the triangle should represent the proportion of lineages in which a certain trait is present, either by subdividing the the triangle or by averaging the colors. Naim reports that the visualization group is leaning towards supporting nexml and that he will work in contact with the Tree Visualization group on the domain model. Brian reiterated that wedge coloring is a requirement and that users should be able to choose whether they want average color or proportion. [btw, "wedge" here isn't a typo, as I first thought -- it refers to the triangle of an unresolved clade -- Brian]

Open action items (see also TE jira page)

  • Find out release plans for big tree --Naim
  • Post screenshots of issues with Taxon Name matching --Barb https://pods.iplantcollaborative.org/jira/browse/TRAILEVOL-8
  • Send Contrast code --Joe
  • Identify test datasets for teaching and presentation – WG
  • Identify test users – WG
  • Investigate datasets available through MyPlant/Data integration – Naim

Completed items

Data donation

To investigate an optional “make trees available for internal testing” functionality when trees are uploaded for analysis. Nicole suggested that this function can be made available within the planned collaborative framework by creating a "iPlant Testing" user group.

Method list

I have received some suggestions of additional methods that could be integrated in the discovery environment. At present time the focus is on the ones already identified, but I think keeping track of the various possibilities could be useful in the longer term. A wiki page for such a list will be created. Also, Brian will invite Ann Stapleton, who provided the suggestions, to join the working group. Wiki page: Software Wishlist

Taxa name matching

Barb reports that she had some issues with the interface that allow users to match taxa names in the datafile to those in the tree file, especially for large groups. She will use it more and collect ideas on how to make the process more efficient (e.g. automatic taxa dropping)

Jira adoption

Naim set up a jira page to track, discuss, and collaboratively work on action items, issues and tasks. The project page is located at https://pods.iplantcollaborative.org/jira/browse/TRAILEVOL.