Nov 30, 2010
Invitees/Attendees (in orange)
Barb Banbury (firstname.lastname@example.org)
Jeremy Beaulieu (email@example.com)
Joe Felsenstein (firstname.lastname@example.org)
Eric Lyons (elyons@iplantcollaborative)
Naim Matasci (email@example.com)
Sheldon McKay (firstname.lastname@example.org)
Brian O'Meara (email@example.com)
Ann Stapleton (firstname.lastname@example.org)
Luke Harmon (email@example.com)
Topics for discussion (updates in orange)
User personas, user stories, use cases and acceptance tests
The core software team requires user stories for the analyses being currently implemented (CACE and DACE). The WG has already collected such information for the PIC implementation an it would be useful to review what is already there as a starting point. The group discussed the various reasons to have user personas and user stories. From the development perspective these stories will greatly facilitate the writing of the necessary documentation for the tools as well as provide the base for the user acceptance tests. Given the progress with the integration of DACE and CACE the development of such stories has become more urgent. The stories should be somewhat application specific but also include more general aspects (data gathering, visualization,...). Ann points out that she has starting to work with the Dolan center towards integrating trait evolution methods into the DNA subway. This offers a formidable opportunity to provide a tool that can be used to teach evolutionary concepts. The group will also take the educational needs into account in their user stories. The stories will be developed interactively and iteratively by the group through a wiki page .
ape integration and timeline
The R statistical software package ape (Analyses of Phylogenetics and Evolution) offers a comprehensive suite of phylogenetic methods. A key point in favor of ape is the fact that ape can perform an estimation of uncertainty of ancestral state reconstructions. Naim has started testing the function and the possibility to efficiently run R code on the Condor cluster. Another candidate package for integration is geiger, which includes several methods for model fitting. Naim reports that he could successfully run ace on the Condor cluster and that given that the GUI and filetypes for ace are the same for contrasts he thinks he can have a working demo within 2 weeks. In addition, because ace can also be used to estimate discrete character states, he thinks he could be able to provide that functionality too. One aspect that has not been dealt with yet, is what form the output should have, and in particular the need to merge the tree with the internal node value estimates. Naim will look into what possible solutions are available. Brian and Barb also point out that people will use ace to reconstruct the ancestral character state, whereas if thy want to obtain the model parameters, they might prefer to use geiger. Another important part of the output are the errors and warnings produced by the scripts. Naim reports that he encountered some issues with running R that are now solved. He has submitted the code to core software development who estimates that the implementation effort will take about a week, starting on Wed/Thurs. Unfortunately, core software encountered some problems with the link to the execution framework which resulted in delays. They indicate that the problem might be resolved later this week. In the meantime, Naim changed the CACE R script so that it can handle multiple traits. Brian asked if it can handle missing data, and Naim answered that it can. However, he has not tested it. He will immediately test it end ensure that there are no problems. Naim was informed that the problems has been resolved and that the development is now focused on the integration. He was told that he should be able to run R scripts by the Dec 3rd.
Naim will work on performance testing of the various applications, with particular focus on buffer underflows, which could yield faulty results. Additionally, Jeremy had pointed out the possibility to use another library (NLopt) for maximum likelihood optimization. Testing will provide a measure to decide whether the other library should be used. Following a discussion with core software, Naim tested the computational overhead associated with parsing newick strings into (and deparsing from) the edges and nodes format used internally by ape and geiger. The results indicate that there is a 9x increase in computational time when newick strings are parsed/deparsed, which adds an average of 4 minutes for a 500K tree (however with peaks reaching 17 minutes). Despite not being a major issue, is something that should be taken into account when loading, saving and exchanging data among applications. Brian suggests that given the small size of tree files, an option could be to store trees in both formats and provide the correct one to the various applications. Brian mentions that because of a low adoption rate, support for phyloXML and nexml is not a high priority.
Jeremy and Naim worked on the warnings generated by ace and pinpointed the origin to the optimization routines. The errors are cause by the behavior of the underlying optimization routine (nlimb, package stat. R implementation of the Netlib's PORT ) when dealing with non numeric (and not Inf) results. Such results are generated when the function is estimated on 0. The problem can be easily resolved either by removing 0 from the bounds of by catching the non-numeric results. Jeremy also suggested using better matrix exponentiation functions and modified the code to allow fixing of root states. Naim presented the results on the completed analysis and will contact the author of ape.
Concurrent implementation of other tools
Regarding the concurrent implementation of other software, Naim reports that AncML cannot be compiled on the Condor machines, and suggests to drop that software, especially given that the ace implementation is underway. Naim also informs the group that Joe is planning to revise PHYLIP and that he could use some help with coding from iPlant; Naim and Sheldon are looking into it. Joe has revised the contrast code and will send it to Naim for integration. Joe is also planning to add new functionalities that are highly relevant for Trait Evolution and could provide better performances. The WG thinks that given the scope, this could be an incubator project and the group will discuss this possibility. Brian mentioned that the iPlant software should be open source and thus made available. Naim indicates that the problem might be what licenses to use. Joe points out that PHYLIP code is not open source and thus iPlant needs to make sure not to accidentally include it into an open source license and that this fact is clearly indicated. On a related note, iPlant is reviewing its policy for code and process transparency.
Another priority identified by the working group is Model Fitting. The R package geiger offers this functionality and Naim will start look into it. Barb also points out that there are some additional general method in geiger that could be useful for tree/data matching (e.g. dropping taxa from a tree in case of missing data). Brian contacted Luke Harmon, the author of geiger and invited him to join the WG.
Methods' input validation
To minimize software errors after job submission, a validation step is necessary. The way it is currently designed, the validation only checks whether input values and filetypes are correct. However, a major source of errors is going to be the file content, rather than its format. In particular, a newick file might contain a correctly specified tree, but the tree might not be suitable for certain kind of analyses (e.g. rooted, additive,...). A compatibility table could be used in that case to ensure that the chosen method will accept the input provided. The group also think it would be useful to have such a chart. Naim will put it online.
One important aspect is to provide a powerful tool and at the same time offer a clean interface that would not dissuade the naive user. Brian indicates that it would be better to avoid an extra window for the advanced options. He suggests as an alternative to separate basic vs. advanced options in the same window and give prominence to the basic ones (e.g. gray vs. white background). Another instance is how to organize model selection, as the parameters required depend on the choice of model. The preferred solution is to have the models listed in a drop-down box and the parameter name/boxed created or renamed accordingly. Naim discussed the GUI options with core software which confirmed that the two groups in a panel option is possible. Naim has also started discussing workflows and organization with core software. TreeTapper provides a good template to how phylogenetic tools can be organized.
Ann points out that it will be very important to offer the tools at the same time as the big tree is unveiled. Another requirement is to have datasets to showcase the tool functionalities, for user training and for teaching. One option could be to match published trees with published data, stored in repositories. Ann points out that having the Big Tree, we will have to focus on the data aspect only. Naim and Jeremy will look into the GBIF data. Barb warns us that taxon name synonyms can be a very touchy subject. Naim reports that one solution floated by the Taxonomic Name Resolution Service WG is to allow synonyms and offer discussion pages for the community to comment on. It is also pointed out that it will be critical to obtain authorization and ensure proper citing: The analysis output should directly mention the source of the data. Given the ability of the DE to track the history of a file, Naim thinks that it might be possible to implement and will investigate with core software. Naim has also contacted a researcher at the Ecology and Evolutionary Biology at the University of Arizona who is working on trait evolution on a ~1000 taxa tree.
Ann is in contact with a small college which would for a good use case for relatively naive users. The people at the college indicated their willingness to be a test user group.
Working with Nicole, I found some documents indicating the needs of TE with regard to visualization. It seems that edge coloring is the main priority: Metadata wish list and Brian also provided some examples of relevant trees: Cross cutting needs analysis. Upon Eric's suggestion, the group takes a look at the current state of the Phyloviewer. Brian confirms that edge (and nodes) coloring is the top priority, including the possibility to color edges with gradients and the coloring of the triangles (collapsed lineages). The color of the triangle should represent the proportion of lineages in which a certain trait is present, either by subdividing the the triangle or by averaging the colors. Naim reports that the visualization group is leaning towards supporting nexml and that he will work in contact with the Tree Visualization group on the domain model. Brian reiterated that wedge coloring is a requirement and that users should be able to choose whether they want average color or proportion. [btw, "wedge" here isn't a typo, as I first thought -- it refers to the triangle of an unresolved clade -- Brian]. Naim reports that the focus of the Tree Visualization group is now being redirected towards implementing the needs of the working groups. He compiled a list of requirements based on wiki documents and discussions regarding visualization and invites the group to review and modify it: https://pods.iplantcollaborative.org/wiki/display/iptol/Visualization+Needs However, Ann points out that the value tables are the most relevant output of the analyses. Naim reports on the outcome of the most recent TV meeting. Currently the major holdup is the lack of an interface and TV discussed some available option (toolbox, right click, visualization mode vs. editing). The group finds that these options have several drawbacks (right click / modifier key might not work as well on all platform, Toolboxes are often cluttered and non intuitive). Barb suggested that the most efficient and intuitive option is to have a collapsible side menu like the one found in FigTree. Brian gave a quick presentation of the software and the consensus was that this solution was indeed the preferred one. Naim will pass on this information to TV.
Taxa name matching
Barb reports that she had some issues with the interface that allow users to match taxa names in the datafile to those in the tree file, especially for large groups. Barb collected problematic cases and provided some suggestions on how to solve them: https://pods.iplantcollaborative.org/jira/browse/TRAILEVOL-8. Some of the problems will be solved (or mitigated) by the Taxonomic Name Resolution Service and Naim will discuss it and the other issues identified by Barb with core. Naim has seen a prototype of an improved dialogue that will address these concerns.
Open action items (see also TE jira page)
- Find out release plans for big tree. He will also ask for whatever is there to be made available for testing and preparing examples --Naim
- Identify test datasets for teaching and presentation – WG
- Investigate GBIF dataset -- Naim and Jeremy
- Identify test users – WG
- Investigate datasets available through MyPlant/Data integration – Naim
Review list of visualization needs: https://pods.iplantcollaborative.org/wiki/display/iptol/Visualization Needs
To investigate an optional “make trees available for internal testing” functionality when trees are uploaded for analysis. Nicole suggested that this function can be made available within the planned collaborative framework by creating a "iPlant Testing" user group.
I have received some suggestions of additional methods that could be integrated in the discovery environment. At present time the focus is on the ones already identified, but I think keeping track of the various possibilities could be useful in the longer term. A wiki page for such a list will be created. Also, Brian will invite Ann Stapleton, who provided the suggestions, to join the working group. Wiki page: Software Wishlist
Naim set up a jira page to track, discuss, and collaboratively work on action items, issues and tasks. The project page is located at https://pods.iplantcollaborative.org/jira/browse/TRAILEVOL.