Page Comparison

...

May 10th, 2011

Invitees/Attendees (in orange)

Barb Banbury (bbanbury@utk.edu)
Jeremy Beaulieu (jeremy.beaulieu@yale.edu)
Joe Felsenstein (joe@gs.washington.edu)
Eric Lyons (elyons@iplantcollaborative)
Naim Matasci (nmatasci@iplantcollaborative.org)
Sheldon McKay (sheldon.mckay@gmail.com)
Brian O'Meara (omeara.brian@gmail.com)
Ann Stapleton (stapletona@uncw.edu)

Open action items (see also TE jira page)

Identify 6 tools for workshop integration, by April 20th. See https://pods.iplantcollaborative.org/wiki/display/sciplant/Tools+under+discussion+for+integration
Naim to ask D Ackerly for the leaf data See Dataset collection
Naim to coordinate model integration See https://pods.iplantcollaborative.org/wiki/display/iptol/Model+Testing+for+Mode+and+Tempo
Barb to make optimization code available.
Sheldon to send out details of stat student and collect suggestions from the group.
Identify test datasets for teaching and presentation. To be listed in a Dataset collection on the wiki – WG
Investigate GBIF dataset -- Naim and Jeremy
Identify test users – WG

Topics for discussion (updates in orange)

EOT Effort

Ann would like to go over how to best interface with Dolan to produce an educational version of the TE tools. This will be the first educational adaptation of the DE, so we need to plan it carefully. First let's go over what we see as possible and problematical from our end, then at a later meeting we can have the Dolan folks and Sheldon talk with us to finalize the steps in the plan.

...

We will also need a way to share, right now analysis are viewable only by that login.

Naim and Ann present the recommendation from the EOT working group. The plan is structured in 2 components: 1) evolving the iPlant cyberinfrastructure into a user-friendly research space and (2) disseminating iPlant tools and services to the broadest audience of researchers, developers, teachers, and students.

The TE group is directly affected by both components. As part of illustrating iPlant's capabilities, TE could be included into a prototype phylogenetic workflow composed of starting with a big tree (55K or 120K, provided by Big Trees), subsetting it to match a dataset of interest (provided by TE, as part of our EOT effort), running an evolutionary analysis (e.g. CACE or DACE) and display the results (using the Tree Viewer developed TV). To expand the cyberinfrastructure, iPlant is planning a tool integration push in the form of a workshop in which iPlant affiliated postdocs (and PhD students) would each pick 2 tool that are relevant to their work, and integrate those into the DE. TE could implement through that model approximately 6 additional tools: 2 by Barb, 2 y Jeremy and additional 2 by Naim. Naim and Brian have started compiling a list of possible 'areas of interest' https://pods.iplantcollaborative.org/wiki/display/iptol/TE110412 and the group will identify tools of interest in the next few days. Brian suggested NIMBioS as a possible hosting site.

Additionally, Naim will ask David Ackerly about the possibility of using the leaf longevity dataset as an example of independent contrasts.

The EOT effort will now focus on https://pods.iplantcollaborative.org/wiki/display/sciplant/Pipeline+Integration+%28related+to+EOT+workshops%29

...

Regarding the concurrent implementation of other software, Naim reports that AncML cannot be compiled on the Condor machines, and suggests to drop that software, especially given that the ace implementation is underway. Naim also informs the group that Joe is planning to revise PHYLIP and that he could use some help with coding from iPlant; Naim and Sheldon are looking into it. Joe has revised the contrast code and will send it to Naim for integration. Joe is also planning to add new functionalities that are highly relevant for Trait Evolution and could provide better performances. The WG thinks that given the scope, this could be an incubator project and the group will discuss this possibility. Brian mentioned that the iPlant software should be open source and thus made available. Naim indicates that the problem might be what licenses to use. Joe points out that PHYLIP code is not open source and thus iPlant needs to make sure not to accidentally include it into an open source license and that this fact is clearly indicated. On a related note, iPlant is reviewing its policy for code and process transparency. Two other tools have been identified for integration: a tool to perform ancestral state reconstruction using parsimony and a tool to perform Diversity analyses (spun off geiger) and will be discussed in future meetings. Brian illustrated the differences between the Phylip implementation and the APE one (see https://pods.iplantcollaborative.org/wiki/display/iptol/ASR-MP). It was decided to wait to talk to Gordon Burleigh before making a decision. Due to Gordon's other commitments it was decided to use PHYLIP implementation. Naim will integrate the tool during the June 6-8 Integration Workshop.

geiger integration

Another priority identified by the working group is Model Fitting. The R package geiger offers this functionality and Naim will start look into it. Barb also points out that there are some additional general method in geiger that could be useful for tree/data matching (e.g. dropping taxa from a tree in case of missing data). Brian contacted Luke Harmon, the author of geiger and invited him to join the WG. The group started to discuss the integration priorities of the different components of geiger. The functions that have been identified are fitContinuous, fitDiscrete and dtt. Tree transformation functions (deltaTree etc.) can also be used in conjunction with the model fitting function to display the new tree shape. Naim points out that in the near future it will be possible and particularly easy to link tools to one another into workflows. Luke indicates that according to his experience, the most used function is fitContinuous. Accordingly, the implementation of that function will be prioritized and all functionalities will be available to the users. It should be possible for the user to select a number of models (from 1 to all) to be fitted to the data in the same job. The user should be warned if the parameters are close to the function bounds. Naim and Jeremy will work with Luke to modify the optimization functions to perform faster and more reliably.
Two additional points of broader relevance are raised.

...

Jeremy worked towards improving the performance of the algorithm to compute the variance-covariance matrix used in geiger. Using matrix algebra, his implementation reduces the runtime for a 10,000 tip tree from >12 hours to approximately 4 minutes! This implementation require a lot of memory (~10G) and a 64 bit architecture. Naim will check with Nirav whether he has any suggestion.

Barb notes that Diversity through Time does not fit in with the other functions (fitContinuous, fitDiscrete and deltaTree) and the group agrees to split it from this project and discuss it as a separate project. Also, the name of the project was changed to Model Testing for Mode and Tempo to better reflect the tool's functionalities. Barb points out that we don't have a clear idea of who the target user is and what and how we want to expose these function. The group decides to discuss the scope of this project in the next meeting.

Barb will integrate fitContinuous, Jeremy fitDiscrete and Naim deltaTree. Barb wanted to know how the tool should be presented to the user and which output it should return. Users should be allowed to check boxes for which model(s) they want to test and 4 measures of fit will be returned: AIC, AICc, Likelihood and Likelihood ratio test results against a Null model. Barb and Brian worked to improve the reliability of the optimization routines used by geiger. These will be incorporated into the version integrated into the DE and will be eventually pushed back to the community, preferentially as an update for geiger. Barb will put the code on a repository so that it can be worked on and used by the rest of the group.

Methods' input validation

...

Working with Nicole, I found some documents indicating the needs of TE with regard to visualization. It seems that edge coloring is the main priority: Metadata wish list and Brian also provided some examples of relevant trees: Cross cutting needs analysis. Upon Eric's suggestion, the group takes a look at the current state of the Phyloviewer. Brian confirms that edge (and nodes) coloring is the top priority, including the possibility to color edges with gradients and the coloring of the triangles (collapsed lineages). The color of the triangle should represent the proportion of lineages in which a certain trait is present, either by subdividing the the triangle or by averaging the colors. Naim reports that the visualization group is leaning towards supporting nexml and that he will work in contact with the Tree Visualization group on the domain model. Brian reiterated that wedge coloring is a requirement and that users should be able to choose whether they want average color or proportion. [btw, "wedge" here isn't a typo, as I first thought -- it refers to the triangle of an unresolved clade -- Brian]. Naim reports that the focus of the Tree Visualization group is now being redirected towards implementing the needs of the working groups. He compiled a list of requirements based on wiki documents and discussions regarding visualization and invites the group to review and modify it: https://pods.iplantcollaborative.org/wiki/display/iptol/Visualization+Needs However, Ann points out that the value tables are the most relevant output of the analyses. Naim reports on the outcome of the most recent TV meeting. Currently the major holdup is the lack of an interface and TV discussed some available option (toolbox, right click, visualization mode vs. editing). The group finds that these options have several drawbacks (right click / modifier key might not work as well on all platform, Toolboxes are often cluttered and non intuitive). Barb suggested that the most efficient and intuitive option is to have a collapsible side menu like the one found in FigTree. Brian gave a quick presentation of the software and the consensus was that this solution was indeed the preferred one. Naim will pass on this information to TV. Naim reports from the latest TV meeting that the left-handed collapsible menu has been adopted for the TV prototype and that the viewer development is proceeding very well.

The integration of the visualizer has hit a major roadblock as incompatibilities with the DE surfaced. Naim is working with Nirav and the Core Software group on an interim solution that would allow the visualization of large trees and the results of their analyses.

Taxa name matching

Barb reports that she had some issues with the interface that allow users to match taxa names in the datafile to those in the tree file, especially for large groups. Barb collected problematic cases and provided some suggestions on how to solve them: https://pods.iplantcollaborative.org/jira/browse/TRAILEVOL-8. Some of the problems will be solved (or mitigated) by the Taxonomic Name Resolution Service (TNRS) and Naim will discuss it and the other issues identified by Barb with core. Naim has seen a prototype of an improved dialogue that will address these concerns. Barb will start working in close cooperation with iPlant on a project to match tree and data. Bill Piel joined the group to discuss the integration of the DataMatch tool with the TNRS. Although an API doesn't exist yet for the TNRS, Bill indicates that the syntax will be similar to what is used in the GNI Parser. He also points out that the TNRS will sometime return possible synonyms and that a system should be in place for the user to indicate how to resolve synonyms. The preferred approach is to have an automatic mode mode that will accept only one of the returned names, and to have a manual mode that returns the synonyms and allows the user to pick which ones to use. Bill is also interested to have some data examples, so that he can already identify features that would not work with the TNRS (e.g. underscores vs dot, abbreviated genera). The group proceeds in identifying the following milestones: data-tree matching (Next meeting, March 29th), data-data and tree-tree, TNRS integration. Barb and Jeremy will work on the backend and Naim will coordinate with Core Software for the UI. When necessary, we will also coordinate meetings with the TNRS group to discuss the integration.
Barb has provided Naim with the algorithmic part of the code and Naim is working on the binding to the DE. Naim sent out the prototype for review and received some changes from Brian. He will integrated these and externalize the parameters and send out a working version for testing next week.

Other

Naim informed the group that a statistics grad student will be part of iPlant this summer with an availability of approximately 540 hours. Sheldon will send out more detail and collect suggestions of projects the student could work on from the group.

Completed items

ape integration and timeline

...

DACE and CACE were released on March 11th as part of the 0.3 DE Release.

Items

Make Ackerly data available through the DE at release -- Naim. The limits of the current DE version prevent this feature.
Remove shorebird dataset from the DE -- Naim. Filed a ticket to have the dataset removed.

Datasets

Ann points out that it will be very important to offer the tools at the same time as the big tree is unveiled. Another requirement is to have datasets to showcase the tool functionalities, for user training and for teaching. One option could be to match published trees with published data, stored in repositories. Ann points out that having the Big Tree, we will have to focus on the data aspect only. Naim and Jeremy will look into the GBIF data. Barb warns us that taxon name synonyms can be a very touchy subject. Naim reports that one solution floated by the Taxonomic Name Resolution Service WG is to allow synonyms and offer discussion pages for the community to comment on. It is also pointed out that it will be critical to obtain authorization and ensure proper citing: The analysis output should directly mention the source of the data. Given the ability of the DE to track the history of a file, Naim thinks that it might be possible to implement and will investigate with core software. Naim has also contacted a researcher at the Ecology and Evolutionary Biology at the University of Arizona who is working on trait evolution on a ~1000 taxa tree. Data integration group plans to regularly ingest data from repositories (NCBI), build a phylogenetic tree and make the tree available through the DE. This will require a tool capable of automatically prune large trees to match datasets, and Naim asked Barb if she would be willing and interested in working on that tool as she had already raised the issue of matching tree and data (see below). Based on that previous discussion, Naim had already started drafting the scope for this project and Barb agreed to take over.

...

To investigate an optional "make “make trees available for internal testing" testing” functionality when trees are uploaded for analysis. Nicole suggested that this function can be made available within the planned collaborative framework by creating a "iPlant Testing" user group.

...

Versions Compared

Old Version 1

New Version Current

Key