iPToL_Summary

Working Group Formation

In May 2009, the iPToL Grand Challenge Project "Kickoff” meeting was held to establish specific focal areas and to develop a high-level implementation plan for the project. From this meeting, it was decided to divide the grand challenge into six focus areas, each corresponding to a working group. Leads/co-leads for each working group were recommended by consensus and working group membership was drafted from meeting participants and non-participating community members. During the kickoff meeting, support was expressed for two other possible collaborations: The Angiosperm Phylogeny Web Site (APWEB2) and the The Botanical Information and Ecology Network (BIEN). By consensus, these were determined not to be within the scope of the core iPToL budget but a desire to support these efforts was agreed. In the fall of 2009, new collaborations were established with these groups. The APWEB2 project is regarded as a Education Outreach and Training (EOT) extension of the iPToL project. The BIEN collaboration is also associated with iPToL and is meant as an expanded solution to the pervasive "taxonomic intelligence" problem (described below) and will also help to integrate ecological trait data with iPLant's tree of life discovery environment.

Working groups are comprised of a lead and an optional co-lead, who are typically faculty-level scientists from the research community most relevant to the group's focus. Two or three other external scientists also participate by invitation. Each working group is assigned an iPlant Engagement Team Analyst (ETA) who serves as the primary liaison/contact for the WG. WG's also interact directly with the engagement team Scientific Lead and the Project Manager. The engagement team as a whole is advised and guided by the iPToL Scientific Lead, in consultation with the iPToL Project Manager, who receive support from an Administrative Assistant. Groups usually meet biweekly via teleconference, have access to a shared collaborative web space, and use of group-specific mailing lists.

Working Groups

Trait Evolution: The goal of the Trait Evolution (TE) working group is to develop infrastructure to support analysis of traits with reference to established species' phylogeny. Example of such analyses are mapping biotic (e.g. floral morhology) and abiotic (e.g. geographic range) onto species trees. The first component will consist of a phylogenetic independent contrasts service that will be included in the initial release of the "Discovery Environment" (described below). The TE group is actively working on defining future development priorities. The next planned deliverable is a service for discrete ancestral character state reconstruction. We expect that the Botanical Information and Ecology Network (BIEN) collaboration will provide great addition of expertise in "taxonomic intelligence" and also a rich source of ecological trait data that will benefit from the analytical services being developed by the TE working group.

With scientific and technical direction from the TE working group and iPToL engagement team, the iPlant core software group is currently developing the first release of the Discovery Environment. As a whole, the Discovery Environment aims to be a user friendly, interactive tool that enables plant biologists to conduct sophisticated computational analysis. In addition to a phylogenetic independent contrasts analysis tool, the discovery environment also includes the core web application framework with data upload and tree viewing capability, user authentication and collaboration tools. Most of these capabilities are also needed by several of the other iPToL working groups. The first release of the discovery environment is scheduled for February 2010.

Tree Reconciliation: The primary goal of the Tree Reconciliation (TR) working group is to develop infrastructure facilitating the reconciliation of gene and species trees, the latter of which is the inferred phylogeny for the species. This includes related but discordant phylogenenies, such as host-parasite species co-evolution. The TE group will publish reconciliations generated as part of the development process.

A secondary goal of this project is the generation of data to help resolve the deep phylogeny of green plants. To this end, the group has initiated a collaboration with the thousand plant transcriptome initiative (http://onekp.org), which is using next generation sequencing-based transcript sequencing for diverse green plant taxa ranging from green algae through to angiosperms. The anticipated outcomes of this collaboration are the identification of many gene families and reconciliation with green plant species trees.

The first software deliverable identified for the tree-reconciliation group is a gene/species tree resolution service that will be included in the second release of the Discovery Environment. Requirements for this service are more complex and are being handled in part through development of a prototype implementation using established gene families and species trees. The Texas Advanced Computing Center (TACC) will host a mirror of the onekp transcriptome dataset and provide assistance with the analytical pipeline that will ultimately lead to gene trees that feeding into the analytical tools developed by the TE working group.

Data Assembly: The goal of the Data Assembly (DA) working group is to develop a prioritized list of phylogenetic datasets and tools to guide the assembly of the data matrix required for the phylogenetic inference being performed by the Big Trees (BT) working group (described below). The first step in this process, was the data assembly workshop held in November 2009. This meeting brought together stakeholders in the plant phylogenetics and analysis communities to discuss priorities and strategies for assembling data for the marquis "big tree" analysis that will form the foundation of the iPToL discovery environment. The activities of this working group are fundamental for iPToL, not only for data assembly, but seeking input from the phylogenetics community at large to ensure that iPlant continues to develop a cyberinfrastructure that addresses the needs laid out in the tree of life grand challenge proposal.

Data Integration: The purpose of the Data Integration (DI) working group is cross-cutting in nature. Data integration and interoperability touch all aspects of the tree of life grand challenge, from initial data entry through to analysis and dissemination of analytical results. The data integration group is actively discussing integration issues that face all of the other working group and iPToL collaborations. This group is actively assessing current integration and interoperability issues as well as anticipating those that will arise in future. The considerable expertise of this group in database and phylogenetic software design is also being leveraged to contribute to high-level design of the Discovery Environment infrastructure. Collaborations with two external groups also impact the operations of the DI working group. The first concerns the overarching need for a "taxonomic intelligence" solution, which addresses the highly complex task of mapping species identifiers across various alternative taxonomies, synonymies, data entry errors, homonyns, etc. The DI group has developed a prototype name resolution service, which is now being exapnded as a consequence of the relationship with the BIEN group. Members of the DI group recently attended a BIEN workshop at NCEAS. A second collaboration is with the EvoIO consortium (http://www.evoio.org), associated with the National Evolutionary Synthesis Center (NESCent) evolutionary informatics working group. This consortium develops the "EvoIO stack", an informatics infrastructure for phylogenetic data standards and interoperability.

Big Trees: The primary goal of the Big Tree (BT) inference working group is to produce the tree of life for green plants - this will form the foundation of the final iPToL Discovery Environment. The technical challenges faced by this group are formidable. Currently, the largest published phylogenetic tree contains ca. 73,000 species. This working group aims to scale up by an order of magnitude phylogenetic inference methods, in particular the maximum likelihood and maximum parsimony methods implemented in Raxml (Randomized Axelerated Maximum Likelihood). The current capabilities are 55,000 species, ~4 days on a Sun x4600 with 32 cores and 64GB of RAM. This must be multiplied by a factor of 500-1000 for measurement of uncertainty via bootstrap replicates, for a total of about 4000 days of clock time. A scale up to 500,000 species would require 40,000 days of clock time. iPlant high performance computing experts are working with the BT working group to refactor and redesign the software to make a 500,000-species tree achievable through the use of a high performance computing environment in a fraction of the time.

Tree Visualization Tree Visualization The Visualization (Viz) working group has only recently been established. Karen Cranston has taken on the lead role with Mike Sanderson acting as a co-lead. The high-level objective of this group is a tool that will take tree description input of up to 500k leaf nodes, their labels, and edge lengths and display this tree that allows for user interaction. The user will be able to browse, zoom, select, search and annotate, while preserving the input information. Browsing will balance between capturing the size of the overall phylogeny and keeping the displayed tree, labels and lengths readable. Zooming will reveal additional information, not simply change the scale of a static image. The user will be able to select single nodes or groups of nodes and add annotations such as text labels, colours or images, either manually or from a file. There will be output options to export a static image that could be used in a publication or presentation. In consultation with the Viz group, a prototype tree viewer is under development at TACC for the "big tree" use case (see https://pods.iplantcollaborative.org/wiki/display/iptol/Use+Case+1+Prototype).

Other Collaborations

NINJA: The engagement team is collaborating with Travis Wheeler on improving the NINJA implementation of the complete neighbor-joining method for phylogeny reconstruction. NINJA is 10X faster than the previous best complete neighbor joining implementation and is capable of dealing with large data sets in excess of 200K species. A Neighbor joining tree will complement the "big tree" working groups maximum likelihood and maximum parsimony analysis in RaXML. The goal of this collaboration is to add DNA distance calculation to the application, port NINJA from Java to C++, checkpoint and optimize its performance in as parallel/HPC environment. Additional details of this project can be found at https://pods.iplantcollaborative.org/wiki/display/iptol/NINJA

APWEB2 The Angiosperm phylogeny website is resource that draws together information about green plant phylogeny at the order and family level. APWEB, curated by Peter Stevens at the Missouri Botanical Garden, is a valuable resource for plant evolution, ecology and allied disciplines. The contents of this website are contained entirely in static HTML documents and images. They have grown to the extent that they need to be rationalized into a database with more sohpisticated ways to access the information. Starting January 1, 2010, APWEB2 (Peter Steven, Amy Zanne, Cam Webb) and the iPToL engagement team will be collaborating on re-building APWEB to be driven by a database that will have both a traditional web interface and web-services interface to facilitate APWEBs role as an informational and a teaching resource. Additional information on this project can be found at https://pods.iplantcollaborative.org/wiki/display/iptol/AP_Project_Charter.

BIEN The Botanical Information and Ecology Network is an NCEAS working group comprised of leading collectors and managers of botanical survey and inventory data, informaticians and ecologists doing synthetic research. This group aims to integrate the most significant existing sets of vegetation data spanning North and South America. This effort will incorporate database resources for plant plot information and taxonomies and will encompass several million records of species occurrences. The result will be the largest assembly of data on plant diversity and distribution for both tropical and temperate plant species yet created. There are two specific areas of synergy with iPlant: 1) taxonomic intelligence and 2) mapping biotic and abiotic trait data to species phylogenies. The BIEN group will work closely with the iPToL Data Integration group and will also share information with the iPToL Trait Evolution working group.