This document is a summary of the iPlant's engagement of the the tree of life grand challenge project (iPToL).
- 1 Summary
- 2 Ongoing Development on the Tree of Life Grand Challenge
- 3 The iPToL Engagement Team
- 4 The iPToL Working Groups
- 5 Core Software and the iPToL Discovery Environment
Working Group Formation
In May 2009, the iPToL Grand Challenge Project "Kickoff” meeting was held to establish specific focal areas and to develop a high-level implementation plan for the project. From this meeting, it was decided to divide the grand challenge into six focus areas, each corresponding to a working group. Leads/co-leads for each working group were recommended by consensus and working group membership was drafted from meeting participants and non-participating community members. During the kickoff meeting, support was expressed for two other possible collaborations: The Angiosperm Phylogeny Web Site (APWEB2) and the The Botanical Information and Ecology Network (BIEN). By consensus, these were determined not to be within the scope of the core iPToL budget but a desire to support these efforts was agreed. In the fall of 2009, new collaborations were established with these groups. The APWEB2 project is regarded as a Education Outreach and Training (EOT) extension of the iPToL project. The BIEN collaboration is also associated with iPToL and is meant as an expanded solution to the pervasive "taxonomic intelligence" problem (described below) and will also help to integrate ecological trait data with iPLant's tree of life discovery environment.
Working groups are comprised of a lead and an optional co-lead, who are typically faculty-level scientists from the research community most relevant to the group's focus. Two or three other external scientists also participate by invitation. Each working group is assigned an iPlant Engagement Team Analyst (ETA) who serves as the primary liaison/contact for the WG. WG's also interact directly with the engagement team Scientific Lead and the Project Manager. The engagement team as a whole is advised and guided by the iPToL Scientific Lead, in consultation with the iPToL Project Manager, who receive support from an Administrative Assistant. Groups usually meet biweekly via teleconference, have access to a shared collaborative web space, and use of group-specific mailing lists.
Trait Evolution: The goal of the Trait Evolution (TE) working group is to develop infrastructure to support analysis of traits with reference to established species' phylogeny. Example of such analyses are mapping biotic (e.g. floral morhology) and abiotic (e.g. geographic range) onto species trees. The first component will consist of a phylogenetic independent contrasts service that will be included in the initial release of the "Discovery Environment" (described below). The TE group is actively working on defining future development priorities. The next planned deliverable is a service for discrete ancestral character state reconstruction. We expect that the Botanical Information and Ecology Network (BIEN) collaboration will provide great addition of expertise in "taxonomic intelligence" and also a rich source of ecological trait data that will benefit from the analytical services being developed by the TE working group.
With scientific and technical direction from the TE working group and iPToL engagement team, the iPlant core software group is currently developing the first release of the Discovery Environment. As a whole, the Discovery Environment aims to be a user friendly, interactive tool that enables plant biologists to conduct sophisticated computational analysis. In addition to a phylogenetic independent contrasts analysis tool, the discovery environment also includes the core web application framework with data upload and tree viewing capability, user authentication and collaboration tools. Most of these capabilities are also needed by several of the other iPToL working groups. The first release of the discovery environment is scheduled for February 2010.
Tree Reconciliation: The primary goal of the Tree Reconciliation (TR) working group is to develop infrastructure facilitating the reconciliation of gene and species trees, the latter of which is the inferred phylogeny for the species. This includes related but discordant phylogenenies, such as host-parasite species co-evolution. The TE group will publish reconciliations generated as part of the development process.
A secondary goal of this project is the generation of data to help resolve the deep phylogeny of green plants. To this end, the group has initiated a collaboration with the thousand plant transcriptome initiative (http://onekp.org), which is using next generation sequencing-based transcript sequencing for diverse green plant taxa ranging from green algae through to angiosperms. The anticipated outcomes of this collaboration are the identification of many gene families and reconciliation with green plant species trees.
The first software deliverable identified for the tree-reconciliation group is a gene/species tree resolution service that will be included in the second release of the Discovery Environment. Requirements for this service are more complex and are being handled in part through development of a prototype implementation using established gene families and species trees. The Texas Advanced Computing Center (TACC) will host a mirror of the onekp transcriptome dataset and provide assistance with the analytical pipeline that will ultimately lead to gene trees that feeding into the analytical tools developed by the TE working group.
Data Assembly: The goal of the Data Assembly (DA) working group is to develop a prioritized list of phylogenetic datasets and tools to guide the assembly of the data matrix required for the phylogenetic inference being performed by the Big Trees (BT) working group (described below). The first step in this process, was the data assembly workshop held in November 2009. This meeting brought together stakeholders in the plant phylogenetics and analysis communities to discuss priorities and strategies for assembling data for the marquis "big tree" analysis that will form the foundation of the iPToL discovery environment. The activities of this working group are fundamental for iPToL, not only for data assembly, but seeking input from the phylogenetics community at large to ensure that iPlant continues to develop a cyberinfrastructure that addresses the needs laid out in the tree of life grand challenge proposal.
Data Integration: The purpose of the Data Integration (DI) working group is cross-cutting in nature. Data integration and interoperability touch all aspects of the tree of life grand challenge, from initial data entry through to analysis and dissemination of analytical results. The data integration group is actively discussing integration issues that face all of the other working group and iPToL collaborations. This group is actively assessing current integration and interoperability issues as well as anticipating those that will arise in future. The considerable expertise of this group in database and phylogenetic software design is also being leveraged to contribute to high-level design of the Discovery Environment infrastructure. Collaborations with two external groups also impact the operations of the DI working group. The first concerns the overarching need for a "taxonomic intelligence" solution, which addresses the highly complex task of mapping species identifiers across various alternative taxonomies, synonymies, data entry errors, homonyns, etc. The DI group has developed a prototype name resolution service, which is now being exapnded as a consequence of the relationship with the BIEN group. Members of the DI group recently attended a BIEN workshop at NCEAS. A second collaboration is with the EvoIO consortium (http://www.evoio.org), associated with the National Evolutionary Synthesis Center (NESCent) evolutionary informatics working group. This consortium develops the "EvoIO stack", an informatics infrastructure for phylogenetic data standards and interoperability.
Big Trees: The primary goal of the Big Tree (BT) inference working group is to produce the tree of life for green plants - this will form the foundation of the final iPToL Discovery Environment. The technical challenges faced by this group are formidable. Currently, the largest published phylogenetic tree contains ca. 73,000 species. This working group aims to scale up by an order of magnitude phylogenetic inference methods, in particular the maximum likelihood and maximum parsimony methods implemented in Raxml (Randomized Axelerated Maximum Likelihood). The current capabilities are 55,000 species, ~4 days on a Sun x4600 with 32 cores and 64GB of RAM. This must be multiplied by a factor of 500-1000 for measurement of uncertainty via bootstrap replicates, for a total of about 4000 days of clock time. A scale up to 500,000 species would require 40,000 days of clock time. iPlant high performance computing experts are working with the BT working group to refactor and redesign the software to make a 500,000-species tree achievable through the use of a high performance computing environment in a fraction of the time.
Tree Visualization Tree Visualization The Visualization (Viz) working group has only recently been established. Karen Cranston has taken on the lead role with Mike Sanderson acting as a co-lead. The high-level objective of this group is a tool that will take tree description input of up to 500k leaf nodes, their labels, and edge lengths and display this tree that allows for user interaction. The user will be able to browse, zoom, select, search and annotate, while preserving the input information. Browsing will balance between capturing the size of the overall phylogeny and keeping the displayed tree, labels and lengths readable. Zooming will reveal additional information, not simply change the scale of a static image. The user will be able to select single nodes or groups of nodes and add annotations such as text labels, colours or images, either manually or from a file. There will be output options to export a static image that could be used in a publication or presentation. In consultation with the Viz group, a prototype tree viewer is under development at TACC for the "big tree" use case (see https://pods.iplantcollaborative.org/wiki/display/iptol/Use+Case+1+Prototype).
NINJA: The engagement team is collaborating with Travis Wheeler on improving the NINJA implementation of the complete neighbor-joining method for phylogeny reconstruction. NINJA is 10X faster than the previous best complete neighbor joining implementation and is capable of dealing with large data sets in excess of 200K species. A Neighbor joining tree will complement the "big tree" working groups maximum likelihood and maximum parsimony analysis in RaXML. The goal of this collaboration is to add DNA distance calculation to the application, port NINJA from Java to C++, checkpoint and optimize its performance in as parallel/HPC environment. Additional details of this project can be found at https://pods.iplantcollaborative.org/wiki/display/iptol/NINJA
APWEB2 The Angiosperm phylogeny website is resource that draws together information about green plant phylogeny at the order and family level. APWEB, curated by Peter Stevens at the Missouri Botanical Garden, is a valuable resource for plant evolution, ecology and allied disciplines. The contents of this website are contained entirely in static HTML documents and images. They have grown to the extent that they need to be rationalized into a database with more sohpisticated ways to access the information. Starting January 1, 2010, APWEB2 (Peter Steven, Amy Zanne, Cam Webb) and the iPToL engagement team will be collaborating on re-building APWEB to be driven by a database that will have both a traditional web interface and web-services interface to facilitate APWEBs role as an informational and a teaching resource. Additional information on this project can be found at https://pods.iplantcollaborative.org/wiki/display/iptol/AP_Project_Charter.
BIEN The Botanical Information and Ecology Network is an NCEAS working group comprised of leading collectors and managers of botanical survey and inventory data, informaticians and ecologists doing synthetic research. This group aims to integrate the most significant existing sets of vegetation data spanning North and South America. This effort will incorporate database resources for plant plot information and taxonomies and will encompass several million records of species occurrences. The result will be the largest assembly of data on plant diversity and distribution for both tropical and temperate plant species yet created. There are two specific areas of synergy with iPlant: 1) taxonomic intelligence and 2) mapping biotic and abiotic trait data to species phylogenies. The BIEN group will work closely with the iPToL Data Integration group and will also share information with the iPToL Trait Evolution working group.
Ongoing Development on the Tree of Life Grand Challenge
Ongoing development in collaboration with the Trait Evolution working group (https://pods.iplantcollaborative.org/wiki/display/iptol/Trait+Evolution) is focused on using the existing phylogenetic tree to address evolution of particular traits in relation to the species phylogeny. The first iteration of the iPToL discovery environment contains an implementation of Phylogenetic Independent Contrasts, using the CONTRAST program from the PHYLIP package. Future development priorities include support for the following methods:
- Discrete ancestral state reconstruction
- Pagel 1994 character correlation
- Fitting models (OU, BM, etc.: various stretching models like those in Blomberg et al. 2003).
- Continuous ancestral state reconstruction
Taxonomic Name Resolution
A cross-cutting data integration problem for many iPToL projects is the resolution of synomous, erroneous, or other conflicting taxonomic names. A pilot project is underway with the BIEN working group (https://pods.iplantcollaborative.org/wiki/display/iptol/BIEN) and other iPlant stakeholders to unify taxonomic name resolution with the Tropicos database and other taxonomic name resources (see: https://pods.iplantcollaborative.org/wiki/display/iptol/TNRS+Workshop). This is the first "incubator project" (https://pods.iplantcollaborative.org/wiki/display/IP/TNRS) in iPlant, which is a new, accelerated collaborative development model that brings together scientific advisers form the working groups, the iPToL engagement team, and members of the iPlant core services and core software development groups.
Scaling up Phylogenetic Inference
Phylogenetic analysis at the level of several hundred thousand species represents a new scalability challenge that iPlant is addressing on two parallel tracks. The general approach is is optimize existing methods (Maximum Likelihood with Raxml and Neighbor Joining with NINJA/WINDJAMMER). The current testing data set is a matrix of eight genes for 116K species provided by Stephen Smith. The ultimate goal is the have the computing capacity to build phylogenetic trees for up to 500K species when the data become available.
RAxML (http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm) was developed by Alexandros Stamatakis. Members of the iPlant engagement team at the Texas Advanced Computing Center (TACC) have been working to scale up the application to be able to leverage the substantial TACC high performance computing resources. The first major achievement for this work is the addition of check-pointing to allow stopping and restarting of long-running processing jobs without loss of data or compute time. Ongoing work is on improving parallel implementation, to decrease overall run-time for large data sets.
Neighbor-joining is a hierarchical clustering method for inferring phylogenies.
NINJA, developed by Travis Wheeler, is a new tool that produces
correct neighbor-joining trees much faster than the canonical algorithm;
it is able to scale to inputs of the size needed for assembling the “tree of life”
for all green plants, but would require months to do so on a single computer.
The goal of this project is to build a parallelized implementation of the core
algorithm from NINJA, to allow biologist to analyze very large data sets in a
matter of hours, instead of months. The end product will be a new implementation
of this clustering method (code named WindJammer) entirely re-written in C
(the original version was in Java).
Data Integration and Assembly
The Data Integration and Assembly working group has two main purposes. First, to engage the plant science community and facilitate the aggregation of data needed to assemble of the "tree of life" for green plants. Second, the group is establishing the infrastructure to support logistics of handling the data as key resources are identified.
My-Plant.org is a phylogenetically-structured social networking website under development for plant scientists, educators, and other interested parties. Users will be able to easily share information and research, collaborate, and stay on top of the latest news in their field. Members of the My-Plant.org community will gather around clades of choice, then view and contribute to information surrounding the clades including image galleries, message board discussions, wiki pages, clade pages, and access to external sources of data. The site is under development and planned launch is late summer 2010.
Data Intake Pipeline
The goal of this project is to establish a robust data pipeline that makes use of different methods to compare/cluster sequences, identify orthologous genes and compute multiple sequence alignments as well as gene trees. The group has begun implementing several approaches to move efficiently from sequences to gene trees. The PHLAWD data intake pipeline (http://code.google.com/p/phlawd/), developed by Stephen Smith, takes predefined gene regions of interest and exemplar sequences representing phylogenetic diversity, it does not require clustering. The All-by-all BLAST pipeline, developed by Gordon Burleigh, offers the advantage of not requiring any a priori functional knowledge. Testing of the pipelines are in progress and work is under way to migrate them to the HPC environment at TACC.
Gene-Species tree reconciliation
Ongoing development in collaboration with the Tree Reconciliation working group (https://pods.iplantcollaborative.org/wiki/display/iptol/Tree+Reconciliation) and the Thousand Plant Transcriptome project (oneKP; http://www.onekp.com/) is focused on using the existing phylogenetic tree to address evolution of gene families. Work is underway on an iPlant incubator project (https://pods.iplantcollaborative.org/wiki/display/IP/TR) to provide a gene-species tree reconciliation service that will reconcile gene families being generated by the onekp project with the green plant species phylogeny.
The onekp project is also a potential source on input data for the data assembly and integration working group's efforts to feed into the "big tree" analysis.
Large Tree Visualization
With current technologies, visualization of phylogenetic trees large numbers of species becomes increasingly slow and difficult. The goal of the Large Tree Visualization project is to develop an application for viewing, analyzing and exploring of large phylogenetic trees. Code named "PhyloViewer", the application is a platform-independent, web-based viewer that will enable users to rapidly navigate very large trees (>500k taxa), integrate their own metadata and share that information with others. PhyloViewer is under active development and will be integrated into the Discovery Environment as well as available through a separate web service.
Angiosperm Phylogeny Website
A collaboration is underway between iPlant and the Angiosperm Phylogeny Website to modernize the infrastructure and convert the current static HTML pages of APWEB into dynamic database-driven web services (APWEB2) that will offer these widely used data to researchers and educators using a variety of different interfaces. See https://pods.iplantcollaborative.org/wiki/display/iptol/AP_Project_Charter for more information.
The iPToL Engagement TeamThe Engagement team is an outward facing part of iPlant that serves as the interface between the Grand Challenge projects and the cyberinfrastucture developers.
Sheldon McKay, Scientific lead
Sheldon received his PhD in phylogenetics. His research background is in evolution, genomics and comparative genomics. He has been involved in bioinformatics and scientific software development for the past 10 years and has served as a bioinformatics liaison between informatics and research components of a number of large research consortia, most recently for the model organism Encyclopedia of DNA Elements (modENCODE) project. Sheldon is a member of the research faculty at Cold Spring Harbor Laboratory (CSHL). He is also active in development and outreach for the Generic Model Organism Database (GMOD) project and is an open source and data interoperability advocate.
Michael Gonzales, Project Manager
Michael received his Ph.D. in cell biology. Michael’s scientific efforts have been focused on various aspects of computational biology including high throughput genomics/proteomics and protein-ligand interactions. Michael serves as the Life Sciences Program Director at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. He is responsible for setting the strategic direction and developing the resources and services to support life sciences users, and for leading collaborative research projects in computational biology.
Victoria Bryan, Project Manager
Natalie Henriques, Administrative Assistant
Natalie has an MBA from Wright State University. She is a senior administrative associate at TACC. Previously, she was at the Ohio Supercomputer Center in July 2008. She was the program coordinator for the Department of Defense (DoD) High Performance Computing (HPCMP) and other large computing platforms. Natalie assists Michael Gonzales with project management of the iPToL engagement team.
Engagement Team Analysts
Adam has a BS in computer science and extensive experience in C++, Python, Java, OpenGL/scientific and geospatial visualization, multi-threaded programming, software architecture and design and SQL.
Andrew has a BS in Computer Science from the University of Arizona. He is a Software Engineer on the Core Software team. He was previously the Lead Developer on the Tree of Life Web Project. Andrew has extensive experience in industrial grade software engineering and has taught courses on enterprise web development and component-oriented development for the University of Arizona Computer Science Department.
Zhenyuan (Jerry) Lu
Jerry received MS degrees in both Computer Science and Biochemistry. His scientific background is in scientific visualization, large-scale data integration, numeric simulation and rapid prototype development for biological applications. His computing experience is in Perl, Java; MySQL and Oracle databases.
Bernice obtained her PhD in psychology from Columbia University and postdoctoral training in phychophysics at Harvard University. She is a fellow in the Society for Optics, Photonics and Imaging. She is currently based at TACC. Bernice has just joined iPlant and will contribute her expertise in human perception and data visualization in both the iPToL and iPG2P visualization working groups.
Sharon has Masters degrees in both Computer Science (University of Houston) and Molecular Genetics (U.T.-M.D. Anderson Cancer Center, Houston, TX). She is experiences in the Ensembl genome database infrastructure from its core/compara databases to its analytical pipelines and web browser. Sharon has also worked on evolutionary projects such as the maize origin and is familiar with various phylogenetic methods, such as Neighor Joining and Maximum Likelihood. She is experienced in Perl, Java, C, MySQL, Oracle.
Floating Team members
Staff members at TACC and CSHL can be brought in on a shorter term basis as required. TACC members contribute high performance computing and software engineering expertise and CSHL staff contribute expertise in phylogenetics and bioinformatics.
Engagement Team Role
The primary purpose of the engagement team is to translate the needs of the grand challenge working groups into plans that will inform the development of software and cyber-infrastructure components that address these needs. This involves several components:
Broadly stated, requirements analysis involves the identification of high level software and infrastructure deliverables that will address research and analysis needs. The deliverables and high-level requirements are documented by the engagement team. In collaboration with the core software needs analysis specialists, these are decomposed into logical components, dependencies, user personas, user stories etc. that will inform orderly development of the final software product. The process is bilateral and involves further iteration with the domain experts in the working groups as required.
An example of a high-level deliverable is "a web application for reconciliation of gene and species trees". This breaks down into a list of requirements. For example:
- Where do the gene tree and species trees come from?
- How will the taxon names (species names) in the gene and species trees be reconciled?
- What data formats will be used?
- What reconciliation algorithm/software will be implemented?
- How will the analysis results be stored?
- How will the analysis results be accessed, shared, displayed, etc?
Project Management and Coordination
Another key responsibility of the engagement team is to coordinate activities of the working groups, the engagement team, the software engineering group and other iPlant staff. This support consists of meeting and other administrative logistics, budgeting, staffing, planning, etc. The engagement team has a professional project manager who works with all stakeholders to assure that documentary requirements are met, that milestones and time-lines are mapped out, that progress is monitored and that necessary adjustments are made to ensure that the project succeeds.
Another facet of requirements analysis is to test ideas and generate proofs of concept that will facilitate early innovation and inform promising directions for further development. An example of this process was the grand-challenge pre-projects. These early engagements with Plant Science community members resulted in several proof of principle applications, including, among others, a phylogenetic tree retrieval, manipulation, storage and visualization platform that linked the treebase database with the PhyloWidget tree editing and display tool. Although the scope of future prototypes will be limited, this engagement method will be employed in some cases to facilitate needs assessment prior to commitment of resources to core software development. An example of such a prototype is a experimental taxon name resolution service being undertaken in the data integration working group (described in more detail below).
In cases where there are clearly circumscribed, primarily technical challenges that intersect with core iPlant expertise, an accelerated engagement process is being used. For example the "big tree" working group's early goal is to ensure the the RaXML software is up to the challenge of inferring phylogenetics trees for very large data sets. There was a clear need for benchmarking, check-pointing and deployment on the high performance compute resources available to iPlant. Several iPlant staff members have relevant expertise and this process proceeded with little lag time. Another example of this style of engagement is an ongoing collaboration (discussed below) with Travis Wheeler on improving and optimizing the NINJA implementation of the Neighbor Joining algorithm for very large data sets.
The iPToL Working GroupsCollaborative implementation was organized into working groups with focused development goals. Each group has an iPToL superuser or faculty member designated as the lead and point of contact. The four main working groups are: Big Trees (Alexis Stamatakis), Data Assembly (Doug Soltis, Pam Soltis, Michael Donoghue), Tree Reconciliation (Todd Vision), and Trait Evolution (Brain O'Meara). Two crosscutting working groups to develop shared data and compute infrastructure are Data Integration (Val Tannen) and Visualization (Karen Cranston).
Data Assembly Working Group
Data Integration Working Group
Trait Evolution Working Group
Tree Reconciliation Working Group
Visualization Working Group
Big Trees Working Group
Core Software and the iPToL Discovery Environment
The iPlant core software engineering group is based at the Unversity of Arizona. It serves all grand challenge projects as well as other iPlant cyberinfrastructure development activities. The lead developer in this group is Sonya Lowry.
Interface with the iPToL engagement team
Planning and high level design issues are communicated directly from Sheldon McKay and Michael Gonzales to Sonya Lowry, the lead developer. This is primarily off-line communication.
Requirements Analysis and Development
Each engagement team analyst (ETA) has primary responsibility for at least one working group and secondary responsibility for a second. The ETAs attend working group meetings and work directly with the working group lead and other members to assess the scientific and technical requirements of the working group. The ETAs then communicate these requirements, with appropriate triage and refactoring, to Nicole Hopkins, core software's needs assessment specialist. Acquisition of scientific and computational domain knowledge for the working is primarily the responsibility of the ETA.
Early on in this process, there was more direct engagement between core software and the working group members but this approach does not scale well for eight ongoing iPToL collaborations, so the respective ETA for each working group needs to be the primary conduit for communicating needs and development priorities to the software engineers.
There is direct communication on detailed needs assessment between all members of the engagement team and Nicole Hopkins, the core software needs analysis specialist. This level of communication is almost entirely documented on the confluence wiki space for core software (https://pods.iplantcollaborative.org/wiki/display/testdev/Home+-+Core+Software). The confluence wiki helps to track design discussions and development issues associated with the discovery environment. Engagement team members also contribute documents and comments to the core software wiki space. Detailed development issues are tracked internally by the core software group on the JIRA content management system, excerpts of which are also posted on the wiki. Core software group members also post "virtual standup" reports on an ongoing basis on the wiki.
There is a bi-weekly design and retrospective meeting held by the core software group and attended by the iPToL engagement team analysts. This is a platform for discussion of architecture design decisions and detailed reports on development activities.
Near Term Road Map
This is a high-level plan for the first iteration of the discovery environment, slated for release in February 2010. Many other features will be added to subsequent releases.
This effort will result in the creation of a simple web-based application for performing basic create, read, update, and delete operations on limited tree data while ensuring that architectural and design choices are made with an eye toward easy addition of features, extensibility, and production quality as this application will become the basis for the final iPToL production Discovery Environment. This foundational application is not specific to the trait evolution groups and is designed to meet current and anticipated requirements for several of the working groups.
This effort will extend the iPToL DE to include the capability for users to apply independent contrast methodology for trait analysis and produce reports with results that can be printed, saved, and viewed. Initially, this service will primarily consume user-supplied tree data but the emphasis will shift to the iPToL big tree data as this comes online.
This effort will extend the iPToL DE to include user workspaces, authentication, authorization, tree and trait data sharing capabilities, and annotations.
Development effort on these three projects will combine to result in a 1.0 release of the iPToL Discovery Environment. We are still iterating on details of Branch and the tree reconciliation related project.