Ongoing Development on the Tree of Life Grand Challenge

Trait Evolution

Ongoing development in collaboration with the Trait Evolution working group (https://pods.iplantcollaborative.org/wiki/display/iptol/Trait+Evolution) is focused on using the existing phylogenetic tree to address evolution of particular traits in relation to the species phylogeny. The first iteration of the iPToL discovery environment contains an implementation of Phylogenetic Independent Contrasts, using the CONTRAST program from the PHYLIP package. Future development priorities include support for the following methods:

Discrete ancestral state reconstruction
Pagel 1994 character correlation
Fitting models (OU, BM, etc.: various stretching models like those in Blomberg et al. 2003).
Continuous ancestral state reconstruction

Taxonomic Name Resolution

A cross-cutting data integration problem for many iPToL projects is the resolution of synomous, erroneous, or other conflicting taxonomic names. A pilot project is underway with the BIEN working group (https://pods.iplantcollaborative.org/wiki/display/iptol/BIEN) and other iPlant stakeholders to unify taxonomic name resolution with the Tropicos database and other taxonomic name resources (see: https://pods.iplantcollaborative.org/wiki/display/iptol/TNRS+Workshop). This is the first "incubator project" (https://pods.iplantcollaborative.org/wiki/display/IP/TNRS) in iPlant, which is a new, accelerated collaborative development model that brings together scientific advisers form the working groups, the iPToL engagement team, and members of the iPlant core services and core software development groups.

Scaling up Phylogenetic Inference

Phylogenetic analysis at the level of several hundred thousand species represents a new scalability challenge that iPlant is addressing on two parallel tracks. The general approach is is optimize existing methods (Maximum Likelihood with Raxml and Neighbor Joining with NINJA/WINDJAMMER). The current testing data set is a matrix of eight genes for 116K species provided by Stephen Smith. The ultimate goal is the have the computing capacity to build phylogenetic trees for up to 500K species when the data become available.

RaxML

RAxML (http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm) was developed by Alexandros Stamatakis. Members of the iPlant engagement team at the Texas Advanced Computing Center (TACC) have been working to scale up the application to be able to leverage the substantial TACC high performance computing resources. The first major achievement for this work is the addition of check-pointing to allow stopping and restarting of long-running processing jobs without loss of data or compute time. Ongoing work is on improving parallel implementation, to decrease overall run-time for large data sets.

NINJA/WINDJAMMER

Neighbor-joining is a hierarchical clustering method for inferring phylogenies.
NINJA, developed by Travis Wheeler, is a new tool that produces
correct neighbor-joining trees much faster than the canonical algorithm;
it is able to scale to inputs of the size needed for assembling the “tree of life”
for all green plants, but would require months to do so on a single computer.
The goal of this project is to build a parallelized implementation of the core
algorithm from NINJA, to allow biologist to analyze very large data sets in a
matter of hours, instead of months. The end product will be a new implementation
of this clustering method (code named WindJammer) entirely re-written in C
(the original version was in Java).

Data Integration and Assembly

The Data Integration and Assembly working group has two main purposes. First, to engage the plant science community and facilitate the aggregation of data needed to assemble of the "tree of life" for green plants. Second, the group is establishing the infrastructure to support logistics of handling the data as key resources are identified.

My-Plant

My-Plant.org is a phylogenetically-structured social networking website under development for plant scientists, educators, and other interested parties. Users will be able to easily share information and research, collaborate, and stay on top of the latest news in their field. Members of the My-Plant.org community will gather around clades of choice, then view and contribute to information surrounding the clades including image galleries, message board discussions, wiki pages, clade pages, and access to external sources of data. The site is under development and planned launch is late summer 2010.

Data Intake Pipeline

The goal of this project is to establish a robust data pipeline that makes use of different methods to compare/cluster sequences, identify orthologous genes and compute multiple sequence alignments as well as gene trees. The group has begun implementing several approaches to move efficiently from sequences to gene trees. The PHLAWD data intake pipeline (http://code.google.com/p/phlawd/), developed by Stephen Smith, takes predefined gene regions of interest and exemplar sequences representing phylogenetic diversity, it does not require clustering. The All-by-all BLAST pipeline, developed by Gordon Burleigh, offers the advantage of not requiring any a priori functional knowledge. Testing of the pipelines are in progress and work is under way to migrate them to the HPC environment at TACC.

Gene-Species tree reconciliation

Ongoing development in collaboration with the Tree Reconciliation working group (https://pods.iplantcollaborative.org/wiki/display/iptol/Tree+Reconciliation) and the Thousand Plant Transcriptome project (oneKP; http://www.onekp.com/) is focused on using the existing phylogenetic tree to address evolution of gene families. Work is underway on an iPlant incubator project (https://pods.iplantcollaborative.org/wiki/display/IP/TR) to provide a gene-species tree reconciliation service that will reconcile gene families being generated by the onekp project with the green plant species phylogeny.

The onekp project is also a potential source on input data for the data assembly and integration working group's efforts to feed into the "big tree" analysis.

Large Tree Visualization

With current technologies, visualization of phylogenetic trees large numbers of species becomes increasingly slow and difficult. The goal of the Large Tree Visualization project is to develop an application for viewing, analyzing and exploring of large phylogenetic trees. Code named "PhyloViewer", the application is a platform-independent, web-based viewer that will enable users to rapidly navigate very large trees (>500k taxa), integrate their own metadata and share that information with others. PhyloViewer is under active development and will be integrated into the Discovery Environment as well as available through a separate web service.

Angiosperm Phylogeny Website

A collaboration is underway between iPlant and the Angiosperm Phylogeny Website to modernize the infrastructure and convert the current static HTML pages of APWEB into dynamic database-driven web services (APWEB2) that will offer these widely used data to researchers and educators using a variety of different interfaces. See https://pods.iplantcollaborative.org/wiki/display/iptol/AP_Project_Charter for more information.