Sankoff_iPAT

Project Charter January 23, 2010

Project Title: Phylogenetic tools for plant comparative genomics

Start Date: September 1, 2009

End Date:  February 1, 2010

Project Justification:
A variety of comparative genomic problems involve the reconstruction of unknown genomes, more specifically their gene orders, starting from knowledge of one or more given, contemporary, genomes. These include the inference of ancestral genomes as part of the small phylogeny problem (i.e., with a given tree topology), and its archetypical cases, the median and quartet problems, based on three or four given genomes, respectively. It also includes the genome halving and genome aliquoting problems, based on a single genome where every gene is duplicated or in a gene family of size m ? 2; the idea being to infer the immediate pre-polyploid ancestor. There are also problems where one or more current, rather than ancestral, genomes, lacking full gene order or affected by paralogy or high levels of order error, are to be reconstructed based on comparative evidence. In addition, there are various hybrid problems, such as guided genome halving, which combines genome halving with phylogenetic inference, or small phylogeny on unequal genomes, which integrates missing data considerations.

The computational status of most of these problem have been reviewed in a recent paper b Tannier et al. and in a recent textbook by Fertin et al. With few exceptions (such as genome halving or multichromosomal breakpoint median), they are NP-hard problems, especially the more realistic versions. Nevertheless, with high demand from the phylogenetics community, a great variety of exact and heuristic algorithms have been developed, some of which are in wide use, like NGR and GRAPPA.

At the heart of many of these methods, especially those that find, or at least seek, a most economical solution in terms of genomic distances is the strategy of maximizing the number of cycles in the breakpoint graph (Siepel,Caprara, and many others), or its dual, the adjacency graph (Bergeron et al.), while reconstructing the unknown genome. For this project, we propose a data structure that is designed entirely for this type of strategy, adapt it to the each of the main gene-order reconstruction problems and implement corresponding algorithms in an integrated software package

Pathgroups is based on a compact and flexible way of storing partially completed cycles, so that genome-wide greedy searches (allowing various weightings, look-aheads and cut-offs) are rapidly executed and the data base rapidly updated. Path groups are readily adapted to virtually all the gene-order reconstructions problems and indeed the virtues of path groups could be anticipated by studying the basic steps in such procedures as diverse as genome halving and the median problem. The procedure handles constraints on the reconstructed genomes (e.g., exact tetraploidy) efficiently.

Project Objectives: To produce an accessible, production-ready implementation of Pathgroups that allows plant biologists to correctly reconstruct ancestral genomes based on gene order or syntenic relationships in genome data.

Overview of Deliverables: Reconstructed ancestral gene orders, based on gene orders in diploid or polyploid descendant, updatable as more genomes are added.

Approach:

Test and harden Sankoff lab software on the iPlant server, develop documentation and tutorials.

Success Criteria:
Working tool deployed in the iPlant iPTOL portal.
Identification of new collaborators who will begin a computational/plant biology collaboration with advice from Sankoff and Albert.

Key Assumptions:
Must have standard tree with genome data already available (already have grape, papaya, poplar; could do Arabidopsis, sorghum).
Genome data must have been analyzed with OrthoMCL or inParanoid. Genome data must have gene ids (eg from fgenesh), gene location data (such as contig_base#), and orthology relation fields for each gene id.

Resources: Sankoff lab software components
Albert lab expertise in testing data sets.
iPlant software engineering, gui development, and HPC expertise.

Roles and Responsibilities: Supervising: Victor Albert, David Sankoff
Coordinating: Ann Stapleton
Researchers: Chunfang Zheng, Sankoff postdoctoral associate, Adriana Munoz, Sankoff PhD student
iPlant iPTOL engagement team coordinator: Sheldon McKay
iPlant software development supervisor:
iPlant software engineering:

Signatures—The following people agree that the above information is accurate:
Project team members: Project sponsor and/or authorizing manager(s):

Notes/Comments: