1KP_Phyloinformatics_Pipeline

OneKP Phyloinformatics Pipelines and Output

As part of the iPlant Assembling the Tree of Life (iPToL) grand challenge project, the Tree Reconciliation Working Group (Todd Vision, Cecile Ane, Jim Leebens-Mack and Jamie Estill), collaborators (Norm Wickett, Raj Ayyampalayam, Mike Barker, Claude dePamphilis, Tandy Warnow, Gordon Burleigh, Oliver Eulenstein) and the iPlant team (including Michael Gonzales, Chris Jordan and Sheldon McKay) have developed pipelines for analysis of gene families within the context of organismal phylogenies. Data are being processed through these pipelines as sequences become available and intermediate results files will be posted for 1kp consortium members.

Analysis of Transcriptome Data for each cDNA Library 

Figure 1 shows how individual transcriptomes are being processed as we receive data from BGI-Shenzhen.

  

Figure 1. Description of analysis pipelines and data files for individual transcriptomes.  Databases and downloadable files included in grey box are being made available as described below

 Access to Data and Assembly Analysis FIles:

  • Raw reads and SOAP assemblies are available from the Westgrid server in Alberta and the TACC server in Texas.  Individual directories (e.g.ABEH-Heliotropium_greggii) at these sites include raw Illumina reads, assemblies, and BLAST analysis output files. 
  • BLASTX analyses of all SOAP assemblies are being performed to compare assemblies to gene annotations for plant genomes available in JGI's Phytozome database.   Downloadable files with blast results can be found within transcriptome assembly directories on the TACC server (e.g. ABEH-assembly.fa_blastx.results.bz2).
  • Sequences and BLAST results for individual assemblies will also be accessible through an assembly viewer.  The web-based viewer will facilitate searches on annotation terms for BLAST hits found in the BLASTX analyses described above.  A prototype for the assembly viewer is currently available on a server at the University of Georgia (Note that the prototype includes OLD assemblies). 
  • A blast portal will soon be launched on the TACC server.  A barebones prototype with OLD assemblies is currently available on a blast server at the University of Georgia.
  • Mike Barker is running all transcriptome assemblies through his DupPipe Ks plot analysis pipeline.  Ks plots will be placed in the assembly directory for each transcriptome as they are generated.   

Sequence Annotation (i.e. Specification of definition lines in fasta files)

For now, 1kp assembly sequences are only labeled with sequence names (e.g.  scaffold-AALA-0000003-Meliosma_cuneifolia). We are considering the possibility of appending the annotation of the best BLASTX hit to the label (e.g. scaffold-AALA-0000003-Meliosma_cuneifolia | similarity to A. thaliana pyridoxal-dependent decarboxylase family gene AT3G17720) but we are somewhat concerned that some may over-interpret the label as implying orthology or functional orthoogy.

Phylogenomic Analyses - Gene Tree and Species Tree Estimations

As indicated in Figure 1, 1Kp assemblies will be sorted into gene families and sequence alignments and maximum likelihood trees will be generated for each gene family (Figure 2). Norm Wickett is directing construction of gene family data sets, alignments and tree estimations using the SOAP assemblies and a custom assembly pipeline developed in the dePamphilis lab (CLC+). For our pilot study, parallel analyses will be done using the SOAP and CLC+ pipelines in order to compare the length, completeness and number of contigs and scaffolds sorted to each family and the degree of resolution within the gene trees estimated using these two assembly approaches. This work is in its early stages, but fasta files with the gene family data sets to be used in the pilot study should be available by the end of March or early April. Gene family alignments and should be available by the end of May and our first species tree estimations will follow shortly after that. Consortium members will be informed as the gene family data sets become available.

After analyses for the pilot study are complete, other 1p transcriptomes will be added to the gene family data sets.  A portal will be developed for consortium members to identify gene families of interest through BLAST and annotation term searches.


Figure 2. Description of analysis pipelines and data files for estimation of gene family sequence alignments, gene tree and species trees.