TR_14DEC09

Action items

  • ACTION: Assess PlantTribes, Phytozome, Plaza, and Ensembl Plants as sources of gene family data, especially aligned sequences
  • ACTION: Get started on TreeBeST as the reconciliation tool
  • ACTION: Think about database/model to support queries as specified in use cases 1-4.
  • PIs: follow-up on Jens

Agenda

onekp followup?

Questions from data integration group (Zhenyuan Lu):

When doing the prototyping, the data most likely will be entered through files. Even later on a lot of data might be directly retrieved from iPlant internal database, I think we still need to consider the data from users and external data providers. So there will be data integration problems we should address. As an ETA of data integration group, I think it is important that the data integration group also get involved in the discussion here and help to work on those problems.

I have the following questions that I would like to go through:   [some quick answers below - Todd]

1. what types of data will be used as input, intermediate results, and output?

(a) a species tree, w/ or w/o polytomies, and (b) either a gene tree, or a gene alignment (depending on which method is used). 

2. what are exchange file formats will be used for each type of data?

Iniitially, these will be internal (not user-provided), so no requirement there.  In a more mature interface, users will need to upload Newick (for gene trees) or Clustal/Phylip/Fasta (for alignments) of the gene alignments. I would like to see the web service layer accept and produce NeXML to enable machine-to-machine communication.
For nexus, we might as well adopt the Mesquite-style gene-to-species tree mapping system: the TaxaAssociation block. For NeXML, we should convoke a discussion among interested parties (especially Rutger) and formally extend NeXML's capabilities to handle gene-to-species mapping (Bill)
Could this be done by over a semester by Jamie Estill (Jim)

3. what are the minimum set of meta data (ex. provenance data) should be kept? And how to handle/store the meta data?

For genes, we will want a standard menu of annotation data (Genbank/EMBL IDs, Pfam domains, GO annotations, possibly pathway designations) in order to facilitate search.  That is the only external metadata I can think of, apart from the taxonomic names.  As for provenance of user supplied files down the road, at a minimum we will want creator, date, title, and format. 
I think we would want the relative positions of genes in the sequenced genomes. This information is captured in the gene names for many but not all genomes (Jim)

4. Any ontology will/should be used?

CDAO for the NeXML.  Controlled vocabularies will be useful for species names (scientific/common) and gene/protein names/IDs, since mapping the taxon of the gene to the taxon in the species tree is fundamental.

plan and time-line for prototype development

DRAFT
Dec 7-31

  1. Sheldon and Andy investigate data sources and assemble test data sets for development purposes
  2. Sheldon and Andy will publish a more detailed development plan before Dec 18.

Jan 1-15

  1. Sheldon: Work on back-end data, alignment service (if need be) treebest gene tree construction
  2. Andy: Assemble first data sets for prototype, gather more requirement from WG

Jan 16-31

  1. Sheldon: Work on tree reconcilitaion and reconciled tree view; queries
  2. Andy: queries; Rough out UI; general tree viewing

Feb 1-28

  1. Sheldon, Andy - iterate with WG, refine design, integrate XXX canned data sets.

Prototype Thoughts and Assumptions

DRAFT

  • May mix and match sources for best coverage. Prototype will us worked examples (may be quite a few), no user supplied data.
  • Full length CDS from ATG to stop codon?
  • AA-guided nucleotide alignments for Phyml/treebest; may need to do (or redo) the alignments.
  • Is there a divergence threshold that we consider too divergent for DNA-based ML? If so, is measured raw distance, Ks, Ka?
  • Species trees require internal node labels for treebest (may need to edit).
  • treebest gives us gene trees. Want to reconcile downstream of treebest, use 'reconcile' and primetv to show reconciled trees graphically.
  • Want to be able to view: alignments, gene/species trees, reconciled trees
  • treebest package compiled c binaries (good for intergration with web app)
  • treefam database schema comes with perl API (good) and tied with treebest, consider using sub-set of treefam schema for prototype back-end, get the schema and API for free.

Notes

Attendees: Cecile Ane, Bernice Rogowitz, Andrew Lenards, Jim Leebens-Mack, Michael Gonzales, Adam Kubach, Natalie Henriques.

  • Reviewed action items. Sheldon and Andrew have been working through the items.
  • Cecile spoke about having a meeting through Skype this week for members who were not able to attend today’s meeting.
  • Bernice Rogowitz was introduced as a new engagement team analyst to the TR working group. She has been working with the G2P group constructing high level workflows.
  • Discussed the questions from the DI group. Points of focus were:
    • What is the definition of Gene Family? What is it meant to be? Will everyone be in agreement with the definition?
    • Gene Family – common ancestor/similarities (Jim stressed that a gene family is "operationally" defined)
    • Which database is going to give the set of alignments we want to work with?
    • Andy – In a general way how do we get from point A to use cases? That information would be helpful for the development of the prototype and the core software group. Understanding the high-level workflow of a user is helpful (a summary or narrative at the level of a "day in the life" of a user). And from the use cases, what are the end points? What data is important for export out of a system so that it can be used for other analysis, publication, etc.
  • Bernice discussed doing a similar high level workflow for the TR group as is being done in the G2P group. Characterizing what they are doing today and exploring what to do in the future.
  • Discussion on TR decisions – uncertainty on tools, philosophical approach, source of data, different algorithms produce different answers.
  • ACTION: Michael will schedule a 1KP meeting for tomorrow to follow up on an action item from the workshop. The goal of the meeting will be to review the technical details of mirroring the 1kp data at TACC, discuss any obstacles (i.e. data sharing policies, etc) and begin to formulate a plan for execution.