TR_25JAN10

Action items

Agenda

1. Brief review, any outstanding action items missed (5 min.)
2. Updates on development efforts (10 min.)
3. Requirements Discussion (remainder, ~45 min.)

Requirements Discussion

Outline where we're headed in the next few meetings. We will work through the following:

1. Understand how Tree Reconciliation research done today (what is not possible, shortcomings, annoyances) - the current workflow [*]

2. refine the current workflow into the high-level workflow

  • identify what parts of the workflow are "pre-processing"
  • determine where new development is needed (which is different than improving existing tools/APIs)
    [this will provide an understanding of the scope of prototype development]

3. analysis of problem statements begins (which includes alternative workflows and user actions)

[*] what's not possible falls out of this effort and gives the insight of "where the gaps are". These become our problem statements, which turn into requirements for software development. These requirements will be scoped into projects, like the prototype development.

{Requirements Discussion will need offline feedback and further coverage potentially Feb. 1, 2010}

Notes

Attendees: Andrew Lenards, Cecile Ane, Nicole Hopkins, Adam Kubach, Sheldon McKay, Todd Vision, Jerry Lu, Bill Piel, Jim Leebens-Mack, Natalie Henriques.

Review

  • TR Meeting are now Bi-Weekly, however to keep momentum in the area of workflow analysis Andy has asked members of the working group to meet Monday, Feb. 1 at the regular meeting time. The attendance of Michael Gonzales, Sheldon McKay, Natalie Henriques is not expected (though Sheldon may call-in).
  • Andy and Todd are planning a face-to-face meeting at NESCent in the near future. Sheldon McKay and John Bowers may also attend.
    • [Update: Andy will be at NESCent Tuesday, Feb. 9 through Friday, Feb. 12]
  • Prototype Development: Sheldon gave some updates and thoughts
    • TreeBeST can utilize a "really big" species tree and canse use gene trees that are present on the species tree
    • The output of TreeBeST is a gene tree w/ annotations indicating the duplications. It does not perform the reconciliation. Though the resulting gene tree and the input species tree can be run through PrimeTV and create a "Fat Tree" represented as a static image (JPG? PNG?). That representation will prevent interactive interface with the tree.
  • Action Item Andy provided the Tree Reconciliation working group with the Call-in/WebEx information for the Core Software "Planning and Retrospective" meeting. This serves as a chance for the current development efforts to be demonstrated and feedback gathered.

Requirements Discussion - Workflow: Detailed Walkthrough

Working through the example workflow provided by Todd:

  • [User Action] Our user (Andy) copies the protein (fertility restores, aka PPR/Rf) in FASTA format so it be could used as input
    • [Aside: If there was a gene catalog available in, say, the current Discovery Environment, we would now BLAST against it to look for homologs]
  • [User Action] Andy copies the protein sequence into input box on the Phytome Search:Single Blast tab
    • Questions/Comments on the interface to the search
      • Are the defaults okay? The interface is one familiar to individuals using BLAST (it's just the NCBI BLAST). Jim said that he would change the Expect value (aka E-value) from 10 to .01. But that change should not matter because the best hit will be isolated. Regarding E-values, it was suggested that
      • Do we need to be able to map the protein back to CDS (aka Coding Sequence)? [open question]
      • Should guide users through the process by de-emphasizing BLAST interface, encourage them to find the gene catalog record and use that as their startingpoint
      • Does the BLAST interface alievate the gene naming differences between genome projects and data source? If they found an identifier, how is it resolved? If you want a unique name, that's a problem. If you're okay with multiple names, use GenBank Accession ID for search. Most researchers are willing to find the Accession ID. System (aka Discovery Environment) needs to resolve Accession ID against the gene catalog. Along w/ Accession ID, a select few naming schemes should be supported. Examples: AT##### (Arabidopsis), EMBL, OS#### (rice).
    • Summary - two options for entry into the system have been identified
      • 1. User may not know gene name, so maintain a BLAST interface. User then choose gene from results list they get
      • 2. User has Accession ID (or one of the select/limited supported identifiers)
        • PlantTribes project determined people don't have a good sense of gene family names, so BLAST hits and search on Accession ID were used. PlantTribes does use GO ID (unique identifier for the term) & GO Term.
      • 3? Another entry for search might be terms that appear in accessions, like our example record. So terms like "transcription factor" would get hits in PlantTribes. "PPR" & "Rf" get hits in searches on PlantTribes. The "Definition Line" in a GenBank accession record is also a way to find information with searches in PlantTribes.
        • [REVISIT] Don't get anything beyond the definition line from the GenBank record, so fertility restorer from our example record would appear - but there would be no mention of "PPR" nor "Rf".
  • [User Action] Andy performs the Single BLAST search with input.
  • [User Action] Andy reviews the search results and sees a summary of the inputs, the gene families with hits, and the best scores within each family (both the bit score & E-value).
    • Questions/Comments on the search results
      • What is the "best ensemble?" In the case of our example, the best ensemble is all hits within Gene Family #139.
      • So the Single BLAST take the use from a gene and leads them to the gene family
        • 1. If the user wants to see the whole family, they use BLAST to ID it, look at reconciliation then look at the individual
        • 2. Or user BLAST ID gene and follow through reconciliation from beginning. The presence of gene in catalog means you have the most closely related gene given a hit.
      • In PlantTribes, they give list of hits and show what families have hits - all values w/ above 3^-24 are considered to be in the same family. E-value is more important, more widely interpreted and consistant, but it is normalized by the size of the catalog. Across multiple searches of different data sources, a user would pay more attention to the bit score or the E-value.
      • Table on the top of the results page is a value add. It provides context and the best hit for the family - which our user knows is Gene Family #139, the next best is Gene Family #64.
    • Discussion what the user learned from the search results
      • Can conclude the gene is probably in Family #139, but the alternative to the hypothesis would be hits that were pretty close in other families (like Family #64).
      • PlantTribes gives more family clustering
      • What do we means by clustering here? Summary: a Markov graph expansion is created, all genes in gene catalog are BLASTed against each other and the scores become the edges and the genes are the nodes (in the graph). Find cluster in the graph where lots of edges present and edges have high E-values.
  • [User Action] ...From Family #139, where does Andy go from here?
    • Comments/Questions
      • Phytome provides alignment & gene family phylogeny - can download all sequences or select individual members of the family (subfamilies).
      • How does a user know which subfamily they want?
        • [REVISIT] The alias listed w/ the Phytome IDs on the subfamily selection page. A user familiar w/ the organisms would recognize the IDs - but they have to have some knowledge to figure out that there is some correlation. The presence of the alias in the subfamily selection page makes this a richer interface. If a user doesn't know correlation, they can't hone in on the subfamily they are after.
        • [REVISIT] Consider having the user mark hits in the original search so that they have them flagged as they progress through each step of the analysis from the family table/page. So flag best hit and trace that through the steps. Note: first column in Phytome results is the internal identifier in Phytome, would like to provide more metadata to user.
      • Regarding graphical view when hits fall in cluster or tree
        • Subfamilies are clades in the tree. Green triangles highlight those hits (w/ number of hits to right).

We left off at the representation of BLAST hit and how they help a resolve what they are searching for.

Will continue from this point on Monday, February 1