Call-in/WebEx Information

TR_WebEx - this meeting will include visuals, so WebEx Video will be utilized

Action items

AI: Former user (Deleted) will present rough UI for tracking results/hits [done]
AI: Former user (Deleted) to mockup an expert tree prune/selection w/ auto-complete to find taxa. Common taxa box should be incorporated. APWEB has an example of how to show users where their species/group of interested is placed on the tree. [incomplete]

Agenda

Brief review (10 min.)
Discussion of UI Mock-flows
- Pick-up w/ discussion of individual gene hyperlinks in Results tab/view
Pre-computation Workflow review

Open Questions

Regarding the creation of an iPlant Gene Catalog, what should be included beyond 1KP or 1KP Pilot Data?
- [answer] (from Todd Vision)
  1. NCBI. All NCBI RefSeq mRNAs for Viridiplantae include 273,424 sequences for 11 taxa. There are 960,134 mRNAs if one does not confine it to Refseq, but even this excludes all unigenes . A merge of the unigene set and the mRNA set, with some processing to remove redundancies, would be close to ideal. Unigenes are not available for all the taxa with substantial EST collections (I notice Mimulus is absent, for instance), but NCBI might be persuaded to include them.
  2. PlantGDB , which already calculates the cDNA-EST merge automatically, and has more comprehensive taxonomic coverage. So this option might be preferred.
It was suggested that if the entire subfamily was not select (so particular members were checked in the checkbox interface, but not all) we might want to use some visual representation of this.
- How do we represent subfamilies or families not checked by the user?

Notes

Attendees: Former user (Deleted), weix, jleebensmack, ane (Unlicensed), tjv, Zhenyuan Lu, Former user (Deleted), NicoleH

Meeting Overview

Discussion of what data will be needed to populate a gene catalog for use by prototype and beyond
Discussion of user interface functionality facilitated by simple mockups
Discussion of aspects of necessary 'Pre-computation Workflow' for creation of gene trees and reconciliations
- this workflow supported Use Case #1 for Phase 1 of the Prototype

Detailed Discussion Notes

Brief review and discussion of postdoc and meeting w/ collaborator Jens by Cecile and Todd.
Rejoin discussion about the results user interface reference
- GenBank Accession may likely not work because some data in gene catalog used will not be present there.
  - Want to avoid using internal iPlant-specific identifiers here
  - What would a sensible "fall back" or pre-population value be? [open question]
    - discuss domain/business rules for how to populate this later...
  - Defer this discussion until it is decided what will be in the gene catalog
What will be used in the iPlant Gene Catalog beyond 1KP Transciptome and/or 1KP Pilot Data?
- Two options:
  - NCBI RefSeq & mRNA/cDNA, but only 11 taxa are represented (Mimulus is absent, for example). Merge data from UniGene with mRNA set.
  - PlantGDB - calculates cDNA/EST merge, provides their own unique IDs, updated every 4 montsh. Path of less hassle & higher quality data. From PlantGDB, you get unigene builds, but what are they doing w/ Next-Gen data? We may want to touch w/ Volker Brendel - his group runs PlantGDB at Iowa State University.
  - ActionItem: Andrew needs to speak with Steve Goff to see if next-gen assembly is on radar and potential actions iPlant can do to aide efforts.
- Following on from this to discussion on Short Read Archive and NextGen data
  - Raw data goes into Short Read archive - folks have own databases of assemblies, not aware if those are being submitted to GenBank or only exist in SR Archive.
  - iPlant may be included in assembly of Next-Gen datasets (re iPG2P).
  - How many species are there that have RNASeq data?
  - Summary: group focused discussion on taking next-gen data and assembling using NGS pipline
    - Submission of data w/ standard format key
    - NCBI's Short Read Archive is difficult to use, extremely cumbersome, and described as disorganized.
Discussion of mockup for Results Tracking reference
- Todd was pleased with user interface, overall group liked the suggested approach
The Action Item for Andy to produce a mockup for "expert prune/selection" interface
- A common taxa box should be included.
- APWEB was mentioned as an example of how to show users where group/species is on the tree, but Andrew was not able to find the example. Jim clarified the example here
- Filter results:
  - only interested in genes from a species taxa (or group of taxa), use the 'star interface' such that a gene is starred
  - only show genes of interest
  - only show genes from one of more of the common taxa
- Re 'star interface' - a user could star all rice genes, later (at same stage) indicate only interested in specific genes in limited number of rice species.
- Re 'common taxa' box
  - should first start out being populated with model organisms (the most complete) and have an option (say an 'Add...' hyperlink) to select more organisms (the less complete, but of interest to user).
  - highlight & restrict options would be nice in 'common taxa' box
  - organize list by taxonmic groupings - collapsible list? Include model organisms above the additions to the common taxa box to make it so users could quickly find model organisms.
  - The 'Add...' more selections should be adding to the collapsible list?
  - In some ways, this functions like a faceted search
Discussion of Pre-computation Workflow (for Phase 1 Prototype development)
- TreeBeST needs Amino Acid guided alignments as part of the input. Is there scripts or tooling for this? Can we use ESTs to produce Amino Acid guided alignments?
  - is AA-guided aligment going to be provided? Or will a data source for test datasets be available?
  - Scripts out there to force this need to figure out how to best estimate EST from unigenes?
    - BioPerl has an AA-to-DNA align that can take AA-align & output cDNA
    - With unigene data that has untranslated regions, what is the best tool to translate?
      - Find the largest open reading frame
      - Users may want to know what the translation (the prediction) is - this immediate form may be of interested and, therefore, may need to be stored.
  - ESTWise - search protein sequence, pull out hits and feed 3 together w/ EST or mRNA sequence and provides prediction of translation. Suggested by Todd, who has used it with very satisfactory results. Tool is part of EBI toolkit.
- How should we be doing the clustering?
  - 1KP Analysis - TribeMCL using any large genome sequences or full length cDNA sequences - cluster all using TribeMCL & use it as a framework for sorting these using BLAST (then augment w/ Hidden Markov Model). Todd & Jim discussed if TribeMCL was deprecated. It appeared that it has been deprecated or there were memory issues with very large datasets.
  - OrthoMCL - to achieve higher granularity - identify smaller gene families (or clusters).
  - all-by-all BLAST and cluster based on e-values - do w/ all large full length coding sequences. This becomes a scaffold - might want to be using some of iPlant computational resources (would need to be able to specify cleared to TACC what is desired/needed).

Additional Information

Here is a description provided by Todd of the ESTWise pipeline from Phytome v2 for translating the DNA sequences.

Unipeptides

Unipeptide sequences have been derived from translated Unigenes. Phytome currently uses homology-based gene-prediction to derive Unipeptides for species other than Arabidopsis. The procedure was as follows. First, Swissprot (release 47.6)/Trembl (release 30.6) was searched with each Unigene sequence using BLASTX. The top three matches with E-values less than 1e-5 were used as templates for translation by In most cases (Arabidopsis and rice excepted), Unipeptides are then inferred from the Unigene sequences. A multi-stage homology search is done against several protein sequence databases using BLAST. First, Uniprot/Swissprot plus TrEMBL plant proteins are searched. If a nearly perfect match is found to a protein from the same species, this protein (or a consensus of all such proteins) is used as the Unipeptide for that Unigene. Failing that, the top three homologs are input to a homology-guided translation step using ESTWise (Birney et al. 2004) using the following three datasets in descending order of priority: (i) all Uniprot/Swissport plus TrEMBL plant records, (ii) non-plant records in Uniprot/Swissprot, or (iii) non-plant records in Uniprot/TrEMBL. Some Unigenes do not produce a corresponding Unipeptide in Phytome for any number of (non-mutually exclusive) reasons: they may lack a coding sequence (by consisting entirely of the 5' or 3'-untranslated region, or of an RNA gene), they possess a coding sequence that is too short, homologs can not be found, or the homology-based translation fails.

In Phytome version 2, 1,070,0355 Unigenes were used for ESTwise translations, derived from 5,017,744 ESTs. This resulted in 793,706 Unipeptides. Attrition was either due to BLAST failure (no or too few hits) or to ESTwise failure.

Reference

Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res.14:988-995

iPToL

TR_15MAR10