DI_20091203

iPG2P Data Integration

December 3, 2009, 12pm EST

Attendees: Karla Gendler, Doreen Ware, Pankaj Jaiswal, Qi Sun, Carolyn Lawrence, Jerry Lu, Matt Pickard, Justin Borevitz, Chris Jordan, Lukas Mueller

Action Items:

  • All review NGS workflow and make more recommendations
    • Think about what else might be needed/associated with genomes
    • Think of what is required and where it can be extended to other attributes
  • Jordan will send around an updated list of resource providers
    • All review and add to as needed
  • Lu and Jordan will update questionnaire to be more specific and ask more targeted questions, especially in regards to ontologies
  • Lu will set up interviews with individuals to discuss the following two workflows, but all should be familiar with them
  • Gendler will send out poll for next meeting time, Dec 17th somewhere between 1pm-5pm EST
  • Lu will review DI needs from phylogenetic group and see if we can overlay what they are working on with what G2P is doing

Notes/Agenda

  1. Reveiw Action Items
    1. All: for each component of NGS workflow, start to identify reference resources (will need to be delegated in more detail)
      1. With NGS, reference resources are looking at genomes
        1. Workflows are hanging on reference genome and not really capturing meta-data
        2. In looking at the workflow presented, only information needed from reference was FASTA format and transcript annotation
      2. Need to decide at what level and how much of the annotation needs to be extracted
      3. Only information they were looking for are gene annotations (for NGS)
        1. What are the attributes?
        2. Are we going to be exchanging identifier information (protein, gene or exon; accession ID or positional information on reference genome)?
          1. Will need both gene and position information (need both read and position as there could be multiple reads in the same position)
      4. We need to first ensure that NGS wants everything in FASTA
      5. Can envision lots of other attributes but what they have described only needing reference sequence and annotation
      6. Genome databases don’t necessarily use same definitions; do we have standard controlled vocabulary that describes transcripts in GFF; #### GFF3 is very tightly linked to sequence ontology but doesn’t require that use terms as descriptors
        1. Can be difficult to change to SO and as a group, we need to make recommendations
        2. For iPlant, will want to have standard ontologies
        3. Ware asked the following questions: do we convert or do we use their format and just provide that as the output? Do we encourage and try to enforce it? Are we just going to input and output what comes from field?
          1. Jordan said won’t be able to just take what is given, with the workflows that are coming, will need to be able to pass on to other tools; will need to either translate it or describe it in a sufficient
          2. Need controlled vocab based on transcripts; will need some controlled vocab based on analysis; standard file formats that need to be defined for outputs of population data
          3. Will need samples of data of what inputs and outputs are from each step; what’s the minimum amount of information? What are current norms for the file formats
            1. Sequence: FASTA
            2. Annotations: GFF (or ASN, what’s NCBI formats?)
      7. Borevitz asked about the outlook on using non-reference genomes?
        1. Ware is personally interested in getting some type of assembly and thinks that it should be put forth in NGS group as high priority
    2. Gendler to send questionnaire to Sonya Lowry for her review
      1. Jordan stated that questions that Core had related to discussion that was had earlier in this meeting; Notion that we should select and utilize some of the standards that are out there; where there will be problems will be in differing implementations/incomplete implementations of them
      2. Pikard asked if there is a possibility that some onotologies will need to be created? Jordan would like to avoid it. Jaiswal commented that there are ontologies for plants: phenotypes, traits, anatomy and that gene ontology is mature enough to cover most of what is needed. He suggested that the group should survey the other working groups to see what is needed overall and what potential ontologies are needed. Jordan said that metabalomics will be needed by the viz group. Jaiswal suggested that they could use CHABI, which is biochemical and a very mature ontology that is funded out of EBI. However there are problems in trying to classify newly identified metabolites.
      3. Jordan would like to see that iPlant contribute to the development of some of these standards and hopefully won’t have to contribute any new ones
      4. In response, Jaiswal stated that most of ontologies are mature enough to support what is needed. In Sol, using GO, SO, and developed own phenotype ontology (should be much more widely applicable, might consider registering on OBOE) and also PATO (ontology to quality/phenotype and trait ontology/ more of an attribute ontology). It is important to remember that it is possible to map to plant ontologies but not very granular but it at least gives a way to submit all data to plant ontology. There might be issues between species specific ontologies and general plant ontologies.
    3. Gessler: begin to identify semantic resources for the NGS workflow. What are the requirements for semantic integration from a genome sequence view?& Identify semantic services that exist and what needs to be done to use them
      1. iPG2P Data Integration - Semantic Web Services Status report.pdf
    4. Ware: follow up with members of group to see if there are roadblocks with times of the meeting
  2. Reference Database Survey
    1. Results
    2. What changes are necessary before being sent out to who?
      1. What specific ontologies are being used and are they registered in the OBOE
      2. Review who else is going to get out and update survey and then send out to all the reference resources
      3. Include standard ontologies
      4. More specific based on what standards are currently used (ask specific and not have them come up with a de novo list)
      5. Would like to get out before the end of this year
  3. Discuss DI needs for Maize Gene Analysis Workflow
  4. Identify Action Items
    1. All review NGS workflow and make more recommendations
      1. Think about what else might be needed/associated with genomes
      2. Think of what is required and where it can be extended to other attributes
    2. Jordan will send around an updated list of resource providers
      1. All review and add to as needed
    3. Lu and Jordan will update questionnaire to be more specific and ask more targeted questions, especially in regards to ontologies
    4. Lu will set up interviews with individuals to discuss the following two workflows, but all should be familiar with them
      1. Maize Gene Analysis
      2. Ruth Grene's Network
    5. Gendler will send out poll for next meeting time, Dec 17th somewhere between 1pm-5pm EST
    6. Lu will review DI needs from phylogenetic group and see if we can overlay what they are working on with what G2P is doing