DI_20091112

iPG2P Data Integration

November 12, 2009, 3pm EST

Attendees: Doreen Ware, Chris Jordan, Jerry Lu, Doina Caragea, Damian Gessler, Steve Goff, Karla Gendler

Action Items:

  • All: for each component of workflow, have DI start to identify reference resources (will need to be delegated in more detail)
  • Gendler to send questionnaire to Sonya Lowry for her review
  • Gessler: begin to identify semantic resources for the NGS workflow.  What are the requirements for semantic integration from a genome sequence view?  Identify semantic services that exist and what needs to be done to use them
  • Ware: follow up with members of group to see if there are roadblocks with times of the meeting

Notes/Agenda

  1. Status of Sequence Data Provider Questionnaire - Jerry/Chris
    1. In response to email from Chirs Jordan, members of the Steering Committee provided some additional resources.  He will post these resources.
    2. Those brainstormed, with contact information:
      1. MaizeGDB: Carolyn J. Lawrence
      2. Solanaceae: Lukas A. Mueller
      3. Arabidopsis: Eva Huala
      4. Gramene: Doreen Ware
      5. Soybase: Rex Nelson
      6. NCBI: Tatiana Tatusova, Brian Smith White
      7. ENSEMBL: Paul Kersey/Doreen Ware
      8. JGI: [Doreen/Steve G will provide names later]
      9. Oakridge and Pacific Bio Northwest National Labs: Goff stated they are major data providers that want to connect to iPlant (biofuels). 
    3. Questionnaire for the models and genome sequence providers
    4. Survey
    5. Focus is on genome providers but beginning to look at metabolomic data which is based essentially on workflows that are being developed by other working groups.  With the genome questionnaire, we want to establish a model of how to get information about what is out there, how to access it, way the data is express, etc.  If this model works, we can go through the same process to get the other types of information necessary.
  2. Summary of VisAnalytics workshop and implications for DI - Karla/Chris
    1. Potential for workflows to drive much of what we do
      1. 00_NGS_Framework_v03.pdf
      2. For each component of the workflows presented, have DI start to identify reference resources. Once these resources are identified, take it it Plant Genome and see if we are missing anything (also Mike Sanderson).
    2. Discuss addition of metabolomics data to current DI efforts, preliminary data specification from Tom and Ruth
      1. Tom: NGS_Documents 
      2. Workflow: NextGen to DataViz Pipe.ppt
      3. Ruth Grene is working with Chris to get him specific references.
    3. Will need to lok at pathways, ortholog data sets, networks
      1. Mike Sanderson is very interested in finding orthologs and speeding that process up. Look at what the botanical gardens and NYU have.
  3. Developing effort to define metadata/provenance/data quality standards for iPlant - Chris
    1. Need others in the group who can volunteer to help with this
    2. At a minimum, need input from group members on their fields of expertise - can be wiki, e-mail, etc.
    3. Ware will begin to delegate action items pertaining to this topic on Monday and go over with Chris
  4. Other working groups progress
    1. StatInf
      1. Goff will talk to Rebecca Doerge about Arabidopsis eQTL data sets
      2. Metabalomics data sets will come from David salk, Rob Last, Dan K and people overseas
      3. For algorithms, what is the minimum amount of data necessary (Doina can address)
        1. will need labeled data, as much as possible especially for Machine Learning
  5. Priorities
    1. next look at ortholog set, identify reference resources and meta-data
    2. expression, pathways, metabolite profiles