This box searches only this space. The box at the upper right searches the entire iPlant wiki.

Skip to end of metadata
Go to start of metadata

Goal

To gather inputs from group members on the DI WG needs on the following work flow:

Here are some questions we might need to address:

  1. what are exchange file formats could/should be used for each type of data?
  2. what are the minimum set of meta data (especially provenance data) could/should be kept?
  3. Any ontology could/should be used?

Interview Notes

Interview with Qi

Building own data models (maybe similar to NCBI Gene, RefSeq, HUGO Gene Nomenclature)

The models should:

  1. Capturing all the required information
  2. Easy to be transformed from/to popular exchange formats

Some Basic data models should be implemented

  1. Reference Genome based gene annotation
    1. Gene family
    2. Gene
    3. Transcript
    4. etc
  2. Model Network (capturing the functional relationship)
    1. Metabolite data
    2. Expression data
    3. etc
  3. Population
    1. Diversity data (might or might not be based on reference genome)
    2. Might consider using Pan-genome to represent the whole genome instead of single reference genome
    3. etc

Interview with Carolyn

  1. Mazie Gene Analysis
    1. MaizeGDB can provide gene name and associated Genbank ID for most if not all of the loci (and candidate genes) in maize genome. This should be incorporated into the literature search step.
    2. Incorporating the upcoming maize gene model from maize sequencing project.
    3. Sorghum and rice should also be used in homology search since functional-wise they are more closer to maize.
  2. Interactive Analysis of Omics Data
    1. Tools should be able to handle expression data from nexgen sequencing besides the affy chip.
    2. Incorporating the curated pathway data from MaizeCyc (collaboration between Gramene and MaizeGDB), SoyCyc, and PlantCyc
    3. Carolyn will follow up on a proposesd protein to protein interaction model across maize genome
  3. Plant Stress Experiment
    1. Using Environment ontology
    2. Tools should be able to handle expression data from nexgen sequencing besides the affy chip.

Interview with Eva

  1. AGI code should be used for Arabidopsis genes.
  2. In TAIR, splicing form encoding the longest product is used as the default reference (called representative gene model). Other splicing forms might be incorporated depending on the data set and the analysis.
  3. Using plant ontology
  4. Studying proposed tools and their data formats to identify the suitable exchange data formats.
  5. Might be useful to gather user experiences from the real users of the tools.
  6. Another tool to visualize Omics data, Pathway Tools Omics Viewers

Interview with Damian

Two kinds of approaches

  1. User senario driven approach
    1. To address the problems in specific user senarios, such as how to identify genes involved in response to drought?)
  2. User senario independent approach
    1. To address some common porblems, such as how to integrate disparity data housed in different places?)

What can DI WG do?

  1. Facilitate iPlant CI and iPG2P on developing concrete user scenarios.
    Upon getting these hard targets, we can then identify more clearly the data integration needs and challenges. This is an exercise of specifically addressing the user scenarios, while being cognizant that they are exemplary, but in no means exhaustive, of the larger DI issues.
  2. In parallel, perform requirements analysis on those issues independent of user scenarios, such as the fundamental distributed nature of data and services and the lack of uniform interfaces. A short-term deliverable would be a report along a format dictated by CI.

Interview with Pankaj

  1. Including not only genes co-expressed (up/down), but also genes share same regulatory elements in the co-expression analysis step
  2. In every step after co-expression analysis, query the genes in the known networks, if not exists, add them into network or build new network
    1. considering evidences including whether or not regulate the same phenotype, share same regulatory elements, share same expression profile, contribute to same network
      This is important because we will find that based on expression we have a set of lets say X genes, but these genes are known to interact with Y number of genes. Based on the new set of X+Y you would go back to previous step (iterate) and find expression and comeback with more conclusive evidence to proceed to find additional homologs from maize (based on X+Y).
  3. data exchange format should be compatible with popular tools such as Cytoscape http://www.cytoscape.org/
    This is important in a way that every gene-gene network that is generated must be portable to cytoscape for visualization of the network as well as the expression overlay.
  4. metadata
    1. cache metabolomics/pathway/protein-protein interaction data
    2. experiment condition (germplasm, treatment/abiotic stress, geo-reference, and etc.)
    3. functional annotation of genes

Interview with Julie Dickerson (Iowa State University)

  1. Using R (ex. Bioconductor/explorase) for interactive statistical analysis
    1. new technology/methods (published/shared) in R packages
  2. Flexibility in tools
    1. allow different tools for different tasks (ex. mapman good for know pathway, but not for new pathway)
    2. flexible interface to allow data from difference data source
      1. Preparing for the difficulties in parsing metabolomics and pathway data (various data sources/data formats)
  3. Adding data quality (ex. normalization) step into the workflow
  4. An open framework which allows plugins/modules/tools from users (ex matlab)
    1. A lot of tools available in biostatistical community, no need to reinvent the wheel
    2. Good example of open-source infrastructure software: system biology workbench http://sbw.sourceforge.net/
  5. Desirable Features
    1. saved workflow which can be shared among colleagues
    2. visually overlay hypothesis over existing data
  • No labels