Tools should be able to handle expression data from nexgen sequencing besides the affy chip.
Interview with Eva
AGI code should be used for Arabidopsis genes.
In TAIR, splicing form encoding the longest product is used as the default reference (called representative gene model). Other splicing forms might be incorporated depending on the data set and the analysis.
To address the problems in specific user senarios, such as how to identify genes involved in response to drought?)
User senario independent approach
To address some common porblems, such as how to integrate disparity data housed in different places?)
What can DI WG do?
Facilitate iPlant CI and iPG2P on developing concrete user scenarios.
Upon getting these hard targets, we can then identify more clearly the data integration needs and challenges. This is an exercise of specifically addressing the user scenarios, while being cognizant that they are exemplary, but in no means exhaustive, of the larger DI issues.
In parallel, perform requirements analysis on those issues independent of user scenarios, such as the fundamental distributed nature of data and services and the lack of uniform interfaces. A short-term deliverable would be a report along a format dictated by CI.
Interview with Pankaj
Including not only genes co-expressed (up/down), but also genes share same regulatory elements in the co-expression analysis step
In every step after co-expression analysis, query the genes in the known networks, if not exists, add them into network or build new network
considering evidences including whether or not regulate the same phenotype, share same regulatory elements, share same expression profile, contribute to same network
This is important because we will find that based on expression we have a set of lets say X genes, but these genes are known to interact with Y number of genes. Based on the new set of X+Y you would go back to previous step (iterate) and find expression and comeback with more conclusive evidence to proceed to find additional homologs from maize (based on X+Y).
data exchange format should be compatible with popular tools such as Cytoscape http://www.cytoscape.org/
This is important in a way that every gene-gene network that is generated must be portable to cytoscape for visualization of the network as well as the expression overlay.
cache metabolomics/pathway/protein-protein interaction data
experiment condition (germplasm, treatment/abiotic stress, geo-reference, and etc.)
functional annotation of genes
Interview with Julie Dickerson (Iowa State University)
Using R (ex. Bioconductor/explorase) for interactive statistical analysis
new technology/methods (published/shared) in R packages
Flexibility in tools
allow different tools for different tasks (ex. mapman good for know pathway, but not for new pathway)
flexible interface to allow data from difference data source
Preparing for the difficulties in parsing metabolomics and pathway data (various data sources/data formats)
Adding data quality (ex. normalization) step into the workflow
An open framework which allows plugins/modules/tools from users (ex matlab)
A lot of tools available in biostatistical community, no need to reinvent the wheel