This box searches only this space. The box at the upper right searches the entire iPlant wiki.

Skip to end of metadata
Go to start of metadata

A solution to the genotype-to-phenotype (G2P) problem can be described as an analytic process that allows an investigator to begin with a trait of interest in a species possessing limited genetic resources and progress towards the ability to predict trait scores for known genotypes in given, non-constant environments. To this end, the iPG2P Steering Committee is identifying common, abstract workflows that will support this process and has defined high-level use cases that the working groups should strive to implement. The two prioritized use cases are 1) Carbon metabolism (including C3 and C4) and flowering time and 2) Hypothesis generation through data mining, processing, and visualization. (need to attach summary document)

Working Group Formation

In late July 2009, 16 scientific research community members with interests in phenology, drought stress, photosynthesis, bioinformatics, and machine learning and 8 iPlant faculty/staff members met in Chicago at the “iPG2P Project Kickoff” meeting to establish the specific focal areas for this Grand Challenge and to develop a high-level implementation plan. Prior to this meeting, a survey was conducted among planning group members and meeting participants to gain insight into what cyberinfrastructure areas were of highest interest to the participants and what use cases might be established. At the meeting, a series of plenary sessions with the entire group, smaller breakout sessions, and extensive group discussions were used to define five working groups that would cover the intellectual and technical range of the "genotype-to-phenotype" problem. Leads/co-leads for each working group were recommended by consensus and working group membership was drafted from appropriate meeting participants and non-participating community members.

Working groups are comprised of a lead and co-lead, who are senior investigators in fields pertinent to the group's focus, one or more iPlant Engagement Team Analysts, and five to ten invited members of the plant science, computer science, and biological informatics communities. They are advised and guided by the G2P Scientific Lead, and are supported by the G2P Project Manager and an Administrative Assistant. Groups meet biweekly via teleconference, have access to a shared collaborative web space, and use of group-specific mailing lists.

Working Group Progress

  • NextGen Sequencing Pipeline: The goal of the NextGen Sequencing Pipeline working group is to develop tools to permit efficient use of next generation sequencing (NGS) data by members of the plant science research community involved in genotype to phenotype research. Requirements analysis for this working group is the most advanced. A draft statement has been developed describing the first iteration of a NGS discovery environment, which includes a core web application framework with data upload capability, user authentication and collaboration tools, pre-processing and quality control tools, support for a variant detection workflow, and support for a transcript quantitation workflow. Also included is support for command-line access to RESTful services for advanced users. Software development on this iteration is scheduled to start in early Q1 2010. The working group is currently working to assemble a list of prioritized development activities for future releases. The group is also working in collaboration with the Visual Analytics Working Group to understand the types and forms of data that should emerge from NGS workflows to best facilitate visualization. In conjunction with the Data Integration Working Group, NGS is defining what data needs to flow through the pipelines. In addition, several matters of standards and practices are being addressed via formation of sub-working groups: base-calling in polyploid genomes, standard formats for representing variants, standard formats for representing sequencing-based transcriptional data, and logical representation of genomic structural variants.
  • Statistical Inference: The Statistical Inference group is working to develop a Discovery Environment that can make advanced computational approaches to statistically link genotype to phenotype more available to the general user and more rapid for the specialist user. A first iteration for this system will be described by the end of Q1 2010. The group has identified and prioritized general classes of statistical genetics methods that should be supported by this platform. These include General Linear Models (GLM), Mixed Models, Machine Learning and Bayesian approaches. General Linear Models, being most pertinent to the widest cross-section of plant biologists, is being addressed first. A test implementation of parallel GLM is being developed by iPlant scientific programmers, which should enable larger, more intensive genetic mapping analyses to be conducted. A prototype of this tool is expected to be complete in Q1 of 2010.
    In addition, a comprehensive description of GLM-based QTL analysis is being developed for two research computing teams affiliated with iPlant who will develop implementations of this algorithm for GPU and FPGA architectures, with the goal of dramatically decreasing execution time for GLM analyses. The group has also initiated discussions with the Visual Analytics group on how to view and explore the large (2.5E+6 points) multidimensional data sets which are expected to emerge from genetic association studies with the advent of relatively inexpensive whole-genome resequencing, as wel as how to make the results of such analyses more accessible to the general research community. Finally, they are working to develop universal standards for defining and describing genotype/phenotype mapping experiments.
  • Modeling Tools: The Modeling Tools working group seeks to develop framework tools to support construction, parameter and confidence estimation, sensitivity analysis, verification testing, and utilization of models. To date, an exemplar modeling workflow has been described and is currently under review by the working group. This workflow includes integration with the activities and products of the NextGen, Statistical Inference, and Visual Analytics working groups. In addition, the group is evaluating model description languages such as SBML and modeling repositories and platforms such as BioModels.net and OpenMI for potential synergy with iPlant-led efforts.
  • Visual Analytics: The goal of the Visual Analytics working group is a Discovery Environment capable of displaying diverse types of data from laboratory, field, in silico analyses and simulations, and other sources specific to genotype-to-phenotype research in ways that reveal underlying patterns, lead to novel hypotheses, provide concise syntheses, and support publication, collaborations, and education. In early November 2009, members of the Visual Analytics working group met at TACC to bring together the plant biologists and computer scientists in the group to develop a mutual understanding between them. During this meeting, the working group identified major issues in the G2P analysis and emerged with test cases for a canvas/widget approach to analysis and visualization. Currently, a series of demonstration applications are being developed to showcase this approach. Workflows have been generated to describe visualization needs for a "Maize Gene Analysis" and of an "Interactive Gene Expression and Metabolomics Analysis".
  • Data Integration: The Data Integration group seeks to build Discovery Environment software atop existing middle-ware systems that use metadata to achieve situational awareness of available data, logical relationships between different data sets, and tools that enable users to find relevant information even when they are not sure what data may exist. The group is currently analyzing workflows from the various G2P working groups as they are developed. One initial focus has been on identifying data integration needs in the NextGen Sequencing variant detection and transcript abundance workflows. To this end, a survey was created and distributed to reference sequence data providers asking for details on the types and formats of data they offer, as well as inquiring about communications standards and methods for integration service applications. A similar questionnaire will be sent out to reference sources identified in the Visual Analtyics (maize gene analysis and stress biology analysis) and Statistical Inference workflows (GWAS/QTL mapping). The group has also begin to define metadata/provenance/data quality standards for iPlant, in collaboration with Sudha Ram and her graduate students. Finally, the working group is exploring approaches for integrating expression data, molecular pathways, metabolite profiles, and biological networks.

Engagement Team

  • The composition and role of the iPG2P Engagement Team and working groups are described. The Engagement Team is the primary source of contact between the five working groups. The Engagement Team has two main roles: to perform requirements analysis and to provide project management and coordination for the working groups.
  • No labels