Report on Meeting of G2P Visualization Working Group at TACC, November 6 and 7, 2009
Submitted to the Group by Ruth Grene November 14, 2009
Present: Dan Stanzione, Greg Abram, Bernice Rogowitz, Chris Jordan, Nicole Hopkins, Steve Welch, Steve Goff, Tom Brutnell, Bjoern Usadel, Eric Lyons, Nick Provart, Ruth Grene, Stephen Kobourov, Karla Gendler, Ruth Jordan, Arjun Krishnan, Nirav Marchant, Lin Wang.
Purpose: The meeting was billed as one to be centered on achieving mutual understanding between plant biologists and computer scientists on the team of the task before us. This was, indeed, the promising outcome, as indicated below.
Goal: An extensible lingua franca that unites contemporary data and meta-data in plant biology in a user-friendly visualization tool.
Outcome: Workflows that integrate data will be constructed as web-based services. Greg A. and Bernice R. will be involved in this construction, in collaboration with the domain experts, e.g. Tom B., Bjoern U., Ruth G., Nick P. Eric L., and Steve W. Such a synthesis will provide the user with qualitatively and quantitatively greater resources in his or her efforts at understanding fundamental plant processes and the consequent formulation of novel, and more sophisticated, working hypotheses.
Ruth gave an introduction to some of the fundamentals of contemporary biology, starting with a formalized diagram of a plant cell and its associated functions, (one of which, C3 and C4 photosynthesis, was subsequently presented in detail by Tom). Ruth focused on images currently constructed by plant biologists to present working hypotheses of mechanisms of plant responses to non-constant environments. These images are often called “models” in the plant biology community, although they are only qualitative, and not quantitative. (The more usual, and more precise, formulation of “models” was subsequently demonstrated by Steve Welch, who, among other accomplishments, has successfully modeled flowering time in the field across a swath of Western Europe for Arabidopsis thaliana, complete with dynamic maps.) The non-quantitative plant biology “models” can represent attempts to relate available genomics/genotypic data to phenotypic outcomes: e.g. a mutation in Gene X leads to a decreased drought tolerance phenotype, leading to the formulation of a hypothetical drought perception and signaling pathway that identifies Gene X as an essential component of the process. An extra layer of complexity arises in the numerous cases in which the interests of the user are outside the canon of fully sequenced plant species, (an agronomically important crop species, for example), in which case a tool has to be used to identify closely related genes in a species where more information is available. (Eric presented a tool, “CoGe” that allows such identifications to be made, using an example from potato, a species whose sequence has only just become available, and is, as yet without annotations.) In many cases, a constellation of interacting transcription factors, signaling and metabolic pathways have been implicated in a plant’s responses to specific changes in the environment (Bjoern’s data on responses of Arabidopsis to lower temperatures are an example of such a multiplicity of data types). In consideration of such complexity, it is clear that data that attempt to relate genotype to phenotype, are multi-dimensional. To date, there is no visualization tool that allows the integration of data such as Bjoern’s. The substantial gaps in our understanding of plants’ responses to their varying environments present an additional enormous challenge.
Current formulations that offer the possibility of visualizing the details of some of these events were presented by Nick (BioArray Resources and eFP Browser, including gene expression, subcellular localization, protein-protein interactions) and Bjoern (metabolic pathways in MapMan). Both Bjoern and Nick utilize gene classification schemes to group, or “bin” co-expressed genes into functionally related groupings. This is a most valuable approach that must be preserved in the larger visualization “wrapper” to be built in collaboration with Greg and Bernice.
Nick’s tool offers the user the option of clicking on various plant tissues that are hyperlinked to repositories of gene expression data, such as NASC arrays, GEO, or a record of relevant literature, (example shown in Killian et al., 2007). Nick emphasized the importance of parallel coordinates in construction of visualizations of data. It is possible to access gene expression data for multiple genes at a time. Co-expression tools (“GeneMania”) are available, using Pearson correlation metrics provided by a colleague at the University of Toronto, and Expression Angler, from Nick’s own work. The tool allows discrimination among gene expression patterns among different members of multi-gene families. A number of fully sequenced plant species are accessible in Nick’s tool and that number is continually updated. The tool already has the capacity to incorporate data across platforms. Eric’s tool, CoGe, could be invoked to establish confidence ratings for inter-species homology relationships, see below. Using Cytoscape, Nick’s tool has already been used to identify an ABA-ORFeome, for example, genes that are co-expressed in response to exposure to the plant stress hormone abscisic acid (Park et al., 2009).
One goal of Nick’s work is to map tissues across plant species, a task that is not currently feasible due to the limitations of the current version of Plant Ontology.
Nick’s tool does not offer a way to paint/brush gene expression data on known metabolic pathways, nor does it offer the optimum tool for identifying homologs across species. The former is offered by MapMan and the latter by CoGe. It is hoped that the combined power of, for example, the eFP browser, MapMan, and CoGe can be harnessed into a combined visualization tool that allows individual plant biology users to view a range of different kinds of data at different scales in an ensemble, something that has not been achieved to date.
Bernice presented a perspective on visualization from the other side of the fence, so to speak. Some of the tasks/judgments facing a developer can be enumerated as magnitude, changes over time, interactions, patterns, relationships, and outliers, all terms that are familiar to biologists. A user’s actions could comprise browsing, exploration of relationships, brushing, transformation of variables, tagging and annotation, integration of analysis, e.g How do a given model (Steve W’s kind of model) and the actual data interact? The plant biologists need access to all of these tasks and could act as users in any or all of the categories enumerated by Bernice.
Greg presented a suggested tool, VizTrails, that could, in principle, serve as the framework for the visual synthesis of the plant biologists’ various kinds of data. Tools already developed, such as those described above, would be wrapped by a Python script and a work flow would emerge from this merging process. The “intellectual descent” of data and tools would be retained as information and visualizations traveled through VizTrails, or its like. The group agreed that Tom’s gene expression data, relating to gene expression patterns and the acquisition of C3 and C4 photosynthetic capacity in maize leaves, Bjoern’s data, showing transcriptomic and metabolomic responses of Arabidopsis to low temperature, and Ruth’s use case scenario would each be used as test cases for the approach suggested by Greg and Bernice. A library of workflows will be constructed and a Visualization Working Group Database built which will first be populated with test data. The plant biologists present were concerned to establish what will travel on the edges of VizTrails. The test cases will be used to make this path clear.
Active collaboration between the Greg/Bernice team and the domain experts has already begun at the time of this report (November 15, 2009), and draft of the first test case is circulation among the members of the Visualization Group.