MT_20091118

iPG2P Modeling Tools Working Group

November 18th, 2009 - 4pm EST

Attendees: Karla Gendler, Adam Kubach, Ann Stapleton, Chris Myers, Matt Vaughn, Jeff White, Liya Wang, Melanie Correll, Sanjoy Das, Steve Welch

Action Items:

All: Provide list of modeling tools currently used
Talk with other working groups to see how modeling connects with use cases from steering committee.
Adam and Liya: look at SBML, BioModels.net, and OpenMI
Gendler: send out email with Confluence introduction, how organized, where to find meeting notes, etc
Gendler: send out poll to establish a regular meeting time

Notes/Agenda:

Tools: what commonalities should we focus on, when should we let 1000 flowers bloom, and how do we connect the two?
1. Myers asked what are the sticking points in the workflows that we execute now and are there ways to join forces to help solve some of these problems. White suggested that perhaps we don't want to build a general solution but instead should be interested in standardizing interfaces. Das pointed out that everyone writes in their own language so will there be a way to integrate the models. Welch commented that the goal would be to work towards a strategy that lets people share, using SBML as an example. Myers said that the issues that arise in modeling are different than creating a centralized workflow like the other working groups are doing. Vaughn commented that people tend to use their own code but they could be limited by access to data. Consider iPlant as a big tent; while others are taking a common/centralized approach, this group should think about how to democratize modeling. Myers said that there will be a need for enhanced computational needs and or/developement efforts and how do we prioritize development on certain tools? The discussion was tabled with an action item being that everyone should provide a list of modeling tools that they currently use.
Formats, standards, and interchange: what are they good for?
1. White stated that in crop modeling, they are all over the place in formats and standards which makes it extremely hard to compare models. He proposed moving to a much more standard interface. Myers pointed out that SBML can be used to exchange models and work in the same format. Vaughn asked if SBML makes assumptions about the way models are executed. Myers answered that you can take SBML and do Monte Carlo or deterministic modelining; there is not specification as to how executed. Welch added that SBML version 3 will expand its capabilities and that this group might want to contact SBML person to help draft standards. Myers said that the bioModels.net group in Europe presents a good opportunity to partner with. Correll added that the bioModels.net would be a good framework to look at, with their repository for linking models and it would be at least something to consider in the iPlant modeling group.
What types of modeling problems will best make use of unique iPlant/TACC resources?
1. In working with NAM populations, White said that just simulating 500 genotypes can add up in a hurry. Myers said it would be good to have an infrastructure to manage related but not idential big runs White asked if it was possible to put a wrapper around an executable. How do you deal with all of the data this generated? Welch asked if there are tools that help and are there issues on both the input and output? Vaughn said that that is a cyberinfrastructure problem, not nescessarily a data integration issue. Welch said we can help the data integration group by identifying needs.
Data integration drives us nuts: how can we convey useful requests and specifications to the Data Integration group?
Personnel: what tasks can we hand to iPlant developers now, and when we find a group postdoc, what will he/she work on?
Use Cases
1. Welch said that the use cases should be a litmus test; if we can be meeting the needs identified than we are making progress and with these use cases, larger groups can be involved. It would be good to begin cataloguing tools now. Myers sent a request to the group to start listing tools. White said that he is reluctant to go outside of the group until the group is more focused. However, with the work on photosynthesis/phenology, both Visual Analytics and Statistical Inference are looking at the NAM lines and maybe the question is how would one model phenology in maize and perhaps to start mapping out that process. What data do we have to work with? Myers suggested RNAseq data. He aslo asked where are there connections with other iPlant activities; with photosynthesis there is work with Tom Brutnell. White would like to look at wheat phenology data. Myers suggested that the group also work top down, to see how modeling connects to other parts of iPlant. Correll asked what are the other groups doing and was pointed to Confluence.

Expanded agenda:

Modeling Tools group,
Regarding this afternoon's working group teleconference, I've elaborated
a bit on the draft agenda that was circulated previously (included
below). Whether or not such an elaboration is useful remains to be seen.
Talk to you later,
Chris

Tools: what commonalities should we focus on, when should we let
1000 flowers bloom, and how do we connect the two?
+Some thoughts on tools from the Steve & Steve Trip Report:+Selecting and/or developing modeling tool sets. Tools are needed
for parameter estimation, sensitivity analysis, verification, and
model comparison. Because modeling is such a diversified activity, it
may be useful for the members of the work group to identify items from
their own workflows and seek commonality.
A few general points on each are:
1. Parameter estimation. This really equates to the need to optimize
  one or more goodness-of-fit functions [e.g., least squares, maximum
  likelihood, maximum entropy (possibly), or hand-crafted objectives].
  So the real need is for optimizers that can be readily used in a
  generalized fashion. This need is shared by Statistical Inference.
  As these problems are numerically intensive, parallel approaches
  should be investigated. Also, both nondeterministic (e.g. particle
  swarm optimization) and deterministic (e.g. DIRECT) algorithms should
  be considered.
2. Sensitivity analysis. In principle, three types of sensitivities
  can be investigated, namely to (i) initial conditions, (ii) parameter
  values, and (iii) to input values. Of these, sensitivity to
  parameters is probably most important in the near term. Tools are
  needed that can explore model responses near an optimally fitting set
  of parameters. These responses include both the values of model
  outputs and of functions thereof (e.g. least squares values). Both
  numeric and symbolic derivatives are probably needed, with the latter
  including derivatives of computer source code. The ability to take
  complicated derivatives will be of assistance in parameter estimation.
  The need exists to visualize the results of sensitivity analyses.
  Sensitivity regions can be expected to extend orders of magnitude
  further in some directions than others.
3. Verification. Sometimes referred to as "model validation", the
  basic question is whether there exist grounds to reject a model based
  on observations. There is a large literature on how this might best
  be done. The question is complicated by the fact that verification
  should be considered in the context of some proposed model use. In
  research contexts the focus is heavily on model falsification but in
  applied contexts model acceptance may be related to 'acceptable levels
  of error'.
4. Model comparisons. The question in this context is generally which
  of two or more models better represents a given set of data. Again,
  there is literature of various methods from which to choose. This
  topic is also of relevance to "model selection" in Statistical
  Inference.
Formats, standards, and interchange: what are they good for?
A useful discussion of at least some standards, model formats, and
ontologies is being developed at BioModels.net (e.g., SBML, MIRIAM, SBGN).
On a related point, it might make sense for iPlant to partner with
BioModels.net to (a) provide a home/portal for plant-specific models
and (b) providing more substantial computational resources for online
simulation.
What types of modeling problems will best make use of unique
iPlant/TACC resources?
There is generally a sense (among those of us who have been discussing
it) that modeling problems of current interest become "big" when we
consider explorations across spaces of parameters, initial conditions,
and populations. Among other things, there are data management and data
integration problems that arise in coordinating sets of simulations.
Data integration drives us nuts: how can we convey useful requests
and specifications to the Data Integration group?
Personnel: what tasks can we hand to iPlant developers now, and when
we find a group postdoc, what will he/she work on?
Use cases:
1. the intersection of photosynthesis/carbon metabolism
  and flowering time
2. hypothesis-generation through data-mining,
  processing, and visualization
3. lignin biosynthesis (interest from group at NCSU working to develop models from detailed 'omics datasets).