Suggested Modeling Workflow
The above diagram indicates a possible workflow (in cyan) for the modeling portion of the iPG2P project including integration with the activities/products of the NextGen (gold), StatInf (tan), and Visual Analytics (violet) workgroups. (These interactions serve to define certain needs to be met by those groups.) The paragraphs that follow describe this workflow and the biological questions to be answered by its various parts. No assumptions are made at this time as to what particular extant modeling tools might be adopted/adapted for use in the cyan blocks. In addition, accompanying files elaborate on selected portions of the diagram.
At an abstract level a “model” is defined as an object that, when executed or interpreted utilizes values of external time series (the “environment”) and scalars and, in turn, exposes both time series and scalar values (collectively, the “solution”). The externally supplied scalars subdivide into three categories: the “metadata”, which is sufficient to biologically distinguish runs from one another; the “parameters” which define fixed properties of the biophysical/chemical system; and “control values” mathematically required to generate the solution. External scalars must be available at the start of a run. All time series are defined over the same interval, namely, [start time, stop time], where start time is metadata and stop time may be metadata or determined during the run. Models may vary as to whether values of external time series are supplied in toto at the start of a run or incrementally requested during a run. Similarly, the solution may be exposed incrementally as it is computed or as a unit at the end of a run. It is the responsibility of the user to resolve any timing incompatibilities involving the supplying of external quantities or the exposing of the solution.
A key attribute of the G2P problem is that it encompasses multiple scales from the molecular to the whole plant levels. A great deal of scientific knowledge and successful modeling efforts currently exist at both ends of this spectrum. To give (perhaps debatable) names to these spectral endpoints, we might refer to “systems biology” and “ecophysiological” models. The former focus typically on molecular or cellular events, while the latter deal with whole-plant and/or major physiological processes. Unfortunately, while both these classes and others at intermediate levels conform to the above abstraction (textbox), they use quite different programming technologies. Nevertheless, if research efforts are constrained to solely use a bottom-up or, alternatively, a top-down approach, the G2P problem will be considerably more difficult, if not impossible. Therefore, flexible methods are needed that support contributions from both styles of modeling. This workflow has been designed with this requirement in mind, beginning with model entry.
There are no biological questions directly addressed during model entry. Rather, the task is to specify a model that is operated on by other tools. Four methods of model entry are indicated: (i) manual entry is used when specifying a model de novo or when editing a model previously described; (ii) System Biology Markup Language (SBML) will be supported, both in its current form and as might be updated in the future #1; (iii) executable units will be supported provided they use the OpenMI standard to comply with the above definition of “model”. (iv) Other forms of model description (e.g., GPML) may be supported either when format converters exist that can produce descriptions in the forms (i) to (iii), or when community prioritization suggests that a particular such converter should be developed.
A specific technical development is required to support the workflow described herein – let us call it a (or the) “composer”. The composer permits executable, OpenMI-compliant modules (typically written in conventional procedural languages) to co-operate and communicate at execution time with modules written in SBML (an interpreted script currently supported by over 170 tools). A foundation for composer functionality could be provided by putting an OpenMI wrapper around an SBML interpreter #2. Extensive information and expertise on both protocols is readily available.
- A suggestion would be for computer implementation specialists to immediately begin researching these protocols and identifying issues related to the design of the composer.
This activity addresses the question “What values of the ‘parameters’ (as defined in the textbox) yield model ‘solutions’ most congruent with experimental data?” Solving a parameter estimation problem entails several prerequisites: (i) a model, here supplied by the Model Entry activity, (ii) experimental data, to be discussed momentarily, and (iii) an optimization procedure which comprises (a) an algorithm for iteratively suggesting plausible sets of parameter values and (b) a measurement protocol wherein one or more “objective functions” calculate the model’s goodness-of-fit to the experimental data when those parameters are used.
The first point to be made about data is a crucial one.
- An absolute requirement for all external data is that they be accompanied by sufficient metadata so as to be able (i) to align with the metadata that identify a model run and, within that run (ii) to uniquely pair each observation with any scalars and/or specific time series values that they represent or that depend on them.
The diagram indentifies three categories of experimental data. Two of these, “RNA-seq data” and “biological data” quantify the plant traits and/or behaviors (phenotypes) that the model is intended to mimic. RNA-seq data is an anticipated product of the NextGen pipeline. It describes the expression levels of genes of interest at defined time points within specified tissues, which may be particular plant parts or quite generic (e.g., all above-ground material). As a direct indication of on-going gene activity, this type of data is singled out as particularly relevant to G2P research. The exponentially falling cost of measuring gene expression by transcript counting is and will generate ever increasing volumes of this type of data.
Biological data, as portrayed in the diagram, is a generic category containing diverse types of other kinds of phenotypic information. These range from metabolomic data, to morphological measurements, to quantification of plant part biomasses, to spectral characterizations of tissue states, etc. (As an aside, the economics of phenotypic measurements involving manual work is increasingly adverse; R&D efforts should target the creation of effective alternatives.)
The ultimate goal of G2P research as defined in this project is the prediction of trait scores in non-constant environments. This cannot be fully achieved in the absence of models that incorporate descriptions of such environments into their calculations. Thus, recording environmental data is a mandatory component of any well-designed experiment. Because most plant processes are environmentally contingent, records should permit reconstruction of the plant’s surroundings from some suitably early stage of development up to the point where the phenotype is measured. In some cases, e.g. where maternal effects are of interest, records may, of necessity, extend into the parental generation, or even earlier.
One can identify three levels of environmental specification that are of decreasing utility to G2P modeling.
- Time series of environmental variables spanning the course of an experiment. Ideally, these are collected via automated instrumentation. These data are essential in any experiment involving non-constant conditions #3. Note that time series can be constructed for constant environment experiments – they merely display no temporal changes.
- Experiments wherein values for some or many needed variables are not explicitly available, but can be inferred or assumed because standard protocols are used and cited #4.
- Experiments wherein, aside from treatment levels, plant rearing conditions prior to the measurement of phenotype are neither reported nor plausibly inferable. Noteworthy risk exists that data from separate instances of such experiments will be incomparable or simply unusable for modeling.
In general, optimization algorithms can be dichotomized along several dimensions that are relevant to the current problem. “Single-objective algorithms” utilize one goodness-of-fit measure, commonly least squares or maximum likelihood. “Multi-objective algorithms” will have more than one such measure. Whereas, the single objective methods typically yield one (hopefully) best set of parameters, multi-objective algorithms produce a population of alternative solutions reflecting different possible tradeoffs between the goodness-of-fit criteria. The latter methods are both much more flexible and more robust in combining experimental data of various types or that originate from several labs having unnoticed differences in their operating procedures #5. However, the user is compelled to make post hoc choices of which specific parameter sets to carry forward to subsequent steps in the modeling process.
A strategy intermediate between single- and multi-objective methods is the use of “penalty” functions. These are ad hoc measures that are added to objective functions and reduce the primary goodness-of-fit score when other desirable conditions are poorly met. A multi-objective approach can be converted to single-objective by construing all but one of the objectives to be penalties that reduce the scores of the most important criterion. The cost to the user is that weights (which can strongly influence the outcome) must be assigned to the penalties a priori.
However, it may be necessary to utilize penalty functions even with multi-objective methods. Current multi-objective algorithms are limited in how many goodness-of-fit criteria can be successfully employed – solution becomes problematic above five or so objectives. Yet there may be many well-known biological features that one would wish a model to reproduce. (As one concrete example, it can be difficult to get a clock model to exhibit oscillations without explicitly including a measure of the failure to do so.)
- A suggestion would be to make our first priority be single objective optimization with a mechanism for including penalties and rank multi-objective methods as a secondary, follow-on priority.
A second dichotomy relates to whether the optimization algorithm uses only the values of the goodness-of-fit measures or attempts to exploit knowledge of their rates of change as the parameters are varied. The latter are generically referred to as “gradient methods”, while the former are “derivative-free”. Derivative-free methods are often preferred for complex models because accurate estimation of rates of change can range from extremely expensive computationally to effectively impossible when models involve discontinuous behavioral thresholds. There are techniques that circumvent this by acting in response to general trends in objective function behavior (e.g., the Nelder-Mead algorithm) Some “hybrid algorithms” combine such methods with a classical, derivative-free approach. The idea is that the outer, derivative-free method is effective at exploring parameter effects at sparse intervals over large ranges and the inner method accelerates location of very good parameter values within small regions once a desirable neighborhood is found.
- A suggestion would be that we include a hybrid algorithm among the repertoire of methods supported.
A third dichotomy separates algorithms into deterministic and stochastic procedures. The latter includes random elements in the procedure by which plausible trial parameter sets are chosen for goodness-of-fit scoring. A common problem situation occurs when a parameter value range yields high goodness-of-fit scores but is separated from even better solutions by barrier regions of poorly performing values. Algorithms can become trapped in the good-but-not-optimal ranges and fail to find the best possible answers. Stochastic algorithms attempt to evade this problem by creating non-zero probabilities of leaving any region, no matter how good it may appear, in order to search for better ones. (Generally, the better a region, the lower the permitted probability of departure.)
The bulk of current optimization research focuses on stochastic approaches. There is, however, a body of deterministic work that exploits very general topological features of functions possessing finite rates of change (e.g., the “DIRECT” method). There are indications that such methods can exhibit superior performance for certain problems.
- A suggestion would be to make sure that each of these two categories be represented with the latter considered as being a technology evaluation in the context of G2P modeling.
A fourth dichotomy is between algorithms readily susceptible to and acceleration by parallelization vs. those that are not (e.g., the classic Simulated Annealing algorithm). The choice here is a no-brainer.
- We should focus exclusively on algorithms for which parallel implementations exist.
Except for very specialized cases (e.g., least squares fitting of linear models) the optimization algorithms used in modeling are iterative in nature. That is, they begin with one or more, usually random, trial solutions and progressively improve them. It is thus necessary for the user to specify when fitting efforts should cease. Common “stopping critera” are a maximum permitted number of iterations or a threshold on the minimum allowed relative or absolute per iteration improvement. “Over fitting” occurs when optimization proceeds to the point that the model is mimicking the idiosyncrasies and/or measurement errors of a particular data set rather than the fundamentals of the underlying processes. When sufficient measurements are available, a set of “guard data” can be sequestered as a protection against over fitting. Periodically during the optimization process the goodness-of-fit to the guard data is calculated using the best model parameters found to that point. At first, goodness-of-fit to both the primary data and the guard data will improve. But when over fitting sets in, goodness-of-fit to the primary data will continue to improve but not to the guard data. Fitting efforts can be stopped at this point.
- A suggestion would be to provide users the option to use goodness-of-fit to guard data as a stopping criterion.
There are interactions that can influence the specific choices of algorithms to use. For example, the estimation of rates of change with parameter variation is intimately connected to Sensitivity Analysis. Additionally, the association mapping tasks of the Statistical Inference workgroup involves the solution of linear least squares problems for which very effective parallel approaches exist. We should look for specific opportunities to synergize between these topics.
Once completely parameterized versions of any SBML-expressible model exist, it should be possible to export it in a self-contained form for use with any SBML-compatible tool.
Visualization of parameterized models
Once models are parameterized, users will wish to view their behaviors. Typical questions will be: “Is the behavior of the model generally reasonable?” “How is it affected by different environments?” “How does it compare to the behavior of other models?” “How does it behave relative to earlier versions of the same model?” Etc. If model outputs are available in simple formats (e.g., CSV) there are a raft of tools available to address these questions.
However, because models are intended to be representations of reality, it should also be possible to visualize their outputs in the same formats and with the same tools that can be used for experimental data. Two suitable data visualization tools identified by the Visual Analytics group are MapMan and the eFP browser. Sample outputs for MapMan (slides 1-2) and eFP (slides 3-4) are contained in the “Examples.ppt ” file that accompanies this document #6.
At one level there are similarities between these programs in that each relies upon template images upon which data is superimposed. It the case of MapMan, the data is color coded as small squares, positioned into arrays and plotted over the templates. My understanding is that eFP uses PNG image files (e.g, of black and white line drawings) whose embedded color fill codes are rewritten to reflect data values. Both of these approaches now require centralized work to prepare the image templates and to describe the exact pairing of data variables with the regions to be color coded.
During modeling process the variables used change many times (e.g., daily, if not faster). With nearly equal rapidity, the modeler-user will alter ideas about what s/he would like to see.
- Therefore, a requirement is that users be able to easily create templates and link regions with model variables.
It is by no means necessary that all visualization features currently enabled within these programs need to be supported in this mode. But users should be able to create passable graphics suitable for presentation or publication (in grey scale if needed).
The dramatic decline in the cost of gene expression measurements will make the collection of time series data ever more common. Thus, there is a convergence between the evolving visualization needs of experimental data and the requirement to display the time series outputs of simulation models.
- A suggestion would be to enhance the abilities of MapMan and the eFP browser to display time series data.
With these modifications, users will also be able to segue from views of model outputs into related explorations of real data being described by the Visual Analytics group, thus permitting seamless integration of theory and experiment.
Association mapping of model quantities
Association mapping seeks regions of the genome that appear to be statistically linked to some phenotype of interest. That is, regions for which DNA sequence changes from one line to another correspond to changes in the trait score that are systematic to within statistical error. When found, such regions often contain many genes, but the thought is that one or more of them will have a causative influence on the trait of interest. Via carefully engineered crosses, applied breeding programs will endeavor to engineer desirable combinations of such regions within individual lines in the hope of obtaining superior performance. Research programs will probe the regions with more refined and stringent methods to attempt to isolate candidate genes and confirm their causative influence.
In the broadest sense, a phenotype is any feature that some human takes an interest in. Quantitative phenotypes are measured with many different tools ranging from rulers to complex laboratory instruments. Conceptually there is no difference between these approaches and a method wherein a model is constructed, fit to a set of diverse genetic lines, and the resulting numerical quantities viewed as traits. Thus, model parameters and instantaneous values of time series outputs can be subjected to association mapping. There are three types of questions that can be addressed by doing so.
- For models of phenomena at levels of organization above the cellular, one can ask “What genomic regions might contain genes influencing this process?”
- For models of metabolic and gene regulatory networks, one can ask “Where might one find genes that could be influencing this process but which are not already contained in the model?”
- Related to the preceding, one can also ask a verification question, “Is there statistical evidence that the model is either incomplete or wrong?” (See textbox just below.)
Metabolic and gene regulatory models are mathematically similar to Hopfield neural networks. Since they are universal function approximators, neural networks are prized for their abilities to mimic the twists and turns of highly nonlinear data. The related downside is that molecular network models may fit experimental data not because they are correct but because they are good at fitting things in general. Similar concerns attend higher level models that reflect emergent network properties. Confidence is gained when models prove they can mimic data from independent sources not used in model construction. Trust declines when model parameters that the modeler asserts to be genetically fixed show significant variation across environments or tissues. Finally, one would expect that the parameters of network nodes asserted to be genes should associate with genomic regions containing those genes.
- A suggestion would be that the model parameters and time series values be readily transferable to the association mapping tools being developed by the Statistical Inference workgroup.
With such a connection in place, the tools that the Visual Analytics group is developing and/or linking in support of the StatInf group can also be used to display this aspect of model analysis. In particular, the QTL visualization tool is designed to display hundreds (or more) traits as mapped against potentially millions of markers. A preliminary draft description of it accompanies this document. When genomic regions of interest are identified, the comparative genomics browser CoGe can be invoked for detailed examination, including syntenic relationships with other species and homolog identification.
This activity asks “How do small changes in model parameters affect model outputs?” This information is important both (i) to the modeling and experimental process as well as (ii) to basic biological understanding and improving our abilities to manipulate living systems. Typically only a minority of biological elements will have major effects in commonly encountered situations – a fact that can be given a precise mathematical characterization. In terms of experimentation and model development, it is precisely these elements for which one wants to have the most complete and accurate knowledge. Also, when models fail to meet expectations, improving sensitive components can be the most productive route toward better performance. A concrete question that combines these two elements is “How should I prioritize my experimentation and modeling efforts to most rapidly advance knowledge of my topic?”
The paper by Daniels et al. (2008) that accompanies this document notes that there are links between robustness, evolvability, canalization, homeostasis, stability, redundancy, and plasticity. All of them have something to do with how much or how little phenotypes change when external conditions change. Phenotype changes can be divided into two categories: (i) those that are ‘computed’ by the network without alteration of its topology or parametric constitution; and (ii) those that result from exactly such changes. The latter category entails DNA changes in the form of allelic substitutions, and is most relevant to G2P research.
Sensitivity is always computed as the ratio of the change in one dependent variable per unit change in something else. (Because these ratios may be finite difference ratios or partial derivatives, we can refer to them generically as “sensitivity coefficients”.) For example, an evolutionary biologist will ask “What is the change in fitness when allele B is substituted for allele A?” A plant breeder may ask the same question but, in this case, “fitness” is defined purely in terms of the particular phenotypes comprising the breeding target. Assessing this type of sensitivity requires models to be fit to populations of lines where more than one allele is represented for each genetic locus of interest.
Such sensitivity is another point of contact between modeling and association mapping. The latter also estimates such sensitivity coefficients, but for linear models indirectly linked to particular genes via genetic markers. A question would be “What quantitative relationships exist between sensitivity as estimated by these two methods?” The linear models used in association mapping and other aspects of statistical genetics have had a long and generally successful history of representing G2P relationships. Close relationships between sensitivities as measured by the two methods will help document the continuation of their ability to do so.
- A suggestion would be to enable calculation of sensitivity to allelic substitution when the data exists to do so.
- Another suggestion would be to facilitate comparisons between such sensitivities and related estimates from association mapping.
Beyond this or when multi-allelic data is unavailable, more general information can be obtained by assessing changes in dependent quantities of interest per unit change in the model parameters fit to one line. There are two cases of interest. In the first, the dependent variable is some specific model output such as flowering time or photosynthetic rate. In the other, the dependent variable is some measure of overall model behavior. An important example of the latter is when the dependent variable is a goodness-of-fit measure such as sum of squared error. In this case a biological question is “How sensitive are model outputs in general within the context of situations delimited by this body of data?”
- A suggestion would be that mechanisms be provided to calculate sensitivity coefficients for model outputs and for goodness-of-fit functions (sans penalty functions).
In addition to their biological relevance, the sensitivities of goodness-of-fit functions have important statistical applications. An entity known as the Fisher Information Matrix (see workflow) can be approximated from derivatives of maximum likelihood goodness-of-fit functions.#7 Asymptotic standard errors for the parameter estimates can be calculated from this matrix. In addition to their desirability from a reporting perspective, standard errors permit testing of hypotheses about the parameters such as “Is this parameter significantly different from zero (or from some experimentally measured value)?” For this reason, as well as because of its contributions to model improvement, the workflow shows sensitivity analysis preceding verification. In point of fact, however, sensitivity analyses that target biological questions rather than model development are usually applied after verification.
Slight changes in some parameters and/or combinations thereof have major effects on model outputs. If one imagines each set of parameter values as specifying a point in a high dimensional space, the combinations just described define “stiff directions”. Major changes in other combinations have very slight effects; these are called “sloppy directions”. Stiff sensitivities may be 1000 times greater than sloppy ones. Figure 2 in Daniels et al 2008 exemplifies both these terms when the dependent variable is a least squares goodness-of-fit function. The paper also describes the mathematics of these directions, which involve computing eigenvalues and eigenvectors of a matrix called the “Hessian”, which is approximated by the Fisher Information Matrix.
- A suggestion would be that mechanisms be provided to calculate the Fisher Information Matrix along with its eigenvalues and eigenvectors.
Eigenvalues provide an indication of overall levels of sensitivity; that is, they quantify degrees of stiffness or sloppiness. Interestingly, it has been shown for quite a few systems biology models that sensitivities measured in this way tend to form hierarchies that are evenly spaced when plotted on a log scale. For example, a sensitivity analysis of a flowering time model revealed groups of genes whose parametric effects on the phenotype were roughly separated by powers of two. I am unaware if such analyses have been applied to ecophysiological models, but any outcome of such a study would be interesting. I do know that a not-uncommon assumption in simulation studies of association mapping designs is that effect sizes follow a power law. It is easy to imagine that this practice has some empirical underpinnings based on real data. If such data exists, it could be reflective of a similar underlying distribution of eigenvalues.
Each eigenvalue has a corresponding eigenvector whose components correspond to and quantify the contribution of individual parameters to overall sensitivity; that is, they specify the stiff or sloppy directions. Thus, parameters that only contribute to sloppy directions have negligible effects on phenotype, at least within the structure of the model as proposed and the range of environments evaluated.
- A suggestion would be that visualization tools be provided to depict eigenvalue and eigevector relationships.
Such a tool is denoted by the “Eigen Plot” box in the workflow. The range of desirable features such a plot should provide needs to be delineated. But it may prove to be a relatively simple application of what visualization experts call “parallel coordinates”. A description of this methodology can be found at http://en.wikipedia.org/wiki/Parallel_coordinates. In this scheme, each parallel axis could depict the contribution of one parameter, with an extra axis supplied for eigenvalues. Axes could be reorderable to group parameters by magnitude; by the gene, metabolite, reaction, or physiological process they were a part of; or by other criteria of interest.
- Another suggestion would be to let sensitivity coefficients, eigenvalues, and eigenvector components be inputs to MapMan and the eFP browser (along with other parametric values) and display them in juxtaposition with the processes or entities they relate to.
An important aspect of implementation concerns the methods to be used to calculate the derivatives needed in sensitivity analysis. Numerical calculation of derivatives proceeds by finite approximation. Such schemes may be constructed from the algebraic expressions whose limiting values mathematically define derivatives. Alternative approaches exactly differentiate polynomials that are fit to model outputs generated from closely spaced values of model parameters. The most sophisticated approaches to numerical differentiation always pay close attention to the error terms of the corresponding differentiated Taylor series.
Needless to say, achieving accuracy with these methods is numerically intensive. Of course, in the current setting, we have enough computational horsepower to overcome this problem. There are, however, two problematic issues. First, models in the class under consideration here are, themselves, solved by finite difference methods so their solution errors will be proportional to the length of the time step raised to some power. Models specified in SBML are likely to be solved by fourth order (probably adaptive) Runga-Kutta methods, so errors will be quite small – proportional to the fifth power of the time step. Many ecophysiological models rely on Euler integration, so errors are much greater – proportional to the second power of the time step. Unfortunately, the numerical methods described in the last paragraph assume that values from the function to be differentiated are available at machine precision. An analysis would be required to ascertain how to best mitigate the effects of this problem.
A second problem can arise depending on the method used to inter-communicate with the differentiation code. A generic communication method that is useful between very disparate codes is for one piece of software to write a file that the other one reads. However, if text files are used, large amounts of precision can be lost if quantities are written with too few significant digits. This, in effect, superimposes a low-amplitude, but very high frequency noise signal on top of the model outputs, which is capable of rendering numerical derivatives worthless #8. I see no way to solve this problem other than by communicating values at close to machine precision.
Other approaches for taking derivatives do not rely on finite differences. The first is “symbolic differentiation”, wherein the rules for taking derivatives are applied to model equations to yield exact formulas for the sensitivity coefficients. In principle, this can be done by hand, but models are typically so complex that doing so without error is, for all practical purposes, impossible. Fortunately, it can be easily done by appropriate computer software. However, because it is a very involved process, it is best done on a one-time basis as a model preprocessing step. Symbolic differentiation is more suitable for models expressed in script languages which usually represent equations in ways that are closer to standard mathematical notation. At least one program #9 exists that can apply this method to models expressed in SBML.
A second approach is suitable for programs written in procedural languages such as FORTRAN, C, and C++. During the process of translating such programs into executables (“compilation”), they are decomposed into large numbers of very simple operations. Derivatives are easily calculated for these individual operations and combined mathematically to yield the derivative of the original code. This is called “automatic differentiation” #10 and can be done at execution time.
- A suggestion would be that symbolic differentiation be supported for SBML components and that automatic differentiation be available for model components written in major procedural languages.
- The composer should be able to integrate SBML and executables produced, respectively, by symbolic and automatic differentiation.
However, as a backup,
- The requirements for high precision numerical differentiation should be researched in case the above approach proves unsuitable for one or more priority use cases.
In any event,
- The system should provide users with information and caveats regarding the calculation of sensitivity coefficients.
There is a huge literature on methods for assessing model quality. This narrative sets forth a few common assessment types that might be included but I will not be surprised if readers have many other ideas/suggestions. Basic questions are: “How well does this model perform?” “Is the model contradicted by data?” “How skillfully might predict ‘future’ data, i.e., not used in constructing the model?” And, “Is it better than other models?”
The simplest and by far most common general measure of model performance is R2, the coefficient of determination #11. This is the proportion of variability in a data set that is accounted for by the model. (Note: R^2^ values can be calculated for any model, not just linear ones.) Clearly, when a model has multiple outputs, its skill may be greater for some than for others. In addition, a model may perform better for some subsets of the data than for others (e.g., a drought stress model might behave better for well-drained soils than for clayey ones). Exploring R^2^ values for data subsets can therefore be quite revealing.
- Whatever else is done, flexible methods for computing R^2^ values should be provided for all model outputs, and for data subsets extractable via the metadata.
As the article in Footnote 11 points out, there are a variety of important issues about which R2 is mute. Many of these can be thrown into sharp relief by a graphical analysis of model outputs, especially of data point residuals. A variety of useful plots are described in the article at http://en.wikipedia.org/wiki/Statistical_model_validation and it is to be expected that our visualization experts will be aware of others #12. In any event, it is certain that users will want great flexibility in simple graphical analysis of model outputs. Therefore,
- A suggestion would be that model outputs be exportable into an Excel template pre-programmed with macros for various predetermined graphical analyses.
A very common graphical analysis is to construct a scatter plot where the [x,y] coordinates of each point are [model_output_value,observed_value] or the reverse. If the model were perfect, all points would lie on the 45-degree diagonal. Departures from this configuration are indicative of model discrepancies. A further step is to fit a linear trend line through this plot and look for statistically significant deviations of the slope from unity and the intercept from zero. While intuitive, this test has been criticized because of a known relationship between the between the sample regression slope, the standard deviations of x and y, and the sample correlation coefficient r~xy~. Given enough data, this relationship can lead to the rejection of a correct model. However, the criticism, itself, is not beyond criticism, and the test is perfectly usable in selected circumstances.
- Whenever possible (and meaningful) all the methods in this section should be equally applicable both to the data used to construct the model and to data not so used.
This desideratum allows the user (i) to “see how things are going” while the model is being built, (ii) provides an independent test of model skill (using sequestered data) when model building is finished, and (iii) provides some indication of the presence of over fitting via a comparison of the two. Note that great care must be exercised to assure the true independence of verification data. For example, guard data used in model parameter estimation does not qualify. Although it never enters directly into the calculation of a goodness-of-fit measure for any trial set of parameters, it nevertheless influences the final estimates by helping to determine when the optimizer stops.
When data are limited, cross validation or bootstrapping can be used to estimate means and variances for (i) the sample parameter estimates, (ii) prediction errors against independent data, (iii) goodness-of-fit measures, etc. Both methods assume that the distribution of the data reflects that of the world #13 . The two techniques use resampling methods to repeatedly estimate the quantities of interest. Cross validation partitions the data into k disjoint, nearly equal-sized sets; re-estimates model parameters k times, each with one of the data subsets left out; calculates model outputs corresponding to the omitted cases; and then determines the desired means and variances using the k sets of results (parameter estimates, model outputs, etc.). Instead of a fixed partitioning of the data, bootstrapping repeatedly draws random samples (with replacement) from the data, but then proceeds in the same manner. Both methods are described at http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html in the not-unrelated context of neural networks. Cross validation is numerically intensive and bootstrapping is considerably more so.
- A suggestion would be to provide both cross validation and bootstrapping capabilities, along with a means for the user to estimate the computational requirements of each.
Because the individual data subsets are statistically independent, these techniques can answer the question “How well can the model predict new data?” But, as the preceding paragraph indicates, they can do much more, besides. For instance, bootstrapping can estimate standard errors for the sample parameter values. This enables statistical comparisons of model estimates to direct experimental measurements in cases where the latter were not incorporated into the model. “Asymptotic” parameter standard error estimates can be also be obtained from the Fisher Information Matrix at a much lower computational cost because resampling is not entailed. The downside is the indicated by the word “asymptotic” – standard errors obtained in this fashion converge asymptotically to the true values as the amount of data becomes infinite.
- The user should not be able to escape being advised to calculate parameter estimate standard errors by one means or another and, when more than one method is used, all their results should be displayed in juxtaposition.
In the biological literature a common model verification display for time series is to plot the independent data points with whiskers depicting experimental confidence limits and superimpose the trajectory predicted by the model. This approach is very unfair to the model because it represents the model outputs as being known without error. Thus, if the model curve passes outside the data confidence limits, it appears that the model has failed. In point of fact, there are at least three increasingly expensive ways to estimate standard errors for model outputs:
- Linearized estimates can be obtained by multiplying an appropriate matrix of sensitivity coefficients by a vector of parameter standard errors obtained from the Fisher Information Matrix;
- Monte Carlo estimates can be calculated by (i) assuming some probability distribution for parameters (e.g., multivariate normal) whose means and (co)variances come from the estimates and the Fisher Information Matrix, respectively; (ii) repeatedly drawing random numbers from these distributions; (iii) running the model with these random parameters, and |(iv) tabulating the frequency distributions of model outputs;
- Obtaining them from bootstrapping runs.
Only the first and second of these require code not already subsumed by suggestions above. However, the first method is extremely simple to do and the second is markedly cheaper computationally that the third because parameter re-estimation is not required.
- Thus, a suggestion would be that all three methods be supported and that, when more than one method is used, it should be possible to juxtapose the results.
Once realistic standard errors for model outputs are available, it should be possible to fairly compare model predictions with data. A caveat is that appropriate p-value corrections are needed for (i) multiple comparisons involving sets of data points and (ii) serial correlations between successive time series values.
- A suggestion would be to consult an appropriate statistical authority to determine how to make such corrections.
As noted in the introduction to this section, it is also desirable to compare new and extant models in some objective fashion that enables them to be ranked. This is termed the “model selection” problem. Various information theoretic measures can be used to construct such rankings. Two of them are the Akaike Information Criterion #14 (AIC) and the Bayesian Information Criterion #15. Both of them operate by penalizing the maximized likelihood (or residual sum of squares) by amounts related to the number of parameters. The notion is that the penalty offsets the tendency for models with more parameters to fit better. Lower scores for both these measures are better, but their individual values are meaningless – their only valid use is to rank models.
- A suggestion would be to include the ability to compare models the AIC and BIC.
This raises the interesting question as to whether there is some “default model” that can be used as a floor for comparison. If there is theory/discussion on this, I am unaware of it. However, in the context of G2P work a particular base suggests itself. The association mapping of parameter values entails fitting the model to the set of lines comprising the mapping population. Thus, model predictions will exist for relevant phenotypes in that population and can, themselves, be association mapped. Doing so would permit calculation of AIC and BIC values for the mapping models.
- We should discuss the idea of including comparisons of information theoretic measures for both association models as well as simulation models.
Ta’ Ta’ For Now
1. Over time we can expect the expressivity of methods used to describe biological models to increase, perhaps leading to the point where most models falling within the textbox abstraction can be supported within a single linguistic framework. An important step in this direction is being made with SBML 3, whose specification is currently in the design phase.
2. This implementation note is provided on the principle that it is incumbent on designers that they be able to specify at least one way that crucial components might be built.
3. Within, but not necessarily outside, the modeling community it is widely known that mathematical hypotheses regarding non-constant effects and influences are easily described. Moreover, effects seen in constant environments can be quite unrepresentative of plant behavior under natural conditions. These facts argue that data collected under variable conditions can, in fact, be analyzed and might, in some cases, be considerably more revealing biologically.
4. One might ask if there are environmental variables whose values should always be recorded. My personal view is that its ubiquitous effects on biophysical/chemical rates makes temperature one such. Other researchers may have additional variables they would like to see on such a list. Indeed, one can envision considerable discussion on how to best balance realism with practicality. A concrete method for resolving such a debate in specific instances is that models omitting critical environmental information will fail during verification.
5. This latter property could make multi-objective methods helpful in cases where environmental specifications fall in the third category listed above, provided the data are deemed to be usable at all.
6. Credit for material in these slides traces by various routes including but not limited to Tom Brutnell to Bjoern Usadel (MapMan) and Nick Provart (eFP). The examples should be viewed in slideshow mode because they contain some animations.
7. When experimental errors are normally distributed, maximum likelihood and least squares goodness-of-fit functions are the same.
8. This is another strong argument for using derivative-free parameter estimation methods.
12. A number of instructive pathological cases are depicted at http://en.wikipedia.org/wiki/Anscombe%27s_quartet
13. An important way this assumption can fail is in systems that are extended in time during which the world changes in ways that the model is insufficiently skilled to handle. Such is life.