SI_20091020

iPG2P Statistical Inference

October 20, 2009

Attendees: Ed Buckler, Jean-Luc Jannink, Chris Myers, Peter Bradbury, Scott Menor, Liya Wang, Steve Welch, Barb Stranger, Dan Kliebenstein and Matt Vaughn (locked out)

Action Items:
[SECTION I] Please comment on the Data outline and I will forward to the Data Integration group for their consideration
[SECTION III] Please come up with specific algorithms to compare lots of genotypes to an individual phenotype (these algorithms will then be run against other phenotypes for the same population).

Notes/Agenda:

Personalities
1. For iPlant, there is a general process to identify potential users and develop their profiles to help with the computational development (Know your audience).
2. Matt/Karla will update us about setting up the computational postdoc profile for editing
  1. The computational postdoc profile (Andres) looked like a good beginning and there were no real suggestions for improvement.
3. Other users that we feel this package should facilitate?
  1. Biologically savvy graduate student
    1. Generates phenotypic data and just wants to find a list of genotypes that might influence the phenotype
  2. Mathematical or Statistical scientist
    1. Wants to access the data and computational resources to test their algorithm for genotype to phenotype linkage
    2. Would want the ability to insert new algorithms into the package for use.
    3. Would want the ability to get performance measures on the computational time per test, etc.
  3. High school or undergraduate lab student
    1. Might have a mapping population and do a phenotyping analysis in a lab course.
    2. Would then want to be able to do the genotype to phenotype tests and get some generally informative answer.
    3. Likely involve links to other modules
  4. Lab Instructor
    1. Ability to understand what their students are doing with the module.
Data
1. Previous meeting came up with the following request for a universal data format for Genotype and Phenotype in mapping populations (structured and non-structured)
2. Outline
  1. Line data
    1. Geographic position of collection
    2. Environmental data
    3. Population design
  2. Genotypic Data
    1. Physical position within the organisms genome
    2. Genetic position within the map
    3. Class variable for allelic state
      1. Basepair
    4. Quantitative value for allelic state
      1. Allows for polyploidy
      2. Allows for copy number variation
    5. Genotyping Assay
      1. Error rate
  3. Phenotypic Information
    1. Phenotype classification
      1. Nucleic acid based
        
        Gene id from which originated
      2. Metabolic based
        
        Metacyc identity
      3. Physiological based
        
        ??
      4. Developmental based
        
        Plant ontology id
    2. Phenotype value
      1. Individual replications?
      2. Means?
    3. Covariates per line per measurement
      1. Environment or treatment
  4. Experimental Database
    1. Environment descriptors
    2. Treatment descriptor
    3. Tissue descriptor
    4. Time descriptor
  5. The predominant setup will be where there is one main genotypic data source per population and then potentially multiple independent sources for phenotypic information
    1. i.e. QTL mapping populations are generated once by one lab and then phenotyped independently by a large number of labs
    2. These additional labs may also have separately scored genotypes but it is not known if it is desirable to allow these into the main database
3. We need to finalize this and see if we can forward to the data integration group
Algorithms
1. What base algorithm should we begin attempting to link genotypes to phenotypes using these different apporaches?
  1. GLM (ANOVA)
    1. Maybe more for QTL population structure is less of a worry
    2. Which algorithm do F-test searching through the genotypic space. Reiterative might be nice
      1. ??
      2. ??
  2. Mixed Model
    1. Maybe more for GWA basis
    2. Iteration
    3. Technology demonstration
    4. Algorithms
      1. EMMA with two SNPs or more on moderate datasets
      2. EMMA with one SNP on monster datasets
      3. ??
  3. Bayesian
    1. Would be nice to have it work for both structured (QTL) and non-structured (GWA) populations
    2. Algorithms
      1. ??
      2. ??
Next Meeting
1. Dan K will send around a Doodle pool for next meeting in two weeks time we will try to make this next Doodle poll set up a standing time for the rest of the year.
Other Topics

WebEx:

Topic: iPG2P Statistical Inference Meeting
Date: Tuesday, October 20, 2009
Time: 10:00 am, Mountain Standard Time (GMT -07:00, Arizona)
Meeting Number: 759 336 365
Meeting Password: iPC123
Please click the link below to see more information, or to join the meeting.
-------------------------------------------------------
To join the online meeting (Now from iPhones too!)
-------------------------------------------------------
1. Go to https://ua.webex.com/ua/j.php?ED=118154482&UID=1064553707&PW=f591390f3a2f5a441f
2. Enter your name and email address.
3. Enter the meeting password: iPC123
4. Click "Join Now".
-------------------------------------------------------
To join the teleconference only
-------------------------------------------------------
Call-in toll-free number (US/Canada): 866-699-3239
Call-in toll number (US/Canada): 1-408-792-6300
Toll-free dialing restrictions: http://www.webex.com/pdf/tollfree_restrictions.pdf