SI_20091201

iPG2P Statistical Inference

December 1, 2009, 11am PST

Attendees: Matt Vaughn, Karla Gendler, Chris Myers, Dan Kliebenstein, Liya Wang, Peter Bradbury, Steve Welch

Action Items:

  • Liya – Check on Maximum Likelihood Estimation at TACC
  • Liya – Create a project description to rough in
  • Everyone edit the project description that Liya sends out
  • Everyone look at list of QTL applications and their plusses/minuses. Edit, if at all possible.
  • Everyone think of topics to be discussed in San Diego

Notes/Agenda:

  1. Review action items
    1. Karla and Liya find the person at TACC who would help us identify the approaches available at TACC for GLM
      1. TACC is currently working on code for iPToL and will not be available until after the 1st of the year.  However, a project description needs to be generated so that the guys working on GPU and FPGA implementation can start.
    2. Karla and Liya distribute the information about approaches available to group and we attempt to make decision via email to get things going. Possible Doodle poll.
    3. Everyone contemplate Mixed models and which of the two approaches would be most useful first and then second.
    4. Everyone (me especially) read up on Machine Learning.
    5. Dan will email everyone in a week to see how things are going.
  2. Proof of Concept update
    1. Liya - Traditonal Parallelizaton
      1. Liya will be working on this problem using CSHL's BlueHelix cluster, which has 2000 2 Ghz compute cores. Short term, he will working with Peter to modify the architecture of the GLM app so that it can process arbitrarily small chunks of the data set instead of whole chromosomes, then getting that up and running under SunGridEngine on the CSHL system. He can optimize execution speed of single GLM instances, test for linearity of scaling, etc. The benefit of using CSHL's system is that he has real-time access to the admin and developers for advice.
    2. GPU/FPGA (Matt)
      1. Two groups have been identified to work on GLM implementation from a research perspective: GPU: John Hartman at the University of Arizona; FPGA: Convey. Liya is currently in the process of writing a project description for these groups that will contain
        1. Math, Statistics and Theory of QTL mapping using GLM
        2. References to publications describing implementation of the specific method your working group has chosen to support
        3. Reference software implementations of specific methods (Peter's Java code, the source to the GLM portion of Tassel, etc)
        4. Sample data, solvable in reasonable time on single commodity processor.
        5. Solved example (Inputs, Output, all necessary explanation).
          1. Benchmark of some repeat trials of solving this data set using the Java GLM implementation
      2. Liya will rough out this document and will pass it on to the group for comments (mainly Dan K for the biology and Peter for the statistics).
  3. Discussion on methods and software implementation of GLM and mixed model analysis
    1. Looking to do a general survey; need to know what other people have done and what is good and bad about it. Liya will drive this process
    2. For Mixed Model and Association mapping it is still very ad hoc as tools are still being developed and most are simply algorithms written for R or another language without a dedicated front end.
    3. QTL mapping in structured populations
      1. QTL Cartographer
        1. Problems
          1. Adapted to biparental populations at best
          2. Not parallelized to handle large datasets very easily
          3. Multiple-interval mapping tool is nice but not automated
        2. Plusses
          1. Free
          2. Somewhat updated but this looks to be ending
      2. MultiQTL
        1. Problems
          1. Industry Package so is expensive
          2. Adapted to biparental populations at best
          3. Not parallelized to handle large datasets very easily
          4. Multiple-interval mapping tool is nice but not automated
        2. Plusses
          1. Continuously updated
      3. R/QTL
        1. Problems
          1. Adapted to biparental populations at best
          2. Not parallelized to handle large datasets very easily
          3. Multiple-interval mapping tool is nice but not automated
          4. Does not look like it will be continuously improved
          5. Doesn’t have a QTL calling attachment
        2. Plusses
          1. Free
      4. R/bQTL
        1. Problems
          1. Adapted to biparental populations at best
          2. Not parallelized to handle large datasets very easily
          3. Multiple-interval mapping tool is nice but not automated
          4. Does not look like it will be continuously improved
        2. Plusses
          1. Free
          2. Bayesian approach to QTL mapping
      5. R/BIMQTL
        1. Problems
          1. Adapted to biparental populations at best
          2. Not parallelized to handle large datasets very easily
          3. Does not look like it will be continuously improved
        2. Plusses
          1. Free
          2. Bayesian approach with mixed model aspects
      6. Others
        1. http://www.stat.wisc.edu/~yandell/statgen/reference/software.html
      7. Visualization
        1. Here is a crude eQTL visualization tool but so clunky we never use it.
          1. http://statgen.ncsu.edu/eQTLViewer/svgHome.html
    4. How do you summarize results? Always have to take it out of package to analyze.
    5. What about ML estimation? Does TACC have expertise with ML estimation? How closely does it track with RAXML?
  4. Discussion on functional data types and structures for performing QTL and GWA analyses
    1. Genetic Map
      1. Three columns
        1. Marker
        2. Chromosome
        3. Map position
    2. Genotype Information
      1. Rows are plant lines within the population
      2. Column 1 may be population identifier
        1. Cells will be an indicator of which population the line came from.
      3. Column 2-n will be different markers
        1. Cells will be the allelic value of the given marker per line
        2. These values need to relate to allele values present in the parents from which the line is obtained
    3. Phenotype Information
      1. Rows are plant lines within the population
      2. Columns are different phenotypes
        1. Cells are the value of that phenotype in that line
        2. Values need to be in a numerical format
  5. Discussion on minimal metadata that needs to be associated with the output of a QTL or GWA mapping analysis
    1. Link to experiment from which the input was obtained
    2. Link to algorithm and version from which output was obtained
      1. Any manual input parameters also linked to output
    3. Link to version of genetic map
  6. Model population/phenotype update (Dan and E. coli)
    1. Will talk about another date
  7. Topics for January meeting
    1. Storage of genotypic data or reimputation on site.
    2. Storage of output given its size.
    3. Metadata of input and output.
  8. Identify action items
    1. Liya – Check on Maximum Likelihood Estimation at TACC
    2. Liya – Create a project description to rough in
    3. Everyone edit the project description that Liya sends out
    4. Everyone look at list of QTL applications and their plusses/minuses. Edit, if at all possible.
    5. Everyone think of topics to be discussed in San Diego
  9. Next Meeting: Tuesday December 15th at 11am PST