This box searches only this space. The box at the upper right searches the entire iPlant wiki.

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Participants

Ali Akoglu

Dave Lowenthal

Martha Narro

Tapasya Patki

Matt Vaughn

 

Agenda

Discuss the update posted by Ali on 5/6/2011.

Notes

  • Population effects are now handled.
  • GPU version of PLM
  • Far more efficient than the GLM version was.
  • They got further than expected, so ran into some issues working on TACC.
    • Have an mpi version and mpi version with gpu on tacc.
    • Performance of gpu implemetation is not as good as they would like.
  • Matt's philosophy is: get it working, then get it working right, then get it working fast
    • They are in the get it fast stage
  • Multi node multi gpu has issues that would not come up in the other implementation.
    •  For example, where will data reside in this implementation.
    • TP: The accuracy is fine.
    • The issue is how to layout the gpu and distrubute the data.
  • MV: The 1 to 1 comparison is a decent speed up.
  • The performance problem is in the communication.
  • A: Which is more informative for you: time spent on alignment per SNP or complete execution time?
    • MV: Thinks execution time per SNP is what needs to be optimized.
    • D: Makes Sense.
  • 448 vs 362 sec is a ____ (CPU mpi is one node one core)
  • Numbers are much better off Longhorn.
  • It's probably that we are new on Longhorn. Expect within a month it will be better.
  • Getting only a 20% speed up in a gpu would not be satisfactory to us.
  • MV: Longhorn is large and has older, single precision GPUs.
    • Lonestar has a decent number of new Fermi nodes(?). Might that help?
    • D: We could try that if our latest runs do not look better.
    • Would need account.
    • MV: I can arrange that. No problem.
  • A: cpu gpu hybrid approach not giving great improvement.
  • Reduce memory footprint and run whole thing on gpu. To reduce memory IO overhead.
  • There is a limit in terms of number of total threads.
  • MV: Won’t change size of dataset, just number of threads?
    • D: correct
    • M: That’s a bit unintuitive. Thanks for clarifying. Wondered about it in the update.
    • D: That's why we wanted to talk.
  • A: Remember that the cpu version also improved (glm vs clm). If we hadn’t improved the cpu version (baseline for comparison), the numbers would look better.
  • MV: Is that the c++ from John Peterson?
  • TP: Yes, modified by Peter.
  • MV: nclude the run with that original code. That would make a nice comparison.
  • Ali describing next steps to work on:
    • 1) Further optimization on current implementation
    • 2 ) Ideal comparison would be build from SVD eliminated, and the partitioned linear model (PLM).
      • MV: This problem launched a long linear algebra discussion here.
    • 3) Stepwise regression
  •  MV: they are building Tassel into the DE.
    • Glm, clm, modular...
    • It would be nice to drop a hook to this code into their DE tool.
    • Once it is running on TACC's hardware, there is a lot more to tassel than this algorithm.
  •  MV: This project has been an interesting model for data scalability that can not just be wrapped or handled by additional cores/nodes (expensive). It illustrates that the original algorithm's code needs to be optimized.
  • Dave and Ali agreed to post monthly updates to the wiki.

Decisions

  • Next work to be done:
    • 1) Further optimization on current implementation
    • 2) Ideal comparison would be build from SVD eliminated, and the partitioned linear model (PLM).
    • 3) Stepwise regression
  • Dave and Ali will post monthly updates to the wiki.