TR120604

Agenda

1KP Data
  • Green light given to make data public under the Ft. Lauderdale agreement.
  • Concern about how BLAST database and annotation search engine will manage with larger amounts of data.
    • Previously, search conducted one taxon at a time. Now users may wish to deal with multiple taxon at a time.
      • Is the database currently in place the one we wish to use if moving front end to DE?
      • If not (or in both cases), will need to decide upon BLAST parameters for doing search.
      • Jim suggests limiting BLAST search to e-5 and maximum of 10 genes per query. Otherwise, files are too large.
  • Currently no other projects making use of the DE to do BLAST searches or looking through annotation results. Interface should be easily constructed, but back end is an issue.
    • Issue with performance: single BLAST query was too long with only 100 species. Now 1000 x 80000 transcripts
    • Matt suggests solutions:
      • MPI BLAST (would give linear scaling up to 2048 cores)
      • TACC could accommodate (will have 600,000 cores in January)
      • Turnaround time expectation: users are doing it one taxon at a time right now, would like results within a few seconds. Doing search on annotations should be equally fast.
    • Jamie interested to know if the size of RAM available to the machine is limiting factor. (unknown)
      • Matt states that at TACC they have nodes with different amounts of RAM provisioned, but highly recommend going to MPI BLAST (folks at TACC can assist with this)
      • Jaime states that in the case of someone running a single query, increase in memory might solve problem.
    • TACC needs query database and sample subjects to determine how to optimize.
    • All the data is there now (900+ transcriptomes). It's a matter of taking all of the new assemblies (Soap DeNovo Trans assemblies), concatenating them and making one big database.
      • Other option: allowing users to choose from a nested set of taxa.
        • Questions of granularity, would like to go down to species level, but not a major requirement. Jim does have examples of how this could be done.
  • Regarding parsing scripts:
    • Matt doesn't have direct experience with these scripts for optimization.
    • Raj developed Java scripts that take full blast output and parses annotation and alignment information. The size of the files that it was parsing may have been causing performance issues.
      • Using search limited to e-5 may assist in performance.
      • Scripts located on corral under tacc_biocomp-7.6. Matt to investigate.
  • Regarding Corral:
    • all of BLAST results done before can be thrown out now (done with no limits). If disc space is an issue, they can be removed and do new series of BLASTs, zip up results files and put them on the same server as the assemblies.
  • Matt: does it make sense to make them all available in the community data space in the DE?
    • Want the info available via search, but don't want to encourage download before the 1KP group gets to make use of it.
    • Moving forward, data can be mirrored and visible via community data space. Subset of this data will be published by (hopefully) end of summer (83 of 1000).
    • It would be a good idea to make the interface available for BLAST and annotation search in the DE.
  • Matt and Naim to work on optimizations. Will continue to collaborate with Jim.
Other items
  • API development for TR database (and front end progress)
    • Naim not available for update. Jaime feels API will be a couple of days work.
    • Work needs to be done so Ray can make update to front end. Still questions as to whether the front end will scale.
    • Jamie can take over developing API. Will work with Naim to come up with a plan to get it done next week or two at the most.
    • Ray going to Penn State at end of the month, but is still willing to assist (his co-advisor also on project)
  • Review of Ontology paper outline: Jamie getting back to this.