2012.06.01 BIEN db

BIEN Database June 1, 2012

Update

  • Column-based import is now documented on the wiki under Column-based Import with the SQL steps for sample columns.
    • The examples use the QMOR dataset, which is on vegbiendev in vegbien's QMOR.specimens table (access instructions on the wiki under PhpPgAdmin).
  • Aaron simplified the import process by translating "relational functions" into plain SQL functions. This sped up the QMOR import by 5x, so that it is now 122x faster than row-based import. New benchmarks are on the wiki under Column-based Import
  • Aaron made the SQL code more self-documenting, by taking advantage of PostgreSQL's support of special characters in names.
  • To import non-CSV inputs to staging tables, Aaron will need to add equivalent scripts for DB and XML inputs (SALVIAS and CTFS). Note that SALVIAS's CSV format is already supported, for individual SALVIAS downloads.

Agenda

  • Update on column-based import
  • Update milestones timeline

To Do

  1. Discuss problems with large VegX files
  2. Finish translating XML functions to SQL functions for column-based import
  3. Reimplement row-based logging mechanism for column-based import
  4. Reload DB using column-based import
  5. Load all plots data
  6. Generic data provider feedback mechanism
  7. Translate existing validation utilities to Python/Postgres

Notes

  • Milestones development timeline has been updated to reflect the month we spent on optimization and column-based import
  • Changes
    • Extended import utilities because that's ongoing while we're loading data
    • Extended Load data (and Reload data) to overlap with Validation utilities because some validation may be done in the mapping process
      • Some invalid data violates database constraints and must be set to NULL on import. If a mapping later fixes the data, the data must be re-loaded to store these fields.
    • Extended Revise schemas II because that could continue as long as we're trying to fit new data into the schema
    • Extended Validation utilities to the endpoint of Data provider feedback because they are related, and to allow additional time for geovalidation and possibly georeferencing
  • Brad will be gone until late July
    • he will be available via e-mail (nightly?), except for one week of "radio blackout"