2014.03.06 BIEN db

BIEN Database

March 6, 2014

Participants

Aaron Marcuse-Kubitza, Brad Boyle, Martha Narro

Agenda

  • Review progress on quantitative validations (Aaron, 50 min.)
  • Outline next steps (5 min.)

Previous Week's To Do list

For Aaron
Schema changes

  1. rename location.top_plot -> plot_location_id (DONE)
    1. in the database, the triggers, and the FAQ
  2. send e-mail that top_plot renamed
  3. rename top_plot view -> plot

SALVIAS aggregating validations

  1. test each query outside of the pipeline, in order
  2. send e-mail that queries written
  3. implement each query in the pipeline

Notes

Progress on quantitative validations

  • Aaron: Top Plot table, query throws errors. Brad didn't understand the structure.
    • Plots are not stored in hierarchical way in VegBIEN. Stored recursively.
    • Can’t tell if count problem is due to the data or the query having a problem.
  • BB: What are the causes for failures in the queries? What's an example?
  • Aaron: A plots issue such as things being attached to projects.
  • BB: That sounds like a problem with the schema rather than with the query.
  • Aaron: Can’t tell.
  • BB: Sounds like the query is doing it’s job (identifying problems in the schema/scripts)
  • A: only affected 3 locations.
  • BB: Figuring out these sorts of problems is a legitimate reason for things taking longer than anticipated.
  • You're doing a combination of two things:
    • 1) Working to make the queries run correctly.
    • 2) Discovering problems in the schema or import scripts.
  • It sounds like once you get through SALVIAS, things should procede more quickly.
  • A: Disagreeing. All queries depend on projects, and not all sources have projects.
  • BB: That sounds like a problem with our schema.
    • You need to insert a dummy project so participation can be mandatory.
    • How difficult would it be to inforce that rule?
  • A: Would need to be enforced for specimens too.
  • BB: For specimens, couldn’t the project be same as the herbarium (i.e., groups of specimens)?
    • When the queries have to be written so they say “for some data sets do this and for others do that” it’s an indication of a problem.
    • This is good. The queries are doing their job identifying shortcomings in the schema.
    • Aaron needs to promptly contact Brad when an issue crops up.
    • Bottom line is how can Brad help Aaron do this more quickly without taking careless short cuts.
    • Need to make a push to make sure the queries themselves make sense and have no problems themselves.
    • Then will have greater confidence that the errors are due to data problems.
  • Brad: apologizes if he misunderstood the schema and therefore didn't write the queries correctly.
  • Aaron: Suggesting that since time is a problem, fewer than 18 queries should be run to validate.
    • There is no single VegCore schema for plots. Only for specimens.
    • The input staging tables for plots vary depending on the data source.
    • VegCore is a naming conventions for columns
    • Could try left joining all plots staging tables.
  • BB: In other words we never moved to “new style import” for all sources?
  • We should have had a single common schema (staging table).
  • Martha: Strongly agreeing that her understanding was new style import would have a common schema for plots.
  • BB: Asking if Aaron is still going form original structure of source data for plots to the VegBIEN schema.
    • He shouldn't be. That's why Brad kept arguing for using a flatten first (denormalize), normalize later approach.
    • Which plots are in new style import?
  • Aaron: Plots in de-normalized (new style) import are: CVS, FIA, VegBank, All specimens
    • Not: CTFS, Madidi, Salvias, team, (NVS doesn’t count)
    • “Left join” is denormalized VegBIEN(?)
  • Brad: Pointing out Aaron has done it (switched to new style import) for the most complex plot sources.
    • There are 11 FIA queries that Brad wrote. Could add the others and they would return no records.
    • Brad will re-number the FIA queries. 
  • The output queries have been changing too.
  • SALVIAS: As long as the output queries are the same, could leave them.
  • (Aaron's to do list and priorities were revised from these after the call based on Martha and Brad's discussion. The To Do list below the notes is the correct one.)
  • Aaron’s top priority is to fix the output queries work (18 plots + 16 specimens, traits are done) 
    • Estimated time is on the order of “weeks” 
    • Brad expressing disbelief. At the end of each day, send the output queries to Brad. He can then do the iput queries. 
  • Brad will validate SALVIAS against those output queries.
  • Aaron: Denormalize CTFS, Madidi, team (not SALVIAS).
  • Write input queries against input
  • Write output queries for specimens.
  • Write input queries for specimens against the VegCore schema.
  • Brad will validate FIA as well as SALVIAS.
  • There is one VegCore vocabulary (it needs to be locked down as a denormalized schema).
  • Brad considers left joins to be a sign of trouble with the schema.
  • Business rules and triggers should be built on the front end so ensure have a well-normalized database with good referential integrity.

To Do (Aaron)

  1. Finish all plot output queries. Make sure they work and are stable. This is a single set of 18 queries modeled after the SALVIAS queries Brad provided.
  2. Write specimens output queries. Make sure they work and are stable. This is a single set of 16 queries modeled after the NYBG queries Brad provided.
  3. Write plots input queries against the data in the denormalized VegCore schema. Make sure they work and are stable. This is a single set of 18 queries that match the plot output queries.
  4. Write specimens input queries against the specimen data in the vegCore schema. Make sure they work and are stable. This is a single set of 16 queries that match the specimen output queries. 
  5. Run plot validations for two plot data sets that have already been denormalized, excluding FIA (VegBank, CVS).
  6. Run all specimen validations.
  7. Denormalize the remaining plot datasources TEAM, Madidi, and CTFS so they will work with the denormalized plot input queries.
  8. Complete plot validations on TEAM, Madidi, and CTFS.
  9. As you work on completing the queries, please send Brad each day the most recent version of the queries you are working on (plot output, specimen output, plot input or specimen input) so he can gauge progress and help with any problems you might be encountering.

For Brad

  • Brad will re-number the FIA queries so they correspond with the other 18 queries.
  • Brad will validate FIA as well as SALVIAS.