2014.04.17 BIEN db

BIEN Database

April 17, 2014

Participants

Aaron Marcuse-Kubitza, Martha Narro, Brad Boyle, Brian Enquist

Agenda

  • Switch meetings to Wednesdays? (Martha, 5 min.) 
  • Review progress and address questions on quantitative validations (Aaron, 40 min.)
    • Troubleshooting specimen queries
    • Plot input queries
  • FIA (Brad, 10 min.))
  • Outline next steps (5 min.)

Previous Week's To Do for Aaron

VegBIEN schema

  • add taxonverbatim.subspecies
  • add TNRS and geoscrub tables to ERD

specimens aggregating validations

  1. fix input queries #4,5: remove subspecies IS NOT NULL filter
  2. fix input queries #6,7: add subspecies IS NOT NULL filter
  3. fix output queries #6,7: use subspecies instead of the concatenated taxonomic name

Email Updates (Aaron)

======== 1st Message ========

From: Aaron Marcuse-Kubitza <aaronmk@nceas.ucsb.edu>

Subject: Re: [Bien-db] April 9 BIEN db call

Date: April 16, 2014 7:51:08 PM MST

> the remaining two problem queries (13, 15)

I have finished fixing and troubleshooting queries #13,15. After revising the input/output sides to match, this revealed the following import issues:

#13:

(#13a,b refer to "breakout queries" that expand the aggregated counts in #13 into diff-able lists: _specimens_13a_list_of_all_verbatim_lat_long, _specimens_13b_list_of_all_decimal_lat_long)

  1. only coordinates with both a latitude and a longitude are stored, due to NOT NULL constraints on latitude_deg and longitude_deg
    1. affects 65/941,107 rows = 0.007% (#13b)
    2. creates diff in [#13+a,b|http://vegpath.org/VegBIEN/NY] (#13a: rows 1-53)
  2. verbatim coordinates are not stored in the database when the column is a valid double precision
    1. this discards trailing zeros that would be useful in determining the precision of the value, in the absence of a coordinatePrecision
    2. creates diff in [#13+a,b|http://vegpath.org/VegBIEN/NY] (#13a: rows 54+)

#15:

  1. habitat, substrate, vegetation should be stored in locationRemarks, not locality
    • creates diff in #15
    • requires schema change: add location.location_remarks
    • requires mappings change: mappings/VegCore-VegBIEN.csv: fields incorrectly mapped to location.locationnarrative (locality): remap to new location.location_remarks instead
  2. sometimes, multiple input fields are concatenated together in the XPath mappings to form a single output column
    • these fields should instead be concatenated together at stage II (source-general derived columns), so that their concatenation can be compared directly with what's stored in the database
    • creates diff in #15, which concatenates locality and habitat into locationnarrative

====== 2nd Message ===========

From: Aaron Marcuse-Kubitza <aaronmk@nceas.ucsb.edu>

Subject: Re: [Bien-db] April 17 BIEN db call

Date: April 17, 2014 3:02:40 AM MST
> an update

  • I fixed queries #4,5,6,7,13,15 (see my earlier e-mail about queries #13,15). bugs uncovered by the fixes are listed on the wiki.
  • I did a test run of the import with 2 datasources to troubleshoot the disk space leak. because the test run crashed as well, the problem is most likely a bug in Postgres or Linux itself. this unfortunately means that we can't run the import until we find the Postgres/Linux bug that is causing the problem.
  • I added the TNRS and geoscrub tables to the VegBIEN ERD

Notes

Specimen queries

  • Bugs in the system: in Postgres and Linux
  • There is a disk space leak, so can’t load an entire data source. Server crashed partway through the import.
  • Is Nick Brandt aware of the problem? As sys admin he could possibly fix it.
  • Top priority: Aaron to contact Nick this morning. Aaron should work with Nick to resolve the problem since it is a system issue. Copy Mark. Let us know as soon as that is resolved.
  • Reloading to normalized database is impacted, so work on validations is impacted.

The queries have been fixed.

  • Do we want to work through the bugs? 
  • Insufficient time. Let’s just look at #15.

Query 15

  • The bug requires a change to the schema since a standard Darwin Core field (locationRemarks ) isn’t in VegBIEN.
  • This field contains notes about the location (locationRemarks).
  • Concatenating things that should have been stored in separate fields results in the difference. This is an example of VegBIEN not being sufficiently atomic, but adding locationRemarks to the schema will fix the problem.
  • Summarizing:
    • Schema change required to add locationRemarks.
    • Then reload the data to get it in that field.
    • Fix the queries.
    • Run the diff.
  • This is good. The validations are doing what they are intended to do now that the copy/paste mistakes in the queries have been fixed (last week).
  • Hopefully any issues remaining are import and schema issues. (as this one was)
  • Decision to move on to making sure the plot queries are correct instead of working through the bugs and issues found from running the specimen validation on the test source (NY).

Plot Input Queries

  • Have the generic input queries against VegCore for plots been written yet?
  • No
  • Those are the input queries Martha mentioned in her email.

SALVIAS queries

  • We are using SALVIAS as the test case for plot queries.
    • Output and input queries are written for SALVIAS.
  • (Going through the issues running SALVIAS.)
  • Brad took the input and output queries he wrote in mysql against nimoy(?), ran them and compared them to the postgres query results.
    • He couldn’t get any of the queries to pass.
  • Query #1 (count of projects)
    • 14 projects on Nimoy
    • 12 projects on VegBIEN
    • Aaron finds both input and output indicate 12 projects (against VegBIEN).
  • Query # 2 (list of project names)
    • On Nimoy, two projects ACA Amazon forest inventory, OTS transects are listed that don’t appear in the input list running against SALVIAS staging table.
    • Both are public, so that isn’t the reason they were filtered out.
    • The IPR_specific field is blank. (so there are no specific use conditions) 
    • Decision: Don’t filter on the IPRC_specific field.
    • Just filter out the projects Brad listed at the top of the MySQL queries.
  • This points out the value of querying the original database, which we aren’t doing for most sources.
  • Aaron should fix problems in the order they are numbered. They are designed to build on each other. The upstream (lower number) validations often must be correct in order for the downstream validations to be correct.
  • Aaron needs to fix the input filter to use that list of plots and reimport the data before the SALVIAS queries can be expected to generate correct output.
  •  
  • After the disk leak is fixed, Aaron can reload just the SALVIAS data.
  • Also reload NYBG so Brad can check the diff results on both.
  • Has a TNRS cache, so doesn’t actually gave to re-run the names through TNRS.

HIgh level status

  • BE: What is 33K ft view of where we stand?
  • First finish the quantitative validations, which is likely to still take another couple of months (Martha's opinion).
  • Then get BIEN members in to look at the data.
  • The validations just test the data were imported correctly. (not geo or cultivated species validations).
  • The analytical tables will do that.

FIA

  • Filtering of FIA requires fields that are in the original data, but that are not imported into VegBIEN.
  • To avoid a bunch of schema changes to accommodate FIA, Brad suggests:
    • If Aaron can make form plot codes as Brad described,
    • Aaron, can continue to import FIA into VegBIEN and validate FIA. (as he is currently doing)
    • Brad Can build a series of scripts that query the raw FIA database to build the indexes in the analytical tables after the data are loaded into VegBIEN.
    • This bypasses Aaron’s pipeline for the filtering. 
  • A: What about FIA cultivated filter?
  • BB: Toss that old script since it is out of date.

Decisions

  • Move on to making sure the plot queries are correct instead of working through the bugs and issues found from running the specimen validation on the test source (NY).
  • Regarding plots query 2: Don’t filter plots on the IPRC_specific field. Just filter out the projects Brad listed at the top of the MySQL queries.

To Do

  • Top priority: Aaron to contact Nick this morning. Aaron should work with Nick to resolve the problem since it is a system issue. Copy Mark. Let us know as soon as that is resolved.
  • Specimens Query 15: 
  •  
    • Schema change required to add locationRemarks.
    • Then reload the data to get it in that field.
    • Fix the queries.
    • Run the diff.
  • Aaron needs to fix the input filter to use that list of plots and reimport the data before the SALVIAS queries can be expected to generate correct output.
  • FIA: Aaron will follow Brads suggestion to bypass Aaron's pipeline for filtering when loading FIA data into VegBIEN. 
  • FIA: Brad will write scripts (that run after FIA data are loaded into VegBIEN) to filter and build indexes on the analytical tables.