Page Comparison

BIEN Database

...

Brad wrote the queries for which he understood the schema.
- Left others blank for Aaron to write the queries.
Estimated time for Aaron to write the queries: a day
A: "Line intercept" is in "aggregate occurrence" table under "linecover" field.
- For SALVIAS, these data have been mapped.
Brad: Each query is numbered.
- Each query needs to be run on each source and VegBIEN.
- Automate them.
- Keep track of which (do and) don't pass validation.
- Validation could fail due to problem with source data or with VegBIEN schema.
Should Aaron make the “easy” schema changes before validating?
BB: Yes.
MN: Let the group know if difficult schema changes come up.
BB: Validations help us spot necessary schema changes.
- Aaron to fix queries that are Blank or have a comment in all CAPS (queries 18, 12).
- Brad is not confident about the taxon queries since he doesn't yet fully understand how they are stored in VegBIEN. Aaron should check those queries carefully.
A: Formation of taxon name happens during mapping. It’s not stored in the schema.
Brad: We need to compare the original name from the original database to the verbatium name in VegBIEN.
- There are standard ways to form a taxon name. He looks for that same fully constructed taxon name in VegBIEN.
- Need to make sure the fully formed taxon name is stored in VegBIEN.
- BB: Is it stored in taxon label?
- A: Yes. The contatenated name.
A: Which queries involve taxon name? 12,13,14,15,16
BB: Take the SALVIAS query and run it on the SALVIAS database on nimoy. That should guide Aaron as to how to form the query against VegBIEN.
Brad doesn’t understand what Aaron is saying.
- Aaron: taxonlabel.taxonicname may need to be used.
- MS: Reiterating that Aaron should do what Brad suggests, i.e., run the SALVIAS queries and then construct a query against VegBIEN that should replicate the outcome.
Brad showing what the taxon query outputs should look like.
- Aaron can just use the field that is sent to TNRS.
Brad: The big picture for quantitative validations is, for any source, there are 5 steps:
1) write query for the original database
2) write the parallel query for VegBIEN (a single set of queries should work for all sources)
3) Once they work,
4) turn them into a quantitative validation pipeline.
5) Run validations and fix andy problems.
They (the queries) will generate an error report so we'll know if there are errors that need to be fixed.
Initially, errors could mean:
- Problem with validation query
- Error in mapping
- Error in import scripts
- Problem with schema
- Problem with source data
A: If we don’t have a postgres database, will use the staging table which is post mapping.
- Would need a database installation of the CSVs
BB: How Brad works:
- Imports CSV file into database.
- Maps to common staging table.
Aaron changes the CSV column names to VegCORE schema. Doesn't keep the original CSV column names in database.
Brad thinks that's OK since the same number of columns and rows exist.
A: The number of rows isn’t the same since some couldn’t be imported.
This validates just the normalization.
MS: Are we losing a lot of data due to frame shift problem? (REMIB)
- A: Yes, because it isn’t clear how to fix it.
- MS: We need to provide that sort of problem report to the data provider.
- BB: Quantitative validation should produce a list of errors from Q validations. Suggesting automatically creating a page where errors are reported per data provider. Could be emailed to them if they opt in.
MN: Confused. Thought Brad had told her in other conversations that he wasn't comfortable validating form the staging tables (after mapping to VegCORE schema).
BB: We need to validate against the earliest possible product we have for data received as CSV files -- Before renaming the columns to VegCORE.
A: Means he needs to write a separate set of queries for each of the 22 data sources.
BB: For specimen data only, validate against original import with column names renamed to VegCORE schema (post mapping).
For plot data, need to validate against the original imported data, not VegCORE mapping.
A: Thinks should validate both specimen and plot data before the VegCORE mapping step.
- Requires 22 different query sets and will take a while.
- Points out that this will be circular since he will have mapped from the original column names to VegCORE schema and then back to original column names. This is all dependent on his mappings, so isn't really the same as validating using queries against the original database. This really only validates the normalization, not the mapping.
MN: He's right. It's circular. The only way to really validate pre-mapping is for someone to write queries against the original database (column names) who understands its schema. Whatever Aaron does will be based on his understanding of the schema which underlies all the mappings.
Brad is comfortable with validating against the VegCORE staging tables for specimens.
For plots, Brad will write the queries against the unmapped original databases, and Mike need to do so for VegBank and CVS.
The SALVIAS query set is the template for generating the desired result set.
A: write the blank queries and fix the ones that have comments in caps.
- Review the taxon related queries since Brad is less confident about them.
Make schema changes
Inform us of difficult schema changes
A day and a half
A: sees only 1 schema change that will need to be made (at this point)
Create the pipeline.
- He'll do all on the postgres databases in house, so don’t have to match up between mysql and postgres. Will need to translate sql queries to postgres.
- Fine with BB.
For plots need to unmap staging table names.
Is it unlikely Aaron will be able to begin doing the quantitative validations in the coming week.

Decisions

For specimens, validate against the VegCORE staging tables.
For plots, validate against the unmapped original database column names.
- Brad will write the queries against the unmapped original databases, and Mike need to do so for VegBank and CVS.

...

Versions Compared

Old Version 4

New Version Current

Key

BIEN Database

Decisions