2014.01.23 BIEN db

BIEN Database

January 23, 2014

Participants

Aaron, Brad, Mark, Bob, Ramona, Martha

Agenda

Review progress on tasks
Outline next steps

Previous Week's To Do List

Plot (and Project) Data Providers">Plot (and Project) Data Providers

Aaron: For SALVIAS, complete the work as described in Item 2. "Plot data providers" from Brad's "High priority tasks" email message of Dec. 17/18.

BIEN2 Traits">BIEN2 Traits

Brad: Validate the BIEN2 trait data, taking this task off Aaron's plate. Use the VegBIEN normalized trait table and the oringinal input data from BIEN2.

Spot check the data.
Write quantitative validation queries.
Send queries to Aaron so he can put them into the validation pipeline.

Aaron: After Brad sends you his BIEN2 Trait quantitative validation queries, put them into the validation pipeline.

Quantitative Validations">Quantitative Validations

Brad: Send Aaron the queries on the original SALVIAS database, which he forgot to attach previously.

Aaron: Work on Items 3.1 and 3.2 described in "Plot data providers" in Brad's "High priority tasks" email message of Dec. 17/18.

After the queries (12,13,15) are fixed, send them back to Brad so he understands where his mistakes were.

Notes

Plot Data Providers

For SALVIAS, Item 2. "Plot data providers" from Brad's "High priority tasks" email message of Dec. 17/18
Completed!

Quantitative Validations

Brad wrote the queries for which he understood the schema.
- Left others blank for Aaron to write the queries.
Estimated time for Aaron to write the queries: a day
A: "Line intercept" is in "aggregate occurrence" table under "linecover" field.
- For SALVIAS, these data have been mapped.
Brad: Each query is numbered.
- Each query needs to be run on each source and VegBIEN.
- Automate them.
- Keep track of which (do and) don't pass validation.
- Validation could fail due to problem with source data or with VegBIEN schema.
Should Aaron make the “easy” schema changes before validating?
BB: Yes.
MN: Let the group know if difficult schema changes come up.
BB: Validations help us spot necessary schema changes.
- Aaron to fix queries that are Blank or have a comment in all CAPS (queries 18, 12).
- Brad is not confident about the taxon queries since he doesn't yet fully understand how they are stored in VegBIEN. Aaron should check those queries carefully.
A: Formation of taxon name happens during mapping. It’s not stored in the schema.
Brad: We need to compare the original name from the original database to the verbatium name in VegBIEN.
- There are standard ways to form a taxon name. He looks for that same fully constructed taxon name in VegBIEN.
- Need to make sure the fully formed taxon name is stored in VegBIEN.
- BB: Is it stored in taxon label?
- A: Yes. The contatenated name.
A: Which queries involve taxon name? 12,13,14,15,16
BB: Take the SALVIAS query and run it on the SALVIAS database on nimoy. That should guide Aaron as to how to form the query against VegBIEN.
Brad doesn’t understand what Aaron is saying.
- Aaron: taxonlabel.taxonicname may need to be used.
- MS: Reiterating that Aaron should do what Brad suggests, i.e., run the SALVIAS queries and then construct a query against VegBIEN that should replicate the outcome.
Brad showing what the taxon query outputs should look like.
- Aaron can just use the field that is sent to TNRS.
Brad: The big picture for quantitative validations is, for any source, there are 5 steps:
1) write query for the original database
2) write the parallel query for VegBIEN (a single set of queries should work for all sources)
3) Once they work,
4) turn them into a quantitative validation pipeline.
5) Run validations and fix andy problems.
They (the queries) will generate an error report so we'll know if there are errors that need to be fixed.
Initially, errors could mean:
- Problem with validation query
- Error in mapping
- Error in import scripts
- Problem with schema
- Problem with source data
A: If we don’t have a postgres database, will use the staging table which is post mapping.
- Would need a database installation of the CSVs
BB: How Brad works:
- Imports CSV file into database.
- Maps to common staging table.
Aaron changes the CSV column names to VegCORE schema. Doesn't keep the original CSV column names in database.
Brad thinks that's OK since the same number of columns and rows exist.
A: The number of rows isn’t the same since some couldn’t be imported.
This validates just the normalization.
MS: Are we losing a lot of data due to frame shift problem? (REMIB)
- A: Yes, because it isn’t clear how to fix it.
- MS: We need to provide that sort of problem report to the data provider.
- BB: Quantitative validation should produce a list of errors from Q validations. Suggesting automatically creating a page where errors are reported per data provider. Could be emailed to them if they opt in.
MN: Confused. Thought Brad had told her in other conversations that he wasn't comfortable validating form the staging tables (after mapping to VegCORE schema).
BB: We need to validate against the earliest possible product we have for data received as CSV files -- Before renaming the columns to VegCORE.
A: Means he needs to write a separate set of queries for each of the 22 data sources.
BB: For specimen data only, validate against original import with column names renamed to VegCORE schema (post mapping).
For plot data, need to validate against the original imported data, not VegCORE mapping.
A: Thinks should validate both specimen and plot data before the VegCORE mapping step.
- Requires 22 different query sets and will take a while.
- Points out that this will be circular since he will have mapped from the original column names to VegCORE schema and then back to original column names. This is all dependent on his mappings, so isn't really the same as validating using queries against the original database. This really only validates the normalization, not the mapping.
MN: He's right. It's circular. The only way to really validate pre-mapping is for someone to write queries against the original database (column names) who understands its schema. Whatever Aaron does will be based on his understanding of the schema which underlies all the mappings.
Brad is comfortable with validating against the VegCORE staging tables for specimens.
For plots, Brad will write the queries against the unmapped original databases, and Mike need to do so for VegBank and CVS.
The SALVIAS query set is the template for generating the desired result set.
A: write the blank queries and fix the ones that have comments in caps.
- Review the taxon related queries since Brad is less confident about them.
Make schema changes
Inform us of difficult schema changes
A day and a half
A: sees only 1 schema change that will need to be made (at this point)
Create the pipeline.
- He'll do all on the postgres databases in house, so don’t have to match up between mysql and postgres. Will need to translate sql queries to postgres.
- Fine with BB.
For plots need to unmap staging table names.
Is it unlikely Aaron will be able to begin doing the quantitative validations in the coming week.

Decisions

For specimens, validate against the VegCORE staging tables.
For plots, validate against the unmapped original database column names.
- Brad will write the queries against the unmapped original databases, and Mike need to do so for VegBank and CVS.

To Do

Aaron

Write the blank quantitative validations queries and fix the queries that have comments in CAPS.
- Review the taxon related queries since Brad is less confident about them.
- Let Brad and Mike know when the queries are ready since they'll need to refer to them to write the plot validation queries.
Make any necessary easy schema changes for the queries (above).
- Inform the group if any difficult schema changes arise.
Create the quantitative validation pipeline.
- Do all queries on the postgres databases in house, so don’t have to match up between mysql and postgres. Will need to translate sql queries to postgres.
For plots, unmap the staging table names.
- Let Brad and Mike know when tables/dbs with original plot column names are ready.
For specimens, write the queries against the VegCORE staging tables for each specimen source.

Brad and Mike

Brad will write the queries against (most) unmapped original databases.
Mike will need to do so for VegBank and CVS.

[data-colorid=oy09pymcov]{color:#333333} html[data-color-mode=dark] [data-colorid=oy09pymcov]{color:#cccccc}BIEN Database