2014.02.13 BIEN db

BIEN Database

February 13, 2014

Agenda

  • Review progress on quantitative validations (Aaron, 45 min.)
  • Outline next steps (5 min.)
  • Updates on Cultivated specimens and Taxonomy (Brad, 5 min.)
  • Reminder: Friday call about strategy for getting BIEN data to iPlant (Martha)

Previous week's To Do List

Aaron

aggregating validations

1. provide rough diagram of pipeline

  • send to Brad
  • not too general that it's too vague, not too specific that it constantly needs revision

2. implement pipeline for traits

  • generate pass/fail report on # queries that passed

3. implement pipeline for SALVIAS

  • traits aggregating validationspre-delete staging table rows with no taxon name (Taxon = '')

Brad

  • Continue work on cultivated specimens and taxonomy.

Notes

Pipeline for traits

  • A: The output side in place.
    • Working on the input side.
  • BB: Why are you using views? (wants to understand Aaron's thinking and approach)
  • A: If anything changes in a query, everything gets changed automatically.
    • The most important parts of the pipeline are within the database queries.
  • Brad uses Shell, PHP, or Python to wrap all the validation queries.
  • Aaron is using a Shell script, but is putting as much as possible in the database so it automatically refactors as much as possible. Instead of having to do search and replace in all the query files.
    • Makes it more maintainable.
  • BB: Will validation have to be manually triggered?
  • A: No, it will be triggered as part of the import process, probably in the make_analytical_db script
  • Brad likes to have a high level summary of the output of the validation queries so he can see at a glance how things went.
    • So he don’t have to click into individual results sets.
    • Need a log file with links to results sets for queries that have failures.
  • When a mapping is performed, where are initial errors reported?
  • A: TraitObservations.errors reports things like data type failures.
  • BB: For these quantitative validations, the errors are a mismatch between input and output data.
    • Those will be reported as shown in the diagram (comparison of table value rows)
    • For error reporting to data providers, will want a summary report in a manner that non-geeks can access and understand.
    • In WordPress.
  • Martha: Clarifying that the work to make validation information accessible to the data provider should be done LATER, not now.

Diagram

  • Looking at the validation workflow diagram to clarify what has been completed and what remains to be done.
  • Output (right side) is completed.
  • Prototype is for traits.
  • The input query work still needs to be done.
    • The table joins is part of that.
  • Mark: Requesting Aaron add additional details to the diagram:
    • Fat, stippled arrow between staging tables and normalized db to indicate
    • Links to example input and output files; DarwinCore, Salvias

(returning to) Pipeline for traits

  • A: Designing the input side of the pipeline. Will put results into db table.
  • estimated time: 1 week
  • Also have the sequencing aspect (?)
  • Brad: Should results be stored in a database table or csv file?
  • A: Would have to write code to the results in a store db. Would not have to write code (or little code) to dump results to csv file.
  • Brad: Explaining the sort of reporting desired:
    • First level of report is "something didn’t match". 
    • 2nd is which query or queries failed.
    • 3rd if something failed, diagnose by looking at the unmatched rows.
  • BB: If store results in database, it is also easier to create a report for data providers. (not true for csv file)
  • Mark: Likes to use text tools (in R, Perl, Awk) to sort through the reports.
    • Thinks that's also easier for ecologists.
  • BB: It's also doable in SQL.
    • Better for implementation.
    • The person doing this is more likely to be the person responsible for the source database, not an ecologist.
    • Store the report in the database and can dump a csv file from that as needed.
    • Just automate creation of the database table.
  • Martha: The queries that result in a count would be a useful to report in a logfile. The data provider could look at the numeric values and would know if they were correct or not. This would help (in part) address our concerns about us having to validate against the staging tables instead of the original data source. 
  • BB: Good point. Report the column header and count.
  • Are we talking about a view or a materialized view?
  • Use a dynamic view, the user will trigger when they want it and we can dump the results as csv. Compress the csv files. Will keep the size manageable.

Recapping

  • In the coming week, Aaron will:
    • Complete implementation of the input pipeline and table join (diff)
    • Send results from Traits to the group.
    • Then move on to SALVIAS and other data sources (if time permits.)
  • Regarding writing scripts that generate csv compressed. This is UI, so really is for Brad (later).

Decisions

  • Work to make validation results accessible to the data provider as CSV files should be done LATER, not now, and is part of Brad's UI work, not Aaron's database development work.
  • The queries that result in a count would be a useful to report in a logfile. The data provider could look at the numeric values and would know if they were correct or not. Report the column header and count.
    • This would help (in part) address our concerns about validating against the staging tables instead of the original data source.

To Do

Aaron

Diagram

  • Add details to the quantitative validation workflow diagram as discussed, including links to examples. (DONE)

Development

  • Complete implementation of the input pipeline and table join (diff)
  • Send results from Traits to the group.
  • Then move on to SALVIAS and other data sources (if time permits.)