2014.07.17 BIEN db

BIEN Database

July 17, 2014

Participants

Aaron Marcuse-Kubitza, Mark Schildhauer, Martha Narro

Agenda

  • Data reload bug
  • Other tasks in queue dependent on data reload (dates, taxon names, workflow for viewFullOccurrence)

Notes

Data reload bug

  • It's fixed!
  • Aaron started re-importing the data, which is expected to take 5-6 days.

Performance optimization

  • Nick will spec out some hardware to speed things up.
    • NCEAS has a machine with CPUs that are twice as fast, but disks are slow.
    • Looking into faster localized disks for systems.
    • Hope to cut the data loading time in half. 
  • Some of the performance problem is how python code is interwoven with postgres code.
    • Considering parallelizing the code.
    • Aaron says python doesn’t support multiple threading.
    • Mark thinks code performance can be improved.
    • Stack overflow has lots of discussion about optimizing performance. 
  • It takes 8 hrs just to construct the analytical database (analytical_stem)
    • This strikes Mark as being extremely slow.
    • At some point, Aaron needs to look into why the performance isn't what it should be. How to improve it.

Other tasks in queue dependent on data reload (dates, taxon names, workflow for viewFullOccurrence)

Status of work on dates when data reload began:

  • Summary of the dates problems are in the Google doc titled BIENDatesPopulated.
  • Dates fixes – the problems/bugs that didn’t need decisions from PIs were fixed by Aaron prior to beginning the data reload.
    • FIA - These cases are now coded. WIll be able to check they are correctly handled after the data reloads.
    • MO - Just needed data reload since apparently the problem had been fixed at some point. Confirm after data reload completes.
  • Salvias - Aaron still needs to troubleshoot salvias date problem. After data reloads.
  • TRT - Mark will look into TRT dates missing after a certain row.
  • GBIF, NVS, JMB, HVAA
    • PIs need to find time to make decisions on the other date-related items detailed in BIENDatesPopulated.
    • GBIF, NVS, JMB, HVAA,  -- PIs make decision.
    • Martha provide sufficient info for PIs to make decisions and one person communicate back to Aaron.

Taxon name work will proceed while data are reloading.

Then the next priority is developing the workflow for viewFullOccurrence.

  • Due to time constraints, the analytical databases (tables) that we expect to have in place by end of August are analytical_stem and viewFullOccurrence.

People's (non)availability

  • Mark: Next 3 weeks, NCEAS and RENCI jointly hosting a training session. Mark is one of the instructors and will also support the students.
  • Martha: Next two weeks out on vacation.
  • Brian E & Brian Mc: At Gordon conference next week.

To Do

Aaron

  • Complete data reloading.
    • When data reload finishes, check that all the date-related bugs and new cases handled do indeed now result in correct dates. Afterwards, ask Brian McGill to access VegBIEN to confirm.
    • Troubleshoot and fix SALVIAS date bug.
  • Taxon names: complete issue #917 (TNRS with TPL)
    • But there was an email exchange between Aaron and Brad about implementing Jerry Lu's algorithm to sort the names. Before launching into this, Martha checking with Nicole and will get back to Aaron Friday.
    • Complete the other taxon name tasks listed under issue #928.
  • Workflow for viewFullOccurrence - move on to this as time permits or if waiting on other things.

Brian E, Bob, Brian Mc

  • Decide how to handle the date problems. Details to follow in a separate email.
    Mark
  • Look into TRT dates missing after a certain row.