2014.02.14 BIEN db to iPlant (planning)

BIEN Database to iPlant

February 14, 2014

Discussion Topics

  • Needs (Mark, Brad, Martha)
    • Analytical views: These are what users will access. What do we want users to be able to do? 
    • Large-scale analyses: What do we envision users doing? 
    • Normalized BIEN3 database: What do we want iPlant to provide – backup, mirroring?
  • Possible Approaches (Edwin, Nirav)
    • To the needs above
  • Summary (Martha)
    • Preliminary plan or options for consideration by the BIEN db leadership group

Email from Mark

Hi Folks,

In anticipation of our chat this morning, I just wanted to drill-down a bit on Martha's list:

the BIEN scientists are brainstorming what views of our database will be most useful to expose to (most) scientists. We expect a relatively small number of these, that individual researchers might then want to further subset (select/project) for download into their own analytical framework (desktop model, R-code, etc.). But subsequent to developing those appropriate queries, some issues arise--

1) can iPlant help develop Web-interfaces for accessing these query results? We have minimal experience developing UI, and imagine iPlant will want some consistent look&feel in how their information resources are exposed..

2) does iPlant have some standard API(s) to access RDBMS (e.g. postgreSQL in the case of BIEN)?

3) does iPlant want to expose our pre-canned BIEN views-- or enable the development of more customized result sets derived from BIEN-- via Web services? For example, we know that SSWAP is Web svc based, and one could imagine a semantic query on BIEN that extracts only the desired subset of materials. In addition, we are guessing that iPlant is promoting some standard approach to Web services in order to facilitate syntactic and semantic interoperability of any analytical pipelines developed within your project?

These areas-- API development, Web service development (as a subset of API), and Web UI development-- are areas in which the current BIEN team does not have strong expertise. Can the iPlant core-CI team help us out in these regards? Talk soon...

Thanks!
Mark

Notes

Overview

  • Mark: The code and database schema will be open source.
    • Want to give a small set of high level users, SQL access to the normalized postgres database.
    • Most users will access a set of views.
    • Mostly the view result set will be more than what user wants.
  • Nirav: ?hat are the space and compute requirements?
  • Mark doesn’t know exactly, but probably in the single gigabyte range at most.
    • We're developing on a quad core machine with 32 gigs of RAM.
  • Brad: If derived products, esp., range maps included, the size gets to terabytes.
  • Nirav and Mark: Those sorts of derived products are not in scope for today's discussion.
    • Today's scope is just the normalized database and only derived products that can be generated using an SQL query on that database.
  • Nirav: The BIEN db is not so large it requires a cluster?
  • Mark: Correct
    • But the time required to re-populate a schema change or reload data needs to be considered.

Recommendation for database and views

  • Nirav: The model is:
    • Have the analytics being actively accessed separate from the database that is actively maintained and growing.
    • Have a replicate database that is for updating.
    • Create a virtual "appliance" with everything in it.
    • It's trivial to do the updating in it with full access control. (Atmosphere)
    • For general user access,there are canned tools that create “reports from a database”. They sit on top of the database. Example: Jasper

Access and Authentication

  • Mark: Access and authentication is needed.
    • Want to be able to use iPlant’s.
  • Nirav: That is a solved problem in our infrastructure. Easy for you to use them.
    • Data sharing is very granular.
  • Mark: Concerned about onus being on them (BIEN/NCEAS) to implement it.
  • Nirav: Need to go through iPlant's ECS (Extended Collaborative Support) process to request developer time.
    • But it's trivial to give them (BIEN/NCEAS) 3 lines of code to drop into their PHP code.
  • Mark: OLAP – online analytical processing...
    • A handful of views will meet the needs of most ecologists.
  • Brad: There is also a long tail in the distribution of users. We are starting with users at the peak, not the tail, but requests from the tail will come up quickly. He can already think of several examples.
    • The infrastructure is open source, but the content is not. Some datasets are not free to distribute, although many can be.
    • Example: Carnagie project still in progress, an active dataset with plots in Peru plots and LIDAR data.
  • Nirav: Everyone has needs to hide part of their data. This is common
  • Brad: Access can vary at the level of the dataset, row and column.
  • Nirav: Have they implemented that level of access control on Postgres?
  • BB: No. Mostly we share the data as csv dump of views.
    • Have a 3 tier level of permissions in WordPress (website).
  • Nirav: It's hard to implement that level of granular access control on materialized views.
  • BB: SALVIAS – implemented access control at the plot level. (not record level)
    • Let the owner manage the permissions.
  • Nirav: In that instance there was more leeway.
  • Need tiers of users if you want someone to use the command line to pull data, vs a web application.
  • (Discussion of access in Postgres for normalized database)
  • Brad: In the database schema, we modeled the access from the provider end, but we haven’t captured all the information yet.
    • Haven’t included the users.
  • Nirav: It sounds interesting if that logic is in the data model.
    • Could do some interesting things in terms of _____
  • What you want is not an out of the box solution. There will need to be some investigation into how to scope it so it’s useable, but doable.
    • Need to look for solutions to provide prepackaged views.
  • It's much harder to do the more specific things (very fine-grained access control).
  • Martha: The more specific things will likely need outside funding. Beyond scope of ECS.

Analyses and APIs

  • Mark: Wants an update on what iPlant can provide for making it possible to connect data to analysis.
  • Does iPlant have an open source database framework or views... layer something on top to make them chain into pipelines?
  • Nirav: Not the chaining.
    • A web service needs an API for SSWAP to reason over it.
  • Mark: Concerned about BIEN having to develop their own APIs.
    • There are a lot of frameworks (e.g., Druple, Google) where you can take an RDBMS and layer their stuff on top.
  • Nirav: Will have a better answer in about 3 weeks since we (iPlant) are harmonizing our APIs.
  • Edwin: We have a lot of APIs in iPlant, developed according to indivdual products (HPC, Atmosphere cloud, Authentication, job launching, data, etc.)
    • It's hard for people to know which to use. So documenting and harmonizing.
  • REST SQL provides SQL interface to databases.
    • Some development will need to be done to make a database a restful service.
  • Nirav: Within data APIs we have some capability to subset for climate data formats, but not for RDBMS.
    • So, will have better answer in 3 weeks.
  • Mark: So we will hold off on any API discussions.
  • Nirav: Also we will address if a project already has an API, how do we bring them in? BIEN could be an example.

User Interface and Web Developer

  • Mark: Does iPlant have a web developer on board?
    • iPlant must have a standard look and feel for their web services.
    • Is there someone who could do this development for us?
    • Envisions geospatial subsetting, info graphics, ...
  • Nirav: You listed several areas of different expertise.
    • We have themes for our web service interfaces, but those may not be a good fit for your project.
    • However, we can give those pieces to people to use.
    • For the geospatial part, we do not have much yet. But a separate project of Nirav's (not iPlant, but related to it) is bringing in people with that expertise.
    • Have many people with general Java scripting expertise. 
  • If looking for something so specific, will need to go through ECS.
  • Website development can be very expensive in terms of development time.
  • Martha: This sounds like it would need outside funding.

Wrap up

  • Martha: The BIEN db group (and iPlant) will need to prioritize these features.
  • Edwin:
    • Can use Atmosphere for the full postgres database.
    • Can have views for users.
    • For the web development project, need a document to describe the scope, architecture and functionality (e.g., access control).
    • Need to break things out into short term deliverables.
    • The long term, larger development projects need additional funding.
  • Mark: Had envisioned authentication and web presentation as being there from iPlant(for the BIEN database project).
  • Nirav: We can advise the developer.
  • Mark: We're delivering a Postgres database with associated ETL code in python.
    • How do we integrate the product with iPlant?
    • The presentation (user interface) could be simple or complex.
    • Would like to see the iPlant desired approaches for doing this sort of integration.
    • Can he be pointed to some documentation?
  • Nirav: That's more of a conversation than reading documentation.
    • More about findng the skill set of the BIEN team and how it can work with iPlant.
    • We'll need to have more discussion to bounce ideas around to find a path forward.
    • Then the BIEN group can fill out a request for ECS.
  • Edwin: We still need a scoping document for describing the presentation (user interface).
    • Then we can have the conversation.
  • Mark: If iPlant can not come forward with a lot of developer assistance, we will have to keep it simple.
  • Nirav: Start modest and iterate on the product.
    • In our experience, if you wait to develop a highly polished product, it takes too long.
    • Provide something quickly for the users, then iterate on it.
  • Martha: The BIEN group will need to focus on more modest, short term deliverables.
    • The user interface development discussed today may even be out of scope for ECS.

Summary

Database and Views

  • Separate the analytical view that will be actively accessed by users from the full relational database that is maintained and growing.
  • Use Atmosphere to create a virtual "appliance" for updating the full relational database.
  • For general user access, identify a canned tool (e.g., Jasper) that creates “reports from a database”.
  • It's likely the group will need to focus initially on only creating the handful of views that will meet the needs of most ecologists, rather than including the more specialized views for ecologists in the tails of the user distribution.
  • Need to look for solutions to provide prepackaged views.

Access and Authentication

  • This is a solved problem in iPlant's infrastructure and the BIEN group can use it.
    • iPlant can provide a few lines of code to drop into the BIEN PHP code.
  • It's hard to implement very granular access control on materialized views (at the level of the dataset, row and column).
  • The group has not implemented that level of access control on the full Postgres database.
  • In the database schema, they have modeled the access from the provider end (not user), so it could be possible to leverage that.
  • What the group desires for access control is not an out of the box solution.
  • The group will need to scope it so it’s useable, but doable.
  • The very fine-grained access control is beyond the scope of what iPlant's ECS (Extended Collaborative Support) could provide. Outside funding would be needed.

Analyses and APIs

  • A web service needs an API for SSWAP to reason over it. (so not a good approach for BIEN group)
  • iPlant is currently documenting and harmonizing it's many APIs, so will be able to provide better information in about three weeks.
  • Therefore, the BIEN group will hold off on any API discussions.

User Interface

  • In iPlant's experience, developing a highly polished web interface product, takes too long and is very expensive.
  • iPlant recommends BIEN group start modest.
    • Provide something quickly for the users, then iterate on it.
  • For the web development project, need a document to describe the scope, architecture and functionality (e.g., access control), then discuss it to find a way forward.
    • Need to break things out into short term deliverables.
    • The more involved development can be submitted as an ECS request, but may be out of scope for ECS.
  • AFTER THE CALL: Martha realized she forgot to mention that Brian suggested Nathan Casler could probably help with UI development.

Action Items

  • Nirav, Edwin: Look for solutions to provide prepackaged views for the database. (DONE)
    • Get back to the group when the iPlant APIs have been harmonized, so using them can be discussed.
  • BIEN leadership group: 
    • Draft a document describing the scope of access control desired so it’s useable, but doable in the near term.
    • Draft a document describing the scope of and features for the User Interface.
    • Send the documents to iPlant.
  • Martha: Schedule a follow up call when the group has scoping documents to discuss.
  • (Later) iPlant can provide a few lines of code to enable the BIEN db project to use iPlant's authentication and access control solutions.