Data Integration Discussion at CIPRES Wrapup

Author: Andrew Lenards <lenards@ipc> (where 'ipc' = the iPlant Collaborative domain name)

Date: Wednesday, July 22, 2009
Location: Jepson Herbaria, University of California Berkeley

Discussions with Val Tannen started informally on Wednesday when I brought up a subversive idea Rod Page had posted on his blog about modeling taxonomy with a filesystem. We quickly jumped from this "leftfield" [iptoldataint:1] idea to the potential storage needs of iPToL along with his concerns about needing to get started with the Data Integration Working Group. This led to scheduling a meeting with Val and Bill Piel. Our discussion ended with Val walking back to the hotel with the group (Karla, Nirav, Steve, and myself) and he shared his skeptical view of RDF , OWL , and semantic Web technologies [iptoldataint:2]. The functionality promised by the technologies might not be developed. Nice counterpoint Val mentioned was that the life sciences still lack of unique identifiers. The adoption of semantic Web technologies might at least deliver unique identifiers. Val's frank candor was refreshing.

[iptoldataint:1] http://iphylo.blogspot.com/2009/06/taxonomy-on-hard-disk.html - Rod Page acknowledges that the idea is a bit out of leftfield, yet the approach yields some very interesting benefits (like being able to easily move around clades and the ability to place the taxonomy under version control).

[iptoldataint:2] One Core Software consideration has been how will data be stored and modeled, where one flexible approach might be to model data with RDF. Using RDF to model data does not mean that you need to use OWL or other semantics Web techologies.

I emailed Val the link to Rod Page's entry on his blog, iPhylo , where he discussed the idea of using a hierarchical system (a hard-disk is how the idea is presented, but what is really being used is the filesystem built on-top of the disk). A variation of this idea was apparently suggested by David Shorthouse (afiliated with Encyclopedia of Life) several years earlier about how you could place a taxonomy hierarchy under version control.

Date: Thursday, July 23, 2009
Location: Wozniak Lounge/Soda Hall, University of California Berkeley

Now that Val had a chance to read the idea, the group (Val, Bill Piel, Karla Gendler, Nirav Merchat, and myself) kicked it around verbally. Val acknolwedged that the unorthodox suggestion actually gets you some really nice features out of the box. But, a filesystem is not necessarily structure to make querying and other operations efficient. Nirav pointed out that the depth of a structure of filesystem can be troublesome, and this issue was what lead to the UNIX ``locate `` commmand being created. Bill Piel suggested that this was an idea in line with the Maddisons' approach to creating the Tree of Life Web Project (where each taxon could be represented as a hyperlink). As we continued, we began discussing how data would be created, imported, and represented by tools GCTs would use, in particular iPToL. Val raised a series of considerations that he felt were not understood or glossed over by other iPToL Working Groups:

How long will users' results be available for post-analysis?
How will users pull in and use "the big 500k taxa tree"?
How will users persist post-analysis work like annotations?
How will remote data sources be handled?
Will provenance be tracked and available? [iptoldataint:3]
How will differences between the results of remote data source queries be handled?

[iptoldataint:3] in particular, modifications to the tree, in say, a prune/graft operation

These considerations or concerns were not intended to be exhaustive, but more to raise my attention (and thus the attention of the development and engagement staff). Or, informally, the concerns were aimed to scare or jolt a developer. It seems that the Data Integration Working Group is ready to dig and roll-up their sleaves, if you will.

Bill also pointed out that it is going to be necessary for the Data Assembly Working Group to begin to formalize their workflow for the perpetual tree updating machine. As this happens, the Data Integration Working Group will need to be informed of this workflow.
At this point, Val had other obligations and excused himself from the conversation.

Note: Midway through our discussion of concerns with Val and Bill, Rutger Vos had joined.

We (Karla, Nirav, myself) continued a discussion with Bill and Rutger. Still on the topic of data storage, Bill and Rutger suggested that we look at BioSQL as a starting point for representing data relationally (as many may know, it provides a generic model for representing sequences, features, annotations, taxonomy and ontologies). We briefly discussed the usage of Mesquite as a headless processing engine by the TreeBASE II efforts. It seems like a powerful usage of the tool, as indicated in Engagement Team meetings - a deeper understanding of Mesquite's many modules may help the development process when we encounter a problem that it is designed to serve. Since evaluating the entire tool (or workbench, as you could call it) is daunting; a survey of the available, mature modules and developing contacts [iptoldataint:4] that could help development effort would be the easiest way to approach Mesquite currently. We then briefly discussed Rutger's effort of screen-scaping TimeTree to get approximations of evolutionary divergence dates. Such dates are not currently available in a convenient web service. Finally, we moved to the discussion of RDF and Ontologies with Rutger. He is leading the movement within TreeBASE to look at modernizing the application and pushing into a flexible data modeling approach that would be ammenable to the semantic Web. He had not done any large scale modeling with RDF, aka "triples". Data storage designed directly at persisting triples has been considered slow when scaling beyond hundreds of millions of triples (popular triplestores like Mulgara and Seasame max out at 500M and 70M respectively according to references on W3C ESW wiki page [iptoldataint:5]). At this phase, we do not know if these limits would affect the project. Rutger indicated when asked by Nirav that he'd like to be included in the effort and/or discussion of modeling data as triples. The Core Software group is not fully commited to the idea, but is definitely exploring using a triplestore or a BigTable-implementation to persist data and metadata in a flexible manner.

[iptoldataint:4] Jeff Oliver is a contributor to Mesquite and a former Doctoral student of David Maddison. He might be able to help with training and evaluation of whether Mesquite is the right tool to solve the problems we have/will identify (Jeff has been contacted - contact author for his email).
[iptoldataint:5] http://esw.w3.org/topic/LargeTripleStores