DI_17JUN09

First meeting of the Data Integration Working Group
June 17, 2009

Attendees: Val Tannen, William Piel, Sheldon McKay, Karla Gendler, Dan Stanzione, Steve Goff, others...

Meeting purpose: The purpose of this meeting is to seek guidance from the designated iPToL working group leads on:

1. establishing the computational framework for the problem of data integration
2. entering the first stage of requirements analysis to determine where iPlant resources and personnel can best be used to accelerate and improve the data integration efforts.

The main focus is introducing the problem to the engagement team and core developers and
roughing out the general parameters of the required infrastructure components.

Introduction: Sheldon McKay (5 minutes)

Roles of the engagement team, iPToL steering comittee and iPlant development team

Data Integration Introduction: Val Tannen and Bill Piel (30 minutes)

Key motivations and problems for data integration
Bottlenecks and checkpoints
Applicable existing tools and software components
Role of the Bill Piel as superuser
iPToL expectations of the engagement team and iPlant developers

Procedural issues Karla Gendler (5 minutes)

Funding
Meeting Frequency and Project Time Line
Reporting and Accountability

Questions from Executive Team (Dan Stanzione, Steve Goff)

Questions from Engagement Team (Sheldon McKay, Karla Gendler, Developer/analysts)

Meeting Notes 061709

Attending: Adam, Damian, Sheldon, Jerry, Liya, Matt, Karla, Nirav, Val, Scott, Bill P., Steve G., Dan S.

Intro/Roles – Sheldon

Working group is Val, Bill, + engagement team + developers. Goal to come up with design and prototypes in iterative fashion; then harden apps to production quality. Steering committee for iPTOL (SG, DS, RAJ, Sanderson, Donoghue, Pam, Karla, Sheldon) meets once a month.

Val – the iPTOL leads had a meeting with iPlant engagement team; made some executive decisions re: scope. Defined several WGs, one of which is Data Integration, will be crosscutting and important to the other WGs. Refer to Val’s word doc “iPTOL Data Management Issues”. Identified 3 big efforts; one is building the big trees (backbone trees). This will involve Alexis’ group working on algorithms and input data (molecular sequences of DNA). B/c it’s important to get good data (Pam and Doug Soltis leading that effort), we should have clean and good provenance info on the datasets for scientific reproduceability. To properly integrate datasets, need to know provenance; Nirav said versioning may also be an issue. Versioning of chromatigrams and alignments also could be problematic.

Also can build large trees by combining trees—supertree construction. Not sure abou this, needs to be discussed by the steering committee.

Applications – important aspect of our endeavor, as this is where the community will meet our work. 2 major applications: character reconstruction. Data about whole bunch of species, botanical information. Brian O’Meara will lead this effort.

Mapping character data from the trees brings up question of which tree? Backbone trees generally.

(Bill had to leave at 1:30 PM-sees this WG as broad and crosscutting. Sees this WG effort as similar to preproject effort for data integration; allows people to analyze life sci data without obvious connection to plant genetics, can derive interesting info from this. Integrating phylogenetic databases produces trees. Character reconstruction and tree decoration integrate with phylogenetic data.)

Users of these features may want to map their own data onto the tree. So need mechanisms for this (get data from one place, map it onto iPTOL trees).

The third, Gene-species tree reconciliation- user might bring gene tree to the backbone tree. We must be able to facilitate this.

Crosscutting issues – none will work without taxonomic intelligence; management of products; visualization, user interfaces, browsing, exploration.

Developers’ questions – none so far.

Dan – experience with brute force methods for data integration. What are the biggest issues or promising techniques. Val – this is not a classical data integration problem; hard part is finding mechanism for bringing in data on demand. The problem is needing minimal interface and taxonomic intelligence for common languange. So not need much preprocessing ability, don’t want to build a huge data warehouse. Propose that have a round of use cases defined, then converse with experts via discussions. Sheldon - NESCent Bioinformatics WG is valuable to keep tabs on. Val – looking for someone in web information community.

On taxonomic intell issue – it’s been a big headache for IPTOL, some solutions but not good ones, find out status of work being done by Encyclopedia of Life people. Have designed some techniques for reconciling data/names. Val and Bill will write internal white paper on this.

Dynamic aspects of integration – has been work on web info extraction, but data is unstructured in those cases.

Val volunteered that he and Bill will write white paper on Tax Intelligence before next meeting.

Next step is to meet in 2 wks, July 1.