Data Integration

Investigators studying genotype to phenotype relationships need access to multiple sources of current information about all genes/ proteins/metabolites/etc. with known or suspected roles in diverse response networks. Equally important are sets of data on environmental variables that can strongly influence (i) gene product dynamics under non-constant conditions and/or (ii) reaction rates or other features central to laboratory measurement techniques. An enormous amount of all these data types is currently available but either difficult or impossible to access in any useful way. The many impediments to data use range from incompatible indexing, storage, or display formats to the near impossibility of maintaining awareness of new data sets as they are formed or, in time, superseded by better technology. As a result, current data are vastly under-utilized. Moreover, as great as the existing corpus of data is, it will be rapidly dwarfed by the exploding rate of acquisition that is taking place due to new technologies such as next generation sequencing.

The data integration work group will investigate and apply methods for describing and unifying data sets into virtual systems that support other project activities. This will not entail the physical merging of data from different sources . such concatenation is neither practical in the short term nor sustainable over time. Instead, the approach will build upon existing middle-ware systems that use metadata to achieve situational awareness of available data, the logical relationships between different data sets, and tools that enable users to find relevant information even when they are not sure what data may exist. The system will support both data intended to be publicly distributed as well as secure, private and/or user-local repositories and will enable information to be pipelined into statistical inference, visualization, and/or modeling tools and applications.