Core database schema for CoGe. This schema represents all the relationships among data in CoGe that allows it to store multiple versions of multiple genomes from multiple organisms, and all relevant associated data: sequences, genomic features (e.g. genes), locations, annotations, names, data sources, etc. The only data not stored directly in the database are the genomic DNA sequences themselves and are instead stored as flat-files. This is due to two major factors:
1. they take up the most space in the database (~100GB; ~113,000,000,000 nucleotides of sequence), and hence make incremental backups difficult (each database change, no matter how small, required the backing up of the entire database).
2. system performance (file_seek from HD is faster than DB queries) when extracting sub-strings from the sequence
For the "robustness" of this database -- CoGe is currently storing and serving:
36,000,000 genomic features
76,700,000 feature locations
60,800,000 feature names
81,600,000 feature annotations
Total DB size: 30GB
On an aside, note the three-levels for "annotations". This is to accommodate GO as so well described in the data-types document.
Posted on: Thu 1 Oct 2009 09:07:38