Populating the Database
Populating the Tree Reconciliation Database
This describes the process to populate the tree reconciliation database.
Initializing the Database
Currently the database is initialized through the SQL code to create the tables.
Controlled Vocabularies/Ontologies
Controlled vocabularies are used throughout the database to tag attributes with the type of entity that is being described.
Tables Populated
- cvterm
- cv
- cvtermprop
- cvterm_relationship
- cvterm_path
- cvterm_synonym
Available Tools
- Chado Tools - The tables from the ontology module of Chado for the controlled vocabulary tables. Therefore Chado ontology loading scripts can be used for loading standard or custom ontologies. Published OBO ontologies are supported with the existing tools, and new ontologies can be generated for import into the database using the OBO-Edit program.
Source Data
The following OBO ontologies are used by the TR database.
- Relationship Ontology (OBO_REL) - http://www.obofoundry.org/ro/
Defines core relations used in all OBO ontologies. - Gene Ontology (GO) - http://www.geneontology.org/
- Biological Process
Provides structured controlled vocabularies for the annotation of gene products with respect to their biological role. - Cellular Component
Provides structured controlled vocabularies for the annotation of gene products with respect to their cellular location. - Molecular Function
Provides structured controlled vocabularies for the annotation of gene products with respect to their molecular function
- Biological Process
- Homology Ontology (HOM) - http://www.obofoundry.org/cgi-bin/detail.cgi?id=homology_ontology
This ontology represents concepts related to homology, as well as other concepts used to describe similarity and non-homology. PubMed Reference
The following ontologies were generated with OBO-edit for use in the TR database.
- Tree Reconciliation Ontology (TRON) - available from svn
A simple TR ontology was developed to store attributes associated with the reconciliations. - Phylogeny Ontology (PHYLO) - available from svn
Ontology for terms applied to phylogentic trees. This incorporates element tags used by nhx format, phyloXML elements, and PRIME . The tags used by the different systems are stored under spearate namespaces
Loading ontologies into the database
The relationship ontology must first be loaded into the database to allow for other ontologies to use these terms.
The available tools for loading ontologies into the database requires the conversion of the obo file to chadoxml format with the go2chadoxml program from GMOD. This program takes a valid OBO format file as input, and converts it to a chado.xml file. For example to convert the phylogeny ontology to chado xml:
go2chadoxml phylo_ontology.obo > phylo_ontology.chado.xml
The resulting chadoxml file can then be loaded into the database using stag-storenode.pl available from CPAN.
This program accepts the following options:
- -d
DSN for connecting to the database. This should be in the format of 'dbi:mysql:dbname=[DATABASENAME];host=[HOSTREF]' - --user
user name for connecting to the database - --password
password for the database connection
For example, to load the file for the phylogeny ontology (phylo_ontology.xml):
stag-storenode.pl -d 'dbi:mysql:dbname=tr_test;host=localhost' --user [USERNAME] --password [PASSWORD] phylo_ontology.xml
For more information on loading custom ontologies into this framework, see the documentation for how to load a custom ontology into Chado.
Loading Term Relationships into the Database
An overview of the use of transitive closure and deductive closure for GO terms is available from the [gene ontology wiki|http://wiki.geneontology.org/index.php/Transitive_closure ]. In the TR database, the relationships among terms in the cvterm table is stored in the cvtermpath table. It is possible to use the code from GMOD to directly generative transitive closure links from the data in the database. This program requires Perl 5.10.0 which can be a limit on its use.
The current TR database does not use GMOD tools for computing relationship among terms in the database, but makes use of a precomputed transitive closure table for GO available at http://www.geneontology.org/scratch/transitive_closure/go_transitive_closure.links. This precomputed file is a result of running obo2linkfile on core GO terms. This program is included with the download of the OBO-Edit program. It is possible to parse out the is_a links from this file, and then load only the is_a relationships to the database. The program tr_import_go_transitive_closure.pl (available from svn) can then be used to import the text file into the cvtermpath table.
The use of these these cvtermpath table to query the database for GO terms that includes all child terms of a parent query term is described elsewhere in the wiki.
Taxonomy
Taxonomy information for the species tree (taxon_id in the species tree node table) and the gene tree (taxon id in the member table). EnsEMBL Compara uses the NCBI taxonomy database for storing this information.
Tables Populated
- ncbi_taxa_node
taxa nodes in a hierarchical framework that allows for selection of individual taxa and daughter taxa. This includes left and right index fields that will need to be filled by a tree crawling algorithm - ncbi_taxa_name
Information on the multiple names used to refer to taxa (ie. common name, scientific name)
Available Tools
- mysqldumpGenomeDBNcbiTaxa.pl - EC seems to have a weird way of doing this, writing the INSERT queries to a database instead of doing the queries with the perl DBI module. In general, tools from Ensemble Compara should be able to populate this at least inititally. However, it may be difficult to update this table continuously as new species identifiers are added to NCBI. The existing code could be modified to first search for an existing NCBI id in the database and only do INSERTs on the missing values.
- taxonTreeTool.pl - I think this is the EC code to fill left and right index information for the NCBI taxonomy information.
Potential Needs
- It may be work revisiting the existing code for something more elegant. I did NCBI Tree traversal and automated download of NCBI Taxonomy information as part of a MyGCAT project and may be able to reuse some old code.
Source Data
- NCBI taxonomy files - ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
- Information on these files are available at: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt
- These files used to be updated every Tuesday in the AM, however the timestamp on the current file available is one year old
Potential Problem
The code for this needs to be added to the database in a straightforward way and the code needs to be documented here. The use of NCBI taxonomy files assumes that all species to be included in the reconciliation database have NCBI Identifiers associated with them
Genes
We will be populating the gene information from either 1kp data or full genome data. It may make sense to have modified pipelines for each of these two sources. Annotated genes from fully sequenced genomes should be able to be populated in the database using the existing EC tools.
Tables Populated
- member
This member table appears to have some dependence on the core Ensemble since this is a reference to the Ensemble id and includes version information for the Ensemble id - member_attribute
This table would be the place to put ids from the 1kp project or other information for individual sequence accessions - sequence
The amino acid (and possibly DNA) sequence of the gene
Potential Problem
It is possible that the EC schema currently only supports amino acid sequences, we will need to support both AA sequences and DNA sequences
The sequence table can be patched by adding a dna_sequence column.
Program Completed
The program for this is: tr_import_members_from_fasta.pl
Clusters/Gene Families
We will be populating the database with clusters from outside of the EC pipeline. This may be delivered from an external source, and will need to be parsed and added to the database.
Tables Populated
- family
Each gene cluster will have a unique row in the family table for a given clustering method - family_member
This links the individual genes to the family that they are a member of for a given clustering method. - method_link_species_set
The method used to generate a clustering result. - species_set
The set of species that were used to generate a given cluster set.
Needs:
- We will need to develop tools to parse incoming clusters
- These clusters should be able to be updated by adding new members when we are using hmm type alignments to add members to a family
Alignments
Given gene families, these are the alignments for the members of that family. These are stored in CIGAR format. The CIGAR format should be documented somewhere in the Exonerate man pages (http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.html).
For some reconciliation software, this alignment will be an outcome, for other software this will be an input.
Tables Populated
- protein_tree_hmmprofile
An hmm profile generated from the alignments of the members of a tree - protein_tree_member
Contains fields to store alignments in CIGAR format - protein_tree_member_score
Contains fields to store alignments in CIGAR format - family_member
_Included aligment information in CIGAR format_
Phylogenies
Species Trees
Species trees are stored in separate tables from gene trees.
Tables Populated
- species_tree
A unique identifier for each species tree. - species_tree_attribute
Information related to the species tree. - species_tree_node
The individual node in a species tree, with the information for its parent node. If parent node is set to zero, the node is the root - species_tree_node_attribute
Information related to the individual node in the species tree. This could be the type of node it represents (duplication or speciation) - species_tree_node_path
Loading Species Trees
Species tree data are loaded into the database using the tr_import_species_tree.pl program.
tr_import_species_tree.pl -i my_species_tree.nwk -n species_tree_name -u USERNAME -p PASSWORD -d DATABASENAME --host HOSTNAME --driver mysql
Gene Trees
Loading gene trees from TreeBest output
Tables Populated
*
Reconciliations
This may include needs to also load species trees and gene trees, or use existing species trees and gene trees
TreeBest format
PrimeTV format
------
Relevant links
Ensemble Compara Code:
Perl Modules for Tree IO:
- BioPerls TreeIO - http://doc.bioperl.org/releases/bioperl-1.4/Bio/TreeIO.html
- Bio Phylo from Rutger Vos - http://search.cpan.org/dist/Bio-Phylo/