Populating the Database

Populating the Tree Reconciliation Database

This describes the process to populate the tree reconciliation database.

Initializing the Database

Currently the database is initialized through the SQL code to create the tables.

Controlled Vocabularies/Ontologies

Controlled vocabularies are used throughout the database to tag attributes with the type of entity that is being described.

Tables Populated
  • cvterm
  • cv
  • cvtermprop
  • cvterm_relationship
  • cvterm_path
  • cvterm_synonym
Available Tools
  • Chado Tools - The tables from the ontology module of Chado for the controlled vocabulary tables. Therefore Chado ontology loading scripts can be used for loading standard or custom ontologies. Published OBO ontologies are supported with the existing tools, and new ontologies can be generated for import into the database using the OBO-Edit program.
Source Data

The following OBO ontologies are used by the TR database.

The following ontologies were generated with OBO-edit for use in the TR database. 

  • Tree Reconciliation Ontology (TRON) - available from svn
    A simple TR ontology was developed to store attributes associated with the reconciliations.
  • Phylogeny Ontology (PHYLO) - available from svn
    Ontology for terms applied to phylogentic trees. This incorporates element tags used by nhx format, phyloXML elements, and PRIME . The tags used by the different systems are stored under spearate namespaces
Loading ontologies into the database

The relationship ontology must first be loaded into the database to allow for other ontologies to use these terms.

The available tools for loading ontologies into the database requires the conversion of the obo file to chadoxml format with the go2chadoxml program from GMOD. This program takes a valid OBO format file as input, and converts it to a chado.xml file. For example to convert the phylogeny ontology to chado xml:

 go2chadoxml phylo_ontology.obo > phylo_ontology.chado.xml

The resulting chadoxml file can then be loaded into the database using stag-storenode.pl available from CPAN.

This program accepts the following options:

  • -d
    DSN for connecting to the database. This should be in the format of 'dbi:mysql:dbname=[DATABASENAME];host=[HOSTREF]'
  • --user
    user name for connecting to the database
  • --password
    password for the database connection

For example, to load the file for the phylogeny ontology (phylo_ontology.xml):

 stag-storenode.pl -d 'dbi:mysql:dbname=tr_test;host=localhost' --user [USERNAME] --password [PASSWORD] phylo_ontology.xml

For more information on loading custom ontologies into this framework, see the documentation for how to load a custom ontology into Chado.

Loading Term Relationships into the Database

An overview of the use of transitive closure and deductive closure for GO terms is available from the [gene ontology wiki|http://wiki.geneontology.org/index.php/Transitive_closure ]. In the TR database, the relationships among terms in the cvterm table is stored in the cvtermpath table. It is possible to use the code from GMOD to directly generative transitive closure links from the data in the database. This program requires Perl 5.10.0 which can be a limit on its use.

The current TR database does not use GMOD tools for computing relationship among terms in the database, but makes use of a precomputed transitive closure table for GO available at http://www.geneontology.org/scratch/transitive_closure/go_transitive_closure.links. This precomputed file is a result of running obo2linkfile on core GO terms. This program is included with the download of the OBO-Edit program.  It is possible to parse out the is_a links from this file, and then load only the is_a relationships to the database. The program tr_import_go_transitive_closure.pl (available from svn) can then be used to import the text file into the cvtermpath table.

The use of these these cvtermpath table to query the database for GO terms that includes all child terms of a parent query term is described elsewhere in the wiki.

Taxonomy

Taxonomy information for the species tree (taxon_id in the species tree node table) and the gene tree (taxon id in the member table). EnsEMBL Compara uses the NCBI taxonomy database for storing this information.

Tables Populated
  • ncbi_taxa_node
    taxa nodes in a hierarchical framework that allows for selection of individual taxa and daughter taxa. This includes left and right index fields that will need to be filled by a tree crawling algorithm
  • ncbi_taxa_name
    Information on the multiple names used to refer to taxa (ie. common name, scientific name)
Available Tools
  • mysqldumpGenomeDBNcbiTaxa.pl - EC seems to have a weird way of doing this, writing the INSERT queries to a database instead of doing the queries with the perl DBI module. In general, tools from Ensemble Compara should be able to populate this at least inititally. However, it may be difficult to update this table continuously as new species identifiers are added to NCBI. The existing code could be modified to first search for an existing NCBI id in the database and only do INSERTs on the missing values.
  • taxonTreeTool.pl - I think this is the EC code to fill left and right index information for the NCBI taxonomy information.
Potential Needs
Source Data

Potential Problem

The code for this needs to be added to the database in a straightforward way and the code needs to be documented here. The use of NCBI taxonomy files assumes that all species to be included in the reconciliation database have NCBI Identifiers associated with them

Genes

We will be populating the gene information from either 1kp data or full genome data. It may make sense to have modified pipelines for each of these two sources. Annotated genes from fully sequenced genomes should be able to be populated in the database using the existing EC tools.

Tables Populated
  • member
    This member table appears to have some dependence on the core Ensemble since this is a reference to the Ensemble id and includes version information for the Ensemble id
  • member_attribute
    This table would be the place to put ids from the 1kp project or other information for individual sequence accessions
  • sequence
    The amino acid (and possibly DNA) sequence of the gene

Potential Problem

It is possible that the EC schema currently only supports amino acid sequences, we will need to support both AA sequences and DNA sequences

The sequence table can be patched by adding a dna_sequence column.

Program Completed

The program for this is: tr_import_members_from_fasta.pl

Clusters/Gene Families

We will be populating the database with clusters from outside of the EC pipeline. This may be delivered from an external source, and will need to be parsed and added to the database.

Tables Populated
  • family
    Each gene cluster will have a unique row in the family table for a given clustering method
  • family_member
    This links the individual genes to the family that they are a member of for a given clustering method.
  • method_link_species_set
    The method used to generate a clustering result.
  • species_set
    The set of species that were used to generate a given cluster set.
Needs:
  • We will need to develop tools to parse incoming clusters
  • These clusters should be able to be updated by adding new members when we are using hmm type alignments to add members to a family

Alignments

Given gene families, these are the alignments for the members of that family. These are stored in CIGAR format. The CIGAR format should be documented somewhere in the Exonerate man pages (http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.html).

For some reconciliation software, this alignment will be an outcome, for other software this will be an input.

Tables Populated
  • protein_tree_hmmprofile
    An hmm profile generated from the alignments of the members of a tree
  • protein_tree_member
    Contains fields to store alignments in CIGAR format
  • protein_tree_member_score
    Contains fields to store alignments in CIGAR format
  • family_member
    _Included aligment information in CIGAR format_

Phylogenies

Species Trees

Species trees are stored in separate tables from gene trees.

Tables Populated
  • species_tree
    A unique identifier for each species tree.
  • species_tree_attribute
    Information related to the species tree.
  • species_tree_node
    The individual node in a species tree, with the information for its parent node. If parent node is set to zero, the node is the root
  • species_tree_node_attribute
    Information related to the individual node in the species tree. This could be the type of node it represents (duplication or speciation)
  • species_tree_node_path
Loading Species Trees

Species tree data are loaded into the database using the tr_import_species_tree.pl program. 

tr_import_species_tree.pl -i my_species_tree.nwk -n species_tree_name -u USERNAME -p PASSWORD -d DATABASENAME --host HOSTNAME --driver mysql

Gene Trees

Loading gene trees from TreeBest output

Tables Populated

*

Reconciliations

This may include needs to also load species trees and gene trees, or use existing species trees and gene trees

TreeBest format
PrimeTV format

------

Ensemble Compara Code:

Perl Modules for Tree IO: