Populating the Tree Reconciliation Database

This describes the process to populate the tree reconciliation database.

Initializing the Database

Currently the database is initialized through the SQL code to create the tables.

Controlled Vocabularies/Ontologies

Controlled vocabularies are used throughout the database to tag attributes with the type of entity that is being described.

Tables Populated

cvterm
cv
cvtermprop
cvterm_relationship
cvterm_path
cvterm_synonym

Available Tools

Chado Tools - The tables from the ontology module of Chado for the controlled vocabulary tables. Therefore Chado ontology loading scripts can be used for loading standard or custom ontologies. Published OBO ontologies are supported with the existing tools, and new ontologies can be generated for import into the database using the OBO-Edit program.

Source Data

The following OBO ontologies are used by the TR database.

Relationship Ontology (OBO_REL) - http://www.obofoundry.org/ro/
Defines core relations used in all OBO ontologies.
Gene Ontology (GO) - http://www.geneontology.org/
- Biological Process
  Provides structured controlled vocabularies for the annotation of gene products with respect to their biological role.
- Cellular Component
  Provides structured controlled vocabularies for the annotation of gene products with respect to their cellular location.
- Molecular Function
  Provides structured controlled vocabularies for the annotation of gene products with respect to their molecular function
Homology Ontology (HOM) - http://www.obofoundry.org/cgi-bin/detail.cgi?id=homology_ontology
This ontology represents concepts related to homology, as well as other concepts used to describe similarity and non-homology. PubMed Reference

The following ontologies were generated with OBO-edit for use in the TR database.

Tree Reconciliation Ontology (TRON) - available from svn
A simple TR ontology was developed to store attributes associated with the reconciliations.
Phylogeny Ontology (PHYLO) - available from svn
Ontology for terms applied to phylogentic trees. This incorporates element tags used by nhx format, phyloXML elements, and PRIME . The tags used by the different systems are stored under spearate namespaces

Loading ontologies into the database

The relationship ontology must first be loaded into the database to allow for other ontologies to use these terms.

The available tools for loading ontologies into the database requires the conversion of the obo file to chadoxml format with the go2chadoxml program from GMOD. This program takes a valid OBO format file as input, and converts it to a chado.xml file. For example to convert the phylogeny ontology to chado xml:

 go2chadoxml phylo_ontology.obo > phylo_ontology.chado.xml

The resulting chadoxml file can then be loaded into the database using stag-storenode.pl available from CPAN.

This program accepts the following options:

-d
DSN for connecting to the database. This should be in the format of 'dbi:mysql:dbname=[DATABASENAME];host=[HOSTREF]'
--user
user name for connecting to the database
--password
password for the database connection

For example, to load the file for the phylogeny ontology (phylo_ontology.xml):

 stag-storenode.pl -d 'dbi:mysql:dbname=tr_test;host=localhost' --user [USERNAME] --password [PASSWORD] phylo_ontology.xml

For more information on loading custom ontologies into this framework, see the documentation for how to load a custom ontology into Chado.

Loading Term Relationships into the Database

An overview of the use of transitive closure and deductive closure for GO terms is available from the [gene ontology wiki|http://wiki.geneontology.org/index.php/Transitive_closure ]. In the TR database, the relationships among terms in the cvterm table is stored in the cvtermpath table. It is possible to use the code from GMOD to directly generative transitive closure links from the data in the database. This program requires Perl 5.10.0 which can be a limit on its use.

The current TR database does not use GMOD tools for computing relationship among terms in the database, but makes use of a precomputed transitive closure table for GO available at http://www.geneontology.org/scratch/transitive_closure/go_transitive_closure.links. This precomputed file is a result of running obo2linkfile on core GO terms. This program is included with the download of the OBO-Edit program. It is possible to parse out the is_a links from this file, and then load only the is_a relationships to the database. The program tr_import_go_transitive_closure.pl (available from svn) can then be used to import the text file into the cvtermpath table.

The use of these these cvtermpath table to query the database for GO terms that includes all child terms of a parent query term is described elsewhere in the wiki.

Taxonomy

Taxonomy information for the species tree (taxon_id in the species tree node table) and the gene tree (taxon id in the member table). EnsEMBL Compara uses the NCBI taxonomy database for storing this information.

Tables Populated

ncbi_taxa_node
taxa nodes in a hierarchical framework that allows for selection of individual taxa and daughter taxa. This includes left and right index fields that will need to be filled by a tree crawling algorithm
ncbi_taxa_name
Information on the multiple names used to refer to taxa (ie. common name, scientific name)

Available Tools

mysqldumpGenomeDBNcbiTaxa.pl - EC seems to have a weird way of doing this, writing the INSERT queries to a database instead of doing the queries with the perl DBI module. In general, tools from Ensemble Compara should be able to populate this at least inititally. However, it may be difficult to update this table continuously as new species identifiers are added to NCBI. The existing code could be modified to first search for an existing NCBI id in the database and only do INSERTs on the missing values.
taxonTreeTool.pl - I think this is the EC code to fill left and right index information for the NCBI taxonomy information.

Potential Needs

It may be work revisiting the existing code for something more elegant. I did NCBI Tree traversal and automated download of NCBI Taxonomy information as part of a MyGCAT project and may be able to reuse some old code.

Source Data

NCBI taxonomy files - ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
- Information on these files are available at: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt
- These files used to be updated every Tuesday in the AM, however the timestamp on the current file available is one year old

Potential Problem

The code for this needs to be added to the database in a straightforward way and the code needs to be documented here. The use of NCBI taxonomy files assumes that all species to be included in the reconciliation database have NCBI Identifiers associated with them

Genes

We will be populating the gene information from either 1kp data or full genome data. It may make sense to have modified pipelines for each of these two sources. Annotated genes from fully sequenced genomes should be able to be populated in the database using the existing EC tools.

Tables Populated

member
This member table appears to have some dependence on the core Ensemble since this is a reference to the Ensemble id and includes version information for the Ensemble id
member_attribute
This table would be the place to put ids from the 1kp project or other information for individual sequence accessions
sequence
The amino acid (and possibly DNA) sequence of the gene

Potential Problem

It is possible that the EC schema currently only supports amino acid sequences, we will need to support both AA sequences and DNA sequences

The sequence table can be patched by adding a dna_sequence column.

Program Completed

The program for this is: tr_import_members_from_fasta.pl

Clusters/Gene Families

We will be populating the database with clusters from outside of the EC pipeline. This may be delivered from an external source, and will need to be parsed and added to the database.

Tables Populated

family
Each gene cluster will have a unique row in the family table for a given clustering method
family_member
This links the individual genes to the family that they are a member of for a given clustering method.
method_link_species_set
The method used to generate a clustering result.
species_set
The set of species that were used to generate a given cluster set.

Needs:

We will need to develop tools to parse incoming clusters
These clusters should be able to be updated by adding new members when we are using hmm type alignments to add members to a family

Alignments

Given gene families, these are the alignments for the members of that family. These are stored in CIGAR format. The CIGAR format should be documented somewhere in the Exonerate man pages (http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.html).

For some reconciliation software, this alignment will be an outcome, for other software this will be an input.

Tables Populated

protein_tree_hmmprofile
An hmm profile generated from the alignments of the members of a tree
protein_tree_member
Contains fields to store alignments in CIGAR format
protein_tree_member_score
Contains fields to store alignments in CIGAR format
family_member
_Included aligment information in CIGAR format_

Phylogenies

Species Trees

Species trees are stored in separate tables from gene trees.

Tables Populated

species_tree
A unique identifier for each species tree.
species_tree_attribute
Information related to the species tree.
species_tree_node
The individual node in a species tree, with the information for its parent node. If parent node is set to zero, the node is the root
species_tree_node_attribute
Information related to the individual node in the species tree. This could be the type of node it represents (duplication or speciation)
species_tree_node_path

Loading Species Trees

Species tree data are loaded into the database using the tr_import_species_tree.pl program.

tr_import_species_tree.pl -i my_species_tree.nwk -n species_tree_name -u USERNAME -p PASSWORD -d DATABASENAME --host HOSTNAME --driver mysql

Gene Trees

Loading gene trees from TreeBest output

Tables Populated

*

Reconciliations

This may include needs to also load species trees and gene trees, or use existing species trees and gene trees

TreeBest format

PrimeTV format

------

Relevant links

Ensemble Compara Code:

Perl Modules for Tree IO:

BioPerls TreeIO - http://doc.bioperl.org/releases/bioperl-1.4/Bio/TreeIO.html
Bio Phylo from Rutger Vos - http://search.cpan.org/dist/Bio-Phylo/

iPToL

Populating the Database

Populating the Tree Reconciliation Database

Initializing the Database

Controlled Vocabularies/Ontologies

Tables Populated

Available Tools

Source Data

Loading ontologies into the database

Loading Term Relationships into the Database

Taxonomy

Tables Populated

Available Tools

Potential Needs

Source Data

Genes

Tables Populated

Clusters/Gene Families

Tables Populated

Needs:

Alignments

Tables Populated

Phylogenies

Species Trees

Tables Populated

Loading Species Trees

Gene Trees

Reconciliations

TreeBest format

PrimeTV format

Relevant links