This box searches only this space. The box at the upper right searches the entire iPlant wiki.

Skip to end of metadata
Go to start of metadata

Schema Overview

This schema for storing information related to gene tree reconciliation keeps data related to species trees separate from data related to gene trees. This makes minimal changes to the existing tables in Ensemble Compara and adds information for species trees, reconciliations, and a controlled vocabulary.

Advantages of this schema:

  • Minimizes changes to the existing Ensemble/Compara schema
  • Follows weak typing paradigm, attribute tables can hold any type of specialized information
  • Makes use of controlled vocabularies and ontologies store data types
  • Use Chaod tables for controlled vocabularies(alternatively could use Chado cv module - http://gmod.org/wiki/Chado_CV_Module) allows for import of CV terms in OBO format using existing tools

Disadvantages of this schema:

  • Not the best schema for representing uncertainty in tree topology
  • Uncertainty in root not easily supported, The E/C schema assumes a gene tree with a single root, other roots would require a replication of the tree topology
  • This requires that ontologies exist and are loaded into the database before attribute values can be stored 

The tables in this database fall into a few major groups:

Protein Tables

Unknown macro: {mockup}

member

Information related to the an individual locus

Name

Type

Description

member_id

INTEGER(8)

Primary key

stable_id

VARCHAR(64)

Stable identifier for the locus such as the gene name or gene identifier.

version

INTEGER(8)

Version of the stable ID (see EnsEMBL core DB)

source_name

CHAR(17)

The source of the member.

taxon_id

INTEGER(8)

Source taxon the protein is derived from. Reference to ncbi_taxa_node.taxon_id

genome_db_id

INTEGER(8)

Reference to geneome_db.genome_db_id

sequence_id

INTEGER(8)

Identifier for the sequence used to represent this protein. Reference to sequence.sequence_id.

gene_member_id

INTEGER(8)

Reference to member.member_id to allow linkage from peptides to genes.

description

TEXT

The description of the gene/protein as described in the core database or from the Uniprot entry.

chr_name

CHAR(40)

The name of the chromosome this sequence maps to. Many assemblies use letters or other strings to name their chromosomes, so an integer would not work here.

chr_start

INTEGER(8)

The start of the location of the member on the chromosome.

chr_end

INTEGER(8)

The end of the location of the member on the chromosome.

chr_strand

TINYINT(1)

Strand of the chromosome.

display_label

VARCHAR(128)

This information not described in the E/C schema description.

member_attribute

This table allows for additional information to be stored in the member attribute table using the controlled vocabulary terms in the database.

Name

Type

Description

member_attribute_id

INTEGER(10)

Primary key

member_id

INTEGER(8)

Foreign key to member table

cvterm_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

TEXT

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

source_id

INTEGER(10)

 

sequence

The protein sequence.

Name

Type

Description

sequence_id

INTEGER(10)

Primary key

length

INTEGER(10)

The length of the protein sequence

sequence

LONGTEXT

The protein sequence.

Protein Family Tables

Unknown macro: {mockup}

family

The protein family. Contains all the group homologies found. There are several family_member entries for each family entry.

Name

Type

Description

family_id

INTEGER(10)

Primary key.

stable_id

VARCHAR(40)

stable family ID

version

INTEGER(10)

version of the stable_id (may only change when members migrate between this family and another one; stays the same otherwise)

method_link_species_set_id

INTEGER(10)

external reference to method_link_species_set.method_link_species_set_id

description

VARCHAR(255)

description of the family as found using the Longest Common String (LCS) of the descriptions of the member proteins.

description_score

DOUBLE

Scores the accuracy of the annotation (max. 100)

family_attribute

Attributes of the protein family.

Name

Type

Description

family_attribute_id

INTEGER(10)

Primary Key, Unique Identifier

family_id

INTEGER(10)

 

cvterm_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

source_id

INTEGER(10)

 

family_member

Contains the proteins corresponding to protein family relationship found. There are several family_member entries for each family entry.

Name

Type

Description

family_id

INTEGER(10)

external reference to family.family_id

member_id

INTEGER(10)

external reference to member.member_id

cigar_line

MEDIUMTEXT

internal description of the multiple alignment (see homology_member table)

Protein Tree Tables

These are mostly relevant tables from the Ensemble Compara schema.

Unknown macro: {mockup}

protein_tree

Information related to an entire protein tree.

Name

Type

Description

protein_tree_id

INTEGER(10)

Unique identifier for the protein tree

family_id

INTEGER(10)

 

root_node_id

INTEGER(10)

ID of the root node for the tree.

protein_tree_attribute

This is a table to store information about the protein tree.

Name

Type

Description

protein_tree_attribute_id

INTEGER(10)

Primary Key, Unique Identifier

protein_tree_id

INTEGER(10)

 

cvterm_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

source_id

INTEGER(10)

 

protein_tree_member

The information for the individual proteins in the tree. I believe this holds the alignment information for the tree in the cigar lines

Name

Type

Description

node_id

INTEGER(10)

The node_id in protein tree this refers to

root_id

INTEGER(8)

The root_id of the tree this refers to. I think that this is the way that E/C refers to the set of nodes the comprise a tree

member_id

INTEGER(8)

Foreign key, The protein that is being represented

method_link_species_set_id

INTEGER(8)

External reference to method_link_species_set.method_link_species_set_id

cigar_line

TEXT

The alignment in CIGAR format

cigar_start

INTEGER(8)

The start of the alignment

cigar_end

INTEGER(8)

The end of the alignment

protein_tree_node

This table represents the topology of the gene tree.

Name

Type

Description

node_id

INTEGER(8)

Unique identifier for nodes in the protein tree

protein_tree_id

INTEGER(10)

This is an addition to the existing E/C schema. This is a unique identifier for the protein tree that the nodes are a member of.

parent_id

INTEGER(8)

This is the parent node for the node representend by node_id

root_id

INTEGER(8)

The id for the root node in the protein tree. By my reading of the E/C schema, this is the only way to refer to the gene tree as a group

clusterset_id

INTEGER(8)

This is the set of clusters (the gene family) represented by the gene tree

left_index

INTEGER(8)

Left indexing value, this can be used with right_index to select the child nodes of a given internal node in the gene tree.

right_index

INTEGER(8)

Right indexing value, this can be used with left_index to select the child nodes of a given internal node in the gene tree.

distance_to_parent

DOUBLE

This is essentially the edge length. This is the way that the data are currently represented in E/C, but it would be possible to make this information available as a protein_tree_node attribute value

protein_tree_node_attribute

This is a table to store information about the individual nodes in the protein tree.

Name

Type

Description

protein_node_attribute_id

INTEGER(10)

Primary Key, Unique Identifier

node_id

INTEGER(8)

The identifier of the node the attribute refers to.

cvterm_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

source_id

INTEGER(10)

 

protein_tree_node_path

This table holds information or transitive closure queries, ie the reachability and distance from an ancestral node to all of its children nodes. This table will be used for least common ancestor queries, and to select a subtree from a larger species tree.

Name

Type

Description

protein_tree_node_path_id

INTEGER(10)

Primary key.

parent_node_id

INTEGER(10)

The parent node being referenced

child_node_id

INTEGER(10)

The child node being referenced (not necessatily the direct descendent, but a child node

path

TEXT

This may be considered optional. This would store the node path from the parent to the child

distance

INTEGER(10)

The distance in number of edges from the parent node to the child node. This data can be used to minimize the distance for Least Common Ancestor (LCA) queries.

Species Tree Tables

These are the tables related to the species tree data.

Unknown macro: {mockup}

species_tree

Name

Type

Description

species_tree_id

INTEGER(10)

Primary key

species_tree_name

TEXT

The name of the species tree.

root_node_id

INTEGER(10)

The root node of the species tree. The current version of the TR database required rooted trees.

version

SMALLINT(3)

I am assuming that there will be different versions of species trees. The version information will make sure that the reconciled data match to the species tree being referred to here. This can be null

Example of data in this table:

species_tree_attribute

Name

Type

Description

species_tree_attribute_id

INTEGER(10)

Primary key

species_tree_id

INTEGER(10)

Foreign Key, The species tree being referenced

term_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

species_tree_node

Name

Type

Description

species_tree_node_id

INTEGER(10)

Primary Key

species_tree_id

INTEGER(10)

The species tree this node is a member of

parent_id

INTEGER(10)

The id of the parent node to this species_tree_node

left_index

INTEGER(10)

A left index value used to select clades

right_index

INTEGER(10)

A right index value used to select clades

Example of species tree nodes from the bowers_rosids tree:

species_tree_node_attribute

Name

Type

Description

species_tree_node_attribute_id

INTEGER(10)

Primary key

species_tree_node

INTEGER(10)

Foreign Key, The species tree node being referenced

term_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

species_tree_node_path

This table holds information or transitive closure queries, ie the reachability and distance from an ancestral node to all of its children nodes. This table will be used for least common ancestor queries, and to select a subtree from a larger species tree.

Name

Type

Description

parent_node_id

INTEGER(10)

The parent node being referenced

child_node_id

INTEGER(10)

The child node being referenced (not necessatily the direct descendent, but a child node

path

VARCHAR(255)

This may be considered optional. This would store the node path from the parent to the child

distance

INTEGER(10)

The distance in number of edges from the parent node to the child node. This data can be used to minimize the distance for Least Common Ancestor (LCA) queries.

Reconciliation Tables

The purpose of the reconciliation tables is to map the nodes of a gene tree to the nodes and edges of a species tree.

Unknown macro: {mockup}

reconciliation

This table describes the gene_tree and species tree that are being reconciled. This also allows for a subset of the species to be selected from the species tree with the species_set column.

Name

Type

Description

reconciliation_id

INTEGER(10)

Primary Key

species_tree_id

INTEGER(10)

The species tree that is serving as the host for the mapping of the protein tree

species_set_id

INTEGER(10)

The species from the large species tree that are being used in the reconciliation, this would probably best be represented as the list of nodes in the species tree that are the leaf nodes being represented

protein_tree_id

INTEGER(10)

The protein tree that is being mapped onto the species tree

reconciliation_attribute

This table contains information related to the overall mapping of the reconciliation itself.

Name

Type

Description

reconciliation_attribute_id

INTEGER(10)

Primary Key

reconciliation_id

INTEGER(10)

The id for the specific reconciliation being referenced

term_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

reconciliation_node

The mapping of an individual node from a gene tree to a node or edge in the species tree.

Name

Type

Description

reconciliation_node_id

INTEGER(10)

Primary Key, unique identifier for each row in the table.

reconciliation_id

INTEGER(10)

Foreign Key, The individual reconciliation the node maps is a member of

node_id

INTEGER(10)

Foreign Key, The node_id in the gene tree that this reconciliation mapping refers to.

host_parent_node_id

INTEGER(10)

The parent node in the species tree, this will be an ID from the species_tree_node_table

host_child_node_id

INTEGER(10)

The child node in the species tree, tis will be an ID from the species_tree_node table. NOTE: It is not required that the parent_node of this child be the parent node of the host_parent_node ID in the species tree.

is_on_node

BOOLEAN

TRUE if the node on the gene tree is located on a node in the species tree.

reconciliation_node_attribute

An attribute related to the node mapping of a node in the gene tree to a node in the species tree.

Name

Type

Description

reconciliation_node_attribute_id

INTEGER(10)

Primary Key

reconciliation_node_id

INTEGER(10)

Foreign Key, The reconciled node the attribute refers to

term_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

Controlled Vocabulary Tables

The controlled vocabulary tables incorporated into this schema were taken from the Chado CV module and converted from PostgreSQL syntax to MySQL tables. The following documentation is taken from the GMOD description of these tables.

Unknown macro: {mockup}

cv

The controlled vocabulary groups. Individual controlled vocabulary terms belong to individual controlled vocabularies as indicated by the cv_id in the cvterm table.

Name

Type

Description

cv_id

INTEGER(10)

 

name

VARCHAR(255)

 

definition

TEXT

 

For example, the ontologies loaded in the current implementation of the tree reconciliation database:

cvterm

A term, class, universal or type within an ontology or controlled vocabulary. This table is also used for relations and properties. cvterms constitute nodes in the graph defined by the collection of cvterms and cvterm_relationships.

Name 

Type

Description

cvterm_id

INTEGER(10)

Primary Key

cv_id

INTEGER(10)

The cv or ontology or namespace to which this cvterm belongs.

name

VARCHAR(950)

A concise human-readable name or label for the cvterm. Uniquely identifies a cvterm within a cv.

definition

TEXT

A human-readable text definition.

dbxref_id

INTEGER(10)

Primary identifier dbxref - The unique global OBO identifier for this cvterm. Note that a cvterm may have multiple secondary dbxrefs - see also table: cvterm_dbxref.

is_obsolete

TINYINT(1)

Boolean 0=false,1=true; see GO documentation for details of obsoletion. Note that two terms with different primary dbxrefs may exist if one is obsolete.

is_relationshiptype

TINYINT(1)

Boolean 0=false,1=true relations or relationship types (also known as Typedefs in OBO format, or as properties or slots) form a cv/ontology in themselves. We use this flag to indicate whether this cvterm is an actual term/class/universal or a relation. Relations may be drawn from the OBO Relations ontology, but are not exclusively drawn from there.

Note that relationship types among cvterms are also stored as cvterms:

The tree reconciliation database includes commonly use ontologies such as GO terms, but also controlled vocabularies for systems of tagging trees such as the PRIME format of representing reconciled trees. For example:

cvterm_dbxref

In addition to the primary identifier (cvterm.dbxref_id) a cvterm can have zero or more secondary identifiers/dbxrefs, which may refer to records in external databases. The exact semantics of cvterm_dbxref are not fixed. For example: the dbxref could be a pubmed ID that is pertinent to the cvterm, or it could be an equivalent or similar term in another ontology. For example, GO cvterms are typically linked to InterPro IDs, even though the nature of the relationship between them is largely one of statistical association. The dbxref may be have data records attached in the same database instance, or it could be a "hanging" dbxref pointing to some external database. NOTE: If the desired objective is to link two cvterms together, and the nature of the relation is known and holds for all instances of the subject cvterm then consider instead using cvterm_relationship together with a well-defined relation.

Name 

Type

Description

cvterm_dbxref_id

INTEGER

 

cvterm_id

INTEGER

 

dbxref_id

INTEGER


is_for_definition

INTEGER

A cvterm.definition should be supported by one or more references. If this column is true, the dbxref is not for a term in an external database - it is a dbxref for provenance information for the definition.

cvterm_relationship

A relationship linking two cvterms. Each cvterm_relationship constitutes an edge in the graph defined by the collection of cvterms and cvterm_relationships. The meaning of the cvterm_relationship depends on the definition of the cvterm R refered to by type_id. However, in general the definitions are such that the statement "all SUBJs REL some OBJ" is true. The cvterm_relationship statement is about the subject, not the object. For example "insect wing part_of thorax".

Name 

Type

Description

cvterm_relationship_id

INTEGER

 

type_id

INTEGER

The nature of the relationship between subject and object. Note that relations are also housed in the cvterm table, typically from the OBO relationship ontology, although other relationship types are allowed.

subject_id

INTEGER

The subject of the subj-predicate-obj sentence. The cvterm_relationship is about the subject. In a graph, this typically corresponds to the child node.

object_id

INTEGER

The object of the subj-predicate-obj sentence. The cvterm_relationship refers to the object. In a graph, this typically corresponds to the parent node.

cvtermpath

The reflexive transitive closure of the cvterm_relationship relation.

Name 

Type

Description

cvterm_relationship_id

INTEGER

 

type_id

INTEGER

The nature of the relationship between subject and object. Note that relations are also housed in the cvterm table, typically from the OBO relationship ontology, although other relationship types are allowed.

subject_id

INTEGER

The subject of the subj-predicate-obj sentence. The cvterm_relationship is about the subject. In a graph, this typically corresponds to the child node.

object_id

INTEGER

The object of the subj-predicate-obj sentence. The cvterm_relationship refers to the object. In a graph, this typically corresponds to the parent node.

cv_id

INTEGER

Closures will mostly be within one cv. If the closure of a relationship traverses a cv, then this refers to the cv of the object_id cvterm.

pathdistance

INTEGER

The number of steps required to get from the subject cvterm to the object cvterm, counting from zero (reflexive relationship).

cvtermprop

Additional extensible properties can be attached to a cvterm using this table. Corresponds to AnnotationProperty in W3C OWL format.

Name 

Type

Description

cvtermprop_id

INTEGER

 Primay key

cvterm_id

INTEGER

 

type_id

 

The name of the property or slot is a cvterm. The meaning of the property is defined in that cvterm.

value

INTEGER

The value of the property, represented as text. Numeric values are converted to their text representation.

rank

TEXT

Property-Value ordering. Any cvterm can have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used.

cvtermsynonym

A cvterm actually represents a distinct class or concept. A concept can be refered to by different phrases or names. In addition to the primary name (cvterm.name) there can be a number of alternative aliases or synonyms. For example, "T cell" as a synonym for "T lymphocyte".

Name 

Type

Description

cvtermsynonym_id

INTEGER

 Primay key

cvterm_id

INTEGER

 

synonym_id

 

 

type_id

INTEGER

A synonym can be exact, narrower, or broader than.

db

A database authority. Typical databases in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this shortened form, which is unique within the bioinformatics and biomedical realm. To Do - add support for URIs, URNs (e.g. LSIDs). We can do this by treating the URL as a URI - however, some applications may expect this to be resolvable - to be decided.

Name 

Type

Description

db_id

INTEGER

 

name

VARCHAR(255)

 

description

TEXT

 

urlprefix

 

 

url

 

 

dbxref

A unique, global, public, stable identifier. Not necessarily an external reference - can reference data items inside the particular chado instance being used. Typically a row in a table can be uniquely identified with a primary identifier (called dbxref_id); a table may also have secondary identifiers (in a linking table <T>_dbxref). A dbxref is generally written as <DB>:<ACCESSION> or as <DB>:<ACCESSION>:<VERSION>

Name 

Type

Description

dbxref_id

INTEGER(10)

 

db_id

INTEGER(10)

 

accession

VARCHAR(255)

The local part of the identifier. Guaranteed by the db authority to be unique for that db.

version

VARCHAR(255)

 

description

TEXT

 

dbxrefprop

Metadata about a dbxref. Note that this is not defined in the dbxref module, as it depends on the cvterm table. This table has a structure analagous to cvtermprop.

Name 

Type

Description

dbxrefprop_id

INTEGER

 Primay key

dbxref_id

INTEGER

 

type_id

INTEGER

 

value

TEXT

 

rank

INTEGER

 

  • No labels