SQL Queries - Controlled Vocabulary Terms

Gene Ontology Terms

A detailed description of Gene Ontology (GO) terms is available at http://www.geneontology.org/GO.doc.shtml.

Types of GO Terms

In general, there are three major classes of GO terms.

1) biological_process

A biological process is series of events accomplished by one or more ordered assemblies of molecular functions. Examples of broad biological process terms are cellular physiological process or signal transduction. Examples of more specific terms are pyrimidine metabolic process or alpha-glucoside transport. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps.

2) cellular_component

A cellular component is just that, a component of a cell, but with the proviso that it is part of some larger object; this may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer). See the Documentation on the cellular component ontology for more details.

3) molecular_function

Molecular function describes activities, such as catalytic or binding activities, that occur at the molecular level. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where or when, or in what context, the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are catalytic activity, transporter activity, or binding; examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding.

GO Term Example

An example of an individual GO term in OBO format is shown below.

GO OBO Format
[Term]
id: GO:0000003
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: "The production by an organism of new individuals that contain some portion of their genetic material inherited from that organism." [GOC:go_curators, GOC:isa_complete, ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"]
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
subset: gosubset_prok
synonym: "reproductive physiological process" EXACT []
xref: Wikipedia:Reproduction
is_a: GO:0008150 ! biological_process

For additional examples of GO terms, see http://www.geneontology.org/GO_slims/goslim_generic.obo .

GO terms have an ID that is an integer (ie. 0000003) and a name that is a string (ie. reproduction). For the tree reconciliation database we are interested in fetching either the GO ID value or the name. We are also interested in the namespace that it is a part of (ie. biological_process, cellular component or molecular function).

GO Terms in the Tree Reconciliation Database

In the tree reconciliation database, the Gene Ontology (GO) terms are stored in the tables described in the Chado Controlled Vocabulary Module ([http://gmod.org/wiki/Chado_CV_Module|http://gmod.org/wiki/Chado_CV_Module]).

GO ID

The integers associated with a GO term is stored in the table dbxref in the column accession

GO Name

The name associated with a GO term is stored in the table cvterm in the column name.

GO Namespace

The namespace that the GO terms is under is stored in the table cvtermpath in the column object_id using* *the integer id of the controlled vocabulary term corresponding to the name space. This cvtermpath table, stores the transitive closure paths of parent and child relationships in the GO nomenclature. These are generally in 'part_of' relationships are 'is_a' relationships.

Additional information of transitive closure of GO terms in the cv module is available at http://gmod.org/wiki/Chado_CV_Module#Transitive_Closure as well as http://wiki.geneontology.org/index.php/Transitive_closure.

GO Assignments

The assignments of GO terms to individual genes are stored in the table member_attribute.

Populating the Database with GO Terms and Relationships

The GO terms are loaded to the tree reconciliation database using some prexisting code from GMOD along with some code written specifically for this.

GO Terms

All ontologies are loaded to the database by first converting the ontologies to chado xml using the program go2chadoxml and then loading these XML files to the database using the program stag-storenode.pl. An example of this is shown below

>go2chadoxml goslim_generic.obo > goslim_generic.chado.xml

>stag-storenode.pl -d 'dbi:mysql:dbname=tr_test;host=localhost' --user USERNAME --password SECRET goslim_generic.chado.xml

This populates the necessary columns in the cvterm set of tables.

GO Namespace

The type of GO term (biological_process, molecular_function etc) are stored in the table cvtermpath. The precomputed transitive closures for is_a and part_of relations were download from http://www.geneontology.org/scratch/transitive_closure/ as the file go_transitive_closure.links. This file is the result of running obo2linkfile on core GO. The program tr_import_go_transitive_closure.pl was used to populate the database with the is_a relations from the file go_tc_isa.links.txt which only includes the is_a relations and ignores the part_of relations.

This program is available at http://svn.iplantcollaborative.org/iptol/tree_reconciliation/scripts/tr_import_go_transitive_closure.pl.

The input file of is_a and part_of relations is on SVN at http://svn.iplantcollaborative.org/iptol/tree_reconciliation/ontologies/go_transitive_closure.links.txt

The input file that only includes is_a relations is also on SVN at http://svn.iplantcollaborative.org/iptol/tree_reconciliation/ontologies/go_tc_isa.links.txt.

Note: It also possible to populate this database without requiring the precomputed transitive closures by calculatring transitive closures using the relation information stored in the database. Tools are available to do this from GMOD cvs. I chose not to do it this way since this program depended on the module DBIX::Class::Tree::Nested set which required Perl v 5.10.0. I'm using 5.8.9 and was willing to update my entire perl distro just to update a single table.

GO Assignments

Sheldon determined GO assignments for each gene using the program ____. The program tr_import_go_assignments.pl loads these assignments to the database using the files that Sheldon provided as input.

This program is available at http://svn.iplantcollaborative.org/iptol/tree_reconciliation/scripts/tr_import_go_assignments.pl

Fetching GO Terms

Some examples of using the GO terms in the database are shown below

Getting the cvterm id for a GO integer ID

A Perlish subfunction for this is:

sub gonumber_2_cvterm_id {
    my ($dbh, $go_number) = @_;
    my ($sql, $cur, $result, @row);

    # The following will give the cvterm_id from the database
    # for the GO number (ie GO:0009987)
    $sql = "SELECT cvterm.cvterm_id".
	" FROM cvterm".
	" LEFT JOIN dbxref".
	" ON cvterm.dbxref_id=dbxref.dbxref_id".
	" LEFT JOIN db".
	" ON db.db_id=dbxref.db_id".
	" WHERE db.name = 'GO'".
	" AND dbxref.accession = '".$go_number."'";

    $cur = $dbh->prepare($sql);
    $cur->execute();
    @row=$cur->fetchrow;
    $result=$row[0];
    $cur->finish();

    return $result;

}

and in SQL the core query is

SELECT cvterm.cvterm_id
	FROM cvterm
	LEFT JOIN dbxref
	ON cvterm.dbxref_id=dbxref.dbxref_id
	LEFT JOIN db
	ON db.db_id=dbxref.db_id
	WHERE db.name = 'GO'
	AND dbxref.accession = ?;

-- For example to show information for the GO term illustrated above
SELECT * FROM cvterm
	LEFT JOIN dbxref
	ON cvterm.dbxref_id=dbxref.dbxref_id
	LEFT JOIN db
	ON db.db_id=dbxref.db_id
	WHERE db.name = 'GO'
	AND dbxref.accession = 3\G

*************************** 1. row ***************************
          cvterm_id: 171
              cv_id: 9
               name: reproduction
         definition: The production by an organism of new individuals that contain some portion of their genetic material inherited from that organism.
          dbxref_id: 338
        is_obsolete: 0
is_relationshiptype: 0
          dbxref_id: 338
              db_id: 66
          accession: 0000003
            version:
        description: NULL
              db_id: 66
               name: GO
        description: NULL
          urlprefix: NULL
                url: NULL
1 row in set (0.54 sec)


Fetching Relationship cvterm_id Value

The types of relationship among GO terms are also stored in the cvterm table. For example 'is_a' and 'part_of' have cvterm ids for using these relationships. So it may be necessary to fetch the cvterm_id for the relationship type.

A perlish example of this:

sub gorel_2_cvterm_id {
    my ($dbh, $rel_term) = @_;
    my ($sql, $cur, $result, @row);

    # The following will give the cvterm_id from the relationship
    # this should work for multiple ontologies since these ontologies
    # share the values used to set relationship types
    $sql = "SELECT cvterm_id".
	" FROM cvterm".
	" WHERE".
	" cvterm.name = \'".$rel_term."\'";

    # This could be narrowed by searching under OBO_REL for the db name

    print STDERR "\tSQL:\t".$sql."\n"
	if $verbose;

    $cur = $dbh->prepare($sql);
    $cur->execute();
    @row=$cur->fetchrow;
    $result=$row[0];
    $cur->finish();

    return $result;

}

To see all relationship types stored for all ontologies in the database:

mysql> select * from cvterm where is_relationshiptype=1;


+-----------+-------+----------------------+------------+-----------+-------------+---------------------+
| cvterm_id | cv_id | name                 | definition | dbxref_id | is_obsolete | is_relationshiptype |
+-----------+-------+----------------------+------------+-----------+-------------+---------------------+
|         3 |     1 | is_a                 |            |         3 |           0 |                   1 |
|         8 |     1 | part_of              |            |        10 |           0 |                   1 |
|         9 |     1 | has_part             | NULL       |        12 |           0 |                   1 |
|        10 |     1 | integral_part_of     |            |        14 |           0 |                   1 |
|        11 |     1 | has_integral_part    | NULL       |        16 |           0 |                   1 |
|        12 |     1 | proper_part_of       |            |        18 |           0 |                   1 |
|        13 |     1 | has_proper_part      | NULL       |        20 |           0 |                   1 |
|        14 |     1 | located_in           |            |        22 |           0 |                   1 |
|        15 |     1 | location_of          | NULL       |        24 |           0 |                   1 |
|        16 |     1 | contained_in         |            |        26 |           0 |                   1 |
|        17 |     1 | contains             | NULL       |        28 |           0 |                   1 |
|        18 |     1 | adjacent_to          |            |        30 |           0 |                   1 |
|        19 |     1 | transformation_of    |            |        32 |           0 |                   1 |
|        20 |     1 | transformed_into     | NULL       |        34 |           1 |                   1 |
|        21 |     1 | derives_from         |            |        36 |           0 |                   1 |
|        22 |     1 | derived_into         | NULL       |        38 |           0 |                   1 |
|        23 |     1 | preceded_by          |            |        40 |           0 |                   1 |
|        24 |     1 | precedes             | NULL       |        42 |           0 |                   1 |
|        25 |     1 | has_participant      |            |        44 |           0 |                   1 |
|        26 |     1 | participates_in      | NULL       |        46 |           0 |                   1 |
|        27 |     1 | has_agent            |            |        48 |           0 |                   1 |
|        28 |     1 | agent_in             | NULL       |        50 |           0 |                   1 |
|        29 |     1 | instance_of          |            |        52 |           0 |                   1 |
|        30 |     1 | has_improper_part    | NULL       |        54 |           1 |                   1 |
|        31 |     1 | improper_part_of     |            |        56 |           1 |                   1 |
|        32 |     1 | relationship         |            |        58 |           1 |                   1 |
|       166 |    12 | negatively_regulates | NULL       |       328 |           0 |                   1 |
|       167 |    12 | positively_regulates | NULL       |       329 |           0 |                   1 |
|       168 |    12 | regulates            | NULL       |       330 |           0 |                   1 |
+-----------+-------+----------------------+------------+-----------+-------------+---------------------+
29 rows in set (0.49 sec)

Fetching a Count of GO Terms for a Gene Tree.

Perlish code is

my $biol_process_id = get_object_id($dbh,"biological_process");
#my $cell_component_id = get_object_id($dbh,"cellular_component");
#my $mol_function_id = get_object_id($dbh,"molecular_function");
my $go_count_sql = "SELECT cvterm.name,COUNT(*)".
    " FROM protein_tree_node".
    " LEFT JOIN protein_tree_member ON".
    " protein_tree_node.node_id = protein_tree_member.node_id".
    " LEFT JOIN member ON".
    " protein_tree_member.member_id = member.member_id".
    " LEFT JOIN member_attribute ON".
    " member.member_id = member_attribute.member_id".
    " LEFT JOIN cvterm ON".
    " member_attribute.cvterm_id = cvterm.cvterm_id".
    " LEFT JOIN dbxref ON".
    " cvterm.dbxref_id = dbxref.dbxref_id".
    " LEFT JOIN db ON".
    " dbxref.db_id = db.db_id".
    " LEFT JOIN cvtermpath ON".
    " cvtermpath.subject_id = cvterm.cvterm_id".
    " WHERE db.name = \"GO\"".
    " AND cvtermpath.object_id = \'".$biol_process_id."\'".
    " AND protein_tree_node.protein_tree_id = ".$tree_id.
    " GROUP BY name";

It is also possible to do this without fetching $biol_process_id separately, but this is substantially slower.

my $go_count_sql = "SELECT cvterm.name,COUNT(*)".
    " FROM protein_tree_node".
    " LEFT JOIN protein_tree_member ON".
    " protein_tree_node.node_id = protein_tree_member.node_id".
    " LEFT JOIN member ON".
    " protein_tree_member.member_id = member.member_id".
    " LEFT JOIN member_attribute ON".
    " member.member_id = member_attribute.member_id".
    " LEFT JOIN cvterm ON".
    " member_attribute.cvterm_id = cvterm.cvterm_id".
    " LEFT JOIN dbxref ON".
    " cvterm.dbxref_id = dbxref.dbxref_id".
    " LEFT JOIN db ON".
    " dbxref.db_id = db.db_id".
    " LEFT JOIN cvtermpath AS cv_subject ON".
    " cv_subject.subject_id = cvterm.cvterm_id".
    " LEFT JOIN cvterm AS cv_object ON".
    " cv_object.cvterm_id = cv_subject.object_id".
    " WHERE db.name = \"GO\"".
    " AND cv_object.name = \'biological_process\'".
    " AND protein_tree_node.protein_tree_id = ".$tree_id.
    " GROUP BY name";

For example:

SELECT cvterm.name,COUNT(*)
    FROM protein_tree_node
    LEFT JOIN protein_tree_member ON
    protein_tree_node.node_id = protein_tree_member.node_id
    LEFT JOIN member ON
    protein_tree_member.member_id = member.member_id
    LEFT JOIN member_attribute ON
    member.member_id = member_attribute.member_id
    LEFT JOIN cvterm ON
    member_attribute.cvterm_id = cvterm.cvterm_id
    LEFT JOIN dbxref ON
    cvterm.dbxref_id = dbxref.dbxref_id
    LEFT JOIN db ON
    dbxref.db_id = db.db_id
    LEFT JOIN cvtermpath AS cv_subject ON
    cv_subject.subject_id = cvterm.cvterm_id
    LEFT JOIN cvterm AS cv_object ON
    cv_object.cvterm_id = cv_subject.object_id
    WHERE db.name = 'GO'
    AND cv_object.name = 'biological_process'
    AND protein_tree_node.protein_tree_id = 100
    GROUP BY name;

-- Results in:
+---------------------------------------------------------+----------+
| name                                                    | COUNT(*) |
+---------------------------------------------------------+----------+
| circadian rhythm                                        |       23 |
| cytokinin mediated signaling pathway                    |       23 |
| embryo development ending in seed dormancy              |       23 |
| red light signaling pathway                             |       23 |
| red or far-red light signaling pathway                  |       23 |
| regulation of circadian rhythm                          |       23 |
| regulation of transcription                             |       23 |
| regulation of transcription, DNA-dependent              |       23 |
| response to chitin                                      |       23 |
| response to cytokinin stimulus                          |       23 |
| response to red light                                   |       23 |
| response to stress                                      |       19 |
| stem cell maintenance                                   |       23 |
| transcription                                           |       23 |
| two-component signal transduction system (phosphorelay) |       23 |
+---------------------------------------------------------+----------+
15 rows in set (28.48 sec)

But the following is much quicker:

SELECT cvterm.name,COUNT(*)
    FROM protein_tree_node
    LEFT JOIN protein_tree_member ON
    protein_tree_node.node_id = protein_tree_member.node_id
    LEFT JOIN member ON
    protein_tree_member.member_id = member.member_id
    LEFT JOIN member_attribute ON
    member.member_id = member_attribute.member_id
    LEFT JOIN cvterm ON
    member_attribute.cvterm_id = cvterm.cvterm_id
    LEFT JOIN dbxref ON
    cvterm.dbxref_id = dbxref.dbxref_id
    LEFT JOIN db ON
    dbxref.db_id = db.db_id
    LEFT JOIN cvtermpath ON
    cvtermpath.subject_id = cvterm.cvterm_id
    WHERE db.name = 'GO'
    AND cvtermpath.object_id = 6377
    AND protein_tree_node.protein_tree_id = 100
    GROUP BY name

-- First Time as
+---------------------------------------------------------+----------+
| name                                                    | COUNT(*) |
+---------------------------------------------------------+----------+
| circadian rhythm                                        |       23 |
| cytokinin mediated signaling pathway                    |       23 |
| embryo development ending in seed dormancy              |       23 |
| red light signaling pathway                             |       23 |
| red or far-red light signaling pathway                  |       23 |
| regulation of circadian rhythm                          |       23 |
| regulation of transcription                             |       23 |
| regulation of transcription, DNA-dependent              |       23 |
| response to chitin                                      |       23 |
| response to cytokinin stimulus                          |       23 |
| response to red light                                   |       23 |
| response to stress                                      |       19 |
| stem cell maintenance                                   |       23 |
| transcription                                           |       23 |
| two-component signal transduction system (phosphorelay) |       23 |
+---------------------------------------------------------+----------+
15 rows in set (1.01 sec)

-- Followup time as
+---------------------------------------------------------+----------+
| name                                                    | COUNT(*) |
+---------------------------------------------------------+----------+
| circadian rhythm                                        |       23 |
| cytokinin mediated signaling pathway                    |       23 |
| embryo development ending in seed dormancy              |       23 |
| red light signaling pathway                             |       23 |
| red or far-red light signaling pathway                  |       23 |
| regulation of circadian rhythm                          |       23 |
| regulation of transcription                             |       23 |
| regulation of transcription, DNA-dependent              |       23 |
| response to chitin                                      |       23 |
| response to cytokinin stimulus                          |       23 |
| response to red light                                   |       23 |
| response to stress                                      |       19 |
| stem cell maintenance                                   |       23 |
| transcription                                           |       23 |
| two-component signal transduction system (phosphorelay) |       23 |
+---------------------------------------------------------+----------+
15 rows in set (0.07 sec)