SQL Queries - Controlled Vocabulary Terms
Gene Ontology Terms
A detailed description of Gene Ontology (GO) terms is available at http://www.geneontology.org/GO.doc.shtml.
Types of GO Terms
In general, there are three major classes of GO terms.
1) biological_process
A biological process is series of events accomplished by one or more ordered assemblies of molecular functions. Examples of broad biological process terms are cellular physiological process or signal transduction. Examples of more specific terms are pyrimidine metabolic process or alpha-glucoside transport. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps.
2) cellular_component
A cellular component is just that, a component of a cell, but with the proviso that it is part of some larger object; this may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer). See the Documentation on the cellular component ontology for more details.
3) molecular_function
Molecular function describes activities, such as catalytic or binding activities, that occur at the molecular level. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where or when, or in what context, the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are catalytic activity, transporter activity, or binding; examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding.
GO Term Example
An example of an individual GO term in OBO format is shown below.
[Term] id: GO:0000003 name: reproduction namespace: biological_process alt_id: GO:0019952 alt_id: GO:0050876 def: "The production by an organism of new individuals that contain some portion of their genetic material inherited from that organism." [GOC:go_curators, GOC:isa_complete, ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"] subset: goslim_generic subset: goslim_pir subset: goslim_plant subset: gosubset_prok synonym: "reproductive physiological process" EXACT [] xref: Wikipedia:Reproduction is_a: GO:0008150 ! biological_process
For additional examples of GO terms, see http://www.geneontology.org/GO_slims/goslim_generic.obo .
GO terms have an ID that is an integer (ie. 0000003) and a name that is a string (ie. reproduction). For the tree reconciliation database we are interested in fetching either the GO ID value or the name. We are also interested in the namespace that it is a part of (ie. biological_process, cellular component or molecular function).
GO Terms in the Tree Reconciliation Database
In the tree reconciliation database, the Gene Ontology (GO) terms are stored in the tables described in the Chado Controlled Vocabulary Module ([http://gmod.org/wiki/Chado_CV_Module|http://gmod.org/wiki/Chado_CV_Module]).
GO ID
The integers associated with a GO term is stored in the table dbxref in the column accession.
GO Name
The name associated with a GO term is stored in the table cvterm in the column name.
GO Namespace
The namespace that the GO terms is under is stored in the table cvtermpath in the column object_id using* *the integer id of the controlled vocabulary term corresponding to the name space. This cvtermpath table, stores the transitive closure paths of parent and child relationships in the GO nomenclature. These are generally in 'part_of' relationships are 'is_a' relationships.
Additional information of transitive closure of GO terms in the cv module is available at http://gmod.org/wiki/Chado_CV_Module#Transitive_Closure as well as http://wiki.geneontology.org/index.php/Transitive_closure.
GO Assignments
The assignments of GO terms to individual genes are stored in the table member_attribute.
Populating the Database with GO Terms and Relationships
The GO terms are loaded to the tree reconciliation database using some prexisting code from GMOD along with some code written specifically for this.
GO Terms
All ontologies are loaded to the database by first converting the ontologies to chado xml using the program go2chadoxml and then loading these XML files to the database using the program stag-storenode.pl. An example of this is shown below
>go2chadoxml goslim_generic.obo > goslim_generic.chado.xml
>stag-storenode.pl -d 'dbi:mysql:dbname=tr_test;host=localhost' --user USERNAME --password SECRET goslim_generic.chado.xml
This populates the necessary columns in the cvterm set of tables.
GO Namespace
The type of GO term (biological_process, molecular_function etc) are stored in the table cvtermpath. The precomputed transitive closures for is_a and part_of relations were download from http://www.geneontology.org/scratch/transitive_closure/ as the file go_transitive_closure.links. This file is the result of running obo2linkfile on core GO. The program tr_import_go_transitive_closure.pl was used to populate the database with the is_a relations from the file go_tc_isa.links.txt which only includes the is_a relations and ignores the part_of relations.
This program is available at http://svn.iplantcollaborative.org/iptol/tree_reconciliation/scripts/tr_import_go_transitive_closure.pl.
The input file of is_a and part_of relations is on SVN at http://svn.iplantcollaborative.org/iptol/tree_reconciliation/ontologies/go_transitive_closure.links.txt
The input file that only includes is_a relations is also on SVN at http://svn.iplantcollaborative.org/iptol/tree_reconciliation/ontologies/go_tc_isa.links.txt.
Note: It also possible to populate this database without requiring the precomputed transitive closures by calculatring transitive closures using the relation information stored in the database. Tools are available to do this from GMOD cvs. I chose not to do it this way since this program depended on the module DBIX::Class::Tree::Nested set which required Perl v 5.10.0. I'm using 5.8.9 and was willing to update my entire perl distro just to update a single table.
GO Assignments
Sheldon determined GO assignments for each gene using the program ____. The program tr_import_go_assignments.pl loads these assignments to the database using the files that Sheldon provided as input.
This program is available at http://svn.iplantcollaborative.org/iptol/tree_reconciliation/scripts/tr_import_go_assignments.pl
Fetching GO Terms
Some examples of using the GO terms in the database are shown below
Getting the cvterm id for a GO integer ID
A Perlish subfunction for this is:
sub gonumber_2_cvterm_id { my ($dbh, $go_number) = @_; my ($sql, $cur, $result, @row); # The following will give the cvterm_id from the database # for the GO number (ie GO:0009987) $sql = "SELECT cvterm.cvterm_id". " FROM cvterm". " LEFT JOIN dbxref". " ON cvterm.dbxref_id=dbxref.dbxref_id". " LEFT JOIN db". " ON db.db_id=dbxref.db_id". " WHERE db.name = 'GO'". " AND dbxref.accession = '".$go_number."'"; $cur = $dbh->prepare($sql); $cur->execute(); @row=$cur->fetchrow; $result=$row[0]; $cur->finish(); return $result; }
and in SQL the core query is
SELECT cvterm.cvterm_id FROM cvterm LEFT JOIN dbxref ON cvterm.dbxref_id=dbxref.dbxref_id LEFT JOIN db ON db.db_id=dbxref.db_id WHERE db.name = 'GO' AND dbxref.accession = ?; -- For example to show information for the GO term illustrated above SELECT * FROM cvterm LEFT JOIN dbxref ON cvterm.dbxref_id=dbxref.dbxref_id LEFT JOIN db ON db.db_id=dbxref.db_id WHERE db.name = 'GO' AND dbxref.accession = 3\G *************************** 1. row *************************** cvterm_id: 171 cv_id: 9 name: reproduction definition: The production by an organism of new individuals that contain some portion of their genetic material inherited from that organism. dbxref_id: 338 is_obsolete: 0 is_relationshiptype: 0 dbxref_id: 338 db_id: 66 accession: 0000003 version: description: NULL db_id: 66 name: GO description: NULL urlprefix: NULL url: NULL 1 row in set (0.54 sec)
Fetching Relationship cvterm_id Value
The types of relationship among GO terms are also stored in the cvterm table. For example 'is_a' and 'part_of' have cvterm ids for using these relationships. So it may be necessary to fetch the cvterm_id for the relationship type.
A perlish example of this:
sub gorel_2_cvterm_id { my ($dbh, $rel_term) = @_; my ($sql, $cur, $result, @row); # The following will give the cvterm_id from the relationship # this should work for multiple ontologies since these ontologies # share the values used to set relationship types $sql = "SELECT cvterm_id". " FROM cvterm". " WHERE". " cvterm.name = \'".$rel_term."\'"; # This could be narrowed by searching under OBO_REL for the db name print STDERR "\tSQL:\t".$sql."\n" if $verbose; $cur = $dbh->prepare($sql); $cur->execute(); @row=$cur->fetchrow; $result=$row[0]; $cur->finish(); return $result; }
To see all relationship types stored for all ontologies in the database:
mysql> select * from cvterm where is_relationshiptype=1; +-----------+-------+----------------------+------------+-----------+-------------+---------------------+ | cvterm_id | cv_id | name | definition | dbxref_id | is_obsolete | is_relationshiptype | +-----------+-------+----------------------+------------+-----------+-------------+---------------------+ | 3 | 1 | is_a | | 3 | 0 | 1 | | 8 | 1 | part_of | | 10 | 0 | 1 | | 9 | 1 | has_part | NULL | 12 | 0 | 1 | | 10 | 1 | integral_part_of | | 14 | 0 | 1 | | 11 | 1 | has_integral_part | NULL | 16 | 0 | 1 | | 12 | 1 | proper_part_of | | 18 | 0 | 1 | | 13 | 1 | has_proper_part | NULL | 20 | 0 | 1 | | 14 | 1 | located_in | | 22 | 0 | 1 | | 15 | 1 | location_of | NULL | 24 | 0 | 1 | | 16 | 1 | contained_in | | 26 | 0 | 1 | | 17 | 1 | contains | NULL | 28 | 0 | 1 | | 18 | 1 | adjacent_to | | 30 | 0 | 1 | | 19 | 1 | transformation_of | | 32 | 0 | 1 | | 20 | 1 | transformed_into | NULL | 34 | 1 | 1 | | 21 | 1 | derives_from | | 36 | 0 | 1 | | 22 | 1 | derived_into | NULL | 38 | 0 | 1 | | 23 | 1 | preceded_by | | 40 | 0 | 1 | | 24 | 1 | precedes | NULL | 42 | 0 | 1 | | 25 | 1 | has_participant | | 44 | 0 | 1 | | 26 | 1 | participates_in | NULL | 46 | 0 | 1 | | 27 | 1 | has_agent | | 48 | 0 | 1 | | 28 | 1 | agent_in | NULL | 50 | 0 | 1 | | 29 | 1 | instance_of | | 52 | 0 | 1 | | 30 | 1 | has_improper_part | NULL | 54 | 1 | 1 | | 31 | 1 | improper_part_of | | 56 | 1 | 1 | | 32 | 1 | relationship | | 58 | 1 | 1 | | 166 | 12 | negatively_regulates | NULL | 328 | 0 | 1 | | 167 | 12 | positively_regulates | NULL | 329 | 0 | 1 | | 168 | 12 | regulates | NULL | 330 | 0 | 1 | +-----------+-------+----------------------+------------+-----------+-------------+---------------------+ 29 rows in set (0.49 sec)
Fetching a Count of GO Terms for a Gene Tree.
Perlish code is
my $biol_process_id = get_object_id($dbh,"biological_process"); #my $cell_component_id = get_object_id($dbh,"cellular_component"); #my $mol_function_id = get_object_id($dbh,"molecular_function"); my $go_count_sql = "SELECT cvterm.name,COUNT(*)". " FROM protein_tree_node". " LEFT JOIN protein_tree_member ON". " protein_tree_node.node_id = protein_tree_member.node_id". " LEFT JOIN member ON". " protein_tree_member.member_id = member.member_id". " LEFT JOIN member_attribute ON". " member.member_id = member_attribute.member_id". " LEFT JOIN cvterm ON". " member_attribute.cvterm_id = cvterm.cvterm_id". " LEFT JOIN dbxref ON". " cvterm.dbxref_id = dbxref.dbxref_id". " LEFT JOIN db ON". " dbxref.db_id = db.db_id". " LEFT JOIN cvtermpath ON". " cvtermpath.subject_id = cvterm.cvterm_id". " WHERE db.name = \"GO\"". " AND cvtermpath.object_id = \'".$biol_process_id."\'". " AND protein_tree_node.protein_tree_id = ".$tree_id. " GROUP BY name";
It is also possible to do this without fetching $biol_process_id separately, but this is substantially slower.
my $go_count_sql = "SELECT cvterm.name,COUNT(*)". " FROM protein_tree_node". " LEFT JOIN protein_tree_member ON". " protein_tree_node.node_id = protein_tree_member.node_id". " LEFT JOIN member ON". " protein_tree_member.member_id = member.member_id". " LEFT JOIN member_attribute ON". " member.member_id = member_attribute.member_id". " LEFT JOIN cvterm ON". " member_attribute.cvterm_id = cvterm.cvterm_id". " LEFT JOIN dbxref ON". " cvterm.dbxref_id = dbxref.dbxref_id". " LEFT JOIN db ON". " dbxref.db_id = db.db_id". " LEFT JOIN cvtermpath AS cv_subject ON". " cv_subject.subject_id = cvterm.cvterm_id". " LEFT JOIN cvterm AS cv_object ON". " cv_object.cvterm_id = cv_subject.object_id". " WHERE db.name = \"GO\"". " AND cv_object.name = \'biological_process\'". " AND protein_tree_node.protein_tree_id = ".$tree_id. " GROUP BY name";
For example:
SELECT cvterm.name,COUNT(*) FROM protein_tree_node LEFT JOIN protein_tree_member ON protein_tree_node.node_id = protein_tree_member.node_id LEFT JOIN member ON protein_tree_member.member_id = member.member_id LEFT JOIN member_attribute ON member.member_id = member_attribute.member_id LEFT JOIN cvterm ON member_attribute.cvterm_id = cvterm.cvterm_id LEFT JOIN dbxref ON cvterm.dbxref_id = dbxref.dbxref_id LEFT JOIN db ON dbxref.db_id = db.db_id LEFT JOIN cvtermpath AS cv_subject ON cv_subject.subject_id = cvterm.cvterm_id LEFT JOIN cvterm AS cv_object ON cv_object.cvterm_id = cv_subject.object_id WHERE db.name = 'GO' AND cv_object.name = 'biological_process' AND protein_tree_node.protein_tree_id = 100 GROUP BY name; -- Results in: +---------------------------------------------------------+----------+ | name | COUNT(*) | +---------------------------------------------------------+----------+ | circadian rhythm | 23 | | cytokinin mediated signaling pathway | 23 | | embryo development ending in seed dormancy | 23 | | red light signaling pathway | 23 | | red or far-red light signaling pathway | 23 | | regulation of circadian rhythm | 23 | | regulation of transcription | 23 | | regulation of transcription, DNA-dependent | 23 | | response to chitin | 23 | | response to cytokinin stimulus | 23 | | response to red light | 23 | | response to stress | 19 | | stem cell maintenance | 23 | | transcription | 23 | | two-component signal transduction system (phosphorelay) | 23 | +---------------------------------------------------------+----------+ 15 rows in set (28.48 sec)
But the following is much quicker:
SELECT cvterm.name,COUNT(*) FROM protein_tree_node LEFT JOIN protein_tree_member ON protein_tree_node.node_id = protein_tree_member.node_id LEFT JOIN member ON protein_tree_member.member_id = member.member_id LEFT JOIN member_attribute ON member.member_id = member_attribute.member_id LEFT JOIN cvterm ON member_attribute.cvterm_id = cvterm.cvterm_id LEFT JOIN dbxref ON cvterm.dbxref_id = dbxref.dbxref_id LEFT JOIN db ON dbxref.db_id = db.db_id LEFT JOIN cvtermpath ON cvtermpath.subject_id = cvterm.cvterm_id WHERE db.name = 'GO' AND cvtermpath.object_id = 6377 AND protein_tree_node.protein_tree_id = 100 GROUP BY name -- First Time as +---------------------------------------------------------+----------+ | name | COUNT(*) | +---------------------------------------------------------+----------+ | circadian rhythm | 23 | | cytokinin mediated signaling pathway | 23 | | embryo development ending in seed dormancy | 23 | | red light signaling pathway | 23 | | red or far-red light signaling pathway | 23 | | regulation of circadian rhythm | 23 | | regulation of transcription | 23 | | regulation of transcription, DNA-dependent | 23 | | response to chitin | 23 | | response to cytokinin stimulus | 23 | | response to red light | 23 | | response to stress | 19 | | stem cell maintenance | 23 | | transcription | 23 | | two-component signal transduction system (phosphorelay) | 23 | +---------------------------------------------------------+----------+ 15 rows in set (1.01 sec) -- Followup time as +---------------------------------------------------------+----------+ | name | COUNT(*) | +---------------------------------------------------------+----------+ | circadian rhythm | 23 | | cytokinin mediated signaling pathway | 23 | | embryo development ending in seed dormancy | 23 | | red light signaling pathway | 23 | | red or far-red light signaling pathway | 23 | | regulation of circadian rhythm | 23 | | regulation of transcription | 23 | | regulation of transcription, DNA-dependent | 23 | | response to chitin | 23 | | response to cytokinin stimulus | 23 | | response to red light | 23 | | response to stress | 19 | | stem cell maintenance | 23 | | transcription | 23 | | two-component signal transduction system (phosphorelay) | 23 | +---------------------------------------------------------+----------+ 15 rows in set (0.07 sec)