SQL Queries - Queries on Reconciled Trees
Example species tree
These queries use the example species tree below:
Find all gene trees having a speciation event at a specific point on the species tree
The goal is to find gene trees that have a bifurcation event for a specific point on the species trees. This could be at a node on the species tree for speciation events, or at an edge on the species trees for duplication events. For example, to find the gene trees that have a speciation event at the common ancestor of poplar,cucumber and soybean we would use node_id = 4.
To find the reconciliation_nodes that map to a specific node on the species tree, we can make use of the is_on_node BOOLEAN:
SELECT reconciliation_id FROM reconciliation_node WHERE host_child_node_id = 4 AND is_on_node = TRUE;
this results in
+-------------------+ | reconciliation_id | +-------------------+ | 4 | | 5 | | 6 | | 7 | | 9 | | 9 | ... | 2539 | | 2540 | | 2540 | | 2540 | +-------------------+ 1466 rows in set (0.15 sec)
Alternatively we can use a query for host_childe_node_id = host_parent_node_id. I put this here since when first done, this query takes twice as long a the boolean query above, but when cached, this query is about twice as fast as the boolean query.
SELECT reconciliation_id FROM reconciliation_node WHERE host_child_node_id = 4 AND host_child_node_id = host_parent_node_id;
To convert these reconciliation_ids to protein trees we could build on the above queries:
SELECT DISTINCT protein_tree_id FROM reconciliation WHERE reconciliation_id IN ( SELECT reconciliation_id FROM reconciliation_node WHERE host_child_node_id = 4 AND is_on_node = TRUE );
but this is quite slow (~ 15.8 seconds), so it is better to use joins (~ 0.3 seconds)
SELECT DISTINCT protein_tree_id FROM reconciliation LEFT JOIN reconciliation_node ON reconciliation.reconciliation_id = reconciliation_node.reconciliation_id WHERE reconciliation_node.host_child_node_id = 4 AND reconciliation_node.is_on_node = TRUE;
If there is more
Find all gene trees having a duplication event at a specific point on the species tree
To find duplication events, we search for gene trees node that map to an edge on the species tree. Edges are delineated by the host and parent node and the host child node. Fore example, the branch leading up to the clade containing cucumber and soybean is defined by having the host_parent node = 4 and the host_child_node = 6.
To find duplication events, for a specific branch on the species tree (ie parent_node = 4 and child_node = 6):
-- -- QUERY TO FIND TREE WITH DUPLICATION AT -- A SPECIFIC BRANCH ON THE SPECIES TREE -- SELECT DISTINCT protein_tree_id FROM reconciliation LEFT JOIN reconciliation_node ON reconciliation.reconciliation_id = reconciliation_node.reconciliation_id WHERE reconciliation_node.host_child_node_id = 6 AND reconciliation_node.host_parent_node_id = 4; +-----------------+ | protein_tree_id | +-----------------+ | 14 | | 54 | | 96 | ... | 1978 | | 2201 | | 2395 | | 2491 | +-----------------+ 33 rows in set (0.15 sec)
Identify the Duplication Nodes and Speciation Nodes in a Gene Tree
For databases that only contain duplications and speciations (ie. no horizontal transfer events), the duplication nodes will be those nodes where reconciliation_node.is_on_node is false. For example, with the gene tree with the protein_tree_id = 1337.
SELECT protein_tree_node.node_id, reconciliation_node.host_parent_node_id, reconciliation_node.host_child_node_id, reconciliation_node.is_on_node, reconciliation_node.reconciliation_id FROM protein_tree_node RIGHT JOIN reconciliation_node ON protein_tree_node.node_id = reconciliation_node.node_id WHERE protein_tree_node.protein_tree_id = 1337;
Should give results like:
+---------+---------------------+--------------------+------------+-------------------+ | node_id | host_parent_node_id | host_child_node_id | is_on_node | reconciliation_id | +---------+---------------------+--------------------+------------+-------------------+ | 30527 | 3 | 3 | 1 | 1337 | | 30528 | 9 | 9 | 1 | 1337 | | 30529 | 10 | 10 | 1 | 1337 | | 30530 | 11 | 11 | 1 | 1337 | | 30531 | 4 | 4 | 1 | 1337 | | 30532 | 7 | 7 | 1 | 1337 | | 30533 | 4 | 5 | 0 | 1337 | | 30534 | 5 | 5 | 1 | 1337 | | 30535 | 5 | 5 | 1 | 1337 | | 30536 | 1 | 1 | 1 | 1337 | | 30537 | 6 | 8 | 0 | 1337 | | 30538 | 8 | 8 | 1 | 1337 | | 30539 | 8 | 8 | 1 | 1337 | | 30540 | 1 | 1 | 1 | 1337 | +---------+---------------------+--------------------+------------+-------------------+
The nodes 30533 and 30537 represent duplication events in the gene tree. Node 30533 has a duplication on the branch on the species tree that is between node 4 and 5, which is the branch leading up to poplar. Node 30537 has a duplication that is on the branch in the species tree between nodes 6 and 8 which is the branch leading up to soybean.
Find Orthologous Loci in Reconciled Trees
Given a reconciled gene tree in which the nodes are annotated as duplication (is_on_node = 0) or speciation nodes (is_on_node = 1), it is possible to return all leaf nodes that are orthologous to a query leaf node.
For example, in the tree below we may want to identify the orthologs of leaf node 44706 for the duplication event shown as node 44701.
Click on thumbnail to enlarge.
The variables for the query is the duplication node of interest (ie. node 44701) as well as the query leaf node of interest (ie. node 44706):
SELECT * FROM protein_tree_node WHERE protein_tree_node.left_index IN ( SELECT np.left_index FROM protein_tree_node nc JOIN protein_tree_node np ON nc.left_index < np.left_index AND nc.right_index > np.right_index WHERE nc.node_id = ( SELECT node_id FROM protein_tree_node WHERE parent_id = 44701 AND left_index < ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 ) AND right_index > ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 ) ) ) AND protein_tree_node.right_index = protein_tree_node.left_index + 1;
This query returns the leaf nodes that are the orthologous set, including the query node in the result:
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+ | node_id | protein_tree_id | parent_id | root_id | clusterset_id | left_index | right_index | distance_to_parent | +---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+ | 44703 | 2191 | 44702 | 44693 | 0 | 89401 | 89402 | 1 | | 44705 | 2191 | 44704 | 44693 | 0 | 89404 | 89405 | 1 | | 44706 | 2191 | 44704 | 44693 | 0 | 89406 | 89407 | 1 | +---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
Another example for the same reconciled tree is to use node 44694 as the duplication node of interest.
Click thumbnail above for full size image
As a query this is:
SELECT * FROM protein_tree_node WHERE protein_tree_node.left_index IN ( SELECT np.left_index FROM protein_tree_node nc JOIN protein_tree_node np ON nc.left_index < np.left_index AND nc.right_index > np.right_index WHERE nc.node_id = ( SELECT node_id FROM protein_tree_node WHERE parent_id = 44694 -- DUPLICATION NODE AND left_index < ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 -- LEAF NODE ID ) AND right_index > ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 -- LEAF NODE ID ) ) ) AND protein_tree_node.right_index = protein_tree_node.left_index + 1; -- Returns leaf nodes
and this returns:
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+ | node_id | protein_tree_id | parent_id | root_id | clusterset_id | left_index | right_index | distance_to_parent | +---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+ | 44703 | 2191 | 44702 | 44693 | 0 | 89401 | 89402 | 1 | | 44705 | 2191 | 44704 | 44693 | 0 | 89404 | 89405 | 1 | | 44706 | 2191 | 44704 | 44693 | 0 | 89406 | 89407 | 1 | | 44707 | 2191 | 44701 | 44693 | 0 | 89410 | 89411 | 1 | | 44708 | 2191 | 44700 | 44693 | 0 | 89413 | 89414 | 1 | +---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+ 5 rows in set (0.96 sec)
This can be modified to a single use of the leaf node id variable by using joins to reduce this SQL statement to:
SELECT * FROM protein_tree_node WHERE protein_tree_node.left_index IN ( SELECT np.left_index FROM protein_tree_node nc JOIN protein_tree_node np ON nc.left_index < np.left_index AND nc.right_index > np.right_index WHERE nc.node_id = ( SELECT np.node_id FROM protein_tree_node nc JOIN protein_tree_node np ON nc.left_index > np.left_index AND nc.left_index < np.right_index WHERE nc.node_id = 44706 AND np.parent_id = 44694 ) ) AND protein_tree_node.right_index = protein_tree_node.left_index + 1;
This also returns:
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+ | node_id | protein_tree_id | parent_id | root_id | clusterset_id | left_index | right_index | distance_to_parent | +---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+ | 44703 | 2191 | 44702 | 44693 | 0 | 89401 | 89402 | 1 | | 44705 | 2191 | 44704 | 44693 | 0 | 89404 | 89405 | 1 | | 44706 | 2191 | 44704 | 44693 | 0 | 89406 | 89407 | 1 | | 44707 | 2191 | 44701 | 44693 | 0 | 89410 | 89411 | 1 | | 44708 | 2191 | 44700 | 44693 | 0 | 89413 | 89414 | 1 | +---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
Instead of using leaf node_id as input, it is straightforward to also use the locus_id for the gene in the query. For example, in the example reconciled tree used above, we also have locus identifiers associated with the leaf nodes.
Click thumbnail above for full size image
For example, we could use the locus identifier Glyma08g05730 in the query.
SELECT * FROM protein_tree_node WHERE protein_tree_node.left_index IN ( SELECT np.left_index FROM protein_tree_node nc JOIN protein_tree_node np ON nc.left_index < np.left_index AND nc.right_index > np.right_index WHERE nc.node_id = ( SELECT np.node_id FROM protein_tree_node nc JOIN protein_tree_node np ON nc.left_index > np.left_index AND nc.left_index < np.right_index LEFT JOIN protein_tree_member ON nc.node_id = protein_tree_member.node_id LEFT JOIN member ON protein_tree_member.member_id = member.member_id WHERE member.stable_id = "Glyma08g05730" AND np.parent_id = 44694 ) ) AND protein_tree_node.right_index = protein_tree_node.left_index + 1;
To include the sequence string and the stable locus id (usually a gene name) in the output:
SELECT protein_tree_node.node_id, protein_tree_node.protein_tree_id, member.stable_id, sequence.sequence FROM protein_tree_node LEFT JOIN protein_tree_member ON protein_tree_node.node_id = protein_tree_member.node_id LEFT JOIN member ON protein_tree_member.member_id = member.member_id LEFT JOIN sequence ON member.sequence_id = sequence.sequence_id WHERE protein_tree_node.left_index IN ( SELECT np.left_index FROM protein_tree_node nc JOIN protein_tree_node np ON nc.left_index < np.left_index AND nc.right_index > np.right_index WHERE nc.node_id = ( SELECT np.node_id FROM protein_tree_node nc JOIN protein_tree_node np ON nc.left_index > np.left_index AND nc.left_index < np.right_index LEFT JOIN protein_tree_member ON nc.node_id = protein_tree_member.node_id LEFT JOIN member ON protein_tree_member.member_id = member.member_id WHERE member.stable_id = "Glyma08g05730" AND np.parent_id = 44694 ) ) AND protein_tree_node.right_index = protein_tree_node.left_index + 1\G
Returns
*************************** 1. row *************************** node_id: 44703 protein_tree_id: 2191 stable_id: CPS0003G408 sequence: MATFFGSPPFLSPPLARTHHLSSSPQPPPPSPPTPPSQPQSPSPQLSTSSSEEQLSPASVQVQQQKPTKRTTVETTDWIASSLTRRFGLGAGLAWAAFLAVGVISEQVKTRLEVSQQEANTRDVGKEEEVVLPNGVRYYELRVGGGASPRTGDLVVIELKGKVEESGQVFVDTFGGDKKPLALVMGSRPYSKGLCEGVEYVLRSMKAGGRRRVIVPPSLGFGEKGADLGSGIQIPPFATLEYIVEIDRVSIAPA *************************** 2. row *************************** node_id: 44705 protein_tree_id: 2191 stable_id: Glyma05g33920 sequence: MAAFFGSPPIFSLPPTIIRTHHISSSSQTPPPTPSPQSQPPTSSPQQLRTTNLNEESVQVSTEAKQQKPIKPVTSSTKVGSTDWIATSLTRRFGIGAGLAWVGFLAFGVISEQIKTRLELSQQEANTRNVEKVEEVVLPNGIRYYELKLGGGASPRPGDLVVIDIMGKIESSEVFVDTFEGDKKPLALVMGSRPYSKGVCEGIEYALKTMKAGGKRKVIVPPKLGFGENGADLGTGVQIPPLATLEYILEVEKVSIAPA *************************** 3. row *************************** node_id: 44706 protein_tree_id: 2191 stable_id: Glyma08g05730 sequence: MATFFGSPPIFSLPLTRTHHISSSSQTPPPTPPPQSQPPTSSPQQLRTTNLNDESMQVCTEAKQQKPIKPSTKVESTDWIATSLTRRFGIGAGLAWVGFLAFGVISEQIKTRLEVSQQEANTRNVEKVEEVVLPNGIRYYELKVGGGASPRPGDLVVIDITGKIESSGEVFVNTFEGDKKPLALVMGSRPYSKGVCEGIEYVLKTMKAGGKRKVIVPPQLGFGENGADFDSGTGVQIPPLATLEYILEVEKVSIAPA *************************** 4. row *************************** node_id: 44707 protein_tree_id: 2191 stable_id: POPTR-0012s04560.1 sequence: MATFFGSPPFLSHPLTRTNFSSSSQTPPPPTPPIPPSQPNPSPQLSSSSSEQPLLPASVRVQQQKPAKPAGNSATKVETTDWIASTLTRRFGLGAGLAWAAFLAVGVVSEQIKTRIEVSQQEANTRNVGKEEEVALPNGIRYYELRVGGGASPKTGDLVVIDLKGKIEGSGEVFVDTFGGDRKPLALVMGSRPYSKGMCEGVEYVLRSVKAGGKRRVIVPPNLGFRENGADLGTGVQIPPFATLEYIVEVERVSIAPA *************************** 5. row *************************** node_id: 44708 protein_tree_id: 2191 stable_id: V17G0220 sequence: MAAFFGSPPFVYHPLTRTPHFSSSQTPPPPPPPTPPSQPQSPTPPLSITSSEPPAAASVKGQQQKPTKPATTPVTTADSTDWIASTLTRRFGLGAGLAWAAFLAFGVVSEQIKTRLEVSQQEANTRDVEKEEEVVLPNGIRYYEMRVGGGASPRPGDLVVIDLKGSVQGSGEVFVDTFDGEKKSLALVMGSRPYTKGMCEGIESVLRSMKAGGKRRVTIPPSLGFGEKGADLGSGLQIPPSATLEYIVEVDKVSIAPA 5 rows in set (0.57 sec)
If the reconciliation nodes are tagged by meaningful duplication event names in the database (ie. Gamma duplication), it would then be possible to fetch all orthologs for a given locus_id for a given named duplication event by modifications to the query above. This could then be turned into a service that takes locus_id and duplication event as input, and returns orthologous loci as output.
Find Paralogous Loci in Reconciled Trees
In addition to finding the loci that are orthologous to the query leaf node of interest, we may want to find the paralogous leaf nodes. This is the clade that is sister to the orthologous clade for a duplication node of interest.
Click thumbnail above for full size image
In the above query we used the following to find the node id defining the orthologous clade. This finds the child node of the duplication node that contains the query leaf node. We used the left_index and right_index values to do this:
SELECT node_id FROM protein_tree_node WHERE left_index < ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 ) AND right_index > ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 ) AND parent_id = 44694;
To find the node or nodes defining the paralogous clade, we need to identify the child nodes of the parent node that does NOT contain query node. Again we can use the left_index and right_index values to do this:
SELECT node_id FROM protein_tree_node WHERE ( left_index > ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 ) AND right_index > ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 ) AND parent_id = 44694 ) OR ( left_index < ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 ) AND right_index < ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 ) AND parent_id = 44694 )
Using this node to identify the leaf nodes in this paralogous clade yields the following query.
SELECT * FROM protein_tree_node WHERE protein_tree_node.left_index IN ( SELECT np.left_index FROM protein_tree_node nc JOIN protein_tree_node np ON nc.left_index < np.left_index AND nc.right_index > np.right_index WHERE nc.node_id = ( SELECT node_id FROM protein_tree_node WHERE ( parent_id = 44694 -- DUPLICATION NODE AND left_index < ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 -- LEAF NODE ID ) AND right_index < ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 -- LEAF NODE ID ) ) OR ( parent_id = 44694 -- DUPLICATION NODE AND left_index > ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 -- LEAF NODE ID ) AND right_index > ( SELECT left_index FROM protein_tree_node WHERE node_id = 44706 -- LEAF NODE ID ) ) ) ) AND protein_tree_node.right_index = protein_tree_node.left_index + 1; -- Returns leaf nodes
which returns:
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+ | node_id | protein_tree_id | parent_id | root_id | clusterset_id | left_index | right_index | distance_to_parent | +---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+ | 44697 | 2191 | 44696 | 44693 | 0 | 89390 | 89391 | 1 | | 44698 | 2191 | 44696 | 44693 | 0 | 89392 | 89393 | 1 | | 44699 | 2191 | 44695 | 44693 | 0 | 89395 | 89396 | 1 | +---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+