SQL Queries - Queries on Reconciled Trees

Example species tree

These queries use the example species tree below:

Unknown macro: {mockup}
Find all gene trees having a speciation event at a  specific point on the species tree

The goal is to find gene trees that have a bifurcation event for a specific point on the species trees. This could be at a node on the species tree for speciation events, or at an edge on the species trees for duplication events. For example, to find the gene trees that have a speciation event at the common ancestor of poplar,cucumber and soybean we would use node_id = 4.

Unknown macro: {mockup}

To find the reconciliation_nodes that map to a specific node on the species tree, we can make use of the is_on_node BOOLEAN:

SELECT reconciliation_id 
FROM reconciliation_node 
WHERE host_child_node_id = 4 
AND is_on_node = TRUE;

this results in

+-------------------+
| reconciliation_id |
+-------------------+
|                 4 |
|                 5 |
|                 6 |
|                 7 |
|                 9 |
|                 9 |
...
|              2539 |
|              2540 |
|              2540 |
|              2540 |
+-------------------+
1466 rows in set (0.15 sec)

Alternatively we can use a query for host_childe_node_id = host_parent_node_id. I put this here since when first done, this query takes twice as long a the boolean query above, but when cached, this query is about twice as fast as the boolean query.

SELECT reconciliation_id
FROM reconciliation_node 
WHERE host_child_node_id = 4 
AND host_child_node_id = host_parent_node_id;

To convert these reconciliation_ids to protein trees we could build on the above queries:

SELECT DISTINCT protein_tree_id
FROM reconciliation
WHERE
reconciliation_id  IN
(
  SELECT reconciliation_id
  FROM reconciliation_node
  WHERE host_child_node_id = 4
  AND is_on_node = TRUE
);

but this is quite slow (~ 15.8 seconds), so it is better to use joins (~ 0.3 seconds)

SELECT DISTINCT protein_tree_id
FROM reconciliation
LEFT JOIN reconciliation_node ON
reconciliation.reconciliation_id = reconciliation_node.reconciliation_id
WHERE
reconciliation_node.host_child_node_id = 4
AND reconciliation_node.is_on_node = TRUE;

If there is more

Find all gene trees having a duplication event at a specific point on the species tree

To find duplication events, we search for gene trees node that map to an edge on the species tree. Edges are delineated by the host and parent node and the host child node. Fore example, the branch leading up to the clade containing cucumber and soybean is defined by having the host_parent node = 4 and the host_child_node = 6.

Unknown macro: {mockup}

To find duplication events, for a specific branch on the species tree (ie parent_node = 4 and child_node = 6):

--
-- QUERY TO FIND TREE WITH DUPLICATION AT
-- A SPECIFIC BRANCH ON THE SPECIES TREE
--
SELECT DISTINCT protein_tree_id
FROM reconciliation
LEFT JOIN reconciliation_node ON
reconciliation.reconciliation_id = reconciliation_node.reconciliation_id
WHERE
reconciliation_node.host_child_node_id = 6
AND reconciliation_node.host_parent_node_id = 4;

+-----------------+
| protein_tree_id |
+-----------------+
|              14 |
|              54 |
|              96 |

...
|            1978 |
|            2201 |
|            2395 |
|            2491 |
+-----------------+
33 rows in set (0.15 sec)

Identify the Duplication Nodes and Speciation Nodes in a Gene Tree

For databases that only contain duplications and speciations (ie. no horizontal transfer events), the duplication nodes will be those nodes where reconciliation_node.is_on_node is false. For example, with the gene tree with the protein_tree_id = 1337.

SELECT protein_tree_node.node_id,
       reconciliation_node.host_parent_node_id,
       reconciliation_node.host_child_node_id,
       reconciliation_node.is_on_node,
       reconciliation_node.reconciliation_id
FROM protein_tree_node
RIGHT JOIN reconciliation_node ON
protein_tree_node.node_id = reconciliation_node.node_id
WHERE protein_tree_node.protein_tree_id = 1337;

Should give results like:

+---------+---------------------+--------------------+------------+-------------------+
| node_id | host_parent_node_id | host_child_node_id | is_on_node | reconciliation_id |
+---------+---------------------+--------------------+------------+-------------------+
|   30527 |                   3 |                  3 |          1 |              1337 |
|   30528 |                   9 |                  9 |          1 |              1337 |
|   30529 |                  10 |                 10 |          1 |              1337 |
|   30530 |                  11 |                 11 |          1 |              1337 |
|   30531 |                   4 |                  4 |          1 |              1337 |
|   30532 |                   7 |                  7 |          1 |              1337 |
|   30533 |                   4 |                  5 |          0 |              1337 |
|   30534 |                   5 |                  5 |          1 |              1337 |
|   30535 |                   5 |                  5 |          1 |              1337 |
|   30536 |                   1 |                  1 |          1 |              1337 |
|   30537 |                   6 |                  8 |          0 |              1337 |
|   30538 |                   8 |                  8 |          1 |              1337 |
|   30539 |                   8 |                  8 |          1 |              1337 |
|   30540 |                   1 |                  1 |          1 |              1337 |
+---------+---------------------+--------------------+------------+-------------------+

The nodes 30533 and 30537 represent duplication events in the gene tree. Node 30533 has a duplication on the branch on the species tree that is between node 4 and 5, which is the branch leading up to poplar. Node 30537 has a duplication that is on the branch in the species tree between nodes 6 and 8 which is the branch leading up to soybean.

Find Orthologous Loci in Reconciled Trees

Given a reconciled gene tree in which the nodes are annotated as duplication (is_on_node = 0) or speciation nodes (is_on_node = 1), it is possible to return all leaf nodes that are orthologous to a query leaf node.

For example, in the tree below we may want to identify the orthologs of leaf node 44706 for the duplication event shown as node 44701.


Click on thumbnail to enlarge.

The variables for the query is the duplication node of interest (ie. node 44701) as well as the query leaf node of interest (ie. node 44706):

SELECT * FROM
protein_tree_node
WHERE protein_tree_node.left_index IN (
 SELECT np.left_index
 FROM protein_tree_node nc
 JOIN protein_tree_node np
 ON nc.left_index < np.left_index
 AND nc.right_index > np.right_index
 WHERE nc.node_id = (
                     SELECT node_id
                     FROM protein_tree_node
                     WHERE parent_id = 44701
                     AND
                     left_index < ( SELECT left_index
                                    FROM protein_tree_node
                                    WHERE node_id = 44706
                                   )
                     AND
                     right_index > ( SELECT left_index
                                     FROM protein_tree_node
                                     WHERE node_id = 44706
                                    )
                    )
)
AND protein_tree_node.right_index = protein_tree_node.left_index + 1;

This query returns the leaf nodes that are the orthologous set, including the query node in the result:

+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
| node_id | protein_tree_id | parent_id | root_id | clusterset_id | left_index | right_index | distance_to_parent |
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
|   44703 |            2191 |     44702 |   44693 |             0 |      89401 |       89402 |                  1 |
|   44705 |            2191 |     44704 |   44693 |             0 |      89404 |       89405 |                  1 |
|   44706 |            2191 |     44704 |   44693 |             0 |      89406 |       89407 |                  1 |
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+

Another example for the same reconciled tree is to use node 44694 as the duplication node of interest.


Click thumbnail above for full size image

As a query this is:

SELECT * FROM
protein_tree_node
WHERE protein_tree_node.left_index IN (
 SELECT np.left_index
 FROM protein_tree_node nc
 JOIN protein_tree_node np
 ON nc.left_index < np.left_index
 AND nc.right_index > np.right_index
 WHERE nc.node_id = (
                     SELECT node_id
                     FROM protein_tree_node
                     WHERE parent_id = 44694                           -- DUPLICATION NODE
                     AND
                     left_index < ( SELECT left_index
                                    FROM protein_tree_node
                                    WHERE node_id = 44706              -- LEAF NODE ID
                                    )
                     AND
                     right_index > ( SELECT left_index
                                     FROM protein_tree_node
                                     WHERE node_id = 44706             -- LEAF NODE ID
                                   )
                     )
 )
AND protein_tree_node.right_index = protein_tree_node.left_index + 1;  -- Returns leaf nodes

and this returns:

+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
| node_id | protein_tree_id | parent_id | root_id | clusterset_id | left_index | right_index | distance_to_parent |
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
|   44703 |            2191 |     44702 |   44693 |             0 |      89401 |       89402 |                  1 |
|   44705 |            2191 |     44704 |   44693 |             0 |      89404 |       89405 |                  1 |
|   44706 |            2191 |     44704 |   44693 |             0 |      89406 |       89407 |                  1 |
|   44707 |            2191 |     44701 |   44693 |             0 |      89410 |       89411 |                  1 |
|   44708 |            2191 |     44700 |   44693 |             0 |      89413 |       89414 |                  1 |
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
5 rows in set (0.96 sec)

This can be modified to a single use of the leaf node id variable by using joins to reduce this SQL statement to:

SELECT * FROM
protein_tree_node
WHERE protein_tree_node.left_index IN (
  SELECT np.left_index
  FROM protein_tree_node nc
  JOIN protein_tree_node np
  ON nc.left_index < np.left_index
  AND nc.right_index > np.right_index
  WHERE nc.node_id = (
    SELECT np.node_id
    FROM protein_tree_node nc
    JOIN protein_tree_node np
    ON nc.left_index > np.left_index
    AND nc.left_index < np.right_index
    WHERE nc.node_id = 44706
    AND np.parent_id = 44694
  )
)
AND protein_tree_node.right_index = protein_tree_node.left_index + 1;

This also returns:

+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
| node_id | protein_tree_id | parent_id | root_id | clusterset_id | left_index | right_index | distance_to_parent |
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
|   44703 |            2191 |     44702 |   44693 |             0 |      89401 |       89402 |                  1 |
|   44705 |            2191 |     44704 |   44693 |             0 |      89404 |       89405 |                  1 |
|   44706 |            2191 |     44704 |   44693 |             0 |      89406 |       89407 |                  1 |
|   44707 |            2191 |     44701 |   44693 |             0 |      89410 |       89411 |                  1 |
|   44708 |            2191 |     44700 |   44693 |             0 |      89413 |       89414 |                  1 |
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+

Instead of using leaf node_id as input, it is straightforward to also use the locus_id for the gene in the query. For example, in the example reconciled tree used above, we also have locus identifiers associated with the leaf nodes.


Click thumbnail above for full size image

For example, we could use the locus identifier Glyma08g05730 in the query.

SELECT * FROM
protein_tree_node
WHERE protein_tree_node.left_index IN (
  SELECT np.left_index
  FROM protein_tree_node nc
  JOIN protein_tree_node np
  ON nc.left_index < np.left_index
  AND nc.right_index > np.right_index
  WHERE nc.node_id = (
    SELECT np.node_id
    FROM protein_tree_node nc
    JOIN protein_tree_node np
    ON nc.left_index > np.left_index
    AND nc.left_index < np.right_index
    LEFT JOIN protein_tree_member ON
    nc.node_id = protein_tree_member.node_id
    LEFT JOIN member ON
    protein_tree_member.member_id = member.member_id
    WHERE member.stable_id = "Glyma08g05730"
    AND np.parent_id = 44694
  )
)
AND protein_tree_node.right_index = protein_tree_node.left_index + 1;

To include the sequence string and the stable locus id (usually a gene name) in the output:

SELECT protein_tree_node.node_id,
       protein_tree_node.protein_tree_id,
       member.stable_id,
       sequence.sequence
FROM protein_tree_node
LEFT JOIN protein_tree_member ON
protein_tree_node.node_id = protein_tree_member.node_id
LEFT JOIN member ON
protein_tree_member.member_id = member.member_id
LEFT JOIN sequence ON
member.sequence_id = sequence.sequence_id
WHERE protein_tree_node.left_index IN (
  SELECT np.left_index
  FROM protein_tree_node nc
  JOIN protein_tree_node np
  ON nc.left_index < np.left_index
  AND nc.right_index > np.right_index
  WHERE nc.node_id = (
    SELECT np.node_id
    FROM protein_tree_node nc
    JOIN protein_tree_node np
    ON nc.left_index > np.left_index
    AND nc.left_index < np.right_index
    LEFT JOIN protein_tree_member ON
    nc.node_id = protein_tree_member.node_id
    LEFT JOIN member ON
    protein_tree_member.member_id = member.member_id
    WHERE member.stable_id = "Glyma08g05730"
    AND np.parent_id = 44694
  )
)
AND protein_tree_node.right_index = protein_tree_node.left_index + 1\G

Returns

*************************** 1. row ***************************
        node_id: 44703
protein_tree_id: 2191
      stable_id: CPS0003G408
       sequence: MATFFGSPPFLSPPLARTHHLSSSPQPPPPSPPTPPSQPQSPSPQLSTSSSEEQLSPASVQVQQQKPTKRTTVETTDWIASSLTRRFGLGAGLAWAAFLAVGVISEQVKTRLEVSQQEANTRDVGKEEEVVLPNGVRYYELRVGGGASPRTGDLVVIELKGKVEESGQVFVDTFGGDKKPLALVMGSRPYSKGLCEGVEYVLRSMKAGGRRRVIVPPSLGFGEKGADLGSGIQIPPFATLEYIVEIDRVSIAPA
*************************** 2. row ***************************
        node_id: 44705
protein_tree_id: 2191
      stable_id: Glyma05g33920
       sequence: MAAFFGSPPIFSLPPTIIRTHHISSSSQTPPPTPSPQSQPPTSSPQQLRTTNLNEESVQVSTEAKQQKPIKPVTSSTKVGSTDWIATSLTRRFGIGAGLAWVGFLAFGVISEQIKTRLELSQQEANTRNVEKVEEVVLPNGIRYYELKLGGGASPRPGDLVVIDIMGKIESSEVFVDTFEGDKKPLALVMGSRPYSKGVCEGIEYALKTMKAGGKRKVIVPPKLGFGENGADLGTGVQIPPLATLEYILEVEKVSIAPA
*************************** 3. row ***************************
        node_id: 44706
protein_tree_id: 2191
      stable_id: Glyma08g05730
       sequence: MATFFGSPPIFSLPLTRTHHISSSSQTPPPTPPPQSQPPTSSPQQLRTTNLNDESMQVCTEAKQQKPIKPSTKVESTDWIATSLTRRFGIGAGLAWVGFLAFGVISEQIKTRLEVSQQEANTRNVEKVEEVVLPNGIRYYELKVGGGASPRPGDLVVIDITGKIESSGEVFVNTFEGDKKPLALVMGSRPYSKGVCEGIEYVLKTMKAGGKRKVIVPPQLGFGENGADFDSGTGVQIPPLATLEYILEVEKVSIAPA
*************************** 4. row ***************************
        node_id: 44707
protein_tree_id: 2191
      stable_id: POPTR-0012s04560.1
       sequence: MATFFGSPPFLSHPLTRTNFSSSSQTPPPPTPPIPPSQPNPSPQLSSSSSEQPLLPASVRVQQQKPAKPAGNSATKVETTDWIASTLTRRFGLGAGLAWAAFLAVGVVSEQIKTRIEVSQQEANTRNVGKEEEVALPNGIRYYELRVGGGASPKTGDLVVIDLKGKIEGSGEVFVDTFGGDRKPLALVMGSRPYSKGMCEGVEYVLRSVKAGGKRRVIVPPNLGFRENGADLGTGVQIPPFATLEYIVEVERVSIAPA
*************************** 5. row ***************************
        node_id: 44708
protein_tree_id: 2191
      stable_id: V17G0220
       sequence: MAAFFGSPPFVYHPLTRTPHFSSSQTPPPPPPPTPPSQPQSPTPPLSITSSEPPAAASVKGQQQKPTKPATTPVTTADSTDWIASTLTRRFGLGAGLAWAAFLAFGVVSEQIKTRLEVSQQEANTRDVEKEEEVVLPNGIRYYEMRVGGGASPRPGDLVVIDLKGSVQGSGEVFVDTFDGEKKSLALVMGSRPYTKGMCEGIESVLRSMKAGGKRRVTIPPSLGFGEKGADLGSGLQIPPSATLEYIVEVDKVSIAPA
5 rows in set (0.57 sec)

If the reconciliation nodes are tagged by meaningful duplication event names in the database (ie. Gamma duplication), it would then be possible to fetch all orthologs for a given locus_id for a given named duplication event by modifications to the query above. This could then be turned into a service that takes locus_id and duplication event as input, and returns orthologous loci as output.

Find Paralogous Loci in Reconciled Trees

In addition to finding the loci that are orthologous to the query leaf node of interest, we may want to find the paralogous leaf nodes. This is the clade that is sister to the orthologous clade for a duplication node of interest.

Click thumbnail above for full size image

In the above query we used the following to find the node id defining the orthologous clade. This finds the child node of the duplication node that contains the query leaf node. We used the left_index and right_index values to do this:

SELECT node_id
FROM protein_tree_node
WHERE
left_index < ( SELECT left_index
               FROM protein_tree_node
               WHERE node_id = 44706
               )
AND
right_index > ( SELECT left_index
                FROM protein_tree_node
                WHERE node_id = 44706
               )
AND parent_id = 44694;

To find the node or nodes defining the paralogous clade, we need to identify the child nodes of the parent node that does NOT contain query node. Again we can use the left_index and right_index values to do this:

SELECT node_id
FROM protein_tree_node
WHERE
(
  left_index > ( SELECT left_index
                FROM protein_tree_node
                WHERE node_id = 44706
                )
  AND
  right_index > ( SELECT left_index
                 FROM protein_tree_node
                 WHERE node_id = 44706
                )
 AND parent_id = 44694
)
OR
(
  left_index < ( SELECT left_index
                 FROM protein_tree_node
                 WHERE node_id = 44706
                )
 AND
 right_index < ( SELECT left_index
                 FROM protein_tree_node
                 WHERE node_id = 44706
               )
 AND parent_id = 44694
)

Using this node to identify the leaf nodes in this paralogous clade yields the following query.

SELECT * FROM
protein_tree_node
WHERE protein_tree_node.left_index IN (
 SELECT np.left_index
 FROM protein_tree_node nc
 JOIN protein_tree_node np
 ON nc.left_index < np.left_index
 AND nc.right_index > np.right_index
 WHERE nc.node_id = (

                      SELECT node_id
                      FROM protein_tree_node
                      WHERE
                        (
                        parent_id = 44694                                 -- DUPLICATION NODE
                        AND
                        left_index < ( SELECT left_index
                                       FROM protein_tree_node
                                       WHERE node_id = 44706              -- LEAF NODE ID
                                       )
                        AND
                        right_index < ( SELECT left_index
                                        FROM protein_tree_node
                                        WHERE node_id = 44706             -- LEAF NODE ID
                                      )
                        )
                      OR
                        (
                        parent_id = 44694                                 -- DUPLICATION NODE
                        AND
                        left_index > ( SELECT left_index
                                       FROM protein_tree_node
                                       WHERE node_id = 44706              -- LEAF NODE ID
                                       )
                        AND
                        right_index > ( SELECT left_index
                                        FROM protein_tree_node
                                        WHERE node_id = 44706             -- LEAF NODE ID
                                      )

                        )

                     )
 )
AND protein_tree_node.right_index = protein_tree_node.left_index + 1;  -- Returns leaf nodes

which returns:

+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
| node_id | protein_tree_id | parent_id | root_id | clusterset_id | left_index | right_index | distance_to_parent |
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+
|   44697 |            2191 |     44696 |   44693 |             0 |      89390 |       89391 |                  1 |
|   44698 |            2191 |     44696 |   44693 |             0 |      89392 |       89393 |                  1 |
|   44699 |            2191 |     44695 |   44693 |             0 |      89395 |       89396 |                  1 |
+---------+-----------------+-----------+---------+---------------+------------+-------------+--------------------+