Final Report

Examining the Variation in Number and Conservation of Repeats Found Within the Carboxy-terminus Domain of the 

Introduction/Background

Transcriptional gene silencing (TGS) is an important process for establishing epigenetic modifications that repress transcription of particular genes or transposable elements across the genome. In plants, epigenetic modifications such as cytosine methylation and histone modifcations are achieved by the microRNA-directed DNA methylation (RdDM) pathway which is responsible for the de novo methylation leading to these modifications of the genome. The RdDM pathway is a complex system that involves the coordinated interaction of numerous proteins at the same loci. One of the most important proteins involved in the pathway is a RNA-dependent polymerase, Nuclear RNA Polymerase V (NRPE)  also referred to as RNA Pol V. RNA Pol V arose through a whole genome duplication event and subsequently evolved a new and unique function from the ancestral RNA Pol II. The largest subunit of RNA Pol V (E1) contains several well conserved domains as well as several two recently discovered motif of with in the carboxy-terminal domain (CTD). One of these motifs is a dipeptide consisting of glycine(G) and tryptophan(W) in the form of WG, GW, or GWG. This motif, commonly referred to as "AGO Hooks", is important for protein-protein interactions occurring at the CTD and are necessary for proper function of the RdDM pathway. AGO Hooks received its name because the motif was first identified in proteins that interact with ARGONAUT (AGO) proteins, which are one of the core proteins of the RdDM pathway. It is hypothesized that AGO Hooks bind AGO proteins to improve the interaction with complex and the overall process. The other newly discovered motif present is a tandem repeating sequence found with the CTD of NRPE1, in addition to the SUPPRESSOR OF TY5-like (STP5L) protein which is also involved in RdDM. Previous studies in the plant family Brassicaceae reveal a great deal of variation in the number of repeats and sequence conservation present between species within family as well as indicate that the repeated sequence is unique to the family. It is hypothesized that these repeats have an unknown function and have arisen through mismatching during homologous recombination, allowing for the repeats to expand or contract. In addition to investigating NRPE1 evolution, one of its interactiving partners, SPT5L, will be analyzed as well. In this project, the variation in number and conservation of a newly discovered motif within the RdDM pathway proteins NRPE1 and SPT5L were analyzed. In addition, single nucleotide polymorphism data was used for Arabidopsis thaliana and Oryza sativa gene sequences to determine the amount of selection occurring on the protein and newly discovered motif in particular. 

Results

WorkFlow Developed

Gene Identification

Gene sequences for NRPE1 and SPT5L were obtained for thirteen (13) species within the Brassicaceae family and eleven (11) Oryza species, refer to Table 1 below. The genomic location and sequence for NRPE1 and SPT5L were identified using the databases National Center for Biotechnology Information (NCBI), Comparative Genomics (CoGe), and EnsembPlants, which contain whole genome sequences for all of the species of interest. Using the protein coding sequence for each gene from Arabidopsis thaliana or Oryza sativa spp japonica, BLAST searches were preformed to identify each gene in all other species. In order to isolate the coding sequence if it was not available for the specie, the gene predicting software FGENESH+ was used. Due to Pol V sequence similarity to Pol IV, each sequence was shown to be syntenic with A.thaliana or O.sativa sequences.

Brassicaceae
Oryza
A. arabicum
O. barthii
A. lyrata
O. brachyantha
A. thaliana
O. glaberrima
B. oleraceae
O. glumaepatula
B. rapa
O. longistaminata
C. rubella
O. meridionalis
C. sativa
O. nirvara
E. parvulum
O. punctata
E. salsugineum
O. rufipogon
L. alabamica
O. sativa ssp indica
N. paniculata
O. sativa ssp japonica
S. irio
 
T. hassleriana
 

Table 1. Species within Brassicaceae family and Oryza species sequences analyzed.

 

Analysis of NRPE1 in Brassicaceae and Orzya

In order to beginning determining the carboxy-terminal repeat sequence in each protein, the EMBL-EBI Rapid Automatic Dectection and Alignment of Repeats (RADAR) software was used to identify repetitive regions. RADAR analysis provided a start to annotating carboxy-terminal repeats by identifying larger portions of repetitive sequence but not the individual repeats. Due to variation in size and sequence of the repeats, manual curation was necessary to correctly determine the repeat structure, general amino acid consensus, locations, and total number of repeats.

Fig.1. Depiction of NRPE1 coding sequence (top) and genomic sequences (bottom) for A. thaliana (top panel) and O. sativa ssp japonica (bottom panel) aligned for visualization of exons. Annotations included for conserved domains (pink), AGO Hooks (WG:red, GWG:purple, GW:dark red), and unique repeat (gray).

Species
# of Repeats
# of WG
# of GW
# of GWG
Species
# of Repeats
# of WG
# of GW
# of GWG
A. arabicum
13
20
1
 0
O. barthii
5
6
5
2
A. lyrata
12
16
1
1
O. brachyantha
4
11
5
2
A. thaliana
12
17
1
1
O. glaberrima
1
5
1
1
B. oleraceae
19
27
2
2
O. glumaepatula
3
6
3
2
B. rapa
20
27
2
2
O. longistaminata
3
6
3
1
C. rubella
12
16
2
2
O. meridionalis
5
7
4
2
C. sativa
11
17
1
1
O. nirvara
3
6
2
2
E. parvulum
4
10
1
 0
O. punctata
3
6
3
2
E. salsugineum
11
13
2
1
O. rufipogon
5
7
5
2
L. alabamica
12
22
1
 0
O. sativa ssp indica
5
7
5
2
N. paniculata
15
17
2
1
O. sativa ssp japonica
5
7
5
2
S. irio
16
21
0
0
     
T. hassleriana
0
11
2
1
     

Table 2. Repeat and AGO Hook numbers for NRPE1 in Brassicaceae and Oryza species.

Fig. 2. Newly identified tandem repeating sequence of unknown function found within NRPE1 carboxy terminal domain in A.thaliana (left panel) and O. sativa ssp japonica (right panel).

Single Nucleotide Polymorphism Analysis of A.thaliana NRPE1

Single Nucleotide Polymorphism (SNP) data for Arabidopsis thaliana was obtained from the 1001 genomes project by the SALK Institute Genomic Analysis Laboratory (http://signal.salk.edu/atg1001/index.php). All available accessions data obtained and any sequences containing ambiguities were removed prior to analysis. Sequences were aligned using MUSCLE algorithm with Biomatters Geneious software 6.1.8. Tajima's D analysis was preformed to evaluate any selection occurring on proteins. Using A. thaliana SNP information to investigate selection on this gene, Tajima's D was determined and reported as -2.117 for the protein overall. Only the first 100 accessions were analyzed to get an idea of the amount of selection on this gene as well as computation time needed. Secondary analysis of the gene using 500 accessions supported and even reinforced the original value. 


Fig. 3. Tajima's D analysis of 100 accessions. Sliding window of 100 nt with step size of 25 nt.

 

Analysis of Transcription elongation factor Suppressor of TY 5-like (SPT5L) in Brassicaceae and Orzya

Implowing the improved workflow shown above, the protein SPT5L was quickly identified and annotated in both Brassicaceae and Oryza species. Once annotations were complete the analysis of repeat number and conservation could be done. 

Fig.4. STP5L coding sequence (top) and genomic sequences (bottom) for A. thaliana (top panel) and O. sativa ssp japonica (bottom panel) aligned for visualization of exons. Annotation included for conserved domains (pink), AGO Hooks (WG:red, GWG:purple, GW:dark red), and unique repeat (gray).

Species
# of Repeats
# of WG
# of GW
# of GWG
A
B
A. arabicum
19
14
33
1
2
A. lyrata
15
17
35
3
5
A. thaliana
19
17
40
3
2
B. oleraceae
19
20
43
2
2
B. rapa
20
23
43
2
2
C. rubella
19
8
35
5
3
C. sativa
TBD
TBD
TBD
TBD
TBD
E. parvulum
20
17
27
5
2
E. salsugineum
3
9
15
1
1
L. alabamica
22
14
33
5
3
N. paniculata
19
14
37
4
2
S. irio
24
10
34
5
2
T. hassleriana
24
16
25
5
12

Table 3. Repeat and AGO Hook numbers for STP5L in Brassicaceae and Oryza species.

Fig. 5. Two newly identified tandem repeating sequences of unknown function found within SPT5L carboxy-terminal domain in A.thaliana

Methods

NRPE1 and SPT5L Gene Sequence Identification

A.thaliana or O.sativa ssp japonica coding sequence used in BLAST searches on NCBI (http://www.ncbi.nlm.nih.gov/), CoGe (https://genomevolution.org/coge/), EnsemblPlants (http://plants.ensembl.org/index.html). Best matching sequence to query was taken as well as an additional 5kb upstream and downstream of match. If coding sequence or transcript sequence not available for species, genomic sequence was used in FGENESH+ (http://linux1.softberry.com/) gene predicting alogrithm with A.thaliana and O. sativa ssp japonica protein sequence being the template for Brassicaceae and Oryza species, respectfully.

Gene Annotation, Repeat Identification, and Evaluation

Conserved domains, AGO Hooks and the newly discovered motif were annotated in each gene for each species using Geneious software. A custom python script was created to scan and identify any AGO Hooks present in proteins (https://github.com/J3TT/PLS-599-Homework). Protein sequence from A.thaliana and O. sativa ssp japonica analyzed by EMBL-EBI Rapid Automatic Detection and Alignment of Repeats (RADAR) software. Protein analysis and output used along with manual annotation to identify repeat structure. Annotation files for each species analyzed are available.

Single Nucleotide Polymorphism (SNP) Analysis

SNP data for A. thaliana gene sequences were obtained from the SALK Institute Genomic Analysis Laboratory (http://signal.salk.edu/atg1001/index.php) 1001 genomes project. Only SNPs present in the coding sequences for each accession used in analysis. Oryza sativa SNP data obtained from the International Rice Informatics Consortium (http://oryzasnp.org/iric-portal/) and the 3000 genomes project.

References


Access, O. (2014). The 3,000 rice genomes project. GigaScience3, 7. doi:10.1186/2047-217X-3-7

Alexandrov, et al. SNP-Seek database of SNPs derived from 3000 rice genomes. Nucl. Acids Res. 2015;43(D1);D1023-D1027

Bies-Etheve, N., Pontier, D., Lahmy, S., Picart, C., Vega, D., Cooke, R., & Lagrange, T. (2009). RNA-directed DNA methylation requires an AGO4-interacting member of the SPT5 elongation factor family. EMBO Reports, 10(6), 649–54. doi:10.1038/embor.2009.31

El-Shami, M., Pontier, D., Lahmy, S., Braun, L., Picart, C., Vega, D., … Lagrange, T. (2007). Reiterated WG/GW motifs form functionally and evolutionarily conserved ARGONAUTE-binding platforms in RNAi-related components. Genes & Development, 21(20), 2539–44. doi:10.1101/gad.451207

Eulalio, A., Tritschler, F., & Izaurralde, E. (2009). The GW182 protein family in animal cells: new insights into domains required for miRNA-mediated gene silencing. RNA (New York, N.Y.), 15(8), 1433–42. doi:10.1261/rna.1703809

Heger, A., & Holm, L. (2000). Rapid automatic detection and alignment of repeats in protein sequences. Proteins: Structure, Function and Genetics41(2), 224–237. doi:10.1002/1097-0134(20001101)41:2<224::AID-PROT70>3.0.CO;2-Z

Kane, J., Freeling, M., & Lyons, E. (2010). The evolution of a high copy gene array in arabidopsis. Journal of Molecular Evolution70(6), 531–544

Matzke, M. a, & Mosher, R. a. (2014). RNA-directed DNA methylation: an epigenetic pathway of increasing complexity. Nature Reviews. Genetics15(6), 394--408. doi:10.1038/nrg3683

Nelson, A. D. L., Forsythe, E. S., Gan, X., Tsiantis, M., & Beilstein, M. a. (2014). Extending the model of Arabidopsis telomere length and composition across Brassicaceae. Chromosome Research?: An International Journal on the Molecular, Supramolecular and Evolutionary Aspects of Chromosome Biology22(2), 153--66. doi:10.1007/s10577-014-9423-y

Pfaff, J., & Meister, G. (2013). Argonaute and GW182 proteins: an effective alliance in gene silencing. Biochemical Society Transactions, 41(4), 855–60. doi:10.1042/BST20130047

Solovyev V.V. (2007) Statistical approaches in Eukaryotic gene prediction. In Handbook of Statistical genetics (eds. Balding D., Cannings C., Bishop M.), Wiley-Interscience; 3d edition, 1616 p.

Tajima, F. (1996). The amount of DNA polymorphism maintained in a finite population when the neutral mutation rate varies among sites. Genetics, 143(3), 1457–1465.

Till, S., Lejeune, E., Thermann, R., Bortfeld, M., Hothorn, M., Enderle, D., … Ladurner, A. G. (2007). A conserved motif in Argonaute-interacting proteins mediates functional interactions through the Argonaute PIWI domain. Nature Structural & Molecular Biology, 14(10), 897–903. doi:10.1038/nsmb1302