Nuclear RNA Polymerase V (NRPE) functions in miRNA directed DNA methylation pathway leading to transcriptional gene silencing. RNA Pol V arose through a whole genome duplication event and subsequently evolved a new and unique function from the ancestral RNA Pol II. The largest subunit of RNA Pol V (E1) is unique to RNA Pol V and contains RNA binding domains as well as recently discovered motifs in the C-terminal domain (CTD). Of these two motifs, one is important for protein-protein interactions, but the other motifs is still of unknown function. The first identified motif is a dipeptide motif, consisting of Glycine (G) and Tryptophan (W), known as "AGO Hooks", and important for protein-protein interactions occurring at the CTD. The other motif present is more variable in length as well as conservatrion but are always found spread thoughout the CTD. Previous studies in the plant family Brassicaceae reveal a great deal of variation in the number of repeats present between species of a family as well as indicate that the repeated sequence is unique to the family. Several other well studied plant families also ?possess a repeat sequence unique to that particular family. It is hypothesized that these repeats have an unknown function and have arisen through mismatching during homologous recombination, allowing for the repeats to expand or contract. Refer to documentation for more information relating to project.
Updated concept map with information and ideas necessary to create desired pipeline
Old version! Needs revision as pipeline has significantly changed.
GitHub repo- https://github.com/J3TT/PLS-599-Homework
I need fasta sequences for SNP analysis and currently have a CSV file with my data
To do list
Modify AGO Hooks Identification module
- Add code to produce file with results in addition to printing results to screen
Continue analyzing Arabidopsis SNP data
Use variant tools to convert CSV file with SNP data into a VCF file. VCF file can then be used with GATK tools to produce fasta file with sequences for all accessions of Oryza sativa.
3/9 - 3/16
Install necessary python modules
- I was able to install most of the modules needed to run the InterProScan script by installing python 2.7 (py27) in addition to python 3.4 (py34). stackoverflow helped me figure out how to install these modules once I had py27, because even with py27 I was having issues with module installation. I just need to install the modules as the admin, not from my 'user' side of the terminal.
Complete AGO Hook Identification module of project
- I was able to complete my RegEx script to look through fasta files and identify the different AGO Hooks
- Slight modifications to be made in the future, but usable as is
Installed compiler software, Visual C++ , needed to accompany some modules for interpro and future scripts
Downloaded SRA toolkit to access 3000 Oryza genome project information/sequences
Installed remaining modules/programs, PyXML and SOAPpy, necessary to run InterProScan script. Obtained updated script since the script I originally found and was using is out of date.
Learned to use InterProScan through command line
Downloaded RADAR program and began downloading necessary modules
Obtained SNP information for SPT5L in Oryza/ 3k genomes
Began setting up Linux system on my other laptop to have another OS in addition to my PC. Windows does all the same things but at a slower or more confusing rate and this will allow me to be faster and more efficient at what I need to get done.
Installed Linux system on old laptop as well Ubuntu VM for my new laptop
- Install programs and modules necessary to run analyses such as RADAR, InterProScan, and last
Looked at SNPs for all O.sativa japonica varieties to get an understand of number and conservation of SNPs
Downloaded and installed MEGA6
Finished obtaining all NRPE1 and STP5L sequences for both Brassicaceae and Oryza. Continuing to annotate and identify repeating sequence.
Downloaded DnaSP for SNP analysis
Downloaded A. thaliana 1001 genomes SNP data/sequences for NRPE1 and SPT5L
Started SNP anaylsis of NRPE1 gene by evaluating 100 accessions
- Tajima's D test performed to determine if any selection is occurring and where in the protein
Continuing annotation of gene sequences to correctly identify repeats for further analysis
Gathered current results to produce figures and images
Downloaded variant tools and GATK tools