Author(s): Dr. Upendra Kumar Devisetty, CyVerse/University of Arizona and Dr. Andrew D. L. Nelson, School of Plant Sciences, University of Arizona
Evolinc is a two-part pipeline to identify lincRNAs from an assembled transcriptome file (.gtf output from cufflinks) and then determine the extent to which those lincRNAs are conserved in the genome and transcriptome of other species.
The first part of the pipeline is the lincRNA identification. The second part is the comparative genomics and transcriptomics analysis. You feed the output from first part to second part. The pipelines were kept separate in case users did not want to perform an evolutionary analysis on the identified lincRNAs. The process is highly dependent on quality of the genomes, transcriptomes, and overall annotation datasets being used. Even a couple of SNPs could lead to a transcript being miss-identified as either a lincRNA or a protein-coding gene.
All of the necessary Python modules are already installed on this instance, so you can get started analyzing right away!
This tutorial will take users through steps of:
- Launching the Evolinc Atmosphere image
- Running Evolinc on an test data
Part 1: Connect to an instance of an Atmosphere Image (Virtual Machine)
Step 1. Go to https://atmo.iplantcollaborative.org and log in with your cyverse credentials.
Step 2. Click on the Launch New Instance button and search for Evolinc.
Step 3. Select the image Evolinc and click Launch Instance. It will take 10-15 minutes for the cloud instance to be launched.
Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs. This tutorial can be accomplished with the small instance size, medium1 (4 CPUs, 8 GB memory, 80 GB root)
Part 2: Set up a Evolinc run using the Terminal window
Step 1. Open the Terminal. Enter the ssh, username along with your IP address to connect the instance through the terminal
Step 2. All the dependencies and scripts for running Evolinc are located in "/opt/Evolinc" folder. You can run the command line options for Evolinc by executing
Explanation of the code line
- -c: Cuffcompare output file in gtf format
- -g: Reference genome file in fasta format
- -r: Reference cDNA file in fasta format
- -b: Transposable elements file in fasta format
- -t: TSS site file in gff format
- -x: Known Long non coding RNA in gff format
Part 3: Running sample data
The staged example data can be found in 2 folders - "Evolinc/sample.data.arabi" and "Evolinc/sample.data.brapa" within "Evolinc" folder. List its contents with the ls command:
Executing the code with the provided test data
A) Arabidopsis test data
B) Brassica test data
This will produce a folder named as "output". The Evolinc pipeline generates 7 different output files
- lincRNA_final_transcripts.fa - Final Long intergenic ncRNA transcripts in fasta format
- lincRNA_final_transcripts.bed - Final Long intergenic ncRNA transcripts in bed format
- lincRNA_final_transcripts.promoters.fa - Promoter sequences of the final Long intergenic ncRNA transcripts in fasta format
- lincRNA_final_transcripts_counts.txt - File showing the number of transcripts left at every step of the pipeline
- lincRNA_final_transcripts_demographics.txt - Final Long intergenic ncRNA transcripts demographics
- lincRNA_CAGE_final_transcripts.fa - Final Long intergenic ncRNA transcripts that have overlapping with the TSS transcripts (generated only when you have TSS file)
- lincRNA_overlapping_known_final_transcripts.fa - Final Long intergenic ncRNA transcripts that have overlapping with the known lincRNA (generated only when you have known lincRNA file)
- lincRNA_final_transcripts_updated.gtf - Final updated cuffcompare output with the final Long intergenic ncRNA transcripts
Part 4: Trying out your data
Make sure that you make a folder within the Evolinc folder and upload your files in to that folder and run the above script. Either Cuffcompare or Cuffmerge output files are acceptable. Genome.fasta file should be the same to which you are aligning your transcriptomic data. The transposable element data set can be either from your species of interest or from a family of closely related species. For example, there is a maintained data set of Brassicaceae transposable elements that can be used to compare A. thaliana lncRNAs against. If you have not generated TSS data yourself, there are publicly available data sets of transcription start sites that may be useful, but only for a limited number of species. If there are multiple public data sets of known lncRNAs for your species that you would like to compare your set against, merge them into one gff document.