Rationale and Background
RMTA Instructional is designed to allow students to assemble RNA-sequencing data from a publicly available source (NCBI's SRA). RMTA allows read mapping and transcript assembly of RNA-seq data to be formed from little required input of data, whereas previous programs required two separate programs to complete read mapping and transcript assembly. Reads are mapped against a reference genome provided by the user, allowing transcripts to be assembled and quantification of reads to be performed. This will then allow the user to quantify expression of specified genes and/or transcripts.
Minimum data requirements:
- Reference Genome as a FASTA or an Indexed Reference Genome as a HISAT2
- Reference Annotation
- SRA ID (starting with DRR, ERR, or SRR)
- A CyVerse account (Register at https://user.cyver.org)
- An updated web browser with java enabled.
- The following mandatory fields:
a. Analysis Name
- The default name for the analysis is shown below but can be changed to a more appropriate name instead.
- The folder for the results of the analysis should be selected as the output folder.
b. Inputs and Parameters
i. The location of the genome for the default, Arabidopsis thaliana flowers, is given and can be used. To select the genome for a different species, select browse and choose the folder labeled with the desired species. Then, select de_support which should allow you to select the genome.fa file.
ii. The associated annotation of the genome selected should be selected or the default should be used. To select an annotation for a species other than the default, browse should be selected and then your species, followed by de_support. This should allow you to then select the annotation.gtf file.
iii. A single NCBI SRA ID should be given for the same selected species as your genome and annotation file.
iv. Single end or Paired end should be selected for the type of sequencing library.
c. Location of results
i. Select a name for your results folder or use the default of RMTA_Output.
An example run can be done by simply launching the analysis with the default selections.
If the RMTA run is successful there will be three separate folders generated for the output.
Index: This folder will contain the genome index.
Output: This folder will contain the output from HISAT2, and the assemblers Stringtie and Cuffcompare. Feature_counts folders will also be here.
a. There will be five folders for each SRA
i. A filtered GTF
ii. A sorted.bam file of the mapped and unmapped reads.
iii. An index for the BAM file above.
iv. An unprocessed GTF directly from the Stringtie for further investigation if the user wishes to look further.
v. A combined GTF that contains the output from the step of the RMTA where all identified transcripts were compared to the reference to identify unobserved transcripts.
b. Feature_counts output folder which will contain one file that has the entirety of the read count data for each gene across the SRA file examined for the run performed.
c. The Fastqc_output folder will have subfolders that are associated with each SRA input file. Each folder will contain an html file that contains the details for FASTq run for each set of reads (1 for SE, 2 for PE).
Logfiles: This folder will contain the information in log files written in either standard out or standard error, along with logs specific to running within the DE.