The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.
Please work through the documentation and add your comments on the bottom of this page, or email comments to support@cyverse.org.
Rationale and background:
RMTA is a high throughput RNA-seq read mapping and transcript assembly workflow. RMTA incorporates the standard RNA-seq analysis programs traditionally used one at a time into a single, easy to use workflow that can rapidly assemble and process any amount of local (FASTq) or NCBI-stored RNA-seq (SRA) data. RMTA maps reads to user-provided reference genome using either HISAT2 (transcript analysis) or Bowtie2 (SNP analysis), assembles transcripts using StringTie, and then performs read quantification using FeatureCounts. Beyond read mapping and assembly, RMTA has a number of additional features that automate onerous data transformation and quality control steps, thus producing outputs that can be directly used for differential expression analysis, data visualization, or novel gene identification - data analyses that can all be performed in the DE or at CoGe.
Minimum data requirements:
- Reference Genome (FASTA) or HISAT2 Indexed Reference Genome (in a subdirectory)
- Reference Transcriptome (GFF3/GTF/GFF)
- RNA-Seq reads (FASTQ) - Single end or Paired-end (compressed or uncompressed) or multiple NCBI SRA id's (each SRA ID on a separate row in the text file).
Pre-Requisites:
A CyVerse account (Register for a free CyVerse account at https://user.cyverse.org).
An up-to-date Java-enabled web browser.
Mandatory fields
Analysis Name
Choose an appropriate name for your analysis and make comments if you wish. Default name is shown in the figure below.
Select the output folder for the results of the analysis.
Reference genome
- Custom genome (required)
- HISAT2 Indexed folder (for indexed genomes)
Reference annotation
- Custom Reference annotation
Paired-end reads
- FASTQ Files (Read 1): HT path list of read 1 files of paired-end data
FASTQ Files (Read 2): HT path list of read 2 files of paired-end data
Single-end reads
- Single end FASTQ files or a HT path list of read files of single-end data
- Single end FASTQ files or a HT path list of read files of single-end data
SRA
- Enter the SRA id, or
- Select a file containing a list of SRA ids (one per line) or a HT path list of multiple SRA ids list files
Aligners
HISAT2 (default)
Bowtie2
- Featurecount
- Choose a Feature Type. The default option will be "exon"
- Choose a Gene Attribute. The default option will be "gene_id"
Select the Type of Strandedness. The three options include unstranded, stranded, and reversely stranded.
Advanced options:
- Type of Sequence: Choose either Single End or Paired End
- Number of threads (Default is 4)
- FPKM cut-off threshold (For RNA-Seq reads only with HISAT2) (Default is 0)
- Coverage cut-off threshold (For RNA-Seq reads only with HISAT2) (Default is 0)
- Choose RNA strandedness (default is unstranded)
- Trim bases from 5' end of read: Trim bases from 5' (left) end of each read before alignment (Default is 0)
- Trim bases from 3' end of read: Trim bases from 3' (right) end of each read before alignment (Default is 0)
- Minimum intron length: Set minimum intron length (Default is 20)
- Maximum intron length: Set maximum intron length (Default is 500000)
- Phred64 (Default is Phred33): Select to run Phred64
- Run FastQC
Remove duplicate reads
RMTA_Output:
Name of the output folder (Default is RMTA_Output)
The following test data using Arabidopsis are provided for testing RMTA in here - /iplant/home/shared/iplantcollaborative/example_data/RMTA (this path can be copied and pasted into the navigation bar in a data window within the DE)
***Note that when testing SRA IDs, only one of steps 3, 4, or 5 may be used at a time per test run.***
- Reference Genome: genome_chr1.fa
- Reference Annotation: genome_chr1.gtf
- Paired End Reads:
- Left End Reads: SRR2037320_R1.fastq.gz and SRR2932454_R1.fast.gz
- Right End Reads: SRR2037320_R2.fastq.gz and SRR2932454_R2.fastq.gz
- Single End Reads: SRR3464102.fastq.gz and SRR3464103.fastq.gz
- List of SRA IDs:
- Paired End: sra_id_pe.txt
- Single End: sra_id_se.txt
- Aligners: HISAT2
- Feature Count: leave as default
- Advanced Options:
- Type of Sequence: select Single for Single End and Paired for Paired End
- Leave the rest as default
- RMTA Output: leave as default
All other settings should be left as default.
Results
Successful execution of RMTA will generate three output folders:
- Index: This folder consists of the index of the genome
- Output: This folder consists of the output from HISAT2, Stringtie and Cuffcompare as well as the Feature_counts and FASTqc (optional) folders.
- In turn, this folder will consist of five files associated with each SRA or FASTq:
- A filtered GTF (if either the read/base or FPKM filter was set).
- A sorted.bam file of all mapped and unmapped reads.
- An index associated with the above BAM file.
- a GTF that represents the unprocessed file straight out of Stringtie if the user would like to investigate their data further.
- A combined.gtf that is the output of the cuffmerge step of RMTA (comparing all identified transcripts back to the reference to identify novel transcripts). This file is useful for novel transcript identification.
- The Feature_counts folder contains one file with all of the read count data for each gene across all SRA/FASTq files examined in the run.
- The FASTqc_out folder contains subfolders associated with each SRA/FASTq input file. In each folder is an html file with all of the details from the FASTqc run for each set of reads (1 for SE, 2 for PE).
- In turn, this folder will consist of five files associated with each SRA or FASTq:
- Logfiles: This folder consists of stout and sterr (information written to standard out or standard error log files) files as well as logs specific to running within the DE