The applications listed here are available for use in the Discovery Environment and are documented in: Discovery Environment Manual.

Discovery Environment Applications List

The box below searches only this space.
To search the entire iPlant wiki, enter your query in the box at the upper right.

 

 

 

 

 

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 22 Next »

The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Please work through the documentation and add your comments on the bottom of this page, or email comments to support@cyverse.org.

Rationale and background:

RMTA is a wrapper script built on top of several publicly available bioinformatic tools that can rapidly proceed from raw short read data to assembled transcripts. RMTA performs this by mapping reads using either HiSat2 or Bowtie2 and then assembling transcripts using either Cufflinks or StringTie, according to user preference. RMTA can process FASTQ files containing paired-end or single-end reads or can directly process one or more sequence read archives (SRA) from NCBI using SRA IDs. RMTA has been successfully used by many groups as the first step towards the identification of long non-coding RNAs using the Evolinc workflow. More information about RMTA can be found here

RMTA (read mapping, transcript assembly), is a gene quantification workflow for RNA-Seq data utilizing CyVerse’s Discovery Environment HT-Condor for job submission and Datastore for data management.

RMTA minimally requires the following input data:

  1. Reference Genome (FASTA) or Hisat2 Indexed Reference Genome (in a subdirectory)
  2. Reference Transcriptome (GFF3/GTF/GFF)
  3. RNA-Seq reads (FASTQ) - Single end or Paired-end (compressed or uncompressed) or multiple NCBI SRA id's (each SRA ID on a separate row in the text file).


Pre-Requisites:

  1. A CyVerse account (Register for a free CyVerse account at https://user.cyverse.org). 

  2. An up-to-date Java-enabled web browser. (Firefox recommended. If you wish to work with your own large datasets and upload them using iCommands, Chrome is not suitable due to its issues in utilizing 64-bit Java.)

  3. Mandatory arguments 

    1. Reference genome 

      Icon

      Select at least one of the below two options for the indexing of the Reference Genome

      1. Custom genome (required)
      2. Hisat2 Indexed folder (for indexed genomes)
    2. Reference annotation

      1. Custom Reference annotation
      2. Reference Annotation from the list

      Icon

      If you have many files to process through the Discovery Environment, an HT Analysis Path List File may prove useful, as this app takes only 1 file at a time. For information on how to create an HT path list, click here.

      Icon

      Only one of the following three read options (c, d, or e) may be selected.

    3. Paired-end reads

      1. FASTQ Files (Read 1): HT path list of read 1 files of paired-end data
      2. FASTQ Files (Read 2): HT path list of read 2 files of paired-end data
    4. Single-end reads

      1. Single end FASTQ files: HT path list of read files of single-end data
    5. SRA

      1. Enter the SRA id, or
      2. Select a file containing SRA ids: HT path list of multiple SRA ids list files
    6. Aligners

      Icon

      Only one of the below two options needs to be selected. Both cannot be selected.

      1. Hisat2 (default)

      2. Bowtie2

        When to use Hisat2 or Bowtie2

        Icon

        Hisat2 is a splice-aware algorithm used to perform reference genome-based read mapping. Stringtie is then used to assemble transcripts based on this read mapping.

        The read aligner Bowtie2 has been included as an optional aligner in the RMTA workflow for users wishing to call single nucleotide polymorphisms (SNPs) from their RNA-seq (or DNA-seq) data in a high throughput manner. When the Bowtie2 option is selected, HiSat2 and Stringtie are both removed from the workflow, but the additional option to remove duplicate reads (important for population level analyses) becomes available.  

    7. Featurecount
      1. Choose a Feature Type. The default option will be "exon"
      2. Choose a gene attribute. The default option will be "gene_id"
      3. Select the type of strandedness. The three options include unstranded, stranded, and reversely stranded.

  4. Advanced options
    1. Type of Sequence: Choose either Single End or Paired End
    2. Number of threads (Default is 4)
    3. FPKM cut-off threshold (For RNA-Seq reads only with Hisat2) (Default is 0)
    4. Coverage cut-off threshold (For RNA-Seq reads only with Hisat2) (Default is 0)
    5. Choose RNA strandedness (default is unstranded) 
    6. Trim bases from 5' end of read: Trim bases from 5' (left) end of each read before alignment (Default is 0)
    7. Trim bases from 3' end of read: Trim bases from 3' (right) end of each read before alignment (Default is 0)
    8. Minimum intron length: Set minimum intron length (Default is 20)
    9. Maximum intron length: Set maximum intron length (Default is 500000)
    10. Phred64 (Default is Phred33): Check to run Phred64
    11. Run Fastqc
    12. Remove duplicate reads

      When using Bowtie2

      Icon

      When using Bowtie2, be sure to check the box labeled "Remove duplicate reads," as shown in the figure below.




  5. RMTA_Output
    1. Name of the output folder (Default is RMTA_Output)
  6. README
    1. HISAT2 and BOWTIE2 are fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA). StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. Cuffcompare compares your assembled transcripts to a reference annotation and tracks Cufflinks transcripts across multiple experiments (e.g. across a time course).

Example Runs

The following test data using Arabidopsis are provided for testing RMTA in here - /iplant/home/shared/iplantcollaborative/example_data/RMTA

Note that when testing SRA IDs, only 3, 4, or 5 may be used at a time per test run. 

  1. Reference Genome: genome_chr1.fa
  2. Reference Annotation: genome_chr1.gtf
  3. Paired End Reads: 
    1. Left End Reads: SRR2037320_R1.fastq.gz and SRR2932454_R1.fast.gz
    2. Right End Reads: SRR2037320_R2.fastq.gz and SRR2932454_R2.fastq.gz
  4. Single End Reads: SRR3464102.fastq.gz and SRR3464103.fastq.gz
  5. List of SRA IDs:
    1. Paired End: sra_id_pe.txt
    2. Single End: sra_id_se.txt
  6. Aligners: HiSat2
  7. Feature Count: leave as default
  8. Advanced Options:
    1. Type of Sequence: select Single for Single End and Paired for Paired End 
    2. Leave the rest as default
  9. RMTA Output: leave as default 

All other settings should be left as default.

Results 

Successful execution of RMTA will generate two output folders:

  1. Index: This folder consists of the index of the genome
  2. Output: This folder consists of the output from Hisat2, Stringtie and Cuffcompare (Please refer to the manual for the explanation of outputs from these individual programs)
  • No labels