RNA-seq Tutorial- HISAT2, StringTie and Ballgown*

Now users can run Rstudio as app in CyVerse Discovery Environment. Please see this updated wiki page to run New RNA-Seq Tuxedo protocol - RNA-seq Tutorial- HISAT2, StringTie and Ballgown using DE and Rstudio-Ballgown*

New RNA-Seq tuxedo protocol using the Discovery Environment

Rational and background

RNA-seq involves preparing the mRNA which is converted to cDNA and provided as input to next generation sequencing library preparation method. Prior to RNA-seq there were hybridization based microarrays used for gene expression studies, the main drawback was the poor quantification of lowly and highly expressed genes.  RNA-seq provides distinct advantages over microarrays, it provides better insights into alternative gene splicing, post-transcriptional modifications, gene fusion and deferentially expressed genes and thus helping to understanding the gene structure and expression patterns of genes across different samples, treatment conditions and time points. The ease of sequencing and the low cost have made  RNA-seq a workhorse in transcriptomic studies and viable option even for small scale labs. But the main challenge remains in analyzing the sequenced data.

The current ecosystems of RNA-seq tools provide a varied ways of analyzing RNA-seq data. Depending on the experiment goal one could align the reads to reference genome or pseduoalign to transcriptome and perform quantification and differential expression of genes or if you want to annotate your reference, assemble RNA-seq reads using a denvo transcriptome assembler. Here we focus on workflows that align reads to reference genomes.  The most commonly cited and widely used workflow is the Tuxedo protocol (Tophat,Cufflinks,Cuffdiff)developed by Cole Trapnell et al. The main drawback of this workflow is the ability to scale i.e they tend to take more runtime compared to newly updated Tuxedo protocol(HISAT,StringTie,Ballgown) by Mihaela Pertea et al. This updated Tuxedo protocol not only scales but is more accurate in detecting deferentially expressed genes.


In this example we will compare gene transcript abundance drought sensitive sorghum line under drought stress(DS) and well-watered (WW) condition. The expression of drought-related genes was more abundant in the drought sensitive genotype under DS condition compared to WW.

We will use RNAseq to compare expression levels for genes between DS and WW- samples for drought sensitive genotype IS20351  and to identify new transcripts or isoforms. In this tutorial, we will use data stored at the NCBI Sequence Read Archive.

  1. Align the data to the Sorghum v1 reference genome using HISAT2
  2. Transcript assembly using StringTie
  3. Identify differential-expressed genes using Ballgown
  4. Use Atmosphere to visually explore the differential gene expression results.

If you do not have an account, please see one of the on-site CyVerse staff for a temporary account.

Here is the webinar that explains more about the new pipeline:

Specific Objectives

By the end of this module, you should

  1. Be more familiar with the DE user interface
  2. Understand the starting data for RNA-seq analysis
  3. Be able to align short sequence reads with a reference genome in the DE
  4. Be able to analyze differential gene expression in the DE and Atmosphere

Note on Staged Data:

Several of the methods in this tutorial can take 2 to 4 hrs to complete on a full-sized data set. So that you can complete the tutorial in the allotted time, we have pre-staged input and output files in the 'Community Data' folder for each step. You can start your analyses then skip to the next step using pre-staged data.

Original data from NCBI Sequence Read Archive study re accessible through GEO Series accession number GSE80699 

Paper Reference for dataset: Drought stress tolerance strategies revealed by RNA-Seq in two sorghum genotypes with contrasting WUE, 

Alessandra Fracasso,  Luisa M. Trindade and Stefano Amaducci; DOI: 10.1186/s12870-016-0800-x, May 2016

The Staged Fastq Data can be found in the

Community Data -> iplantcollaborative -> example_data -> HISAT2_StringTie_Ballgown -> reads

 

Section 1: Align reads to reference using HISAT2 aligner

 Hisat2 is another efficient splice aligner which is a replacement for Tophat in the new Tuxedo protocol. Like Tophat2 it uses one global FM index along with several small local FM indexes to build an efficient data structure which helps speed its alignment several times faster than Tophat2.

HISAT2 paper: HISAT: a fast spliced aligner with low memory requirements 
Reference Manual

a) Open HISAT2 app

Click on Operation -> Categories ->  Mapping ->  HISAT2-index-align.


b) Click on the name of the App to open it.


c) In the Input  section,you need to first select the a reference genome and annotation file either from the drop list or upload a fasta for the genomes or gtf annotation file.

Here we upload the reference genome file from:

Reference genome : Community Data ->iplantcollaborative->example_data->HISAT2_StringTie_Ballgown->Sorghum_bicolor.Sorbi1.20.dna.toplevel.fa


you can click 'Add' at the top right of the FASTQ File(s) box to navigate to the folder containing the FASTQ files: For paired end data add the read 1 gzipped files for all samples then followed by read 2 files in gzipped format.

Select File type as PE since we are using paired end data set.


Note: For multiple input files, it is convenient to open a separate data window, navigate to the above folder and drag all of files into the input data box. Each of the input files will be processed independently. For convenience, a batch of FASTQ files can be analyzed together but these files can also be processed concurrently in independent STAR runs

Under advance options select to make alignments compatible with StringTie and Cufflinks for downstream analysis


d) Start the analysis by clicking the Launch Analysis button, naming the analysis 'HISAT2' in the dialog box. Reasonable default options are provided for the analysis settings.


e) HISAT2 will require some time to complete its work on each sample. Click on the Analyses icon to view the status of your submitted analysis.


f) When the analysis is complete, navigate to the results under your 'analyses' folder. The principle outputs from HISAT2 are BAM files, each set of alignments in a bam_output folder, bam file corresponding to the name of the original Fastq file. This is the most time-consuming step of this training module. If your analysis has not completed in time, you can skip ahead by using the pre-computed results:

 

Community Data ->iplantcollaborative->example_data->HISAT2_StringTie_Ballgown-> HISAT2_results

 In the HISAT2_results folder, you should see these folders:

HISAT2_results:  The result directory for the HISAT2 runs contain the following

  • bam_output: directory of alignment files coordinate sorted in bam format for each sample, along with their index bai files

We will use the bam_output folder to assemble transcripts using Stringtie.


 

Section 2: Assemble transcripts with StringTie-1.3.3

StringTie like Cufflinks2 assembles transcripts from RNA seq reads aligned to the reference and does quantification. It follows a netflow algorithm where it assembles and quantitates simultaneously the highly expressed transcripts and removes reads associated with that transcripts and repeats the process until all the reads are used. This algorithm improves the run time for StringTie using less memory compared to Cufflinks2. If provided with a reference annotation file Stringtie uses it to construct assembly for  low abundance genes, but this is optional. Alternatively, you can skip the assembly of novel genes and transcripts, and use StringTie simply to quantify all the transcripts provided in an annotation file

a) Click on the Apps icon and find StringTie-1.3.3. Open it. Name the analysis as StringTie-1.3.3

b) In the 'Select Input data' section, add the 'bam' files by navigating to  the bam_output folder  from HISAT2 (above) and select and drag all  bam files into the input box. For convenience, a batch of HISAT2 bam files can be analyzed together but these files can also be processed concurrently in independent StringTie runs. The path for the HISAT2 bam files

 

Community Data ->iplantcollaborative->example_data->HISAT2_StringTie_Ballgown-> bam_output


c) In the Reference Annotaiton section, select reference annotation Sorghum_bicolor.Sorbi1.20.gtf.

d) Run StringTie-1.3.3. When it is complete, you will see the following outputs


StringTie_output contains the following files:

  • Stringtie's main output is a GTF file containing the assembled transcripts e.g:IS20351_DS_1_1.gtf
  • Gene abundances in tab-delimited format e.g IS20351_DS_1_1_abund.tab
  • Fully covered transcripts that match the reference annotation, in GTF format e.g IS20351_DS_1_1.refs.gtf

Examine the GTF file, IS20351_DS_1_1.gtf This file contains annotated transcripts assembled by StringTie-1.3.3, using the annotated transcripts selected from the reference file uploaded in the StringTie-1.3.3 app as a guide.This file gives normalized expression metrics in both FPKM and TPM along with per base coverage

# stringtie IS20351_DS_1_1.sorted.bam -o IS20351_DS_1_1.gtf -G Sorghum_bicolor.Sorbi1.20.gtf -p 4 -m 200 -c 2.5 -g 50 -f 0.1 -C IS20351_DS_1_1.refs.gtf -A IS20351_DS_1_1.abund.tab
# StringTie version 1.3.3
8       StringTie       transcript      10354   14469   1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; cov "3.489598"; FPKM "0.801415"; TPM "0.990349";
8       StringTie       exon    10354   10640   1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; cov "1.000000";
8       StringTie       exon    13241   13900   1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "2"; cov "5.615151";
8       StringTie       exon    13975   14469   1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "3"; cov "2.098990";
8       StringTie       transcript      43183   43920   1000    +       .       gene_id "STRG.2"; transcript_id "STRG.2.1"; cov "2.972900"; FPKM "0.682751"; TPM "0.843710";
8       StringTie       exon    43183   43920   1000    +       .       gene_id "STRG.2"; transcript_id "STRG.2.1"; exon_number "1"; cov "2.972900";
8       StringTie       transcript      43997   44781   1000    +       .       gene_id "STRG.3"; transcript_id "STRG.3.1"; cov "6.772846"; FPKM "1.555440"; TPM "1.922136";
8       StringTie       exon    43997   44216   1000    +       .       gene_id "STRG.3"; transcript_id "STRG.3.1"; exon_number "1"; cov "9.109091";

 

Section 3: Merge all StringTie-1.3.3 transcripts into a single transcriptome annotation file using StringTie-1.3.3_merge

StringTie-merge like Cufflinks-merge uses the same principle of merging the transcript assemblies from samples into a consolidated annotation set. This merging steps helps restore full length of the structure transcript especially for transcript assembled with low coverage.The main purpose of this application is to make it easier to make an assembly GTF file suitable for use with Ballgown. A merged, empirical annotation file will be more accurate than using the standard reference annotation, as the expression of rare or novel genes and alternative splicing isoforms seen in this experiment will be better reflected in the empirical transcriptome assemblies.

 

a) Open the StringTie-1.3.3_merge app. Under 'Input Data', browse to the results of the StringTie-1.3.3 analyses (gtf_out). Select the all GTF files as input for StringTie-1.3.3_merge.

 

b) Under Reference Annotation, select the Sorghum_bicolor.Sorbi1.20.gtf.



c) Run the App. When it is complete, you should see the following outputs under StringTie-1.3.3_merge_results:


We will use the merged.out.gtf with StrignTie-1.3.3 app again

 

Section 4: Create Ballgown input files using  with StringTie-1.3.3

We will use StringTie-1.3.3 again to assemble transcripts but this time will will use the consolidated annotation file we got from StringTie-1.3.3_merge step. This will re-estimate the transcript abundances using the merged structures but reads may need to be re-allocated for transcripts whose structures were altered by the merging step. We will set the option -B, e  of StringTie. These options are set by checking the two boxes at the bottom of the 'analysis option' section of the app.  This will create table count files (*ctab files) for each sample to be used in Ballgown for differential expression. Name the analysis "StringTie-1.3.3_from_merged_annotation"


Run the App. When it is complete, you should see the following outputs under

 

  • Ballgown_input_files: count files (*ctab files) for each sample

    • e_data.ctab: exon-level expression measurements.

    • i_data.ctab: intron- (i.e., junction-) level expression measurements

    • t_data.ctab: transcript-level expression measurements

    • e2t.ctab: table with two columns, e_id and t_id, denoting which exons belong to which transcripts

    • i2t.ctab: table with two columns, i_id and t_id, denoting which introns belong to which transcripts

 

More details of the ctab files please refere the Ballgown documentation

 

Section 5: Compare expression analysis using Ballgown

Ballgown is a R package that uses abundance data produced by StringTie to perform differential expression analysis at gene, transcript, exon or junction level. It does both time series and fixed condition differential expression analysis.

a) In Apps select Ballgown

b) provide a experiment design matrix file in txt, This file defines  samples, group and replicate information e:g

ID      condition   reps

IS20351_DS_1_1    DS     1

IS20351_DS_1_2     DS     2

IS20351_DS_1_3     DS     3

IS20351_WW_1_1    WW     1

IS20351_WW_1_2    WW     2

IS20351_WW_1_3    WW     3

Here we are doing a pairwise comparison for differential expression in sensitive genotype under Drought Stress(DS) and Well Watered(WW) condition. We have 3 replicates under each condition.

c) Upload the following files

  • design matrix file

 

Community Data -> iplantcollaborative -> example_data -> HISAT2_StringTie_Ballgown -> design_matrix


  • Ballgown_input_files directoey containing the ctab count files from stringtie runs

 

Community Data -> iplantcollaborative -> example_data -> HISAT2_StringTie_Ballgown -> StringTie-1.3.3_from_merged_annotation -> ballgown_input_files


  • provide experimental covariate which is condition(it should match column in design matrix file)


d) Launch analysis

e) Examine results

 

Successful execution of the  Ballgown  will create a directory named output. The directory will contain the following files:

 

  1.   Rplots.pdf- Boxplot of FPKM distribution of each smaple
     
  2.   results_gene.tsv- Gene level Differential expression with no filtering
     
  3.   results_gene_filter.sig.tsv- Identify genes with p value < 0.05
  4.   results_gene_filter.tsv- Filter low-abundance genes, here we remove all genes with a variance across samples less than one
  5.   results_trans.tsv-transcript level Differential expression with no filtering
     
  6.   results_trans_filter.sig.tsv- Identify transcripts with p value < 0.05
     
  7.   results_trans_filter.tsv-Filter low-abundance genes, here we remove all transcript with a variance across samples less than one


 

Running Ballgown for Differential gene expression and visualization in Atmosphere.

 

Section 1: Connect to an instance of an Atmosphere Image and launch a instance

 

1. Go to https://atmo.cyverse.org/ and log in with IPLANT TEST USER CREDENTIALS.

 

 

2. Click on images. Search "Ubuntu 16_04 GUI XFCE Base v2" on the search space. Click on "Ubuntu 16_04 GUI XFCE Base v2" image.

 

 

3. Click on Launch. Add to existing projects if you have already created one. In this case, we will add the instance to existing project "RNA-seq visualization". Select the size of the instance "tiny 2(CPU:1,Mem: 8, Disk: 60Gb)". Click on Launch instance. This will launch a instance of the image.

 

4. Once the instance is active, you will get a email confirmation and can also see  the status active on the Atmosphere portal

5. Copy the IP address "128.196.64.168", if you are using MAC or Ubuntu open a terminal window and paste the IP address on the command line using your Cyverse user name and hit enter. This will ask you your Cyverse user account passowrd, hit enter
Note- After entering the password if its ask you to add your IP address to the config, enter yes.

escudilla:~ kchougul$ ssh kapeelc@128.196.64.168

After you enter the password you should see something like this

 

escudilla:~ kchougul$ ssh kapeelc@128.196.64.168
kapeelc@128.196.64.168's password: 
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-72-generic x86_64)

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

45 packages can be updated.
0 updates are security updates.


*** System restart required ***
Welcome to
    _   _                             _
   / \ | |_ _ __ ___   ___  ___ _ __ | |__   ___ _ __ ___
  / _ \| __| '_ ` _ \ / _ \/ __| '_ \| '_ \ / _ \ '__/ _ \
 / ___ \ |_| | | | | | (_) \__ \ |_) | | | |  __/ | |  __/
/_/   \_\__|_| |_| |_|\___/|___/ .__/|_| |_|\___|_|  \___|
                               |_|
Last login: Mon Jun 19 14:16:47 2017 from 143.48.50.204
**************************************************************************
# A new feature in cloud-init identified possible datasources for        #
# this system as:                                                        #
#   ['OpenStack', 'None']                                                #
# However, the datasource used was: CloudStack                           #
#                                                                        #
# In the future, cloud-init will only attempt to use datasources that    #
# are identified or specifically configured.                             #
# For more information see                                               #
#   https://bugs.launchpad.net/bugs/1669675                              #
#                                                                        #
# If you are seeing this message, please file a bug against              #
# cloud-init at                                                          #
#    https://bugs.launchpad.net/cloud-init/+filebug?field.tags=dsid      #
# Make sure to include the cloud provider your instance is               #
# running on.                                                            #
#                                                                        #
# After you have filed a bug, you can disable this warning by launching  #
# your instance with the cloud-config below, or putting that content     #
# into /etc/cloud/cloud.cfg.d/99-warnings.cfg                            #
#                                                                        #
# #cloud-config                                                          #
# warnings:                                                              #
#   dsid_missing_source: off                                             #
**************************************************************************

Disable the warnings above by:
  touch /home/kapeelc/.cloud-warnings.skipy
or
  touch /var/lib/cloud/instance/warnings/.skip
kapeelc@vm64-168:~$ 


6. Now your instance is ready to use.  We first download the data (Section 4  "/iplant/home/shared/iplantcollaborative/example_data/HISAT2-StringTie-Ballgown/StringTie-1.3.3_from_merged_annotation/ballgown_input_files") needed to run Ballgown. We create a directory ballgown_analysis to download the input files. We will use Rstudio  on atmosphere to do the interactive analysis using Ballgown R package. To download the data we will use command line icommands that are installed in this instance for you. Type the below command on the terminal where the instance is opened and enter your cyverse password to begin download

 

1. kapeelc@vm64-135:~$ mkdir ballgown_analysis
2. kapeelc@vm64-168:~$ iget -PVr /iplant/home/shared/iplantcollaborative/example_data/HISAT2-StringTie-Ballgown/StringTie-1.3.3_from_merged_annotation/ballgown_input_files
NOTICE: created irodsHome=/iplant/home/kapeelc
NOTICE: created irodsCwd=/iplant/home/kapeelc
NOTICE: created irodsHome=/iplant/home/kapeelc
NOTICE: created irodsCwd=/iplant/home/kapeelc
NOTICE: created irodsHome=/iplant/home/kapeelc
NOTICE: created irodsCwd=/iplant/home/kapeelc
NOTICE: created irodsHome=/iplant/home/kapeelc
NOTICE: created irodsCwd=/iplant/home/kapeelc
NOTICE: created irodsHome=/iplant/home/kapeelc
NOTICE: created irodsCwd=/iplant/home/kapeelc
NOTICE: created irodsHome=/iplant/home/kapeelc
NOTICE: created irodsCwd=/iplant/home/kapeelc
NOTICE: created irodsHome=/iplant/home/kapeelc
NOTICE: created irodsCwd=/iplant/home/kapeelc
Enter your current iRODS password:

3. kapeelc@vm64-135:~$ iget -PV /iplant/home/shared/iplantcollaborative/example_data/HISAT2-StringTie-Ballgown/design_matrix .

Once the down load is complete you should see the directory which contains the ctab count files for each sample that will be used in Ballgown for differential expression and plotting

kapeelc@vm64-168:~$ ls -lh
total 36K
drwxr-xr-x 3 kapeelc iplant-everyone 4.0K Jul  19 13:50 ballgown_analysis
drwxr-xr-x 2 root root 4.0K Jun 19 13:50 Desktop
drwxr-xr-x 2 kapeelc iplant-everyone 4.0K Jun 19 13:50 Documents
drwxr-xr-x 2 kapeelc iplant-everyone 4.0K Jun 19 13:50 Downloads
drwxr-xr-x 2 kapeelc iplant-everyone 4.0K Jun 19 13:50 Music
drwxr-xr-x 2 kapeelc iplant-everyone 4.0K Jun 19 13:50 Pictures
drwxr-xr-x 2 kapeelc iplant-everyone 4.0K Jun 19 13:50 Public
drwxr-xr-x 2 kapeelc iplant-everyone 4.0K Jun 19 13:50 Templates
drwxr-xr-x 2 kapeelc iplant-everyone 4.0K Jun 19 13:50 Videos

 

Section 2: Using R studio on atmosphere to do the interactive analysis using Ballgown R package

  1.  On the command line type below three commands to start a Rstudio and enter your Cyverse password. This will install all the dependencies needed to start Rstudio.
1. kapeelc@vm64-135:~$ sudo apt-get update && sudo apt-get -y install gdebi-core r-base libxml2-dev
[sudo] password for kapeelc:
2. kapeelc@vm64-135:~$ wget https://download2.rstudio.org/rstudio-server-1.0.136-amd64.deb
3. kapeelc@vm64-135:~$ sudo gdebi -n rstudio-server-1.0.136-amd64.deb

2. If you want to find what the IP address of the R studio then run below command,

kapeelc@vm64-135:~$ echo My RStudio Web server is running at: http://$(hostname):8787/
My RStudio Web server is running at: http://vm64-135.iplantcollaborative.org:8787/

just copy and paste  http://vm64-135.iplantcollaborative.org:8787/ on to the web-browser

3. Enter your Cyverse user name and password to launch the R studio on browser

4. paste the below R commands to begin the analysis. Start File -> New File -> R script, then paste the commands

 

 

 

 

Unable to render {include} The included page could not be found.