This space is home to learning materials and tutorials created for CyVerse products and services. To search the entire CyVerse wiki, use the box at the upper right.


LEARNING MATERIALS
Skip to end of metadata
Go to start of metadata

General Workflow:

  • Transfer data to Cyverse Discovery Environment
  • Gunzip your files
  • FastQC
  • Hisat2 (index and align)
  • HTSeq Count
  • DeSeq2


 

Example Data

Icon

Example data to be used with this tutorial can be found here: Community Data -> iplantcollaborative -> example_data -> Mouse_RNAseq_DESeq2

 

How to Transfer your files to Cyverse Discovery Environment (DE)

There are multiple ways to transfer your files (either locally from your computer or from the UAGC data storage space). Below are the most common methods. If you are unfamiliar with the command line, I would suggest using Cyberduck. If you are familiar with the command line, I would suggest using iRODs. 

  1. Transfer using Cyberduck: https://pods.iplantcollaborative.org/wiki/display/DS/Transfer+data+between+AWS+and+CyVerse
  2. Transfer using iRODS/iCommands: https://pods.iplantcollaborative.org/wiki/display/DS/Using+iCommands 

How you know that your files are transferred correctly:

  • Login into Cyverse Discovery Environment
  • Click on the Data (cloud icon with up and down arrows) icon on the left-hand side
  • Check to see that all your files have been transferred to the DE

Interpreting your files' names and understanding what it means:

  • Abbreviations: XXX= non-specific filler name (used to indicate that something is there but might be different for everyone), YYYYYYY
  • Typical format for Illumina data
    • XXX_YYYYYY_L00X_RX_001.fastq.gz (the .gz extension indicates it's a zipped file so you would need to GUNZIP the file before proceeding; THIS IS RAW FASTQ FILE)
    • XXX_YYYYYY_L00X_RX_001.fastq (regular fastq file, you can proceed to the FASTQC section)
    • XXX= usually a name you provided for your samples --> ie: untreated1
    • YYYYYY=specific to each sample (it's the barcode used to label your sample)
    • L00X=the lane your samples were run on
    • RX=X could be 1 or 2 where R1=run1 (forward sequencing) and R2=run2 (reverse sequencing)
    • .fastq= the output file for RNASeq is a fastq file (a type of file extension)
  • Processed FASTQ Files
    • Some sequencing facilities perform a simple trim on your fastq files to remove barcodes and adapters from the sequence and to filter very poor quality sequence.
    • If your reads have not been trimmed and filtered you can use Trimmomatic to do it.
    • In the case of this data the processed files are either:
      • pair.XXX_YYYYYY_L00X_RX_001.fastq.gz (pair-end)
      • sngl.XXX_YYYYYY_L00X_RX_001.fastq.gz (single-end)

Gunzip your Files (Unzip your Files)

If your files are not unzipped (meaning they have a fastq.gz extension), follow the below protocols to unzip your files. 

Steps:

  1. Click on the Apps Icon
  2. Type "gunzip" into the search bar located near the top
  3. Click on "Uncompress files with gunzip" by Matthew Vaughn

How to input your files and process the gunzip:

  1. Analysis Name: You can change the name of your analysis, I usually keep the name of the App I'm using so I remember what kind of analysis it was
  2. Comments: You could include comments/notes for yourself
  3. Select output folder: You can browse and tell the computer specifically where you want your analysis to go once it finish processing (I usually select the default)
  4. Retain Inputs? If you want the computer to copy the initial input files into the final analysis result folder, check the box (If you don't want it to copy the files, leave the box uncheck)
  5. Dropdown Menu: Click on the dropdown arrow to move to the next section

Gunzip Inputs:

  1. Click on the "Add" button
  2. A directory will open and you can search for the fastq.gz files you want to unzip
  3. Click on "OK" and it will load the files you chose
  4. The "Names" box will load all the files you inputed into the box
  5. I keep the settings as the default: If you want to change the setting, read the manual of GUNZIP here:Uncompress files with gunzip 1.6-2
  6. Click "Launch Analysis" once you are ready for the computer to start the analysis. (Repeat steps 1-4 as necessary until you have all the files you want loaded)

Gunzip Outputs:

You'll have unzipped files in your analyses file. You can move those files to another folder if you choose to. 

FastQC

FastQC is a tool to evaluate the quality of your sequencing data. More information on using FastQC.

Steps:

  1. Click on the Apps Icon
  2. Type "fastqc" into the search bar located near the top
  3. Click on "HTProcess-fastqc-0.2" by Roger Barthelson

Input File Types:

  1. Move all your forward .fastq files to one folder for the easiest analysis
  2. Move all your reverse .fastq files to one folder for the easiest analysis
  3. Move all your unread .fastq files to one folder for the easiest analysis
  4. Follow through the app directions on how to input the files

Output File Types:

  1. In your analyses folder, click on "HTProcess-fastqc-0.2" 
  2. Click on "individual report"
  3. You will see "XXX_report_html" as your readout file for the quality of your reads

Hisat2

In this tutorial we will be using Hisat2 to index our genome sequence and align our reads to that indexed sequence.

Steps:

  1. Get the latest mouse gtf file from Ensembl or Entrez (depending on your analysis): Ensembl
  2. Download and/or transfer the gtf.gz file to your DE (the example data for this tutorial uses chromosome 3 only)
  3. Click on the Apps icon
  4. Type "hisat2" into the search bar located near the top
  5. Click on "HISAT2-index-align-2.1" by Kapeel Chougule

Input File Types:

  1. Move all your forward and reverse .fastq files to one folder for the easiest analysis
  2. Input your first forward file, and input your first reverse file, for "Fragment Library", select "forward-reverse" if you put the forward file first and the reverse file second
  3. For "File Type", select PE (paired end) or SE (single end) depending on what files you inputted
  4. Repeat steps 1-3 for all your samples (ie: untreated 1, untreated 2, untreated 3 , treated 1, treated 2, treated 3). If you had those conditions, you would need to repeat 6 times. 

Output File Types:

  1. In your analyses folder, click on "HISAT2-index-align-2.1" 
  2. You will see "XXX.paired.bam" files for your proper alignments
  3. You will also see a "XXX.paried.bam.bai" index file corresponding to each BAM file.

HTSeq Count

Once you have aligned your reads you need to determine the number of reads that mapped to each gene/transcript. For this, we will use 'HTseq count'.

Steps:

  1. Click on the Apps icon
  2. Type "htseq" into the search bar located near the top
  3. Click on "HTseq-count-0.6.1" by Upendra Kumar Devisetty

Input File Types:

  1. Move all your .bam files to one folder for the easiest analysis
  2. Select all your .bam files for your input and put your .gtf file as the GFF file
  3. Select:
    1. Input file type: bam
    2. Sorting order of alignment: name
    3. Everything else: as default

Output File Types:

  1. In your analyses folder, click on "HTSeq-count-0.6.1" 
  2. You will see "paired.sorted.XXXX.txt" files for your counts . This will be your count matrix

Example of Count Matrix for UT_top5M_1.sorted

DESeq2

Now that we know the quantity of each transcript in each sample we need to compare those quantities between samples. For this we will use DESeq2.

Steps:

  1. Click on the Apps icon
  2. Type "deseq2" into the search bar located near the top
  3. Click on "Deseq2 (multifactorial pairwise compairson" by Upendra Kumar Devisetty

Input File Types:

  1. Move all your paired.sorted.XXX.txt files to one folder for the easiest analysis
  2. Select all your paired.sorted.XXX.txt files for your input (you can not do 27 pairwise at the same time, select smaller sets of samples for your pairwise comparisons)
  3. Need to create a target file: look below for image
  4. Select:
    1. Reference Biological condition: probably whatever the name is for your untreated/control samples
    2. Everything else: as default

Output File Types:

  1. In your analyses folder, click on "Deseq2(multifactorial pairwise comparison)" 
  2. You will see "XXX.complete.txt" which has all the genes and your pairwise comparison
  3. You will see "XXX.up.txt" are genes that are upregulated in your pairwise comoparison
  4. You will see "XXX.down.txt" are genes that are downregulated in your pairwise comparison