UHTS_Test_RNAseq_2

Overview

User script

  1. Manage Data: Import data files into DE
    1. http://irodsweb.iplantcollaborative.org/collections/g2p_di/Testing/UHTS/RNAseq/RMMT/data/s_8_sequence.txt
  2. Process Sequence Data
    1. Set up single-end processing of s_8_sequence.txt
    2. You don't have barcodes
    3. Do 3' adapter trimming
      1. Adapter sequence is: CTGTAGGCACCATCAATCGTATGCCGTCTTCTGCTTG
      2. Minimum sequence length is 20
      3. Discard sequences with N's
      4. Output both clipped and unclipped sequences
    4. You don't need to Remove Non-Biological Sequences
    5. Check that the option in 'Convert to Sanger Scoring' is set to Illumina
    6. Don't do Quality filtering
    7. Submit processing job and note about how long it takes to complete.
    8. Record the names of the files created in your DE workspace by the processing step.
  3. Align to Genome
    1. Select reference genome 'Arabidopsis thaliana'
    2. Select the s_8_sequence clipped and converted file as your input
    3. Launch job. Try to monitor how long it takes to run. We estimate 3-4 hours.
    4. Record the name of the SAM file that results from this alignment process.
    5. Inspect the SAM file and verify that its provenance tag indicates it was generated by TopHat/BOWTIE.
  4. Measure Transcript Abundance
    1. Select reference genome 'Arabidopsis thaliana'
    2. Select files for input: This should be the 'accepted_hits.sam' SAM file from your previous alignment.
    3. Set Minimum Isoform Fraction to Display to 0.15
    4. Set Expected pre-mRNA fraction to 0.05
    5. Launch job and track how long it takes to run. We estimate 20 minutes.
  5. Validation
    1. Verify presence of files in DE named genes.expr, genes_all.expr, transcripts.expr, and transcripts_all.expr
    2. Inspect genes_all.expr and compare to
      gene_id	bundle_id	chr	left	right	FPKM
      AT1G01010	118298	Chr1	3759	5630	3.41898
      AT1G01020	118300	Chr1	6914	8666	2.84584
      AT1G01030	118300	Chr1	11863	12940	0.780031
      AT1G01040	000000	Chr1	23146	31227	0.0000
      AT1G01046	000000	Chr1	28500	28706	0.0000
      AT1G01050	118314	Chr1	31381	32670	35.9864
      AT1G01060	000000	Chr1	33379	37840	0.0000
      AT1G01070	118326	Chr1	38897	40877	19.4927
      AT1G01073	000000	Chr1	44677	44787	0.0000
      AT1G01080	118330	Chr1	45502	46789	28.0984
      AT1G01090	000000	Chr1	47485	49286	0.0000
      AT1G01100	000000	Chr1	50075	51199	0.0000
      AT1G01110	000000	Chr1	52239	54692	0.0000
      AT1G01115	000000	Chr1	56624	56740	0.0000
      AT1G01120	000000	Chr1	57269	59167	0.0000
      AT1G01130	000000	Chr1	61963	63811	0.0000
      AT1G01140	000000	Chr1	64166	67625	0.0000
      AT1G01150	000000	Chr1	70115	72138	0.0000
      AT1G01160	118334	Chr1	72582	73883	15.3588
      AT1G01170	118334	Chr1	74104	74443	24.1693
      AT1G01180	118336	Chr1	75632	77446	12.0147
      AT1G01183	000000	Chr1	78932	79032	0.0000
      AT1G01190	118344	Chr1	83044	84864	0.783669
      AT1G01200	118346	Chr1	86714	88145	2.64735
      

Background

This is an example of strand-specific RNAseq data from Arabidopsis thaliana. The data set is comprised of 6632564 51 bp reads. These reads were generated using a protocol that may result in a terminal 3' sequence adapter (CTGTAGGCACCATCAATCGTATGCCGTCTTCTGCTTG) that must be removed before genome alignment. The objective of the experiment is to generate a list of every transcript in Arabidopsis along with a quantified measurement of its abundance in the original RNA library submitted into the sequencing process. Input is $BASE/data/s_8_sequence.txt (2.1 GB). Outputs are $BASE/results/tophat_out/accepted_hits.sam, $BASE/results/genes_all.expr, and $BASE/results/transcripts_all.expr

This data set can be found on the iRODS server at bitol.iplantcollaborative.org://iplant/collections/g2p_di/Testing/UHTS/RNAseq/RMMT

The following environment variables are assumed in the scripts provided as examples.
export BASE=/home/plantgroup/vaughn/iPlant/Testing/UHTS/RNAseq/RMMT
export BIN=/home/plantgroup/bin

Versions

[vaughn@blade170 ~]$ bowtie --version
bowtie version 0.12.5
64-bit
Built on sycamore.umiacs.umd.edu
Sat Apr 24 15:56:29 EDT 2010
Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
Options: -O3 -m64  -Wl,--hash-style=both
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

[vaughn@blade170 ~]cufflinks --version
cufflinks v0.8.1
...

[vaughn@blade170 ~]$ tophat --version
TopHat v1.0.13

Preprocessing

  1. Trim 3' adapter
  2. Convert explicitly to Sanger (PHRED33) scoring
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -l virtual_free=1.9G
#$ -N process.rmmt
#$ -V

export BASE=/data/plantgroup/vaughn/iPlant/Testing/UHTS/RNAseq/RMMT
export BIN=/home/plantgroup/bin

# Clip off 3' adapter
$BIN/fastx_clipper -a "CTGTAGGCACCATCAATCGTATGCCGTCTTCTGCTTG" -l 20 -n 3 -v -i $
BASE/data/s_8_sequence.txt -o $BASE/data/s_8_sequence.clipper.txt

# Convert to PHRED33 (Sanger) quality space
$BASE/scripts/fq_all2std.pl sol2std2 $BASE/data/s_8_sequence.clipper.txt > $BASE
/data/s_8_sequence.clipper.sanger.txt

TopHat Alignment

Script in $BASE/scripts/run_tophat.sh
Logs in $BASE/logs/tophat*

Runtime

~4h 43m for 6.6 million reads running on 2 cores in same processing node
Wed Jun 2 19:11:59 EDT 2010: Start
Wed Jun 2 23:54:08 EDT 2010: End

Cufflinks Analysis

Script in $BASE/scripts/run_cufflinks.sh
Logs in $BASE/logs/tophat*

Runtime

~6m for 972MB SAM file running on 2 cores in same processing node
Thu Jun 3 06:09:59 EDT 2010
Thu Jun 3 06:15:52 EDT 2010

Results

Final Listing of Results directory

[vaughn@blade170 RMMT]$ ls -alth $BASE/results
total 1.1G
-rw-r--r-- 1 vaughn plantgroup 3.1M 2010-06-03 06:14 transcripts_all.expr
-rw-r--r-- 1 vaughn plantgroup 1.3M 2010-06-03 06:14 genes_all.expr
-rw-r--r-- 1 vaughn plantgroup  25M 2010-06-03 06:14 transcripts.gtf
-rw-r--r-- 1 vaughn plantgroup 802K 2010-06-03 06:14 genes.expr
-rw-r--r-- 1 vaughn plantgroup 1.5M 2010-06-03 06:14 transcripts.expr
drwxr-xr-x 3 vaughn plantgroup 2.0K 2010-06-03 05:56 .
drwxr-xr-x 6 vaughn plantgroup 2.0K 2010-06-03 05:45 ..
-rw-r--r-- 1 vaughn plantgroup    0 2010-06-03 05:24 cufflinks.log
-rw-r--r-- 1 vaughn plantgroup  78M 2010-06-02 23:52 coverage.wig
-rw-r--r-- 1 vaughn plantgroup 927M 2010-06-02 23:50 accepted_hits.sam
-rw-r--r-- 1 vaughn plantgroup 725K 2010-06-02 23:43 junctions.bed
drwxr-xr-x 2 vaughn plantgroup 2.0K 2010-06-02 21:39 logs
-rw-r--r-- 1 vaughn plantgroup 3.7M 2010-06-02 19:13 annotation.juncs
-rw-r--r-- 1 vaughn plantgroup    0 2010-06-02 19:10 tophat.log

Metrics