UHTS_Test_RNAseq_2
Overview
User script
- Manage Data: Import data files into DE
- Process Sequence Data
- Set up single-end processing of s_8_sequence.txt
- You don't have barcodes
- Do 3' adapter trimming
- Adapter sequence is: CTGTAGGCACCATCAATCGTATGCCGTCTTCTGCTTG
- Minimum sequence length is 20
- Discard sequences with N's
- Output both clipped and unclipped sequences
- You don't need to Remove Non-Biological Sequences
- Check that the option in 'Convert to Sanger Scoring' is set to Illumina
- Don't do Quality filtering
- Submit processing job and note about how long it takes to complete.
- Record the names of the files created in your DE workspace by the processing step.
- Align to Genome
- Select reference genome 'Arabidopsis thaliana'
- Select the s_8_sequence clipped and converted file as your input
- Launch job. Try to monitor how long it takes to run. We estimate 3-4 hours.
- Record the name of the SAM file that results from this alignment process.
- Inspect the SAM file and verify that its provenance tag indicates it was generated by TopHat/BOWTIE.
- Measure Transcript Abundance
- Select reference genome 'Arabidopsis thaliana'
- Select files for input: This should be the 'accepted_hits.sam' SAM file from your previous alignment.
- Set Minimum Isoform Fraction to Display to 0.15
- Set Expected pre-mRNA fraction to 0.05
- Launch job and track how long it takes to run. We estimate 20 minutes.
- Validation
- Verify presence of files in DE named genes.expr, genes_all.expr, transcripts.expr, and transcripts_all.expr
- Inspect genes_all.expr and compare to
gene_id bundle_id chr left right FPKM AT1G01010 118298 Chr1 3759 5630 3.41898 AT1G01020 118300 Chr1 6914 8666 2.84584 AT1G01030 118300 Chr1 11863 12940 0.780031 AT1G01040 000000 Chr1 23146 31227 0.0000 AT1G01046 000000 Chr1 28500 28706 0.0000 AT1G01050 118314 Chr1 31381 32670 35.9864 AT1G01060 000000 Chr1 33379 37840 0.0000 AT1G01070 118326 Chr1 38897 40877 19.4927 AT1G01073 000000 Chr1 44677 44787 0.0000 AT1G01080 118330 Chr1 45502 46789 28.0984 AT1G01090 000000 Chr1 47485 49286 0.0000 AT1G01100 000000 Chr1 50075 51199 0.0000 AT1G01110 000000 Chr1 52239 54692 0.0000 AT1G01115 000000 Chr1 56624 56740 0.0000 AT1G01120 000000 Chr1 57269 59167 0.0000 AT1G01130 000000 Chr1 61963 63811 0.0000 AT1G01140 000000 Chr1 64166 67625 0.0000 AT1G01150 000000 Chr1 70115 72138 0.0000 AT1G01160 118334 Chr1 72582 73883 15.3588 AT1G01170 118334 Chr1 74104 74443 24.1693 AT1G01180 118336 Chr1 75632 77446 12.0147 AT1G01183 000000 Chr1 78932 79032 0.0000 AT1G01190 118344 Chr1 83044 84864 0.783669 AT1G01200 118346 Chr1 86714 88145 2.64735
Background
This is an example of strand-specific RNAseq data from Arabidopsis thaliana. The data set is comprised of 6632564 51 bp reads. These reads were generated using a protocol that may result in a terminal 3' sequence adapter (CTGTAGGCACCATCAATCGTATGCCGTCTTCTGCTTG) that must be removed before genome alignment. The objective of the experiment is to generate a list of every transcript in Arabidopsis along with a quantified measurement of its abundance in the original RNA library submitted into the sequencing process. Input is $BASE/data/s_8_sequence.txt (2.1 GB). Outputs are $BASE/results/tophat_out/accepted_hits.sam, $BASE/results/genes_all.expr, and $BASE/results/transcripts_all.expr
This data set can be found on the iRODS server at bitol.iplantcollaborative.org://iplant/collections/g2p_di/Testing/UHTS/RNAseq/RMMT
The following environment variables are assumed in the scripts provided as examples.
export BASE=/home/plantgroup/vaughn/iPlant/Testing/UHTS/RNAseq/RMMT
export BIN=/home/plantgroup/bin
Versions
[vaughn@blade170 ~]$ bowtie --version bowtie version 0.12.5 64-bit Built on sycamore.umiacs.umd.edu Sat Apr 24 15:56:29 EDT 2010 Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-46) Options: -O3 -m64 -Wl,--hash-style=both Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8} [vaughn@blade170 ~]cufflinks --version cufflinks v0.8.1 ... [vaughn@blade170 ~]$ tophat --version TopHat v1.0.13
Preprocessing
- Trim 3' adapter
- Convert explicitly to Sanger (PHRED33) scoring
#!/bin/bash #$ -S /bin/bash #$ -cwd #$ -l virtual_free=1.9G #$ -N process.rmmt #$ -V export BASE=/data/plantgroup/vaughn/iPlant/Testing/UHTS/RNAseq/RMMT export BIN=/home/plantgroup/bin # Clip off 3' adapter $BIN/fastx_clipper -a "CTGTAGGCACCATCAATCGTATGCCGTCTTCTGCTTG" -l 20 -n 3 -v -i $ BASE/data/s_8_sequence.txt -o $BASE/data/s_8_sequence.clipper.txt # Convert to PHRED33 (Sanger) quality space $BASE/scripts/fq_all2std.pl sol2std2 $BASE/data/s_8_sequence.clipper.txt > $BASE /data/s_8_sequence.clipper.sanger.txt
TopHat Alignment
Script in $BASE/scripts/run_tophat.sh
Logs in $BASE/logs/tophat*
Runtime
~4h 43m for 6.6 million reads running on 2 cores in same processing node
Wed Jun 2 19:11:59 EDT 2010: Start
Wed Jun 2 23:54:08 EDT 2010: End
Cufflinks Analysis
Script in $BASE/scripts/run_cufflinks.sh
Logs in $BASE/logs/tophat*
Runtime
~6m for 972MB SAM file running on 2 cores in same processing node
Thu Jun 3 06:09:59 EDT 2010
Thu Jun 3 06:15:52 EDT 2010
Results
Final Listing of Results directory
[vaughn@blade170 RMMT]$ ls -alth $BASE/results total 1.1G -rw-r--r-- 1 vaughn plantgroup 3.1M 2010-06-03 06:14 transcripts_all.expr -rw-r--r-- 1 vaughn plantgroup 1.3M 2010-06-03 06:14 genes_all.expr -rw-r--r-- 1 vaughn plantgroup 25M 2010-06-03 06:14 transcripts.gtf -rw-r--r-- 1 vaughn plantgroup 802K 2010-06-03 06:14 genes.expr -rw-r--r-- 1 vaughn plantgroup 1.5M 2010-06-03 06:14 transcripts.expr drwxr-xr-x 3 vaughn plantgroup 2.0K 2010-06-03 05:56 . drwxr-xr-x 6 vaughn plantgroup 2.0K 2010-06-03 05:45 .. -rw-r--r-- 1 vaughn plantgroup 0 2010-06-03 05:24 cufflinks.log -rw-r--r-- 1 vaughn plantgroup 78M 2010-06-02 23:52 coverage.wig -rw-r--r-- 1 vaughn plantgroup 927M 2010-06-02 23:50 accepted_hits.sam -rw-r--r-- 1 vaughn plantgroup 725K 2010-06-02 23:43 junctions.bed drwxr-xr-x 2 vaughn plantgroup 2.0K 2010-06-02 21:39 logs -rw-r--r-- 1 vaughn plantgroup 3.7M 2010-06-02 19:13 annotation.juncs -rw-r--r-- 1 vaughn plantgroup 0 2010-06-02 19:10 tophat.log