UHTS_Test_RNAseq_1
Overview
User script
- Manage Data: Import two data files by URL from SRA into your workspace
- ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX018/SRX018907
- Inspect the two files after input to verify that they were uncompressed
- Process Sequence Data
- You need to do this for each of the two FASTQ files
- You don't have barcodes
- You don't need to do adapter clipping
- You don't need to remove Non-Biological Sequences
- DO check that the option in 'Convert to Sanger Scoring' is set to Sanger PHRED33 (SRA)
- Don't do Quality filtering
- Submit processing job and note about how long it takes to complete.
- Record the names of the files in your DE workspace by these processing steps.
- Align to Genome
- NOTE: You will perform two alignments, one on each FASTQ file from the 'Process Sequence Data' steps
- Select reference genome 'Zea mays'
- Select one of the clipped and converted FASTQ files as your input
- Launch job.
- Repeat this with the other FASTQ file.
- Monitor how long it takes to complete both alignments (they should be running in parallel now)
- Record the name of the SAM files that result each alignment
- Inspect each SAM file and verify that its provenance tag indicates it was generated by TopHat/BOWTIE.
- Measure Transcript Abundance
- Select reference genome 'Zea mays'
- Select TWO SAM files for input: This should be the two 'accepted_hits.sam' files from your alignment process.
- Set Minimum Isoform Fraction to Display to 0.15
- Set Expected pre-mRNA fraction to 0.05
- Launch job and track how long it takes to run. We estimate 40 minutes.
- Validation
- Verify presence of files in DE named genes.expr, genes_all.expr, transcripts.expr, and transcripts_all.expr
- Inspect genes_all.expr and compare to
gene_id bundle_id chr left right FPKM GRMZM2G000014 000000 chr4 210486437 210490902 0.0000 GRMZM2G000035 000000 chr8 8673735 8674470 0.0000 GRMZM2G000039 000000 chr2 8966185 8975382 0.0000 GRMZM2G000042 000000 chr2 8966202 8967590 0.0000 GRMZM2G000044 000000 chr2 8964709 8966031 0.0000 GRMZM2G000047 000000 chr2 8956399 8959233 0.0000 GRMZM2G000052 000000 chr2 8944359 8949614 0.0000 GRMZM2G000071 000000 chr5 140167310 140168829 0.0000 GRMZM2G000076 000000 chr3 97024467 97028155 0.0000
- Inspect transcripts_all.expr and compare to
trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length GRMZM2G000014_T01 000000 chr4 210486437 210490902 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000014_T02 000000 chr4 210486437 210490902 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000035_T01 000000 chr8 8673735 8674470 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000039_T01 000000 chr2 8966185 8975382 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000042_T01 000000 chr2 8966202 8967590 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000044_T01 000000 chr2 8964709 8966031 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000044_T02 000000 chr2 8964709 8966031 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000047_T01 000000 chr2 8956399 8959233 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000052_T01 000000 chr2 8944359 8949614 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000
Background
SRX018907: aRNA-Seq experiment using bundle-sheath cells from maize leaf. 19,869,189 36bp reads from maize B73 accession. Objective of the experiment is to detect and quantify expression levels at the gene and transcript levels. Inputs are $BASE/data/SRR039512.fastq and $BASE/data/SRR039514.fastq. Outputs are $BASE/results/transcripts_all.expr, $BASE/results/genes_all.expr and $BASE/results/accepted_hits.sam. This data set can be found on the iRODS server at bitol.iplantcollaborative.org://iplant/collections/g2p_di/Testing/UHTS/RNAseq/SRX018907
Data Intake
- Search NCBI SRA (Sequence Read Archive) for SRX018907.
- Follow "Download data for this experiment SRX018907", which leads to an FTP directory
ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX018/SRX018907
Index of /sra/static/SRX018/SRX018907 Name Size Date Modified [parent directory] SRR039512.fastq.gz 368.7 MB 5/1/10 7:38:00 PM SRR039514.fastq.gz 438 MB 5/1/10 7:40:00 PM
Downloaded and unpacked gzipped fastq files.
wget "ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX018/SRX018907/SRR039512.fastq.gz" wget "ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX018/SRX018907/SRR039514.fastq.gz" gunzip SRR039512.fastq.gz gunzip SRR039514.fastq.gz
The resulting FASTQ files have already been converted to Sanger PHRED33 scaling. There is no processing required. This knowledge is implicit if the user is familiar with NCBI SRA data.
export BASE=/data/plantgroup/vaughn/iPlant/Testing/UHTS/RNAseq/SRX018907
export BIN=/home/plantgroup/bin
TopHat
Scripts for running these alignments are in $BASE/scripts/run_tophat_1.sh and $BASE/scripts/run_tophat_2.sh
Note that outputs from this stage are in $BASE/results/1 and $BASE/results/2, respectively. This is because TopHat always generates a file named 'accepted_hits.sam' and we'd have a namespace collision if we dumped both alignment runs into $BASE/results.
Run time: 1h 37m (2 threads)
Thu Jun 3 16:34:00 EDT 2010: Start TopHat
Thu Jun 3 18:11:34 EDT 2010: End TopHat
Cufflinks
Script for running this step alignments is $BASE/scripts/run_cufflinks.sh
Important features of this script:
- Join and sort multiple SAM alignments (I cheat and assume a fixed-length header. A better approach would be to filter out lines beginning with @, the SAM comment character).
- Post-processing .expr files to add in zero records involves a sort stage. This was not described in the original JIRA ticket explaining this step.
Run time: 38m (2 threads)
Fri Jun 4 08:06:00 EDT 2010
Fri Jun 4 08:44:04 EDT 2010
Results
Listing of results directory
[vaughn@blade170 SRX018907]$ ls -alth results total 6.2G drwxr-xr-x 4 vaughn plantgroup 2.0K 2010-06-04 08:47 . -rw-r--r-- 1 vaughn plantgroup 4.8M 2010-06-04 08:42 transcripts_all.expr -rw-r--r-- 1 vaughn plantgroup 1.6M 2010-06-04 08:42 genes_all.expr -rw-r--r-- 1 vaughn plantgroup 108K 2010-06-04 08:42 transcripts.expr -rw-r--r-- 1 vaughn plantgroup 62K 2010-06-04 08:42 genes.expr -rw-r--r-- 1 vaughn plantgroup 1.8M 2010-06-04 08:42 transcripts.gtf -rw-r--r-- 1 vaughn plantgroup 0 2010-06-04 07:58 cufflinks.log -rw-r--r-- 1 vaughn plantgroup 6.2G 2010-06-04 07:53 accepted_hits.sam drwxr-xr-x 6 vaughn plantgroup 2.0K 2010-06-04 06:22 .. drwxr-xr-x 3 vaughn plantgroup 2.0K 2010-06-03 18:10 2 drwxr-xr-x 3 vaughn plantgroup 2.0K 2010-06-03 17:58 1 -rw-r--r-- 1 vaughn plantgroup 0 2010-06-03 16:24 tophat.log
Ten lines of *.expr files
[vaughn@blade170 SRX018907]$ head -n 10 results/transcripts_all.expr trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length GRMZM2G000014_T01 000000 chr4 210486437 210490902 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000014_T02 000000 chr4 210486437 210490902 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000035_T01 000000 chr8 8673735 8674470 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000039_T01 000000 chr2 8966185 8975382 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000042_T01 000000 chr2 8966202 8967590 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000044_T01 000000 chr2 8964709 8966031 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000044_T02 000000 chr2 8964709 8966031 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000047_T01 000000 chr2 8956399 8959233 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 GRMZM2G000052_T01 000000 chr2 8944359 8949614 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 [vaughn@blade170 SRX018907]$ head -n 10 results/genes_all.expr gene_id bundle_id chr left right FPKM GRMZM2G000014 000000 chr4 210486437 210490902 0.0000 GRMZM2G000035 000000 chr8 8673735 8674470 0.0000 GRMZM2G000039 000000 chr2 8966185 8975382 0.0000 GRMZM2G000042 000000 chr2 8966202 8967590 0.0000 GRMZM2G000044 000000 chr2 8964709 8966031 0.0000 GRMZM2G000047 000000 chr2 8956399 8959233 0.0000 GRMZM2G000052 000000 chr2 8944359 8949614 0.0000 GRMZM2G000071 000000 chr5 140167310 140168829 0.0000 GRMZM2G000076 000000 chr3 97024467 97028155 0.0000