UHTS_Test_RNAseq_1

Overview

User script

  1. Manage Data: Import two data files by URL from SRA into your workspace
    1. ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX018/SRX018907
    2. Inspect the two files after input to verify that they were uncompressed
  2. Process Sequence Data
    1. You need to do this for each of the two FASTQ files
    2. You don't have barcodes
    3. You don't need to do adapter clipping
    4. You don't need to remove Non-Biological Sequences
    5. DO check that the option in 'Convert to Sanger Scoring' is set to Sanger PHRED33 (SRA)
    6. Don't do Quality filtering
    7. Submit processing job and note about how long it takes to complete.
    8. Record the names of the files in your DE workspace by these processing steps.
  3. Align to Genome
    1. NOTE: You will perform two alignments, one on each FASTQ file from the 'Process Sequence Data' steps
    2. Select reference genome 'Zea mays'
    3. Select one of the clipped and converted FASTQ files as your input
    4. Launch job.
    5. Repeat this with the other FASTQ file.
    6. Monitor how long it takes to complete both alignments (they should be running in parallel now)
    7. Record the name of the SAM files that result each alignment
    8. Inspect each SAM file and verify that its provenance tag indicates it was generated by TopHat/BOWTIE.
  4. Measure Transcript Abundance
    1. Select reference genome 'Zea mays'
    2. Select TWO SAM files for input: This should be the two 'accepted_hits.sam' files from your alignment process.
    3. Set Minimum Isoform Fraction to Display to 0.15
    4. Set Expected pre-mRNA fraction to 0.05
    5. Launch job and track how long it takes to run. We estimate 40 minutes.
  5. Validation
    1. Verify presence of files in DE named genes.expr, genes_all.expr, transcripts.expr, and transcripts_all.expr
    2. Inspect genes_all.expr and compare to
      gene_id	bundle_id	chr	left	right	FPKM
      GRMZM2G000014	000000	chr4	210486437	210490902	0.0000
      GRMZM2G000035	000000	chr8	8673735	8674470	0.0000
      GRMZM2G000039	000000	chr2	8966185	8975382	0.0000
      GRMZM2G000042	000000	chr2	8966202	8967590	0.0000
      GRMZM2G000044	000000	chr2	8964709	8966031	0.0000
      GRMZM2G000047	000000	chr2	8956399	8959233	0.0000
      GRMZM2G000052	000000	chr2	8944359	8949614	0.0000
      GRMZM2G000071	000000	chr5	140167310	140168829	0.0000
      GRMZM2G000076	000000	chr3	97024467	97028155	0.0000
      
    3. Inspect transcripts_all.expr and compare to
      trans_id	bundle_id	chr	left	right	FPKM	FMI	frac	
      FPKM_conf_lo	FPKM_conf_hi	coverage	length
      GRMZM2G000014_T01	000000	chr4	210486437	210490902	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
      GRMZM2G000014_T02	000000	chr4	210486437	210490902	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
      GRMZM2G000035_T01	000000	chr8	8673735	8674470	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
      GRMZM2G000039_T01	000000	chr2	8966185	8975382	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
      GRMZM2G000042_T01	000000	chr2	8966202	8967590	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
      GRMZM2G000044_T01	000000	chr2	8964709	8966031	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
      GRMZM2G000044_T02	000000	chr2	8964709	8966031	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
      GRMZM2G000047_T01	000000	chr2	8956399	8959233	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
      GRMZM2G000052_T01	000000	chr2	8944359	8949614	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
      

Background

SRX018907: aRNA-Seq experiment using bundle-sheath cells from maize leaf. 19,869,189 36bp reads from maize B73 accession. Objective of the experiment is to detect and quantify expression levels at the gene and transcript levels. Inputs are $BASE/data/SRR039512.fastq and $BASE/data/SRR039514.fastq. Outputs are $BASE/results/transcripts_all.expr, $BASE/results/genes_all.expr and $BASE/results/accepted_hits.sam. This data set can be found on the iRODS server at bitol.iplantcollaborative.org://iplant/collections/g2p_di/Testing/UHTS/RNAseq/SRX018907

Data Intake

  1. Search NCBI SRA (Sequence Read Archive) for SRX018907.
  2. Follow "Download data for this experiment SRX018907", which leads to an FTP directory ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX018/SRX018907
    Index of /sra/static/SRX018/SRX018907
    Name	Size	Date Modified
    [parent directory]		
    SRR039512.fastq.gz	368.7 MB	5/1/10 7:38:00 PM
    SRR039514.fastq.gz	438 MB	5/1/10 7:40:00 PM
    

Downloaded and unpacked gzipped fastq files.

wget "ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX018/SRX018907/SRR039512.fastq.gz"
wget "ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX018/SRX018907/SRR039514.fastq.gz"
gunzip SRR039512.fastq.gz
gunzip SRR039514.fastq.gz

The resulting FASTQ files have already been converted to Sanger PHRED33 scaling. There is no processing required. This knowledge is implicit if the user is familiar with NCBI SRA data.

export BASE=/data/plantgroup/vaughn/iPlant/Testing/UHTS/RNAseq/SRX018907
export BIN=/home/plantgroup/bin

TopHat

Scripts for running these alignments are in $BASE/scripts/run_tophat_1.sh and $BASE/scripts/run_tophat_2.sh

Note that outputs from this stage are in $BASE/results/1 and $BASE/results/2, respectively. This is because TopHat always generates a file named 'accepted_hits.sam' and we'd have a namespace collision if we dumped both alignment runs into $BASE/results.

Run time: 1h 37m (2 threads)

Thu Jun 3 16:34:00 EDT 2010: Start TopHat
Thu Jun 3 18:11:34 EDT 2010: End TopHat

Script for running this step alignments is $BASE/scripts/run_cufflinks.sh

Important features of this script:

  • Join and sort multiple SAM alignments (I cheat and assume a fixed-length header. A better approach would be to filter out lines beginning with @, the SAM comment character).
  • Post-processing .expr files to add in zero records involves a sort stage. This was not described in the original JIRA ticket explaining this step.

Run time: 38m (2 threads)

Fri Jun 4 08:06:00 EDT 2010
Fri Jun 4 08:44:04 EDT 2010

Results

Listing of results directory

[vaughn@blade170 SRX018907]$ ls -alth results
total 6.2G
drwxr-xr-x 4 vaughn plantgroup 2.0K 2010-06-04 08:47 .
-rw-r--r-- 1 vaughn plantgroup 4.8M 2010-06-04 08:42 transcripts_all.expr
-rw-r--r-- 1 vaughn plantgroup 1.6M 2010-06-04 08:42 genes_all.expr
-rw-r--r-- 1 vaughn plantgroup 108K 2010-06-04 08:42 transcripts.expr
-rw-r--r-- 1 vaughn plantgroup  62K 2010-06-04 08:42 genes.expr
-rw-r--r-- 1 vaughn plantgroup 1.8M 2010-06-04 08:42 transcripts.gtf
-rw-r--r-- 1 vaughn plantgroup    0 2010-06-04 07:58 cufflinks.log
-rw-r--r-- 1 vaughn plantgroup 6.2G 2010-06-04 07:53 accepted_hits.sam
drwxr-xr-x 6 vaughn plantgroup 2.0K 2010-06-04 06:22 ..
drwxr-xr-x 3 vaughn plantgroup 2.0K 2010-06-03 18:10 2
drwxr-xr-x 3 vaughn plantgroup 2.0K 2010-06-03 17:58 1
-rw-r--r-- 1 vaughn plantgroup    0 2010-06-03 16:24 tophat.log

Ten lines of *.expr files

[vaughn@blade170 SRX018907]$ head -n 10 results/transcripts_all.expr 
trans_id	bundle_id	chr	left	right	FPKM	FMI	frac	FPKM_conf_lo	FPKM_conf_hi	coverage	length
GRMZM2G000014_T01	000000	chr4	210486437	210490902	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
GRMZM2G000014_T02	000000	chr4	210486437	210490902	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
GRMZM2G000035_T01	000000	chr8	8673735	8674470	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
GRMZM2G000039_T01	000000	chr2	8966185	8975382	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
GRMZM2G000042_T01	000000	chr2	8966202	8967590	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
GRMZM2G000044_T01	000000	chr2	8964709	8966031	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
GRMZM2G000044_T02	000000	chr2	8964709	8966031	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
GRMZM2G000047_T01	000000	chr2	8956399	8959233	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000
GRMZM2G000052_T01	000000	chr2	8944359	8949614	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	00000

[vaughn@blade170 SRX018907]$ head -n 10 results/genes_all.expr 
gene_id	bundle_id	chr	left	right	FPKM
GRMZM2G000014	000000	chr4	210486437	210490902	0.0000
GRMZM2G000035	000000	chr8	8673735	8674470	0.0000
GRMZM2G000039	000000	chr2	8966185	8975382	0.0000
GRMZM2G000042	000000	chr2	8966202	8967590	0.0000
GRMZM2G000044	000000	chr2	8964709	8966031	0.0000
GRMZM2G000047	000000	chr2	8956399	8959233	0.0000
GRMZM2G000052	000000	chr2	8944359	8949614	0.0000
GRMZM2G000071	000000	chr5	140167310	140168829	0.0000
GRMZM2G000076	000000	chr3	97024467	97028155	0.0000