UHTS_Test_VarDetect_1

Overview

User script

  1. Manage Data: Import data files into DE
    1. Here, we're testing import from an Amazon S3 bucket. Now this is pod racing!
    2. http://irods-collections.s3.amazonaws.com/g2p_di/Testing/UHTS/VariantDetection/Ler/data/landsberg.fastq
  2. Process Sequence Data
    1. We do NO processing on this data
  3. Align to Genome
    1. Select reference genome 'Arabidopsis thaliana'
    2. Select the landsberg.fastq file as input
    3. Launch job. Try to monitor how long it takes to run. We estimate 1-2 hours.
  4. Find SNPS
    1. Select reference genome 'Arabidopsis thaliana'
    2. Select files for input: This should be the SAM file from your previous alignment.
    3. In base calling, set haplotypes to 1 since Arabidopsis thaliana is generally monoallelic. Set Probability of Indel to 40.
    4. All settings in Filtering are defaults, but set name of sample to 'Ler'
    5. Launch job and track how long it takes to run. We estimate 20 minutes.
  5. Validation
    1. Inspect the SAM file that came from your Alignment process
    2. Inspect the VCF file that results from your SNP-finding endeavour. Compare to this reference VCF for common SNPs, header formatting, etc.
      ##fileformat=VCFv3.3
      ##INFO=DP,1,Integer,"Total Depth"
      ##FORMAT=GT,1,String,"Genotype"
      ##FORMAT=GQ,1,Integer,"Genotype Quality"
      ##FORMAT=DP,1,Integer,"Read Depth"
      #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Landsberg
      Chr1	711	.	T	C	126	0	DP=4	GT:GQ:DP	1/1:126:4
      Chr1	956	.	C	T	126	0	DP=4	GT:GQ:DP	1/1:126:4
      Chr1	6324	.	T	IA	79	0	DP=3	GT:GQ:DP	1/1:47:3
      Chr1	10904	.	A	T	169	0	DP=6	GT:GQ:DP	1/1:169:6
      Chr1	14534	.	G	D1	39	0	DP=4	GT:GQ:DP	1/1:10:4
      Chr1	37388	.	G	T	84	0	DP=3	GT:GQ:DP	1/1:84:3
      Chr1	37768	.	A	IG	24	0	DP=6	GT:GQ:DP	0/1:24:6
      Chr1	41749	.	G	D2	29	0	DP=3	GT:GQ:DP	0/1:29:3
      Chr1	44685	.	C	T	24	0	DP=4	GT:GQ:DP	1/1:24:4
      Chr1	71326	.	G	T	235	0	DP=12	GT:GQ:DP	1/1:235:12
      Chr1	71348	.	C	T	140	0	DP=7	GT:GQ:DP	1/1:140:7
      Chr1	80288	.	T	IC	79	0	DP=3	GT:GQ:DP	1/1:47:3
      Chr1	87746	.	C	IA	39	0	DP=3	GT:GQ:DP	1/1:47:3
      Chr1	90334	.	A	D1	40	0	DP=14	GT:GQ:DP	0/1:40:14
      Chr1	90378	.	G	IA	27	0	DP=5	GT:GQ:DP	0/1:27:5
      Chr1	91199	.	G	A	33	0	DP=4	GT:GQ:DP	1/1:33:4
      Chr1	96770	.	C	A	93	0	DP=3	GT:GQ:DP	1/1:93:3
      Chr1	97473	.	C	T	173	0	DP=6	GT:GQ:DP	1/1:173:6
      Chr1	100305	.	G	T	32	0	DP=3	GT:GQ:DP	1/1:32:3
      

Background

This data set is comprised of 30344190 32 bp reads from the Arabidopsis thaliana Landsberg erecta land race, which is estimated to be ~5% divergent from the reference Columbia land race. The objective of the analysis is to determine the entire table of high-confidence SNPs.

This test set starts with the file data/landsberg.fastq and the genome directory data/arabidopsis_thaliana/v9 and emits results/landsberg.v9.sam and results/landsberg.v9.vcf as end-user deliverables. From start to finish it required ~115 minutes to complete.

This data set can be found on the iRODS server at bitol.iplantcollaborative.org://iplant/collections/g2p_di/Testing/UHTS/VariantDetection/Ler

The following environment variables are assumed in the scripts provided as examples.
export BASE=/home/plantgroup/vaughn/iPlant/Testing/UHTS/VariantDetection/Ler
export BIN=/home/plantgroup/bin

Versions

[vaughn@blade173] bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.5.7 (r1310)

[vaughn@blade173 src]$ samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.7-6 (r530)

Indexing genome sequence

The Engagement Team are providing indexed genomes for maize and Arabidopsis but if you'd care to test that you've got indexing working, here are the expected outputs.

cd $HOME/data/arabidopsis_thaliana/v8
bwa index genome.fas

[bwa_index] Pack FASTA... 2.05 sec
[bwa_index] Reverse the packed sequence... 0.63 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 69.32 seconds elapse.
[bwa_index] Construct BWT for the reverse packed sequence...
[bwa_index] 68.97 seconds elapse.
[bwa_index] Update BWT... 0.66 sec
[bwa_index] Update reverse BWT... 0.67 sec
[bwa_index] Construct SA from BWT and Occ... 20.89 sec
[bwa_index] Construct SA from reverse BWT and Occ... 20.72 sec

ls -alth
total 288M
-rw-r--r-- 1 vaughn plantgroup  15M 2010-05-28 16:11 genome.fas.rsa
drwxr-xr-x 2 vaughn plantgroup 2.0K 2010-05-28 16:11 .
-rw-r--r-- 1 vaughn plantgroup  15M 2010-05-28 16:10 genome.fas.sa
-rw-r--r-- 1 vaughn plantgroup  43M 2010-05-28 16:10 genome.fas.rbwt
-rw-r--r-- 1 vaughn plantgroup  43M 2010-05-28 16:10 genome.fas.bwt
-rw-r--r-- 1 vaughn plantgroup  29M 2010-05-28 16:08 genome.fas.rpac
-rw-r--r-- 1 vaughn plantgroup 8.1K 2010-05-28 16:08 genome.fas.amb
-rw-r--r-- 1 vaughn plantgroup  249 2010-05-28 16:08 genome.fas.ann
-rw-r--r-- 1 vaughn plantgroup  29M 2010-05-28 16:08 genome.fas.pac
-rw-r--r-- 1 vaughn plantgroup 117M 2010-05-28 15:59 genome.fas
cd $HOME/data/arabidopsis_thaliana/v9
bwa index genome.fas

[bwa_index] Pack FASTA... 2.04 sec
[bwa_index] Reverse the packed sequence... 0.63 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 69.08 seconds elapse.
[bwa_index] Construct BWT for the reverse packed sequence...
[bwa_index] 69.06 seconds elapse.
[bwa_index] Update BWT... 0.65 sec
[bwa_index] Update reverse BWT... 0.66 sec
[bwa_index] Construct SA from BWT and Occ... 21.00 sec
[bwa_index] Construct SA from reverse BWT and Occ... 21.07 sec

ls -alth
total 287M
-rw-r--r-- 1 vaughn plantgroup  15M 2010-05-28 16:15 genome.fas.rsa
drwxr-xr-x 2 vaughn plantgroup 2.0K 2010-05-28 16:15 .
-rw-r--r-- 1 vaughn plantgroup  15M 2010-05-28 16:15 genome.fas.sa
-rw-r--r-- 1 vaughn plantgroup  43M 2010-05-28 16:14 genome.fas.rbwt
-rw-r--r-- 1 vaughn plantgroup  43M 2010-05-28 16:14 genome.fas.bwt
-rw-r--r-- 1 vaughn plantgroup  29M 2010-05-28 16:12 genome.fas.rpac
-rw-r--r-- 1 vaughn plantgroup 7.4K 2010-05-28 16:12 genome.fas.amb
-rw-r--r-- 1 vaughn plantgroup  690 2010-05-28 16:12 genome.fas.ann
-rw-r--r-- 1 vaughn plantgroup  29M 2010-05-28 16:12 genome.fas.pac
drwxr-xr-x 4 vaughn plantgroup 2.0K 2010-05-28 15:58 ..
-rw-r--r-- 1 vaughn plantgroup 116M 2010-05-28 15:58 genome.fas

Running the BWA alignment

The following is explained as an SGE script for launching an alignment job. Commands for alignment and SAI->SAM conversion are indicated. This isn't very fast since it is a single threaded alignment. See below for actual benchmarks.

run_bwa_align.sh
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -l virtual_free=3.9G

# SGE script to launch single-thread BWA align then convert output to SAM

date
export BASE=/home/plantgroup/vaughn/iPlant/Test/UHTS/VariantDetection/Ler
export BIN=/home/plantgroup/bin

# Align using default BWA params. There are other params specified in the BWA
# man page but they invoke changes in BWAs actual behavior (read trimming, 
# suboptimal alignments, etc)
$BIN/bwa aln -n 0.04 -o 1 -e -1 -d 16 -i 5 -k 2 -t 1 -M 3 -O 11 -E 4 $BASE/data/arabidopsis_thaliana/v9/genome.fas $BASE/data/landsberg.fastq > $BASE/results/landsberg.v9.sai

date

# SAI -> SAM
$BIN/bwa samse $BASE/data/arabidopsis_thaliana/v9/genome.fas $BASE/results/landsberg.v9.sai $BASE/data/landsberg.fastq > $BASE/results/landsberg.v9.sam

date

Result 1

Script start: Sun May 30 06:05:53 EDT 2010
Alignment complete: Sun May 30 07:31:01 EDT 2010
SAI conversion complete: Sun May 30 07:37:20 EDT 2010

On CSHL's compute nodes (Opteron 64 processors @ 2.0GHz), the alignment took 85 minutes and SAI conversion required 6 minutes. This works out to about 5888 sequences/second on this hardware for the alignment and 5057365 sequences/second for SAI conversion.

ls -alt results
total 5174176
-rw-r--r-- 1 vaughn plantgroup 4503664386 2010-05-30 07:36 landsberg.v9.sam
-rw-r--r-- 1 vaughn plantgroup  794623076 2010-05-30 07:29 landsberg.v9.sai
drwxr-xr-x 2 vaughn plantgroup       2048 2010-05-30 06:04 .
drwxr-xr-x 5 vaughn plantgroup       2048 2010-05-28 16:00 ..

Variant Detection

run_vardetect.sh
#!/bin/bash 
#$ -S /bin/bash
#$ -cwd
#$ -l virtual_free=1.9G

# SGE script to launch single-thread BWA align then convert output to SAM

export BASE=/home/plantgroup/vaughn/iPlant/Test/UHTS/VariantDetection/Ler
export BIN=/home/plantgroup/bin

# Convert to compressed BAM
date
$BIN/samtools view -bt $BASE/data/arabidopsis_thaliana/v9/genome.fas.fai -o $BASE/results/landsberg.v9.unsorted.bam $BASE/results/landsberg.v9.sam

# Sort BAM
date
$BIN/samtools sort $BASE/results/landsberg.v9.unsorted.bam $BASE/results/landsberg.v9

# Pileup
# -N is 1 because Arabidopsis is an inbred. This is an example of the parameter differing
# from the default value.
date
$BIN/samtools pileup -T 0.85 -N 1 -r 0.001 -I 40 -vcf $BASE/data/arabidopsis_thaliana/v9/genome.fas $BASE/results/landsberg.v9.bam > $BASE/results/landsberg.v9.pileup.raw

# Filtering
date
$BIN/samtools.pl varFilter -Q 25 -q 10 -d 3 -D 100 -G 25 -w 10 -W 10 -N 2 -l 30 $BASE/results/landsberg.v9.pileup.raw | awk '$6>=20' > $BASE/results/landsberg.v9.pileup

# Conversion to VCF3.3
date
$BIN/sam2vcf.pl -r $BASE/data/arabidopsis_thaliana/v9/genome.fas -t Landsberg < $BASE/results/landsberg.v9.pileup > $BASE/results/landsberg.v9.vcf

date

Results 2

Run Times

Sun May 30 08:34:53 EDT 2010: Start
Sun May 30 08:40:59 EDT 2010: BAM conversion done (6 min)
Sun May 30 08:52:04 EDT 2010: BAM sort done (11 min)
Sun May 30 08:58:18 EDT 2010: Pileup done (6 min)
Sun May 30 08:58:35 EDT 2010: Pileup filter done (17 sec)
Sun May 30 08:58:50 EDT 2010: Conversion to VCF done (15 sec)

Final contents of results directory

ls -alt results
total 7198848
-rw-r--r-- 1 vaughn plantgroup   11961564 2010-05-30 09:26 landsberg.v9.vcf
-rw-r--r-- 1 vaughn plantgroup   10457460 2010-05-30 08:57 landsberg.v9.pileup
-rw-r--r-- 1 vaughn plantgroup   31503169 2010-05-30 08:57 landsberg.v9.pileup.raw
drwxr-xr-x 2 vaughn plantgroup       2048 2010-05-30 08:50 .
-rw-r--r-- 1 vaughn plantgroup  899645311 2010-05-30 08:50 landsberg.v9.bam
-rw-r--r-- 1 vaughn plantgroup 1119586992 2010-05-30 08:39 landsberg.v9.unsorted.bam
-rw-r--r-- 1 vaughn plantgroup 4503664386 2010-05-30 07:36 landsberg.v9.sam
-rw-r--r-- 1 vaughn plantgroup  794623076 2010-05-30 07:29 landsberg.v9.sai

Metrics

VCF file is 244993 lines long

wc results/landsberg.v9.vcf 
244993  2449888 11961564 landsberg.v9.vcf

First 10 lines of VCF

head -n results/landsberg.v9.vcf

##fileformat=VCFv3.3
##INFO=DP,1,Integer,"Total Depth"
##FORMAT=GT,1,String,"Genotype"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=DP,1,Integer,"Read Depth"
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Landsberg
Chr1	711	.	T	C	126	0	DP=4	GT:GQ:DP	1/1:126:4
Chr1	956	.	C	T	126	0	DP=4	GT:GQ:DP	1/1:126:4
Chr1	6324	.	T	IA	79	0	DP=3	GT:GQ:DP	1/1:47:3
Chr1	10904	.	A	T	169	0	DP=6	GT:GQ:DP	1/1:169:6