UHTS_Test_VarDetect_1
Overview
User script
- Manage Data: Import data files into DE
- Here, we're testing import from an Amazon S3 bucket. Now this is pod racing!
- http://irods-collections.s3.amazonaws.com/g2p_di/Testing/UHTS/VariantDetection/Ler/data/landsberg.fastq
- Process Sequence Data
- We do NO processing on this data
- Align to Genome
- Select reference genome 'Arabidopsis thaliana'
- Select the landsberg.fastq file as input
- Launch job. Try to monitor how long it takes to run. We estimate 1-2 hours.
- Find SNPS
- Select reference genome 'Arabidopsis thaliana'
- Select files for input: This should be the SAM file from your previous alignment.
- In base calling, set haplotypes to 1 since Arabidopsis thaliana is generally monoallelic. Set Probability of Indel to 40.
- All settings in Filtering are defaults, but set name of sample to 'Ler'
- Launch job and track how long it takes to run. We estimate 20 minutes.
- Validation
- Inspect the SAM file that came from your Alignment process
- Inspect the VCF file that results from your SNP-finding endeavour. Compare to this reference VCF for common SNPs, header formatting, etc.
##fileformat=VCFv3.3 ##INFO=DP,1,Integer,"Total Depth" ##FORMAT=GT,1,String,"Genotype" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=DP,1,Integer,"Read Depth" #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Landsberg Chr1 711 . T C 126 0 DP=4 GT:GQ:DP 1/1:126:4 Chr1 956 . C T 126 0 DP=4 GT:GQ:DP 1/1:126:4 Chr1 6324 . T IA 79 0 DP=3 GT:GQ:DP 1/1:47:3 Chr1 10904 . A T 169 0 DP=6 GT:GQ:DP 1/1:169:6 Chr1 14534 . G D1 39 0 DP=4 GT:GQ:DP 1/1:10:4 Chr1 37388 . G T 84 0 DP=3 GT:GQ:DP 1/1:84:3 Chr1 37768 . A IG 24 0 DP=6 GT:GQ:DP 0/1:24:6 Chr1 41749 . G D2 29 0 DP=3 GT:GQ:DP 0/1:29:3 Chr1 44685 . C T 24 0 DP=4 GT:GQ:DP 1/1:24:4 Chr1 71326 . G T 235 0 DP=12 GT:GQ:DP 1/1:235:12 Chr1 71348 . C T 140 0 DP=7 GT:GQ:DP 1/1:140:7 Chr1 80288 . T IC 79 0 DP=3 GT:GQ:DP 1/1:47:3 Chr1 87746 . C IA 39 0 DP=3 GT:GQ:DP 1/1:47:3 Chr1 90334 . A D1 40 0 DP=14 GT:GQ:DP 0/1:40:14 Chr1 90378 . G IA 27 0 DP=5 GT:GQ:DP 0/1:27:5 Chr1 91199 . G A 33 0 DP=4 GT:GQ:DP 1/1:33:4 Chr1 96770 . C A 93 0 DP=3 GT:GQ:DP 1/1:93:3 Chr1 97473 . C T 173 0 DP=6 GT:GQ:DP 1/1:173:6 Chr1 100305 . G T 32 0 DP=3 GT:GQ:DP 1/1:32:3
Background
This data set is comprised of 30344190 32 bp reads from the Arabidopsis thaliana Landsberg erecta land race, which is estimated to be ~5% divergent from the reference Columbia land race. The objective of the analysis is to determine the entire table of high-confidence SNPs.
This test set starts with the file data/landsberg.fastq and the genome directory data/arabidopsis_thaliana/v9 and emits results/landsberg.v9.sam and results/landsberg.v9.vcf as end-user deliverables. From start to finish it required ~115 minutes to complete.
This data set can be found on the iRODS server at bitol.iplantcollaborative.org://iplant/collections/g2p_di/Testing/UHTS/VariantDetection/Ler
The following environment variables are assumed in the scripts provided as examples.
export BASE=/home/plantgroup/vaughn/iPlant/Testing/UHTS/VariantDetection/Ler
export BIN=/home/plantgroup/bin
Versions
[vaughn@blade173] bwa Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.5.7 (r1310) [vaughn@blade173 src]$ samtools Program: samtools (Tools for alignments in the SAM format) Version: 0.1.7-6 (r530)
Indexing genome sequence
The Engagement Team are providing indexed genomes for maize and Arabidopsis but if you'd care to test that you've got indexing working, here are the expected outputs.
cd $HOME/data/arabidopsis_thaliana/v8 bwa index genome.fas [bwa_index] Pack FASTA... 2.05 sec [bwa_index] Reverse the packed sequence... 0.63 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 69.32 seconds elapse. [bwa_index] Construct BWT for the reverse packed sequence... [bwa_index] 68.97 seconds elapse. [bwa_index] Update BWT... 0.66 sec [bwa_index] Update reverse BWT... 0.67 sec [bwa_index] Construct SA from BWT and Occ... 20.89 sec [bwa_index] Construct SA from reverse BWT and Occ... 20.72 sec ls -alth total 288M -rw-r--r-- 1 vaughn plantgroup 15M 2010-05-28 16:11 genome.fas.rsa drwxr-xr-x 2 vaughn plantgroup 2.0K 2010-05-28 16:11 . -rw-r--r-- 1 vaughn plantgroup 15M 2010-05-28 16:10 genome.fas.sa -rw-r--r-- 1 vaughn plantgroup 43M 2010-05-28 16:10 genome.fas.rbwt -rw-r--r-- 1 vaughn plantgroup 43M 2010-05-28 16:10 genome.fas.bwt -rw-r--r-- 1 vaughn plantgroup 29M 2010-05-28 16:08 genome.fas.rpac -rw-r--r-- 1 vaughn plantgroup 8.1K 2010-05-28 16:08 genome.fas.amb -rw-r--r-- 1 vaughn plantgroup 249 2010-05-28 16:08 genome.fas.ann -rw-r--r-- 1 vaughn plantgroup 29M 2010-05-28 16:08 genome.fas.pac -rw-r--r-- 1 vaughn plantgroup 117M 2010-05-28 15:59 genome.fas
cd $HOME/data/arabidopsis_thaliana/v9 bwa index genome.fas [bwa_index] Pack FASTA... 2.04 sec [bwa_index] Reverse the packed sequence... 0.63 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 69.08 seconds elapse. [bwa_index] Construct BWT for the reverse packed sequence... [bwa_index] 69.06 seconds elapse. [bwa_index] Update BWT... 0.65 sec [bwa_index] Update reverse BWT... 0.66 sec [bwa_index] Construct SA from BWT and Occ... 21.00 sec [bwa_index] Construct SA from reverse BWT and Occ... 21.07 sec ls -alth total 287M -rw-r--r-- 1 vaughn plantgroup 15M 2010-05-28 16:15 genome.fas.rsa drwxr-xr-x 2 vaughn plantgroup 2.0K 2010-05-28 16:15 . -rw-r--r-- 1 vaughn plantgroup 15M 2010-05-28 16:15 genome.fas.sa -rw-r--r-- 1 vaughn plantgroup 43M 2010-05-28 16:14 genome.fas.rbwt -rw-r--r-- 1 vaughn plantgroup 43M 2010-05-28 16:14 genome.fas.bwt -rw-r--r-- 1 vaughn plantgroup 29M 2010-05-28 16:12 genome.fas.rpac -rw-r--r-- 1 vaughn plantgroup 7.4K 2010-05-28 16:12 genome.fas.amb -rw-r--r-- 1 vaughn plantgroup 690 2010-05-28 16:12 genome.fas.ann -rw-r--r-- 1 vaughn plantgroup 29M 2010-05-28 16:12 genome.fas.pac drwxr-xr-x 4 vaughn plantgroup 2.0K 2010-05-28 15:58 .. -rw-r--r-- 1 vaughn plantgroup 116M 2010-05-28 15:58 genome.fas
Running the BWA alignment
The following is explained as an SGE script for launching an alignment job. Commands for alignment and SAI->SAM conversion are indicated. This isn't very fast since it is a single threaded alignment. See below for actual benchmarks.
#!/bin/bash #$ -S /bin/bash #$ -cwd #$ -l virtual_free=3.9G # SGE script to launch single-thread BWA align then convert output to SAM date export BASE=/home/plantgroup/vaughn/iPlant/Test/UHTS/VariantDetection/Ler export BIN=/home/plantgroup/bin # Align using default BWA params. There are other params specified in the BWA # man page but they invoke changes in BWAs actual behavior (read trimming, # suboptimal alignments, etc) $BIN/bwa aln -n 0.04 -o 1 -e -1 -d 16 -i 5 -k 2 -t 1 -M 3 -O 11 -E 4 $BASE/data/arabidopsis_thaliana/v9/genome.fas $BASE/data/landsberg.fastq > $BASE/results/landsberg.v9.sai date # SAI -> SAM $BIN/bwa samse $BASE/data/arabidopsis_thaliana/v9/genome.fas $BASE/results/landsberg.v9.sai $BASE/data/landsberg.fastq > $BASE/results/landsberg.v9.sam date
Result 1
Script start: Sun May 30 06:05:53 EDT 2010
Alignment complete: Sun May 30 07:31:01 EDT 2010
SAI conversion complete: Sun May 30 07:37:20 EDT 2010
On CSHL's compute nodes (Opteron 64 processors @ 2.0GHz), the alignment took 85 minutes and SAI conversion required 6 minutes. This works out to about 5888 sequences/second on this hardware for the alignment and 5057365 sequences/second for SAI conversion.
ls -alt results total 5174176 -rw-r--r-- 1 vaughn plantgroup 4503664386 2010-05-30 07:36 landsberg.v9.sam -rw-r--r-- 1 vaughn plantgroup 794623076 2010-05-30 07:29 landsberg.v9.sai drwxr-xr-x 2 vaughn plantgroup 2048 2010-05-30 06:04 . drwxr-xr-x 5 vaughn plantgroup 2048 2010-05-28 16:00 ..
Variant Detection
#!/bin/bash #$ -S /bin/bash #$ -cwd #$ -l virtual_free=1.9G # SGE script to launch single-thread BWA align then convert output to SAM export BASE=/home/plantgroup/vaughn/iPlant/Test/UHTS/VariantDetection/Ler export BIN=/home/plantgroup/bin # Convert to compressed BAM date $BIN/samtools view -bt $BASE/data/arabidopsis_thaliana/v9/genome.fas.fai -o $BASE/results/landsberg.v9.unsorted.bam $BASE/results/landsberg.v9.sam # Sort BAM date $BIN/samtools sort $BASE/results/landsberg.v9.unsorted.bam $BASE/results/landsberg.v9 # Pileup # -N is 1 because Arabidopsis is an inbred. This is an example of the parameter differing # from the default value. date $BIN/samtools pileup -T 0.85 -N 1 -r 0.001 -I 40 -vcf $BASE/data/arabidopsis_thaliana/v9/genome.fas $BASE/results/landsberg.v9.bam > $BASE/results/landsberg.v9.pileup.raw # Filtering date $BIN/samtools.pl varFilter -Q 25 -q 10 -d 3 -D 100 -G 25 -w 10 -W 10 -N 2 -l 30 $BASE/results/landsberg.v9.pileup.raw | awk '$6>=20' > $BASE/results/landsberg.v9.pileup # Conversion to VCF3.3 date $BIN/sam2vcf.pl -r $BASE/data/arabidopsis_thaliana/v9/genome.fas -t Landsberg < $BASE/results/landsberg.v9.pileup > $BASE/results/landsberg.v9.vcf date
Results 2
Run Times
Sun May 30 08:34:53 EDT 2010: Start
Sun May 30 08:40:59 EDT 2010: BAM conversion done (6 min)
Sun May 30 08:52:04 EDT 2010: BAM sort done (11 min)
Sun May 30 08:58:18 EDT 2010: Pileup done (6 min)
Sun May 30 08:58:35 EDT 2010: Pileup filter done (17 sec)
Sun May 30 08:58:50 EDT 2010: Conversion to VCF done (15 sec)
Final contents of results directory
ls -alt results total 7198848 -rw-r--r-- 1 vaughn plantgroup 11961564 2010-05-30 09:26 landsberg.v9.vcf -rw-r--r-- 1 vaughn plantgroup 10457460 2010-05-30 08:57 landsberg.v9.pileup -rw-r--r-- 1 vaughn plantgroup 31503169 2010-05-30 08:57 landsberg.v9.pileup.raw drwxr-xr-x 2 vaughn plantgroup 2048 2010-05-30 08:50 . -rw-r--r-- 1 vaughn plantgroup 899645311 2010-05-30 08:50 landsberg.v9.bam -rw-r--r-- 1 vaughn plantgroup 1119586992 2010-05-30 08:39 landsberg.v9.unsorted.bam -rw-r--r-- 1 vaughn plantgroup 4503664386 2010-05-30 07:36 landsberg.v9.sam -rw-r--r-- 1 vaughn plantgroup 794623076 2010-05-30 07:29 landsberg.v9.sai
Metrics
VCF file is 244993 lines long
wc results/landsberg.v9.vcf 244993 2449888 11961564 landsberg.v9.vcf
First 10 lines of VCF
head -n results/landsberg.v9.vcf ##fileformat=VCFv3.3 ##INFO=DP,1,Integer,"Total Depth" ##FORMAT=GT,1,String,"Genotype" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=DP,1,Integer,"Read Depth" #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Landsberg Chr1 711 . T C 126 0 DP=4 GT:GQ:DP 1/1:126:4 Chr1 956 . C T 126 0 DP=4 GT:GQ:DP 1/1:126:4 Chr1 6324 . T IA 79 0 DP=3 GT:GQ:DP 1/1:47:3 Chr1 10904 . A T 169 0 DP=6 GT:GQ:DP 1/1:169:6