Variant Detection
Background
This is an example of data from the SRA (Sequenced Read Archive). Specifically, this is experiment SRX010829 "GA II sequencing the maize inbred line Mo17". It contains two data sets: SRR026741 (47472582 36 bp reads) and SRR026996 (2308132 32 bp reads) from the from the Zea mays Mo17 land race, which is genetically divergent from the reference B73 land race. The objective of the analysis is to determine a entire table of high-confidence SNPs.
User script
- Manage Data: Import data files into the Discovery Environment by choosing File=>Import from URL in the My Data window.
- ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX010/SRX010829/SRR026741.fastq.gz
- ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX010/SRX010829/SRR026996.fastq.gz
- After files have uploaded, they will appear in My Data
- Process Sequence Data (using Investigation Type QC Pre Processing (FASTQ files))
- You don't need to do any pre-processing on this data because it has no barcodes or sequence adapters and because it has been automatically rescaled to PHRED33 by NCBI.
- Align to Genome: There will be TWO alignments performed, one for each FASTQ (using Investigation Type Variant Detection)
- Select reference genome 'Zea mays'
- Select the SRR026741.fastq file as input
- Launch job.
- Repeat this with SRR026996.fastq.
- Your jobs will be named after the job id of the alignment job. An example might be Job_12345. You can view the status of your job in the My Jobs window. When your job is done, your My Data window will have a folder named after the job id that will contain the results. We estimate it will take ~30 min for the smaller file and 3 hours for the larger one to run.
- You will get two SAM files in My Data. If you're curious about what's in these files, you can preview the first 8k of the file by selecting View in the file manager.
- Find SNPS (using Investigation Type Variant Detection)
- Select reference genome 'Zea mays'
- Select files for input: Select the TWO SAM files generated by the previous alignment jobs. You'll need to select them one at a time as the current interface doesn't allow multiple selections.
- In base calling, set haplotypes to 2. Set Probability of Indel to 40.
- Leave all settings in Filtering to their defaults, but set name of sample to 'Mo'
- Launch the job. We estimate 60-90 minutes to complete and your My Jobs window will update with the status of your job.
- Validation
- You will get a VCF file in a folder in My Data named after the job id. If you inspect it, it should look something like the snippet below. These are novel alleles in Mo17 relative to B73.
##fileformat=VCFv3.3 ##INFO=DP,1,Integer,"Total Depth" ##FORMAT=GT,1,String,"Genotype" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=DP,1,Integer,"Read Depth" #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Mo17 chr0 45785 . C G 39 0 DP=4 GT:GQ:DP 1/1:39:4 chr0 124336 . C T 42 0 DP=5 GT:GQ:DP 1/1:42:5 chr0 124356 . A G 34 0 DP=5 GT:GQ:DP 1/1:34:5 chr0 729215 . A G 39 0 DP=4 GT:GQ:DP 1/1:39:4 chr0 729224 . T C 39 0 DP=4 GT:GQ:DP 1/1:39:4 chr0 1116103 . C A 45 0 DP=8 GT:GQ:DP 1/1:17:8 chr0 1116109 . C A 44 0 DP=8 GT:GQ:DP 1/1:16:8 chr0 1353844 . C IT 248 0 DP=33 GT:GQ:DP 0/1:248:33 chr0 1774140 . T D2 29 0 DP=3 GT:GQ:DP 0/1:29:3 chr0 1936168 . G C 39 0 DP=4 GT:GQ:DP 1/1:39:4 chr0 1936169 . C T 36 0 DP=4 GT:GQ:DP 1/1:36:4 chr0 2039472 . T D1 159 0 DP=5 GT:GQ:DP 1/1:53:5 chr0 2185119 . T C 36 0 DP=3 GT:GQ:DP 1/1:36:3 chr0 2559281 . T C 51 0 DP=8 GT:GQ:DP 1/1:51:8 chr0 2559295 . G A 51 0 DP=8 GT:GQ:DP 1/1:51:8 chr0 2565923 . T C 48 0 DP=7 GT:GQ:DP 1/1:48:7 chr0 2570157 . T G 97 0 DP=53 GT:GQ:DP 0/1:97:53 chr0 2570245 . C T 32 0 DP=3 GT:GQ:DP 1/1:28:3 chr0 2571239 . T C 36 0 DP=3 GT:GQ:DP 1/1:36:3
- You will get a VCF file in a folder in My Data named after the job id. If you inspect it, it should look something like the snippet below. These are novel alleles in Mo17 relative to B73.