Variant Detection

Background

This is an example of data from the SRA (Sequenced Read Archive). Specifically, this is experiment SRX010829 "GA II sequencing the maize inbred line Mo17". It contains two data sets: SRR026741 (47472582 36 bp reads) and SRR026996 (2308132 32 bp reads) from the from the Zea mays Mo17 land race, which is genetically divergent from the reference B73 land race. The objective of the analysis is to determine a entire table of high-confidence SNPs.

User script

  1. Manage Data: Import data files into the Discovery Environment by choosing File=>Import from URL in the My Data window.
    1. ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX010/SRX010829/SRR026741.fastq.gz
    2. ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX010/SRX010829/SRR026996.fastq.gz
    3. After files have uploaded, they will appear in My Data
  2. Process Sequence Data (using Investigation Type QC Pre Processing (FASTQ files))
    1. You don't need to do any pre-processing on this data because it has no barcodes or sequence adapters and because it has been automatically rescaled to PHRED33 by NCBI.
  3. Align to Genome: There will be TWO alignments performed, one for each FASTQ (using Investigation Type Variant Detection)
    1. Select reference genome 'Zea mays'
    2. Select the SRR026741.fastq file as input
    3. Launch job.
    4. Repeat this with SRR026996.fastq.
    5. Your jobs will be named after the job id of the alignment job. An example might be Job_12345. You can view the status of your job in the My Jobs window. When your job is done, your My Data window will have a folder named after the job id that will contain the results. We estimate it will take ~30 min for the smaller file and 3 hours for the larger one to run.
    6. You will get two SAM files in My Data. If you're curious about what's in these files, you can preview the first 8k of the file by selecting View in the file manager.
  4. Find SNPS (using Investigation Type Variant Detection)
    1. Select reference genome 'Zea mays'
    2. Select files for input: Select the TWO SAM files generated by the previous alignment jobs. You'll need to select them one at a time as the current interface doesn't allow multiple selections.
    3. In base calling, set haplotypes to 2. Set Probability of Indel to 40.
    4. Leave all settings in Filtering to their defaults, but set name of sample to 'Mo'
    5. Launch the job. We estimate 60-90 minutes to complete and your My Jobs window will update with the status of your job.
  5. Validation
    1. You will get a VCF file in a folder in My Data named after the job id. If you inspect it, it should look something like the snippet below. These are novel alleles in Mo17 relative to B73.
      ##fileformat=VCFv3.3
      ##INFO=DP,1,Integer,"Total Depth"
      ##FORMAT=GT,1,String,"Genotype"
      ##FORMAT=GQ,1,Integer,"Genotype Quality"
      ##FORMAT=DP,1,Integer,"Read Depth"
      #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Mo17
      chr0	45785	.	C	G	39	0	DP=4	GT:GQ:DP	1/1:39:4
      chr0	124336	.	C	T	42	0	DP=5	GT:GQ:DP	1/1:42:5
      chr0	124356	.	A	G	34	0	DP=5	GT:GQ:DP	1/1:34:5
      chr0	729215	.	A	G	39	0	DP=4	GT:GQ:DP	1/1:39:4
      chr0	729224	.	T	C	39	0	DP=4	GT:GQ:DP	1/1:39:4
      chr0	1116103	.	C	A	45	0	DP=8	GT:GQ:DP	1/1:17:8
      chr0	1116109	.	C	A	44	0	DP=8	GT:GQ:DP	1/1:16:8
      chr0	1353844	.	C	IT	248	0	DP=33	GT:GQ:DP	0/1:248:33
      chr0	1774140	.	T	D2	29	0	DP=3	GT:GQ:DP	0/1:29:3
      chr0	1936168	.	G	C	39	0	DP=4	GT:GQ:DP	1/1:39:4
      chr0	1936169	.	C	T	36	0	DP=4	GT:GQ:DP	1/1:36:4
      chr0	2039472	.	T	D1	159	0	DP=5	GT:GQ:DP	1/1:53:5
      chr0	2185119	.	T	C	36	0	DP=3	GT:GQ:DP	1/1:36:3
      chr0	2559281	.	T	C	51	0	DP=8	GT:GQ:DP	1/1:51:8
      chr0	2559295	.	G	A	51	0	DP=8	GT:GQ:DP	1/1:51:8
      chr0	2565923	.	T	C	48	0	DP=7	GT:GQ:DP	1/1:48:7
      chr0	2570157	.	T	G	97	0	DP=53	GT:GQ:DP	0/1:97:53
      chr0	2570245	.	C	T	32	0	DP=3	GT:GQ:DP	1/1:28:3
      chr0	2571239	.	T	C	36	0	DP=3	GT:GQ:DP	1/1:36:3