The purpose of this exercise is to gain familiarity with a commonly used procedure for de novo whole genome assembly of Illumina reads using the iPlant Discovery Environment (DE). The procedure is based on a publication describing the Assemblathon project, where different methods of assembly were compared.
This procedure will include assembly of paired and unpaired Illumina reads with Soapdenovo2, followed by analysis of the assembly quality for comparison to what was accomplished in one of the Assemblathon procedures that used Soapdenovo1.
The procedure begins with reads previously trimmed with Scythe to remove extraneous sequence, and Sickle to remove low quality reads and low quality portions of the remaining reads. After assembly with the Soapdenovo 2 App in the DE, the resulting assembly will be analyzed for basic quality statistics, and compared to a reference genome for more in depth analysis of the assembly, specifically for assembly fidelity.
The Assemblathon1 Experiment
The Assemblathon, the first one, was a competition to compare different approaches to genome assembly. For the studies, 2 different genomes were used, but primarily one was tested in the competition: an artificially created genome, the species A genome, which is approximately 12.5 mbp. Synthetic Illumina reads were generated for a diploid version of this genome, and 17 groups from different institutions tried different approaches/assemblers to assemble the genome sequence from the reads provided. Two different groups used Soapdenovo 1, a commonly used whole genome assembler for assembling Illumina sequences. For this tutorial, Soapdenovo 2 will be used for assembly of the same artificial diploid genome, but the exact procedures used to create the final assemblies entered for the Assemblathon. For the tutorial, a recommended approach will be followed in testing different kmer settings, and in this, testing the benefits of using the GapCloser program that is provided by the same group that created Soapdenovo 2.
The original Assemblathon1 data is located in the iPlant Data Store at this location: /iplant/home/shared/iplant_assembly_test_data/assemblathon1. The data to be used in this tutorial is available in the Data Store (Community Data > iplant_training > genome_assembly2 > input_reads).
This represents the original data from the first assemblathon, trimmed and cleaned up with the applications Scythe and Sickle, as described in a separate tutorial. During the process of trimming, many culled reads left unpaired mates behind, which were moved into a separate single reads file. All the single read files were combined together and renamed to spA_singles.fq.
More information on the Assemblathon is available here: http://assemblathon.org/assemblathon1.
The Assemblathon1 publication, Dent, E. et al., Assemblathon 1: A competitive assessment of de novo short read assembly methods, Genome Res. 2011. 21:2224-2241, is found here: http://genome.cshlp.org/content/21/12/2224.full?sid=74019122-f944-4ccc-bffe-d16fdd0e7d6c.
I. Soapdenovo 2
App: Soapdenovo 2.0.4 (Public Applications>NGS>Assemblers).
A de novo, whole genome assembler for Illumina reads created by BGI.
Time for execution of the tutorial data, approximately 3 hours.
a. General settings:
b. Paired Reads 1:
c. Paired Reads 2:
d. Paired Reads 3:
d. Paired Reads 4:
d. Paired Reads 5:
f. Run Settings
II. Assembly with kmer parameter sweep
4. Click Analyses in the Discovery Environment, select the checkbox next to the Soapdenovo job run above, and then click Relaunch.
5. If desired, append kmer45 to your name and/description. Under the General Settings select a kmer size of 45 and then click Launch analysis.
6. Repeat steps 4 and 5, this time selecting a kmer size of 49 (naming/describing your job appropriately)and then launching your analysis.
Note: A best practice is to repeat this same assembly with 2 other kmer settings (e.g., 45 and 49). Creating good assemblies generally requires testing multiple kmer settings with most assemblers in use.
III. Assess assembly.
Time for execution of the tutorial data, approximately 1 hour.
7. Open a Data window by clicking Data in the Discovery Environment. Navigate to the test data Community Data > iplant_training > genome_assembly2 > assess_assembly > speciesA.diploid.fa. Keep this window off to the side so you can drag and drop the test data into the app window.
if you are using your own data, additional supported genomes can be found at: ( Community Data > iplantcollaborative > genomeservices > builds > 0.2.1> )
8. Click on Apps and search for the Assess Assembly vs whole genome app (Public Applications>NGS>Assembly Annotation). If desired make adjustment to the analysis name, description, etc. (Basic Documentation)
Explanation: Compares an assembly to a genome fasta file and evaluates the fidelity of the assembly.
9. For App Inputs use the following values for test data:
10. Repeat steps 7-9 for the other two Soapdenovo assemblies (kmer size values 45-49 respectively).
11. Compare the results for the different assemblies. Some have larger N50 values for the scaffolds. Some have lower numbers of inserts and deletions. Compare the representation values (the amount of the genome that is accurately reproduced in the assembly).
IV. Additional Assembly: GapCloser
GapCloser is a tool for filling gaps in scaffolds produced with the Soapdenovo 2 assembler. The whole output directory from a successful run of the App Soapdenovo 2.0.4 is ingested and the reads are used to find matches at the internal gap margins of the output scaffold.
12. Click on Apps and search for GapCloser, click on GapCloser 1.12.0. If desired make adjustment to the analysis name, description
13. For Input enter the entire output directory for a Soapdenovo2 assembly that you want to use with GapCloser. For the test data, (spA genome) enter one of the assemblies from running Soapdenovo2, for example “soapdenovo2_spA47” Community Data > iplant_training > genome_assembly2 > Soapdenovo2)
14. For General Settings use the following:
V. Assess assembly on GapCloser outputs
15. Click on Apps and search for the Assess Assembly vs whole genome app. (Public Applications>NGS>Assembly Annotation). If desired make adjustment to the analysis name, description, etc.
16. For App Inputs use the following values for test data:
1. Input files:
a. Illumina fastq read files sample:
2. Output files:
a. Scaffold files in fasta format:
b. Scaffold fasta file after GapCloser:
>12629 length 638 cvg_9.1_tip_1
>12631 length 638 cvg_6.0_tip_1
fasta format text file
scaffold fasta file, which represents the final output for the assembly sequences (includes large gaps filled with N’s)
fasta format text file
contig fasta file, the final assembly of continuous sequence (only small gaps)
text formatted list of statistics about the scaffold sequences including the N50 value
text formatted list of statistics about the scaffold sequences (or any input fasta file), including N50, but also comparisons to a reference genome (e.g. representation value)
fasta format text file
scaffold fasta file, which represents the final output for the assembly sequences from GapCloser