Introduction and overview
Author: Dr. Roger Barthelson, iPlant Collaborative/University of Arizona
The goal of this tutorial is to gain familiarity with a commonly used procedure for de novo whole genome assembly of Illumina reads using the iPlant Discovery Environment (DE).
This procedure will include assembly of paired and unpaired Illumina reads with SOAPdenovo2, followed by an analysis of the assembly quality.
- Workflow: De novo Assembly I: Genome assembly with SOAPdenovo2.
- The procedure begins with reads previously trimmed with Scythe to remove extraneous sequence, and Sickle to remove low quality reads and low quality portions of the remaining reads. After assembly with the SOAPdenovo2 App in the DE, the resulting assembly will be analyzed for basic quality statistics, and compared to a reference genome for more in depth analysis of the assembly, specifically for assembly fidelity.
Rationale and Background
De Novo Sequencing A process in which a novel genome is sequenced for the first time and requires specialized assembly of sequencing reads. For this tutorial the assembler SOAPdenovo2 will be used to assemble the genome. A recommended approach will be followed in testing different kmer settings.
SOAPdenovo SOAPdenovo is a novel short-read assembly method that can build draft assemblies de novo for human-sized genomes. The program is designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way. SOAPdenovo aims for large plant and animal genomes, although it also works well on bacteria and fungi genomes.
- An iPlant account. (Register for an iPlant account at user.iplantcollaborative.org.)
- The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.
This tutorial uses Rhodobacter sphaeroides Illumina sequencing data that are stored in the Data Store in the iPlant Discovery Environment. They were downloaded from the GAGE project, which, in turn, retrieved them from the NCBI Sequence Read Archive (SRA) to test genome assembly methods. The data was trimmed and cleaned up with the applications Scythe and Sickle, as described in a separate tutorial. During the process of trimming, many culled reads left unpaired mates behind, which were moved into a separate single-reads file. The data to be used in this tutorial is available in the Data Store at Community Data > iplant_training > genome_assembly_soapdenovo > A_Assemble_Reads.
More information on the GAGE project is available here.
fasta format text file
scaffold fasta file, which represents the final output for the assembly sequences (includes large gaps filled with N’s)
fasta format text file
contig fasta file, the final assembly of continuous sequence (only small gaps)
text formatted list of statistics about the scaffold sequences including the N50 value
text formatted list of statistics about the scaffold sequences (or any input fasta file), including N50, but also comparisons to a reference genome (e.g. representation value)
Approximate analysis durations for the iPlant sample data are provided in each step. Other datasets, depending on size, could take less or more time. Using the sample data, users can skip through the workflow (a la 'cooking show'), returning later to examine the results of their own analysis.