Rationale and background:
J Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D Bushman, Elizabeth K Costello, Noah Fierer, Antonio Gonzalez Pena, Julia K Goodrich, Jeffrey I Gordon, Gavin A Huttley, Scott T Kelley, Dan Knights, Jeremy E Koenig, Ruth E Ley, Catherine A Lozupone, Daniel McDonald, Brian D Muegge, Meg Pirrung, Jens Reeder, Joel R Sevinsky, Peter J Turnbaugh, William A Walters, Jeremy Widmann, Tanya Yatsunenko, Jesse Zaneveld and Rob Knight; Nature Methods, 2010; doi:10.1038/nmeth.f.303
This tutorial will take users through steps of:
- Launching the QIIME-1.9.1 Atmosphere image
- Running QIIME-1.9.1 on an test data
Please work through the tutorial and add your comments on the bottom of this page. Or send comments per email to email@example.com. Thank you.
Part 1: Connect to an instance of an Atmosphere Image (Virtual Machine)
Step 1. Go to https://atmo.iplantcollaborative.org and log in with your CyVerse credentials.
Step 2. Click on the Launch New Instance button and search for Qiime-1.9.1
Step 3. Select the image Qiime-1.9.1 and click Launch Instance. It will take ~10-15 minutes for the cloud instance to be launched.
Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs. This tutorial can be accomplished with the small instance size, medium1 (4 CPUs, 8 GB memory, 80 GB root)
Part 2: Set up a Qiime-1.9.1 run using the Terminal window
Step 1. Open the Terminal. Add the ssh details along with your IP address to connect the instance through the terminal. Remember to put your actually iPlant username in place of the text 'username' and 'IPaddress' in this next line of code:
Step 2. You will find tutorial data in "/opt/qiime_illumina_test_data" folder. List its contents with the ls command.
We'll change to the
illumina directory for the remaining steps.
Part 3: Run Qiime-1.9.1
Step 1: Check our mapping file for errors¶
The QIIME mapping file contains all of the per-sample metadata, including technical information such as primers and barcodes that were used for each sample, and information about the samples, including what body site they were taken from. In this current tutorial data set we're looking at human microbiome samples from four sites on the bodies of two individuals at mutliple time points. The metadata in this case therefore includes a subject identifier, a timepoint, and a body site for each sample. You can review the
map.tsv file to see an example of the data (or view the published Google Spreadsheet version, which is more nicely formatted).
In this step, we run
validate_mapping_file.py to ensure that our mapping file is compatible with QIIME. First we will use a perfectly normal mapping file to do test this
In this case there were no errors, but if there were we would review the resulting HTML summary to find out what errors are present. You could then fix those in a spreadsheet program or text editor and rerun
validate_mapping_file.py on the updated mapping file.
For the sake of illustrating what errors in a mapping file might look like, we've created a bad mapping file (
map-bad.tsv). We'll next call
validate_mapping_file.py on the file
Step 2: Demultiplexing and quality filtering sequences
We next need to demultiplex and quality filter our sequences (i.e. assigning barcoded reads to the samples they are derived from). In general, you should get separate fastq files for your sequence and barcode reads. Note that we pass these files while still gzipped to
split_libraries_fastq.py command which can handle gzipped or unzipped fastq files. The default strategy in QIIME for quality filtering of Illumina data is described in Bokulich et al (2013).
The above command creates 3 new files which we can view:
We can examine the results of the split libraries commands by using the less.
You can also see how many sequencing reads passed after demultiplexing and quality filtering using the
Step 3: OTU picking using an open-reference OTU picking protocol by searching reads against the Greengenes database
Now that we have demultiplexed sequences, we're ready to cluster these sequences into OTUs. The QIIME demo here uses Open-reference OTU picking (though it takes some time but this is currently QIIME's preferred method). Discussion of these methods can be found in Rideout et. al (2014). Note that this command takes the slout/
seqs.fna file that was generated in the previous step. Specifically, we set
True, which allows sequences to match the reference database if either their forward or reverse orientation matches to a reference sequence. This parameter is specified in theparameters file which is passed as
-p. You can find information on defining parameters files here.
This step can take long time to complete since we are working on a small machine. With a larger instance you can take advantage of parallelization.
The primary output that we get from this command is the OTU table, or the number of times each operational taxonomic unit (OTU) is observed in each sample. QIIME uses the Genomics Standards Consortium Biological Observation Matrix standard (BIOM) format for representing OTU tables. You can find additional information on the BIOM format here, and information on converting these files to tab-separated text that can be viewed in spreadsheet programs here. Several OTU tables are generated by this command. The one we typically want to work with is
otus/otu_table_mc2_w_tax_no_pynast_failures.biom. This has singleton OTUs (or OTUs with a total count of 1) removed, as well as OTUs whose representative (i.e., centroid) sequence couldn't be aligned with PyNAST. It also contains taxonomic assignments for each OTU as observation metadata.
The open-reference OTU picking command also produces a phylogenetic tree where the tips are the OTUs. The file containing the tree is
otus/rep_set.tre, and is the file that should be used with
otus/otu_table_mc2_w_tax_no_pynast_failures.biom in downstream phylogenetic diversity calculations. The tree is stored in the widely used newick format.
To compute some summary statistics of the OTU table we can run the biom summarize-table command.
The key piece of information you need to pull from this output is the depth of sequencing (Counts/sample) that should be used in diversity analyses (step 4). Many of the analyses that follow require that there are an equal number of sequences in each sample, so you need to review the output and decide what depth you'd like. Any samples that don't have at least that many sequences will not be included in the analyses, so this is always a trade-off between the number of sequences you throw away and the number of samples you throw away. For some perspective on this, see Kuczynski 2010.
Now we run the
core_diversity_analyses.py script which does the diversity analyses that users are generally interested in. The main output that users will interact with is the otus/
index.html file, which provides links into the different analysis results. Note that in this step we're passing
-e (depth of sampling) that should be used for diversity analyses. Based on reviewing the above output from
biom summarize-table, 1114 seems to be e value. This value will be study-specific, so don't just use this value on your own data (though it's fine to use that value for this tutorial).
Step 5: Run emperor
After this runs, you can reload the Emperor plots that you accessed from the above