This space is home to learning materials and tutorials created for CyVerse products and services. To search the entire CyVerse wiki, use the box at the upper right.






Skip to end of metadata
Go to start of metadata

Mothur-1.40.0 on Atmosphere (Images Tutorial)

Rationale and background:

Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community.
Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology. 79(17):5112-20.


This tutorial follows the basic Schloss Lab MiSeq Standard Operating Procedures Tutorial. Follow along there for more details about each command. For help with any command, use the mothur website or type the command with 'help' e.g. make.contigs(help). This tutorial only uses one sample and does not make use of 'groups', see the SOP for more details on running multiple samples simultaneously.



Starting out we need to first determine, what is our question? The Schloss lab is interested in understanding the effect of normal variation in the gut microbiome on host health. To that end we collected fresh feces from mice on a daily basis for 365 days post weaning (we're accepting applications). During the first 150 days post weaning (dpw), nothing was done to our mice except allow them to eat, get fat, and be merry. We were curious whether the rapid change in weight observed during the first 10 dpw affected the stability microbiome compared to the microbiome observed between days 140 and 150. We will address this question in this tutorial using a combination of OTU, phylotype, and phylogenetic methods. To make this tutorial easier to execute, we are providing only part of the data - you are given the flow files for one animal at 10 time points (5 early and 5 late). In addition, to sequencing samples from mice fecal material, we resequenced a mock community composed of genomic DNA from 21 bacterial strains. We will use the 10 fecal samples to look at how to analyze microbial communities and the mock community to measure the error rate and its effect on other analyses.

Part 1: Connect to an instance of an Atmosphere Image (virtual machine)

Step 1. Go to and log in with your CyVerse credentials.

note: click images to enlarge

Step 2. Create a project tab and name the project name as Mothur and description as Prediction-based taxonomic classification

note: click images to enlarge

Step 3. Click the Mothur project and then click Instance. Select the image Mothur and click Launch Instance. It will take ~10-15 minutes for the cloud instance to be launched. 


note: click images to enlarge

Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs.  This tutorial can be accomplished with the small instance size, medium1 (4 CPUs, 8 GB memory, 80 GB root)

Step 4. Once the VM is ready. Click the VM which will take onto next screen where you can open Web Shell

note: click images to enlarge

or ssh into the ip address on a terminal like below.


Part 2: Set up a Mothur run using the Web Shell

Get oriented.
You will find staged example data in folder "MiSeq_SOP_sub/" within the folder /opt".  Copy that folder onto your home directory

Reducing sequencing and PCR errors

The first thing we want to do is combine our two sets of reads for each sample and then to combine the data from all of the samples. This is done using the make.contigs command, which requires stability.files as input. This command will extract the sequence and quality score data from your fastq files, create the reverse complement of the reverse read and then join the reads into contigs. We have a very simple algorithm to do this. First, we align the pairs of sequences. Next, we look across the alignment and identify any positions where the two reads disagree. If one sequence has a base and the other has a gap, the quality score of the base must be over 25 to be considered real. If both sequences have a base at that position, then we require one of the bases to have a quality score 6 or more points better than the other. If it is less than 6 points better, then we set the consensus base to an N. 

This command will also produce several files that you will need down the road: stability.trim.contigs.fasta and stability.contigs.groups. These contain the sequence data and group identity for each sequence. The file will tell you something about the contig assembly for each read.

this outputs several files

  stability.trim.contigs.fasta   # trimmed contigs in fasta format
  stability.trim.contigs.qual    # base by base quality calls on contigs       # combined sequence report
  stability.scrap.contigs.fasta  # contigs not passing quality controls
  stability.scrap.contigs.qual   # quality calls on bad contigs
  stability.contigs.groups       # group membership of each new contig

Let's see what these sequences look like using the 

This tells us that we have 7793 sequences that for the most part vary between 249 and 253 bases. Interestingly, the longest read in the dataset is 502 bp. Be suspicious of this. Recall that the reads are supposed to be 251 bp each. This read clearly didn't assemble well (or at all). Also, note that at least 2.5% of our sequences had some ambiguous base calls. We'll take care of these issues in the next step when we run screen.seqs.

 This implementation of the command will remove any sequences with ambiguous bases and anything longer than 275 bp. There's another way to run this using the output from summary.seqs:


This may be faster because the summary.seqs output file already has the number of ambiguous bases and sequence length calculated for your sequences. Also, mothur is smart enough to remember that we used 4 processors in make.contigs and so it will use that throughout your current session. To see what else mothur knows about you, run the following:


Processing improved sequences

We anticipate that many of our sequences are duplicates of each other. Because it's computationally wasteful to align the same thing a bazillion times, we'll unique our sequences using the unique.seqs command:

 If two sequences have the same identical sequence, then they're considered duplicates and will get merged. In the screen output there are two columns - the first is the number of sequences characterized and the second is the number of unique sequences remaining. So after running unique.seqs we have gone from 7793 to 6638 sequences. This will make our life much easier. Another thing to do to make our lives easier is to simplify the names and group files. If you look at the most recent versions of those files you'll see together they are 13 MB. This may not seem like much, but with a full MiSeq run those long sequence names can add up and make life tedious. So we'll run count.seqs to generate a table where the rows are the names of the unique sequences and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group.

 This will generate a file called stability.trim.contigs.good.count_table. In subsequent commands we'll use it by using the count option:


Cool, right? Now we need to align our sequences to the reference alignment. Again we can make our lives a bit easier by making a database customized to our region of interest using the pcr.seqscommand. To run this command you need to have the reference database (silva.bacteria.fasta) and know where in that alignment your sequences start and end. To remove the leading and trailing dots we will set keepdots to false. You could also run this command using your primers of interest.:

Let's rename it to something more useful using the rename.file command:


Let's take a look at what we've made:

Now we have a customized reference alignment to align our sequences to. The nice thing about this reference is that instead of being 50,000 columns wide, it is now 13,425 columns wide which will save our hard drive some space and should improve the overall alignment quality. We'll do the alignment with align.seqs:


This should be done in a manner of seconds and we can run summary.seqs again:

So what does this mean? You'll see that the bulk of the sequences start at position 1968 and end at position 11550. Some sequences start at position 1967 and end at 11549. These deviants from the mode positions are likely due to an insertion or deletion at the terminal ends of the alignments. Sometimes you'll see sequences that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification. To make sure that everything overlaps the same region we'll re-run screen.seqs to get sequences that start at or before position 1968 and end at or after position 11550. We'll also set the maximum homopolymer length to 8 since there's nothing in the database with a stretch of 9 or more of the same base in a row (this really could have been done in the first execution of screen.seqs above). Note that we need the count table so that we can update the table for the sequences we're removing and we're also using the summary file so we don't have to figure out again all the start and stop positions:

Now we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we'll filter the sequences to remove the overhangs at both ends. Since we've done paired-end sequencing, this shouldn't be much of an issue, but whatever. In addition, there are many columns in the alignment that only contain gap characters (i.e. "-"). These can be pulled out without losing any information. We'll do all this with filter.seqs:


At the end of running the command we get the following information:

This means that our initial alignment was 13425 columns wide and that we were able to remove 13128 terminal gap characters using trump=. and vertical gap characters using vertical=T. The final alignment length is 297 columns. Because we've perhaps created some redundancy across our sequences by trimming the ends, we can re-run unique.seqs:

The next thing we want to do to further de-noise our sequences is to pre-cluster the sequences using the pre.cluster command allowing for up to 2 differences between sequences. This command will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other. If they are then they get merged. We generally favor allowing 1 difference for every 100 bp of sequence:

We now have 564 unique sequences. At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the VSEARCH algorithm that is called within mothur using the chimera.vsearch command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:

Running chimera.vsearch with the count file will remove the chimeric sequences from the count file. But you still need to remove those sequences from the fasta file. We do this using remove.seqs:

Running summary.seqs we see what we're left with:


Note that we went from 7793 to 6628 sequences; this is a reasonable number of sequences to be flagged as chimeric. As a final quality control step, we need to see if there are any "undesirables" in our dataset. Sometimes when we pick a primer set they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondria. There's also just the random stuff that we want to get rid of. Now you may say, "But wait I want that stuff". Fine. But, the primers we use, are only supposed to amplify members of the Bacteria and if they're hitting Eukaryota or Archaea, then its a mistake. Also, realize that chloroplasts and mitochondria have no functional role in a microbial community. But I digress. Let's go ahead and classify those sequences using the Bayesian classifier with the classify.seqs command:


Now that everything is classified we want to remove our undesirables. We do this with the remove.lineage command:


Also of note is that "unknown" only pops up as a classification if the classifier cannot classify your sequence to one of the domains. If you run summary.seqs you'll see that we now have 2281 unique sequences and a total of 118150 total sequences. This means about 350 of our sequences were in these various groups. Now, to create an updated taxonomy summary file that reflects these removals we use the command:

 This creates a file with the undesirables removed. At this point we have curated our data as far as possible and we're ready to see what our error rate is.

Assessing error rates

Measuring the error rate of your sequences is something you can only do if you have co-sequenced a mock community. This is something we include for every 95 samples we sequence. You should too because it will help you gauge your error rates and allow you to see how well your curation is going and whether something is wrong with your sequencing set up. First we want to pull the sequences out that were from our "Mock" sample using the get.groups command:

 This tells us that we had 289 unique sequences and a total of 6196 total sequences in our F3D0 sample. We can now cluster the sequences into OTUs to see how many spurious OTUs we have:

This string of commands will produce a file for you called Open it. You'll see that for 4061 sequences, we'd have 64 OTUs from the Mock community. This number of course includes some stealthy chimeras that escaped our detection methods. If we used 3000 sequences, we would have about 31 OTUs. In a perfect world with no chimeras and no sequencing errors, we'd have 20 OTUs. This is not a perfect world. But this is pretty darn good!

Preparing for analysis

You can follow the rest of the tutorial from here -




  • No labels