Scalable metagenomic analysis using iPlant

Pre-Workshop Homework

  1. Please test your iPlant credentials to ensure you have access. You can log in from user.iplantcollaborative.org: click the button "iPlant Login"
  2. Once authenticated, click the "My Services" link
  3. Under "My Services" please ensure you have access to Discovery Environment.
  4. If you have account problems, please email support@iplantcollaborative.org
  5. Please check your email and this wiki prior to the workshop, as this is the way we will communicate any important changes.

Objective

To provide an interactive, hands-on training on iPlant's bioinformatics cyberinfrastructure in the context of exploring metagenomics.

Key Resources

Overview

Metagenomics remains one of the most computationally-intensive challenges in the modern bioinformatics space. In this session, we will start with a publicly available metagenomics data set from NCBI SRA. We will perform first-pass taxonomic classification using METAPHLAN, then create and explore an interactive visualization of the results using Krona. We'll build a draft assembly of the data using Newbler and perform computational gene discovery on the contigs using MetaGeneMark. If time permits, we will use iPlant's high-performance BLAST service to annotate the entire data set using NCBI nr, then interactively explore the results in MEGAN running in an Atmosphere virtual machine.

Metagenomics of the rice leaf and root microbial communities

  • What is metagenomics? From the excellent Wikipedia article on the topic it "...is the study of metagenomes, genetic material recovered directly from environmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics. While traditional microbiology and microbial genome sequencing and genomics rely upon cultivated clonal cultures, early environmental gene sequencing cloned specific genes (often the 16S rRNA gene) to produce a profile of diversity in a natural sample. Such work revealed that the vast majority of microbial biodiversity had been missed by cultivation-based methods. Recent studies use "shotgun" Sanger sequencing or massively parallel pyrosequencing to get largely unbiased samples of all genes from all the members of the sampled communities. Because of its ability to reveal the previously hidden diversity of microscopic life, metagenomics offers a powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the entire living world".
  • The question at hand is "What are the major differences between the phyllosphere (leaf) and rhizosphere (root) microbial communities in rice"? There are a few ways to get at this question, and we will explore some of them using a public data set Metaproteogenomic analysis of microbial communities in the phyllosphere and rhizosphere of rice.

The above- and below ground parts of rice plants create specific habitats for various microorganisms. While the identity and the ecological impact of microorganisms that live in association with rice roots are subject of extensive studies since many years, less is known about the microorganisms in the rice phyllosphere. In this study, we characterized the phyllosphere and rhizosphere microbiota of rice cultivars using a metaproteogenomic approach to get insight into the physiology of the bacteria and archaea.

Phylogenetic profiling: What species are present, and in what quantities, in the given sample(s)?

  • MetaPhyler-SR : De facto standard
    • Metaphyler is a software tool for inferring the taxonomic composition of a microbial community from whole-metagenome (WGS) sequencing data. Metaphyler relies on alignments to a curated database of housekeeping genes.
  • MetaPhlan 1.6.0 : Extremely fast!
    • A computational tool for profiling the composition of microbial communities from metagenomic data (WGS). MetaPhlAn relies on unique clade-specific marker genes identified from reference genomes, allowing very fast computational times, unambiguous taxonomic assignments, and species-level resolution.

What genes are present in the samples?

  • Assemble then find genes
    • Velvet, metaVelvet, Ray, SOAPdenovo, and more. Since we are using 454 data, we can assemble using Newbler
      • If you are using Illumina or similar high-volume data, please read about Titus Brown's khmer digital normalization tool
    • MetaGeneMark
      • Uses a Hidden Markov Model (HMM) based on 350+ bacterial and archaebacterial species to predict genes
  • Find gene fragments from short reads
    • FragGeneScan 1.16
      • Combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads

What do the genes do?

  • BLASTX reads (or a subset) using NCBI refseq, nr, SEED, etc (challenging to do at scale!)
  • BLASTX predicted genes or fragments using NCBI refseq, nr, SEED, etc
    • Consolidate functional annotation using Blast2Go

Phylogenetic profiling of the rice rhizosphere using the iPlant Discovery Environment

Staged data files for each step of these analyses can be found in the Data/Community Data/uncc-wings shared folder in the Discovery Environment. This is helpful in case you fall behind or are having trouble progressing through a specific step.

Data/Community Data/uncc-wings
/iplant/home/shared/uncc-wings:
  C- /iplant/home/shared/uncc-wings/01_sra_import_SRP009087
  C- /iplant/home/shared/uncc-wings/02_fastq_dump_SRR358565
  C- /iplant/home/shared/uncc-wings/02_fastq_dump_SRR358608
  C- /iplant/home/shared/uncc-wings/03a_fasta_SRR358565_16k
  C- /iplant/home/shared/uncc-wings/03a_fasta_SRR358608_16k
  C- /iplant/home/shared/uncc-wings/03_fasta_SRR358565
  C- /iplant/home/shared/uncc-wings/03_fasta_SRR358608
  C- /iplant/home/shared/uncc-wings/04_metaphlan_phyllosphere
  C- /iplant/home/shared/uncc-wings/04_metaphlan_rhizosphere
  C- /iplant/home/shared/uncc-wings/05_krona_phyllosphere
  C- /iplant/home/shared/uncc-wings/05_krona_rhizosphere
  C- /iplant/home/shared/uncc-wings/06_newbler_phyllosphere
  C- /iplant/home/shared/uncc-wings/06_newbler_rhizosphere
  C- /iplant/home/shared/uncc-wings/07_metagenemark_phyllosphere
  C- /iplant/home/shared/uncc-wings/08_blastx_phyllo_refseq
  C- /iplant/home/shared/uncc-wings/09_bunzip_blastx
  C- /iplant/home/shared/uncc-wings/10_concat_blastx
  1. Use the NCBI SRA Import tool to bring in the SRA files from SRP009087
    1. Paste this URL ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP009/SRP009087 into the text box and launch the job. It should take just a few minutes to complete the transfer.
  2. You now have a pair of .sra files, one for each SRA-defined sample. In this case, 1 sample = 1 SRA file
    1. Convert SRA-Import/SRP009087/SRR358565/SRR358565.sra to FASTQ using NCBI SRA Toolkit fastq-dump 2.19
    2. Make sure to un-check "SRA files contains paired-end read data" as this is 454 data and is by nature, single-ended
  3. Many of the tools we want to work with require data in Fasta format
    1. Use FASTX Fastq-to-Fasta to do this conversion
    2. Make sure to specify a name for your new FASTA file or it will be, quite boringly, named output.fa
    3. I recommend naming the output from SRR358565 'Rhizosphere.fa'
  4. Now, it's time to profile the microbial composition of this sample!
    1. Launch MetaPhlan 1.6.1 and select 'Rhizosphere.fa' as the input file.
    2. It should not take long to run, as we make use of TACC's Lonestar supercomputer
    3. The results of MetaPhlan will include an XML file, which is a taxonomic report in PhyloXML; a *.krona.txt file which contains data formatted for the Krona viz tool, and a .metaphlan.txt file, which is the native format of MetaPhlan
  5. Optional: Rename the file metaphlan.tax_report.krona.txt to Rhizosphere.krona.txt
  6. Open KronaTools ktImportText 2.3 and launch it using the .krona.txt file from your previous analysis as input.
    1. Once it completes, find the text.krona.html file in your analyses result folder and click on it to open it in a new window. Have fun exploring your first taxonomic profile!
  7. On your own, go back and run the same analyses on the rice Phyllosphere data set, so that you have two files: Rhizosphere.krona.txt and Phyllosphere.krona.txt. For now, skip ahead to Assembling and annotating the rice phyllosphere

Phylogenetic profiling questions

  1. What are the most striking differences between the Krona taxonomic profiles of the phyllosphere and rhizosphere?
  2. Using trusty Google (unless you're a microbiologist, in which case I expect you to know this cold), look up the life habits of some of the microbial classes that differ significantly between the two samples. Are these differences consistent with your expectations, or are you surprised by some of them? Why?
  3. Can you figure out how to make single Krona plot that allows you to toggle dynamically between data sets?
  4. Advanced: Can you identify any interesting differences between our rice phyllosphere sample and a soybean phyllosphere data set?

Assembling and annotating the rice phyllosphere

  1. Open an instance of the Newbler 2.6 app, and select your Phyllosphere.fa file from before as its input. Note that Newbler can accept multiple files, which is useful if you want to combine replicates or samples.
    1. Select 8h as an estimated run time. This app will run on one of the 16x 1TB RAM systems attached to the Lonestar supercomputer.
    2. Please note that not everyone's assembly will complete during the workshop, so it's OK to use the pre-staged data in 06_newbler_phyllosphere for the next step
    3. Look inside the NewblerOutput subdirectory in the job's output folder. The file containing your assembly, in Fasta form, is named 454LargeContigs.fna. The other files contain scaffolding, diagnostic, quality, and other useful information.
  2. Now, we will find ORFs in the assembled contigs using MetaGeneMark 1.0.0
    1. Select 454LargeContigs.fna as your input, and launch the job. There are no special parameters to set
    2. MetaGeneMark 1.01 is another HPC application
    3. Congratulations! You've assembled a metagenome, identified the predicted genes using Hidden Markov Models, and now it's time to extract some biological knowledge.

Visualize your assembled genome(s) and annotation using IGB

  1. First, we must prepare the assembly and predicted gene files for use in Integrated Genome Browser
    1. Open the tool 'MetaGeneMark GFF to IGB'
      1. Select the file 454LargeContigs.fna from your phyllosphere Newbler assembly
      2. Select the file metagenemark-predicted.gff from your phyllosphere MetaGeneMark job
      3. Launch the job and let it run. It should take just a couple of minutes!
  2. Download the files 454LargeContigs.fna.2bit, metagenemark-predicted.gff.bed.gz, and metagenemark-predicted.gff.bed.gz.tbi to a single directory on your local system
    1. Launch IGB 7.0.2
    2. Under the File menu, select 'Open Reference Sequence' and choose the 454LargeContigs.fna.2bit file
    3. Now, under File, select 'Open File' and choose metagenemark-predicted.gff.bed.gz
    4. On the far right, sort the list of ~25,000 sequences by length. Select one of the larger ones, then click the 'Load Data' button in IGB
    5. Congratulations, you have assembled and visualized annotation of a metagenome!

Visualization questions

  1. The file metagenemark-predicted.gff.bed.gz is in gzipped BED12 format, which means that it contains two extra fields: the gene name field (13) and a description field (14) which is used for described text, such as curator notes or whatever else you want.
  2. If you had BLAST or GO matches for every predicted gene (see below), could you update the MetaGeneMark GFF to IGB program to include text annotations of all the gene models in the metagenemark-predicted.gff.bed.gz file?

Here is the source code for the converter script for your reference

#!/bin/sh
faToTwoBit 454LargeContigs.fa 454LargeContigs.2bit
metagenemarkGffToBedDetail.py metagenemark-predicted.gff | sort -k1,1 -k2,2n | bgzip > metagenemark-predicted.bed.gz
tabix -s 1 -b 2 -e 3 metagenemark-predicted.bed.gz

Optional: MEGAN4 for Metagenomics

The MEGAN tool is a very powerful tool for exploring metagenomics data sets. We provide MEGAN4 on an iPlant Atmosphere VM called 'Entangled Genomes' but to simplify things, I assume you will download and install it on your local system. The following workflow is very abbeviated, as it assumes familiarity with the iPlant DE at this point.

  1. Download and install MEGAN from http://ab.inf.uni-tuebingen.de/software/megan/.
    1. You will need to obtain a free academic license to use it from the download page
    2. MEGAN requires that you have Java JRE 1.6 or better installed
  2. Download seq.0.blastx from uncc-wings/09_bunzip_blastx/blast_out to your local drive
  3. Select File...Import from BLAST, then navigate to seq.0.blastx, select it, and load it
    1. MEGAN will present you with a default taxonomic distribution tree
    2. Explore the various display options.The image gallery below will give you some inspiration about the types of information you can glean from MEGAN.

MEGAN questions

  1. As we have configured it, MEGAN is trying to make taxonomic inference based on contig assemblies rather than reads. Do you believe it's doing a good job?
  2. Does the taxonomic distribution you get using MEGAN agree broadly with the one you derived for the phyllosphere using MetaPhlan + Krona?
  3. Assuming you belive the MEGAN results a little bit, what is the dominant shape and temperature range for bacteria found in the phyllosphere. What is the dominant genus?

On Your Own: Comparing Phyllosphere and Rhizosphere samples in MEGAN

  1. Use the Split FASTA file app to split your Rhizosphere.fa and Phyllosphere.fa sequence read files into chunks of 16,000 sequences. You will use ONLY the first chunk of 16,000 from each split operation for downstream analysis. I suggest calling them Rhizosphere_16k.fa and Phyllosphere_16k.fa
  2. Launch BLASTX+ 2.2.26 (HPC) jobs on each 16k file, estimating that you have 16000 sequences when asked
  3. Uncompress and join the resulting BLASTX files as you did before. You should aim to end up with two files: Rhizosphere_16k.blastx and Phyllosphere_16k.blastx
  4. Load BOTH sample .blastx files into MEGAN and note that you can now initiate comparative analyses. Have fun!

MEGAN Background