MAKER-P Genome Annotation using Atmosphere

For an updated MAKER-P genome annotation tutorial, please use the latest version- MAKER-P_2.31.9 Atmosphere Tutorial

Rationale and background:

MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes. MAKER-P identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices. MAKER-P was developed by the Yandell Lab. Its predecessor, MAKER, is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011). Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading. MAKER-P v2.28 is currently available as an Atmosphere image and is MPI-enabled for parallel processing.

This tutorial will take users through steps of:

Launching the MAKER-P Atmosphere image
Uploading data from the iPlant Data Store (Optional)
Running MAKER-P on an example small genome assembly
Visualizing MAKER-P output using the Integrated Genome Viewer (IGV)

Note: Parts of this tutorial requires stand-alone VNC Viewer. Please download VNC Viewer (For windows, if you are unsure of which version to download, select the 32bit.exe file)

Part 1: Connect to an instance of an Atmosphere Image (virtual machine)

Step 1. Go to https://atmo.iplantcollaborative.org and log in with your iPlant credentials.

Step 2. Click on the Launch New Instance button and search for MAKER-P_2.28

Step 3. Select the image MAKER-P_2.28 (emi-F13821D0) and click Launch Instance. It will take 10-15 minutes for the cloud instance to be launched. You will be notified by email when the image is ready.

Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs. This tutorial can be accomplished with the small instance size, m1.small (default). For real genome annotation larger instances will be required.

Step 4. Launch the standalone VNC Viewer application on your desktop computer and enter YOUR VM's IP address following by :1

Once you have connected, you will see your virtual Desktop, a Linux server running in the Atmosphere cloud computing platform. You will interact with it the same way as you would a physical computer.

Part 2: Set up a MAKER-P run using the Terminal window

Step 1. On the desktop click Terminal. A terminal window will open giving a command-line interface.

Step 2. Get oriented. You will find staged example data in folder "MAKER-P_example_data" within the folder "Desktop". List its contents with the ls command:

$ ls Desktop/MAKER-P_example_data/
README           mRNA.fasta        maker_exe.ctl    msu-irgsp-proteins.fasta   test_genome.fasta
example_output   maker_bopts.ctl   maker_opts.ctl   plant_repeats.fasta

The fasta files include an scaled-down genome (test_genome.fasta) which is comprised of the first 100kb of each chromosome of rice. The remaining fasta files provide evidence for annotation: mRNA sequences from NCBI, publicly available annotated protein sequences of rice (MSU7.0 and IRGSP1.0), and a collection of plant repeats. The maker_*.ctl files are configuration files for MAKER which can be optionally used for this exercise or generated as described later.

Executables for running MAKER-P are located in /opt/maker/bin and /opt/maker/exe:

$ ls /opt/maker/bin
cegma2zff       fasta_merge        iprscan2gff3    maker2jbrowse           maker_functional_gff  map_gff_ids
chado2gff3      fasta_tool         iprscan_wrap    maker2wap               maker_map_ids         mpi_evaluator
compare         genemark_gtf2gff3  maker           maker2zff               map2assembly          mpi_iprscan
cufflinks2gff3  gff3_merge         maker2chado     maker_functional        map_data_ids          tophat2gff3
evaluator       ipr_update_gff     maker2eval_gtf  maker_functional_fasta  map_fasta_ids

$ ls /opt/maker/exe/
RepeatMasker  augustus  blast  exonerate  mpich2-1.5  mpich2.tar.gz  snap

As the names suggest the /opt/maker/bin directory includes many useful auxiliary scripts. For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for MAKER-P. Both Cufflinks and cufflinks2gff3 are available as tools the iPlant Discovery Environment (DE). Other auxiliary scripts now available in the DE include: tophat2gff3, maker2jbrowse, and maker2zff.

RepeatMasker, augustus, blast, exonerate, and snap are programs that MAKER-P uses in its pipeline. We recommend reading MAKER Tutorial at GMOD for more information about these.

Step 3. Set up your environment. Note: this step is not essential for the purpose of this tutorial. Skipping this step you will need to specify the complete path when running the maker command (/opt/maker/bin/maker).

$ export PATH=/opt/maker/bin/:$PATH
$ export ZOE=/opt/maker/exe/snap
$ maker -CTL
$ ls
maker_bopts.ctl  maker_exe.ctl  maker_opts.ctl  test_data

$ export AUGUSTUS_CONFIG_PATH=/opt/maker/exe/augustus/config

Step 4. Set up a MAKER-P run. Create a working directory called "maker_run" using the mkdir command and use cd to move into that directory:

$ mkdir maker_run
$ cd maker_run

Step 5. Create another directory for the data called "test_data" and copy the staged fasta data into test_data using cp. Verify using the ls command

$ cp ~/Desktop/MAKER-P_example_data/*.fasta test_data/.
$ ls
test_data
$ ls test_data/
mRNA.fasta  msu-irgsp-proteins.fasta  plant_repeats.fasta  test_genome.fasta

Step 6. Run the maker command with the --help flag to get a usage statement and list of options:

$ maker --help

MAKER version 2.28

Usage:

     maker [options] <maker_opts> <maker_bopts> <maker_exe>


Description:

     MAKER is a program that produces gene annotations in GFF3 format using
     evidence such as EST alignments and protein homology. MAKER can be used to
     produce gene annotations for new genomes as well as update annotations
     from existing genome databases.

     The three input arguments are control files that specify how MAKER should
     behave. All options for MAKER should be set in the control files, but a
     few can also be set on the command line. Command line options provide a
     convenient machanism to override commonly altered control file values.
     MAKER will automatically search for the control files in the current
     working directory if they are not specified on the command line.

     Input files listed in the control options files must be in fasta format
     unless otherwise specified. Please see MAKER documentation to learn more
     about control file  configuration.  MAKER will automatically try and
     locate the user control files in the current working directory if these
     arguments are not supplied when initializing MAKER.

     It is important to note that MAKER does not try and recalculated data that
     it has already calculated.  For example, if you run an analysis twice on
     the same dataset you will notice that MAKER does not rerun any of the
     BLAST analyses, but instead uses the blast analyses stored from the
     previous run. To force MAKER to rerun all analyses, use the -f flag.

     MAKER also supports parallelization via MPI on computer clusters. Just
     launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
     configured during the MAKER installation process for this to work though


Options:

     -genome|g <file>    Overrides the genome file path in the control files

     -RM_off|R           Turns all repeat masking options off.

     -datastore/         Forcably turn on/off MAKER's two deep directory
      nodatastore        structure for output.  Always on by default.

     -old_struct         Use the old directory styles (MAKER 2.26 and lower)

     -base    <string>   Set the base name MAKER uses to save output files.
                         MAKER uses the input genome file name by default.

     -tries|t <integer>  Run contigs up to the specified number of tries.

     -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
                         Note: this is for BLAST and not for MPI!

     -force|f            Forces MAKER to delete old files before running again.
                         This will require all blast analyses to be rerun.

     -again|a            recaculate all annotations and output files even if no
                         settings have changed. Does not delete old analyses.

     -quiet|q            Regular quiet. Only a handlful of status messages.

     -qq                 Even more quit. There are no status messages.

     -dsindex            Quickly generate datastore index file. Note that this
                         will not check if run settings have changed on contigs

     -nolock             Turn off file locks. May be usful on some file systems,
                         but can cause race conditions if running in parallel.

     -TMP                Specify temporary directory to use.

     -CTL                Generate empty control files in the current directory.

     -OPTS               Generates just the maker_opts.ctl file.

     -BOPTS              Generates just the maker_bopts.ctl file.

     -EXE                Generates just the maker_exe.ctl file.

     -MWAS    <option>   Easy way to control mwas_server for web-based GUI

                              options:  STOP
                                        START
                                        RESTART

     -version            Prints the MAKER version.

     -help|?             Prints this usage statement.

Step 7. Create control files that tell MAKER-P what to do. Three files are required:

maker_opts.ctl - gives location of input files (genome and evidence) and sets options that affect MAKER-P behavior
maker_exe.ctl - gives path information for the underlying executables.
maker_bopt.ctl - sets parameters for filtering BLAST and Exonerate alignment results

To create these files run the maker command with the -CTL flag. Verify with ls:

$ maker -CTL
$ ls
maker_bopts.ctl  maker_exe.ctl  maker_opts.ctl  test_data

The maker_exe.ctl is automatically generated with the correct paths to executables and does not need to be modified. The maker_bopt.ctl is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters.

The automatically generated maker_opts.ctl file does need to be modified in order to specify the genome file and evidence files to be used as input. Several text editors are available, including emacs, nano andgedit. If you have not tried any of these gedit probably is easiest to learn as it has a familiar graphical user interface. It can be started by typing gedit & on the command line or by selecting it from the menu on the VNC desktop (Applications=>Accessories=>Text Editor).

Note: If pressed for time a pre-edited version of the maker_opts.ctl file is staged in ~/Desktop/MAKER-P_example_data/. Delete the current file and copy the staged version here. Then skip to Step 8.
$ rm maker_opts.ctl
$ cp ~/Desktop/MAKER-P_example_data/maker_opts.ctl .

$ gedit &

Here are the sections of the maker_opts.ctl file you need to edit. Add path information to files as shown. Do not allow any spaces after the equal sign or anywhere else. You can specify a complete path or relative path as shown here. In general you can specify multiple files by separating with a comma without spaces.

This section pertains to specifying the genome assembly to be annotated and setting organism type:

#-----Genome (these are always required)
genome=./test_data/test_genome.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

The following section pertains to EST and other mRNA expression evidence. Here we are only using same species data, but one could specify data from a related species using the altest parameter. With RNA-seq data aligned to your genome by Cufflinks or Tophat one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter:

#-----EST Evidence (for best results provide a file for at least one)
est=./test_data/mRNA.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

The following section pertains to protein sequence evidence. Here we are using previously annotated protein sequences. Another option would be to use SwissProt or other database:

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=./test_data/msu-irgsp-proteins.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

This next section pertains to repeat identification:

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib=./test_data/plant_repeats.fasta #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

Various programs for ab initio gene prediction can be specified in the next section. Here we are using SNAP set to use an HMM trained on rice. Specifying the entire path to the hmm file would not be necessary if the ZOE environment variable is set as described in step 3 above.

#-----Gene Prediction
snaphmm=/opt/maker/exe/snap/HMM/O.sativa.hmm #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

Step 8. Run MAKER-P !!!

Make sure you are in the maker_run directory and all of your files are in place. Perform these steps to check:

$ pwd
/home/steinj/maker_run
$ ls
maker_bopts.ctl  maker_exe.ctl  maker_opts.ctl  stderr  test_data
$ ls test_data/
mRNA.fasta  msu-irgsp-proteins.fasta  plant_repeats.fasta  test_genome.fasta

Starting the MAKER-P run is as simple as entering the command maker. If you did not set the PATH environmental variable in Step 3 then you will need to type the entire path: /opt/maker/bin/maker. Because your maker control files are in the present directory you do not have to explicitly specify these in the command; they will be found automatically. MAKER-P automatically outputs thousands of lines of STDERR, reporting on its progress and producing warnings and errors if they arise. It is a good practice to redirect this output to a log file so you have a record of it, especially if something goes wrong. Since we are in the bash shell you can redirect the output by typing the command as follows: maker 2> log_file &. Another useful practice is to employ the unix time command to report statistics on how long the run took to complete. The output of time will appear at the end of the captured standard error file. Putting all of this together we type the following command to start MAKER-P:

(time /opt/maker/bin/maker) 2> log_file &

This part of the tutorial usually takes about 30 minutes to complete. You will know MAKER-P is finished when the log_file announces "Maker is now finished!!!". Use the tail command to look at the last 10 lines of the log_file:

$ tail log_file
processing chunk output
processing contig output


Maker is now finished!!!


real    31m11.250s
user    36m52.210s
sys     1m34.106s

//

Step 10. Examining MAKER-P output

Output data appears in a new directory called test_genome.maker.output. Move to that directory and examine its contents:

$ cd test_genome.maker.output/
$ ls -l
total 32
-rw-r--r--  1 steinj iplant-everyone 1413 Dec  3 11:20 maker_bopts.log
-rw-r--r--  1 steinj iplant-everyone 1164 Dec  3 11:20 maker_exe.log
-rw-r--r--  1 steinj iplant-everyone 4593 Dec  3 11:20 maker_opts.log
drwxr-xr-x  5 steinj iplant-everyone 4096 Dec  3 11:20 mpi_blastdb
-rw-r--r--  1 steinj iplant-everyone    0 Dec  3 11:20 seen.dbm
drwxr-xr-x 14 steinj iplant-everyone 4096 Dec  3 11:50 test_genome_datastore
-rw-r--r--  1 steinj iplant-everyone 2048 Dec  3 11:20 test_genome.db
-rw-r--r--  1 steinj iplant-everyone 1152 Dec  3 11:51 test_genome_master_datastore_index.log

//

The maker_opts.log, maker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER.
The mpi_blastdb directory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat databases.
test_genome_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
The test_genome_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.

Check the test_genome_master_datastore_index.log to see if there were any failures:

$ less test_genome_master_datastore_index.log

Chr1    test_genome_datastore/41/30/Chr1/       STARTED
Chr1    test_genome_datastore/41/30/Chr1/       FINISHED
Chr10   test_genome_datastore/7C/72/Chr10/      STARTED
Chr10   test_genome_datastore/7C/72/Chr10/      FINISHED
Chr11   test_genome_datastore/1E/AA/Chr11/      STARTED
Chr11   test_genome_datastore/1E/AA/Chr11/      FINISHED
Chr12   test_genome_datastore/1B/FA/Chr12/      STARTED
Chr12   test_genome_datastore/1B/FA/Chr12/      FINISHED
Chr2    test_genome_datastore/E9/36/Chr2/       STARTED
Chr2    test_genome_datastore/E9/36/Chr2/       FINISHED
Chr3    test_genome_datastore/CC/EF/Chr3/       STARTED
Chr3    test_genome_datastore/CC/EF/Chr3/       FINISHED
Chr4    test_genome_datastore/A3/11/Chr4/       STARTED
Chr4    test_genome_datastore/A3/11/Chr4/       FINISHED
Chr5    test_genome_datastore/8A/9B/Chr5/       STARTED
Chr5    test_genome_datastore/8A/9B/Chr5/       FINISHED
Chr6    test_genome_datastore/13/44/Chr6/       STARTED
Chr6    test_genome_datastore/13/44/Chr6/       FINISHED
Chr7    test_genome_datastore/91/B7/Chr7/       STARTED
Chr7    test_genome_datastore/91/B7/Chr7/       FINISHED
Chr8    test_genome_datastore/9A/9E/Chr8/       STARTED
Chr8    test_genome_datastore/9A/9E/Chr8/       FINISHED
Chr9    test_genome_datastore/87/90/Chr9/       STARTED
Chr9    test_genome_datastore/87/90/Chr9/       FINISHED

//

All completed. Other possible status entries include:

FAILED - indicates a failed run on this contig, MAKER will retry these
RETRY - indicates that MAKER is retrying a contig that failed
SKIPPED_SMALL - indicates the contig was too short to annotate (minimum contig length is specified in maker_opt.ctl)
DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry (number of times to retry a contig is specified in maker_opt.ctl)

The actual output data is stored in in nested set of directories under* test_genome_datastore* in a nested directory structure.

A typical set of outputs for a contig looks like this:

$ ls test_genome_datastore/41/30/Chr1/
Chr1.gff                                                Chr1.maker.proteins.fasta                 Chr1.maker.transcripts.fasta
Chr1.maker.non_overlapping_ab_initio.proteins.fasta     Chr1.maker.snap_masked.proteins.fasta     run.log
Chr1.maker.non_overlapping_ab_initio.transcripts.fasta  Chr1.maker.snap_masked.transcripts.fasta  theVoid.Chr1

The Chr1.gff file is in GFF3 format and contains the maker gene models and underlying evidence such as repeat regions, alignment data, and ab initio gene predictions, as well as fasta sequence. Having all of these data in one file is important to enable visualization of the called gene models and underlying evidence, especially using tools like Apollo which enable manual editing and curation of gene models.
The fasta files Chr1.maker.proteins.fasta and Chr1.maker.transcripts.fasta contain the protein and transcript sequences for the final MAKER-P gene calls.
The Chr1.maker.non_overlapping_ab_initio.proteins.fasta and Chr1.maker.non_overlapping_ab_initio.transcripts.fasta files are models that don't overlap MAKER-P genes that were rejected for lack of support.
The Chr1.maker.snap_masked.proteins.fasta and Chr1.maker.snap_masked.transcript.fasta are the initial SNAP predicted models not further processed by MAKER-P

//

The output directory theVoid.Chr1 contains raw output data from all of the pipeline steps. One useful file found here is the repeat-masked version of the contig, query.masked.fasta.

Schedule of Events

MAKER-P Tutorial for Atmosphere

MAKER-P Genome Annotation using Atmosphere

Rationale and background:

Part 1: Connect to an instance of an Atmosphere Image (virtual machine)

Part 2: Set up a MAKER-P run using the Terminal window