MAKER-P Tutorial for Atmosphere
MAKER-P Genome Annotation using Atmosphere
For an updated MAKER-P genome annotation tutorial, please use the latest version- MAKER-P_2.31.9 Atmosphere Tutorial
Rationale and background:
MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes. MAKER-P identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices. MAKER-P was developed by the Yandell Lab. Its predecessor, MAKER, is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011). Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading. MAKER-P v2.28 is currently available as an Atmosphere image and is MPI-enabled for parallel processing.
This tutorial will take users through steps of:
- Launching the MAKER-P Atmosphere image
- Uploading data from the iPlant Data Store (Optional)
- Running MAKER-P on an example small genome assembly
- Visualizing MAKER-P output using the Integrated Genome Viewer (IGV)
Note: Parts of this tutorial requires stand-alone VNC Viewer. Please download VNC Viewer (For windows, if you are unsure of which version to download, select the 32bit.exe file)
Part 1: Connect to an instance of an Atmosphere Image (virtual machine)
Step 1. Go to https://atmo.iplantcollaborative.org and log in with your iPlant credentials.
Step 2. Click on the Launch New Instance button and search for MAKER-P_2.28
Step 3. Select the image MAKER-P_2.28 (emi-F13821D0) and click Launch Instance. It will take 10-15 minutes for the cloud instance to be launched. You will be notified by email when the image is ready.
Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs. This tutorial can be accomplished with the small instance size, m1.small (default). For real genome annotation larger instances will be required.
Step 4. Launch the standalone VNC Viewer application on your desktop computer and enter YOUR VM's IP address following by :1
Once you have connected, you will see your virtual Desktop, a Linux server running in the Atmosphere cloud computing platform. You will interact with it the same way as you would a physical computer.
Part 2: Set up a MAKER-P run using the Terminal window
Step 1. On the desktop click Terminal. A terminal window will open giving a command-line interface.
Step 2. Get oriented. You will find staged example data in folder "MAKER-P_example_data" within the folder "Desktop". List its contents with the ls command:
$ ls Desktop/MAKER-P_example_data/ README mRNA.fasta maker_exe.ctl msu-irgsp-proteins.fasta test_genome.fasta example_output maker_bopts.ctl maker_opts.ctl plant_repeats.fasta
The fasta files include an scaled-down genome (test_genome.fasta) which is comprised of the first 100kb of each chromosome of rice. The remaining fasta files provide evidence for annotation: mRNA sequences from NCBI, publicly available annotated protein sequences of rice (MSU7.0 and IRGSP1.0), and a collection of plant repeats. The maker_*.ctl files are configuration files for MAKER which can be optionally used for this exercise or generated as described later.
Executables for running MAKER-P are located in /opt/maker/bin and /opt/maker/exe:
$ ls /opt/maker/bin cegma2zff fasta_merge iprscan2gff3 maker2jbrowse maker_functional_gff map_gff_ids chado2gff3 fasta_tool iprscan_wrap maker2wap maker_map_ids mpi_evaluator compare genemark_gtf2gff3 maker maker2zff map2assembly mpi_iprscan cufflinks2gff3 gff3_merge maker2chado maker_functional map_data_ids tophat2gff3 evaluator ipr_update_gff maker2eval_gtf maker_functional_fasta map_fasta_ids $ ls /opt/maker/exe/ RepeatMasker augustus blast exonerate mpich2-1.5 mpich2.tar.gz snap
As the names suggest the /opt/maker/bin directory includes many useful auxiliary scripts. For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for MAKER-P. Both Cufflinks and cufflinks2gff3 are available as tools the iPlant Discovery Environment (DE). Other auxiliary scripts now available in the DE include: tophat2gff3, maker2jbrowse, and maker2zff.
RepeatMasker, augustus, blast, exonerate, and snap are programs that MAKER-P uses in its pipeline. We recommend reading MAKER Tutorial at GMOD for more information about these.
Step 3. Set up your environment. Note: this step is not essential for the purpose of this tutorial. Skipping this step you will need to specify the complete path when running the maker command (/opt/maker/bin/maker).
$ export PATH=/opt/maker/bin/:$PATH $ export ZOE=/opt/maker/exe/snap $ maker -CTL $ ls maker_bopts.ctl maker_exe.ctl maker_opts.ctl test_data $ export AUGUSTUS_CONFIG_PATH=/opt/maker/exe/augustus/config
Step 4. Set up a MAKER-P run. Create a working directory called "maker_run" using the mkdir command and use cd to move into that directory:
$ mkdir maker_run $ cd maker_run
Step 5. Create another directory for the data called "test_data" and copy the staged fasta data into test_data using cp. Verify using the ls command
$ cp ~/Desktop/MAKER-P_example_data/*.fasta test_data/. $ ls test_data $ ls test_data/ mRNA.fasta msu-irgsp-proteins.fasta plant_repeats.fasta test_genome.fasta
Step 6. Run the maker command with the --help flag to get a usage statement and list of options:
$ maker --help MAKER version 2.28 Usage: maker [options] <maker_opts> <maker_bopts> <maker_exe> Description: MAKER is a program that produces gene annotations in GFF3 format using evidence such as EST alignments and protein homology. MAKER can be used to produce gene annotations for new genomes as well as update annotations from existing genome databases. The three input arguments are control files that specify how MAKER should behave. All options for MAKER should be set in the control files, but a few can also be set on the command line. Command line options provide a convenient machanism to override commonly altered control file values. MAKER will automatically search for the control files in the current working directory if they are not specified on the command line. Input files listed in the control options files must be in fasta format unless otherwise specified. Please see MAKER documentation to learn more about control file configuration. MAKER will automatically try and locate the user control files in the current working directory if these arguments are not supplied when initializing MAKER. It is important to note that MAKER does not try and recalculated data that it has already calculated. For example, if you run an analysis twice on the same dataset you will notice that MAKER does not rerun any of the BLAST analyses, but instead uses the blast analyses stored from the previous run. To force MAKER to rerun all analyses, use the -f flag. MAKER also supports parallelization via MPI on computer clusters. Just launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be configured during the MAKER installation process for this to work though Options: -genome|g <file> Overrides the genome file path in the control files -RM_off|R Turns all repeat masking options off. -datastore/ Forcably turn on/off MAKER's two deep directory nodatastore structure for output. Always on by default. -old_struct Use the old directory styles (MAKER 2.26 and lower) -base <string> Set the base name MAKER uses to save output files. MAKER uses the input genome file name by default. -tries|t <integer> Run contigs up to the specified number of tries. -cpus|c <integer> Tells how many cpus to use for BLAST analysis. Note: this is for BLAST and not for MPI! -force|f Forces MAKER to delete old files before running again. This will require all blast analyses to be rerun. -again|a recaculate all annotations and output files even if no settings have changed. Does not delete old analyses. -quiet|q Regular quiet. Only a handlful of status messages. -qq Even more quit. There are no status messages. -dsindex Quickly generate datastore index file. Note that this will not check if run settings have changed on contigs -nolock Turn off file locks. May be usful on some file systems, but can cause race conditions if running in parallel. -TMP Specify temporary directory to use. -CTL Generate empty control files in the current directory. -OPTS Generates just the maker_opts.ctl file. -BOPTS Generates just the maker_bopts.ctl file. -EXE Generates just the maker_exe.ctl file. -MWAS <option> Easy way to control mwas_server for web-based GUI options: STOP START RESTART -version Prints the MAKER version. -help|? Prints this usage statement.
Step 7. Create control files that tell MAKER-P what to do. Three files are required:
maker_opts.ctl
- gives location of input files (genome and evidence) and sets options that affect MAKER-P behaviormaker_exe.ctl
- gives path information for the underlying executables.maker_bopt.ctl
- sets parameters for filtering BLAST and Exonerate alignment results
To create these files run the maker command with the -CTL flag. Verify with ls:
$ maker -CTL $ ls maker_bopts.ctl maker_exe.ctl maker_opts.ctl test_data
The maker_exe.ctl is automatically generated with the correct paths to executables and does not need to be modified. The maker_bopt.ctl is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters.
The automatically generated maker_opts.ctl file does need to be modified in order to specify the genome file and evidence files to be used as input. Several text editors are available, including emacs, nano andgedit. If you have not tried any of these gedit probably is easiest to learn as it has a familiar graphical user interface. It can be started by typing gedit & on the command line or by selecting it from the menu on the VNC desktop (Applications=>Accessories=>Text Editor).
Note: If pressed for time a pre-edited version of the maker_opts.ctl file is staged in ~/Desktop/MAKER-P_example_data/. Delete the current file and copy the staged version here. Then skip to Step 8.
$ rm maker_opts.ctl
$ cp ~/Desktop/MAKER-P_example_data/maker_opts.ctl .
$ gedit &
Here are the sections of the maker_opts.ctl file you need to edit. Add path information to files as shown. Do not allow any spaces after the equal sign or anywhere else. You can specify a complete path or relative path as shown here. In general you can specify multiple files by separating with a comma without spaces.
This section pertains to specifying the genome assembly to be annotated and setting organism type:
#-----Genome (these are always required) genome=./test_data/test_genome.fasta #genome sequence (fasta file or fasta embeded in GFF3 file) organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
The following section pertains to EST and other mRNA expression evidence. Here we are only using same species data, but one could specify data from a related species using the altest parameter. With RNA-seq data aligned to your genome by Cufflinks or Tophat one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter:
#-----EST Evidence (for best results provide a file for at least one) est=./test_data/mRNA.fasta #set of ESTs or assembled mRNA-seq in fasta format altest= #EST/cDNA sequence file in fasta format from an alternate organism est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file altest_gff= #aligned ESTs from a closly relate species in GFF3 format
The following section pertains to protein sequence evidence. Here we are using previously annotated protein sequences. Another option would be to use SwissProt or other database:
#-----Protein Homology Evidence (for best results provide a file for at least one) protein=./test_data/msu-irgsp-proteins.fasta #protein sequence file in fasta format (i.e. from mutiple oransisms) protein_gff= #aligned protein homology evidence from an external GFF3 file
This next section pertains to repeat identification:
#-----Repeat Masking (leave values blank to skip repeat masking) model_org= #select a model organism for RepBase masking in RepeatMasker rmlib=./test_data/plant_repeats.fasta #provide an organism specific repeat library in fasta format for RepeatMasker repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner rm_gff= #pre-identified repeat elements from an external GFF3 file prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
Various programs for ab initio gene prediction can be specified in the next section. Here we are using SNAP set to use an HMM trained on rice. Specifying the entire path to the hmm file would not be necessary if the ZOE environment variable is set as described in step 3 above.
#-----Gene Prediction snaphmm=/opt/maker/exe/snap/HMM/O.sativa.hmm #SNAP HMM file gmhmm= #GeneMark HMM file augustus_species= #Augustus gene prediction species model fgenesh_par_file= #FGENESH parameter file pred_gff= #ab-initio predictions from an external GFF3 file model_gff= #annotated gene models from an external GFF3 file (annotation pass-through) est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
Step 8. Run MAKER-P !!!
Make sure you are in the maker_run directory and all of your files are in place. Perform these steps to check:
$ pwd /home/steinj/maker_run $ ls maker_bopts.ctl maker_exe.ctl maker_opts.ctl stderr test_data $ ls test_data/ mRNA.fasta msu-irgsp-proteins.fasta plant_repeats.fasta test_genome.fasta
Starting the MAKER-P run is as simple as entering the command maker. If you did not set the PATH environmental variable in Step 3 then you will need to type the entire path: /opt/maker/bin/maker. Because your maker control files are in the present directory you do not have to explicitly specify these in the command; they will be found automatically. MAKER-P automatically outputs thousands of lines of STDERR, reporting on its progress and producing warnings and errors if they arise. It is a good practice to redirect this output to a log file so you have a record of it, especially if something goes wrong. Since we are in the bash shell you can redirect the output by typing the command as follows: maker 2> log_file &. Another useful practice is to employ the unix time command to report statistics on how long the run took to complete. The output of time will appear at the end of the captured standard error file. Putting all of this together we type the following command to start MAKER-P:
(time /opt/maker/bin/maker) 2> log_file &
This part of the tutorial usually takes about 30 minutes to complete. You will know MAKER-P is finished when the log_file announces "Maker is now finished!!!". Use the tail command to look at the last 10 lines of the log_file:
$ tail log_file processing chunk output processing contig output Maker is now finished!!! real 31m11.250s user 36m52.210s sys 1m34.106s
//
Step 10. Examining MAKER-P output
Output data appears in a new directory called test_genome.maker.output. Move to that directory and examine its contents:
$ cd test_genome.maker.output/ $ ls -l total 32 -rw-r--r-- 1 steinj iplant-everyone 1413 Dec 3 11:20 maker_bopts.log -rw-r--r-- 1 steinj iplant-everyone 1164 Dec 3 11:20 maker_exe.log -rw-r--r-- 1 steinj iplant-everyone 4593 Dec 3 11:20 maker_opts.log drwxr-xr-x 5 steinj iplant-everyone 4096 Dec 3 11:20 mpi_blastdb -rw-r--r-- 1 steinj iplant-everyone 0 Dec 3 11:20 seen.dbm drwxr-xr-x 14 steinj iplant-everyone 4096 Dec 3 11:50 test_genome_datastore -rw-r--r-- 1 steinj iplant-everyone 2048 Dec 3 11:20 test_genome.db -rw-r--r-- 1 steinj iplant-everyone 1152 Dec 3 11:51 test_genome_master_datastore_index.log
//
- The
maker_opts.log
,maker_exe.log
, andmaker_bopts.log
files are logs of the control files used for this run of MAKER. - The
mpi_blastdb
directory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat databases. - test_genome_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
- The test_genome_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.
Check the test_genome_master_datastore_index.log to see if there were any failures:
$ less test_genome_master_datastore_index.log Chr1 test_genome_datastore/41/30/Chr1/ STARTED Chr1 test_genome_datastore/41/30/Chr1/ FINISHED Chr10 test_genome_datastore/7C/72/Chr10/ STARTED Chr10 test_genome_datastore/7C/72/Chr10/ FINISHED Chr11 test_genome_datastore/1E/AA/Chr11/ STARTED Chr11 test_genome_datastore/1E/AA/Chr11/ FINISHED Chr12 test_genome_datastore/1B/FA/Chr12/ STARTED Chr12 test_genome_datastore/1B/FA/Chr12/ FINISHED Chr2 test_genome_datastore/E9/36/Chr2/ STARTED Chr2 test_genome_datastore/E9/36/Chr2/ FINISHED Chr3 test_genome_datastore/CC/EF/Chr3/ STARTED Chr3 test_genome_datastore/CC/EF/Chr3/ FINISHED Chr4 test_genome_datastore/A3/11/Chr4/ STARTED Chr4 test_genome_datastore/A3/11/Chr4/ FINISHED Chr5 test_genome_datastore/8A/9B/Chr5/ STARTED Chr5 test_genome_datastore/8A/9B/Chr5/ FINISHED Chr6 test_genome_datastore/13/44/Chr6/ STARTED Chr6 test_genome_datastore/13/44/Chr6/ FINISHED Chr7 test_genome_datastore/91/B7/Chr7/ STARTED Chr7 test_genome_datastore/91/B7/Chr7/ FINISHED Chr8 test_genome_datastore/9A/9E/Chr8/ STARTED Chr8 test_genome_datastore/9A/9E/Chr8/ FINISHED Chr9 test_genome_datastore/87/90/Chr9/ STARTED Chr9 test_genome_datastore/87/90/Chr9/ FINISHED
//
All completed. Other possible status entries include:
- FAILED - indicates a failed run on this contig, MAKER will retry these
- RETRY - indicates that MAKER is retrying a contig that failed
- SKIPPED_SMALL - indicates the contig was too short to annotate (minimum contig length is specified in
maker_opt.ctl
) - DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry (number of times to retry a contig is specified in
maker_opt.ctl
)
The actual output data is stored in in nested set of directories under* test_genome_datastore* in a nested directory structure.
A typical set of outputs for a contig looks like this:
$ ls test_genome_datastore/41/30/Chr1/ Chr1.gff Chr1.maker.proteins.fasta Chr1.maker.transcripts.fasta Chr1.maker.non_overlapping_ab_initio.proteins.fasta Chr1.maker.snap_masked.proteins.fasta run.log Chr1.maker.non_overlapping_ab_initio.transcripts.fasta Chr1.maker.snap_masked.transcripts.fasta theVoid.Chr1
- The Chr1.gff file is in GFF3 format and contains the maker gene models and underlying evidence such as repeat regions, alignment data, and ab initio gene predictions, as well as fasta sequence. Having all of these data in one file is important to enable visualization of the called gene models and underlying evidence, especially using tools like Apollo which enable manual editing and curation of gene models.
- The fasta files Chr1.maker.proteins.fasta and Chr1.maker.transcripts.fasta contain the protein and transcript sequences for the final MAKER-P gene calls.
- The Chr1.maker.non_overlapping_ab_initio.proteins.fasta and Chr1.maker.non_overlapping_ab_initio.transcripts.fasta files are models that don't overlap MAKER-P genes that were rejected for lack of support.
- The Chr1.maker.snap_masked.proteins.fasta and Chr1.maker.snap_masked.transcript.fasta are the initial SNAP predicted models not further processed by MAKER-P
//
The output directory theVoid.Chr1 contains raw output data from all of the pipeline steps. One useful file found here is the repeat-masked version of the contig, query.masked.fasta.