MAKER-P Genome Annotation using cc-tools and Atmosphere
Rationale and background:
MAKER-P with cctools is a modified MAKER annotation tools capable of being run on distributed computing resources (Thrasher et al., 2012). Using the work-queue platform, users can now run MAKER-P across multiple virtual machines to achieve a several fold reduction in the duration of the MAKER run.
MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes (Campbell et al. 2013). MAKER-P identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices. MAKER-P was developed by the Yandell Lab. Its predecessor, MAKER, is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011). Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading. MAKER-P v2.28 is currently available as an Atmosphere image and is MPI-enabled for parallel processing. MAKER-P is now packaged with JBrowse, enabling users to set up a genome browser over the web to display annotations. The new Atmosphere image is called MAKER-P_2.31_3_JBrowse (7888b8e1-c006-4794-82d9-4c940ddbf4c6). For instructions on setting up JBrowse and loading your MAKER-P data, see the documentation "README_Jbrowse" located in the Desktop folder under the MAKER-P_tutorial_data folder.
This tutorial will take users through steps of:
- Launching the MAKER-P with cctools Atmosphere image
- Uploading data from the iPlant Data Store (Optional)
- Running MAKER-P with cctools on an example genome assembly
Note: Parts of this tutorial requires stand-alone VNC Viewer. Please download VNC Viewer (For windows, if you are unsure of which version to download, select the 32bit.exe file)
Part 1: Connect to an instance of an Atmosphere Image (virtual machine)
Step 1. Go to https://atmo.iplantcollaborative.org and log in with your iPlant credentials.
note: click images to enlarge
Step 2. Click on the 'Projects' tab on the top menu bar and click on 'Create New Project' button and name a new project. For this demo, name your project 'maker-demo'.
Step 3. Click on the maker-demo project. Click on the 'New' button and select 'Image' to open the Instance Launch Wizard. Go through Steps 1-7 of the wizard to launch an instance.
Step 1. In the wizards search box enter 'maker' to view listings of all images with the term 'maker' in the name.
Double click to launch the image MAKER-P_2.28_with_CCTools_5 (702d6867-e37d-4aa6-b1cf-0c4c69fb33b4). For the demo, we will be launching 4 instances from the same maker image, the first instance will be designated as master and the rest as worker instances. The process for launching the 4 instances are the same. Go through Steps 1-7 of the wizard to launch each instance.
Step 2. Name the instance 'maker-demo-master', Set version as "1.0", Select "iPlant Cloud-Tucson" as Provider and click 'Continue'. Currently every iPlant user will have access to iPlant's Tucson and Austin clouds. Your choice of provider will depend on the resources you have available (AUs, CPUs) and the needs of your instance. If you need more AUs or CPUs to run instances, contact the Atmosphere team at email@example.com.
Step 3. Select and Instance size. Your choice of instance size will depend on the tasks to be run on the instance. For annotation runs on small genomes (0-50Mb) a tiny 1 CPU instance with 4GB memory and 30 GB root is sufficient. Larger genomes would require more resources. For this demo, select "tiny1" as Instance size and click "Continue". Make sure to check on the Projected Resources Usage to make sure you have enough resources to run the instance. If you need more AUs or CPUs to run instances, contact the Atmosphere team at firstname.lastname@example.org.
Steps 4-6. Can be skipped for this demo.
Step 7. Review the details of the instance you are selecting to launch and click "Launch Instance".
Step 4. As the instance is launched behind the scenes, you will get an update as it goes through each step. Status updates of Instance launch include Build-requesting launch, Build-networking, Build-spawning, Active-networking, Active-deploying. Depending on the usage load on Atmosphere, it can take anywhere from 20 mins to a couple of hours for an instance to become active. You can force check updates by using the refresh button in the Instance launch page or the refresh button on your browser. Once the instance becomes active a virtual machine with the ip address provided will become available for you to connect to. This virtual machine will have all the necessary components to run MAKER and test files to run a MAKER demo.
Step 5. Launch worker instances. To launch worker instances, follow Step 1-7 of the Instance Launch Wizard. For demo, launch 3 or 4 instances of tiny1 (1 CPU) MAKER-P_2.28_with_CCTools_5 image.
Step 6. Connect to Master instance and browse through MAKER components and demo files. Launch the standalone VNC Viewer application on your desktop computer and enter YOUR VM's IP address by:
Once you have connected, you will see your virtual Desktop, a Linux server running in the Atmosphere cloud computing platform. You will interact with it the same way as you would a physical computer.
Part 2: Set up a MAKER-P run using the Terminal window
Step 1. On the desktop click Terminal. A terminal window will open giving a command-line interface.
Step 2. Get oriented. You will find staged example data in folder "MAKER-P_tutorial_data/" within the folder "Desktop". List its contents with the ls command:
The maker_opts.ctl file is a configuration file that can be used for this exercise or generated as described below. The fasta files include a scaled-down genome (test_genome.fasta) which is comprised of the first 300kb of 12 chromosomes of rice. The remaining fasta files provide evidence for annotation: mRNA sequences from NCBI, publicly available annotated protein sequences of rice (MSU7.0 and IRGSP1.0), and a collection of plant repeats.
Executables for running MAKER-P are located in /opt/maker/bin and /opt/maker/exe:
As the names suggest the "/opt/maker/bin" directory includes many useful auxiliary scripts. For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for MAKER-P. Both Cufflinks and cufflinks2gff3 are available as tools the iPlant Discovery Environment (DE). Other auxiliary scripts now available in the DE include: tophat2gff3, maker2jbrowse, and maker2zff.
RepeatMasker, augustus, blast, exonerate, and snap are programs that MAKER-P uses in its pipeline. We recommend reading MAKER Tutorial at GMOD for more information about these.
Step 3. Environmental variables should already be set up automatically in your instance (can check using the env command). If not (for example you type the maker command and get an error saying "command not found") here are instructions for setting env variables necessary for MAKER-P:
Step 4. Set up a MAKER-P run. Create a working directory called "maker_run" using the mkdir command and use cd to move into that directory:
Step 5. Copy the directory "test_data" into the current directory using cp -r. Verify using the ls command. The test data in this
Step 6. Run the maker command with the --help flag to get a usage statement and list of options:
Step 7. Create control files that tell MAKER-P what to do. Three files are required:
maker_opts.ctl- gives location of input files (genome and evidence) and sets options that affect MAKER-P behavior
maker_exe.ctl- gives path information for the underlying executables.
maker_bopt.ctl- sets parameters for filtering BLAST and Exonerate alignment results
To create these files run the maker command with the -CTL flag. Verify with ls:
The "maker_exe.ctl" is automatically generated with the correct paths to executables and does not need to be modified. The "maker_bopt.ctl" is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters.
The automatically generated "maker_opts.ctl" file does need to be modified in order to specify the genome file and evidence files to be used as input. Several text editors are available, including emacs, nano and gedit. If you have not tried any of these gedit probably is easiest to learn as it has a familiar graphical user interface. It can be started by typing gedit & on the command line or by selecting it from the menu on the VNC desktop (Applications=>Accessories=>Text Editor).
Here are the sections of the "maker_opts.ctl" file you need to edit. Add path information to files as shown. Do not allow any spaces after the equal sign or anywhere else. You can specify a complete path or relative path as shown here. In general you can specify multiple files by separating with a comma without spaces. If you do not have a specific file type to submit, leave the entry blank.
This section pertains to specifying the genome assembly to be annotated and setting organism type:
The following section pertains to EST and other mRNA expression evidence. Here we are only using same species data, but one could specify data from a related species using the altest parameter. With RNA-seq data aligned to your genome by Cufflinks or Tophat one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter:
The following section pertains to protein sequence evidence. Here we are using previously annotated protein sequences. Another option would be to use SwissProt or other database:
This next section pertains to repeat identification:
Various programs for ab initio gene prediction can be specified in the next section. Here we are using SNAP set to use an HMM trained on rice. Specifying the entire path to the hmm file would not be necessary if the ZOE environment variable is set as described in step 3 above.
Step 8. Run MAKER-P 2.28 with CCTools
Before running MAKER-P, check to make sure all worker instances have become active.
IMPORTANT: Jetstream sets firewall around each instance upon launch. This is different from Atmosphere where workers can communicate with the master as all ports on the master are open by default. In Jetstream, users have to define the WQ-MAKER ports on the master to be left open for the workers to connect to. WQ-MAKER master runs on port 9155.
NOTE: A WQ-MAKER instance with the port changes is being imaged during the time of this zoom call. Once imaged you do not need to run the iptables edit.
On the Master instance, make sure you are in the "maker_run" directory and all of your files are in place. Perform these steps to check:
Connect to the ip address of each worker the same way as you did for the master and run the export commands as in Step 3. In addition, run this export command:
On each worker instance, run the following command:
$ work_queue_worker -N maker_project_name -s /home/usr_name/ --cores=#cores -dall &
Once the maker run is started on the master, the jobs will become available for worker to pick up after a sequence processing step. On Jetstream, run "sudo service iptables stop" on the Master instance if the workers are not able to connect to the master after the sequence processing step.
Step 9. Example output on master instance
Step 10. Benchmarking MAKER-P 2.28 with CCTOOLS
|Benchmark run||Data used for benchmarking||CPUs per worker||Number of workers||Time to completion (Mins)|
|1||12 Chromosomes, 100K bases per chromosome||4||3||16.83333333|
|2||24 Chromosomes, 100K bases per chromosome||4||3||23.41666667|
|3||12 Chromosomes, 200K bases per chromosome||4||3||23.13333333|
|4||12 Chromosomes, 1M bases per chromosome||4||3||54.76666667|
|5||12 Chromosomes, 1M bases per chromosome||4||7||17. 5444449|
Use the above benchmarking to determine the resources you would need for your project. Note: Please contact us at email@example.com if you need help determining the optimal amount of resources to annotate your genome.
Step 11. Examining MAKER-P output
Output data appears in a new directory called "test_genome.maker.output". Move to that directory and examine its contents:
maker_bopts.logfiles are logs of the control files used for this run of MAKER.
mpi_blastdbdirectory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat databases.
- test_genome_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
- The test_genome_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.
Check the test_genome_master_datastore_index.log to see if there were any failures:
All completed. Other possible status entries include:
- FAILED - indicates a failed run on this contig, MAKER will retry these
- RETRY - indicates that MAKER is retrying a contig that failed
- SKIPPED_SMALL - indicates the contig was too short to annotate (minimum contig length is specified in
- DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry (number of times to retry a contig is specified in
The actual output data is stored under* test_genome_datastore* in a nested directory structure.
A typical set of outputs for a contig looks like this:
- The Chr1.gff file is in GFF3 format and contains the maker gene models and underlying evidence such as repeat regions, alignment data, and ab initio gene predictions, as well as fasta sequence. Having all of these data in one file is important to enable visualization of the called gene models and underlying evidence, especially using tools like Apollo which enable manual editing and curation of gene models.
- The fasta files Chr1.maker.proteins.fasta and Chr1.maker.transcripts.fasta contain the protein and transcript sequences for the final MAKER-P gene calls.
- The Chr1.maker.non_overlapping_ab_initio.proteins.fasta and Chr1.maker.non_overlapping_ab_initio.transcripts.fasta files are models that don't overlap MAKER-P genes that were rejected for lack of support.
- The Chr1.maker.snap_masked.proteins.fasta and Chr1.maker.snap_masked.transcript.fasta are the initial SNAP predicted models not further processed by MAKER-P
The output directory theVoid.Chr1 contains raw output data from all of the pipeline steps. One useful file found here is the repeat-masked version of the contig, query.masked.fasta.