This space is home to learning materials and tutorials created for CyVerse products and services. To search the entire CyVerse wiki, use the box at the upper right.


LEARNING MATERIALS
 

 

 

 

Skip to end of metadata
Go to start of metadata

MAKER-P Genome Annotation using Atmosphere (Images Tutorial)

Rationale and background:

MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes (Campbell et al. 2013).  MAKER-P identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices.  MAKER-P was developed by the Yandell Lab.  Its predecessor, MAKER, is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011).  Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading.  MAKER-P v2.28 is currently available as an Atmosphere image and is MPI-enabled for parallel processing.

This tutorial will take users through steps of:

  1. Launching the MAKER-P Atmosphere image
  2. Uploading data from the iPlant Data Store (optional)
  3. Running MAKER-P on an example small genome assembly

Note: Parts of this tutorial require standalone VNC Viewer. If you have not already done so, download VNC Viewer. (For Windows, if you are unsure of which version to download, select the 32bit.exe file.)

New: MAKER-P is now packaged with JBrowse, enabling users to set up a genome browser over the web to display annotations.  The new Atmosphere image is called MAKER-P_2.31_3_JBrowse (7888b8e1-c006-4794-82d9-4c940ddbf4c6).  For instructions on setting up JBrowse and loading your MAKER-P data, see the documentation "README_Jbrowse" located in the Desktop folder under the MAKER-P_tutorial_data folder.

Part 1: Connect to an instance of an Atmosphere Image (virtual machine)

Step 1. Go to https://atmo.iplantcollaborative.org and log in with your iPlant credentials.

note: click images to enlarge

Step 2. Click on the Launch New Instance button and search for MAKER-P.

Step 3. Select the image MAKER-P_2.31.3 (2bb7bb3b-3ceb-4f76-b97a-70e2fce03dc3) and click Launch Instance. It will take 10-15 minutes for the cloud instance to be launched. The panel on the left side of the screen will show progress and status. You will also be notified by email when the instance is ready.


Note: Instances can be configured for different amounts of CPU, memory, and storage depending on user needs.  This tutorial can be accomplished with the small instance size, m1.small (default).  For real genome annotation, larger instances will be required.

Step 4. Launch the standalone VNC Viewer application on your desktop computer and enter YOUR VM's IP address followed by :1, as shown here:

Once you have connected, you will see your virtual desktop, a Linux server running in the Atmosphere cloud computing platform. You will interact with it the same way as you would a physical computer.

Part 2: Set up a MAKER-P run using the Terminal window

Step 1. On the desktop, click Terminal.  A terminal window will open in the command-line interface:

Step 2. Get oriented. You will find staged example data in folder "MAKER-P_tutorial_data/" within the folder "Desktop".  List its contents with the ls command:

The maker_opts.ctl file is a configuration file that can be used for this exercise or generated as described below.

The subdirectory "test_data" includes input data files that you will use for this tutorial.  The subdirectory "example_output" contains output that you should expect to see when completing the tutorial.  Have a look at "test_data" directory:

The fasta files include a scaled-down genome (test_genome.fasta) that is comprised of the first 300kb of three chromosomes of rice.  The remaining fasta files provide evidence for annotation: mRNA sequences from NCBI, publicly available annotated protein sequences of rice (MSU7.0 and IRGSP1.0), and a collection of plant repeats.

Executables for running MAKER-P are located in /opt/maker/bin and /opt/maker/exe:

As the names suggest, the "/opt/maker/bin" directory includes many useful auxiliary scripts.  For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for MAKER-P.  Both Cufflinks and cufflinks2gff3 are available as tools in the iPlant Discovery Environment (DE).  Other auxiliary scripts now available in the DE include tophat2gff3, maker2jbrowse, and maker2zff.

RepeatMasker, augustus, blast, exonerate, and snap are programs that MAKER-P uses in its workflow. We recommend reading MAKER Tutorial at GMOD for more information about these.

(Optional) Step 3.  Environmental variables should already be set up automatically in your instance (can check using the env command).  If not (for example, you type the maker command and get an error saying "command not found") here are instructions for setting env variables necessary for MAKER-P:

Step 4. Set up a MAKER-P run.  Create a working directory called "maker_run" using the mkdir command and use cd to move into that directory:

Step 5. Copy the directory "test_data" into the current directory using cp -r.  Verify using the ls command:

Step 6. Run the maker command with the --help flag to get a usage statement and list of options:

Step 7.  Create control files that tell MAKER-P what to do. Three files are required:

  • maker_opts.ctl - Gives location of input files (genome and evidence) and sets options that affect MAKER-P behavior
  • maker_exe.ctl - Gives path information for the underlying executables.
  • maker_bopt.ctl - Sets parameters for filtering BLAST and Exonerate alignment results

To create these files run the maker command with the -CTL flag. Verify with ls:

The "maker_exe.ctl" is automatically generated with the correct paths to executables and does not need to be modified.  The "maker_bopt.ctl" is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters.

The automatically generated "maker_opts.ctl" file does need to be modified in order to specify the genome file and evidence files to be used as input.  Several text editors are available, including emacs, nano, and gedit.  If you have not tried any of these, gedit probably is easiest to learn as it has a familiar graphical user interface.  It can be started by typing gedit & on the command line or by selecting it from the menu on the VNC desktop (Applications=>Accessories=>Text Editor).

Icon

Note: If pressed for time, a preedited version of the "maker_opts.ctl" file is staged in ~/Desktop/MAKER-P_example_data/. Delete the current file and copy the staged version here.  Then skip to Step 8.
$ rm maker_opts.ctl
$ cp ~/Desktop/MAKER-P_tutorial_data/maker_opts.ctl .

Here are the sections of the "maker_opts.ctl" file you need to edit.  

  • Add path information to files as shown. 
  • Do not allow any spaces after the equal sign or anywhere else. 
  • You can specify a complete path or relative path as shown here. 
  • In general, you can specify multiple files by separating with a comma without spaces.

This section pertains to specifying the genome assembly to be annotated and setting organism type:

The following section pertains to EST and other mRNA expression evidence.  Here we are only using same species data, but one could specify data from a related species using the altest parameter.  With RNA-seq data aligned to your genome by Cufflinks or Tophat, one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter:

The following section pertains to protein sequence evidence.  Here we are using previously annotated protein sequences.  Another option would be to use SwissProt or other database:

This next section pertains to repeat identification:

Various programs for ab initio gene prediction can be specified in the next section.  Here we are using SNAP set to use an HMM trained on rice.  Specifying the entire path to the hmm file would not be necessary if the ZOE environment variable is set as described in step 3 above.

Step 8.  Run MAKER-P.

Make sure you are in the "maker_run" directory and all of your files are in place.  Perform these steps to check:

Starting the MAKER-P run is as simple as entering the command maker.  Because your maker control files are in the present directory, you do not have to explicitly specify these in the command; they will be found automatically.  

  • MAKER-P automatically outputs thousands of lines of STDERR, reporting on its progress and producing warnings and errors if they arise.  It is a good practice to redirect this output to a log file so you have a record of it, especially if something goes wrong.  Since we are in the bash shell, you can redirect the STDERR by typing the command as follows: maker 2> log_file &
  • Another useful practice is to employ the unix time command to report statistics on how long the run took to complete. The output of time will appear at the end of the captured standard error file.  

Putting all of this together we type the following command to start MAKER-P:

Icon

Running MAKER-P in MPI mode

If you have launched an Atmosphere instance with multiple CPUs, you can distribute MAKER-P across the each processor using mpiexec command.  A few additional steps are required to ensure that mpiexec can locate hosts.  This example assumes that a user has checked out a "medium" instance size with 4 CPU.

  1. Add the host key fingerprint of 'localhost' to ~/.ssh/known_hosts using the ssh command:

2. Enter your iPlant password at the prompt:

3. Create a file called "hostfile" in your run directory that lists host and CPU information in the following format:

<hostname/IP>:<cores per node>

For this example the hostfile looks like this:

4. Finally, you can run MAKER-P using the mpiexec command using the -n flag to specify the number of CPU and the -f flag to specify the hostfile:

MAKER-P should now be running.  For this example, it usually takes about 30 minutes to complete.  

Monitor progress and check for errors by examining the log_file.  You will know MAKER-P is finished when the log_file announces "Maker is now finished!!!".  

5. Use the tail command to look at the last 10 lines of the log_file.  Because we included the time command, statistics about duration of execution is automatically appended to the end of the file:

//

Step 10. Examine MAKER-P output.

Output data appears in a new directory called "test_genome.maker.output".  Move to that directory and examine its contents:

//

  • The maker_opts.logmaker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER.
  • The mpi_blastdb directory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat databases.
  • test_genome_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
  • The test_genome_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.

Check the test_genome_master_datastore_index.log to see if there were any failures:

All completed.  Other possible status entries include:

  • FAILED - Indicates a failed run on this contig, MAKER will retry these
  • RETRY - Indicates that MAKER is retrying a contig that failed
  • SKIPPED_SMALL - Indicates the contig was too short to annotate (minimum contig length is specified in maker_opt.ctl)
  • DIED_SKIPPED_PERMANENT - Indicates a failed contig that MAKER will not attempt to retry (number of times to retry a contig is specified in maker_opt.ctl)

The actual output data are stored in nested set of directories under* test_genome_datastore* in a nested directory structure.

A typical set of outputs for a contig looks like this:

  • The Chr1.gff file is in GFF3 format and contains the maker gene models and underlying evidence such as repeat regions, alignment data, and ab initio gene predictions, as well as fasta sequence.  Having all of these data in one file is important to enable visualization of the called gene models and underlying evidence, especially using tools like Apollo which enable manual editing and curation of gene models.
  • The fasta files Chr1.maker.proteins.fasta and Chr1.maker.transcripts.fasta contain the protein and transcript sequences for the final MAKER-P gene calls.
  • The Chr1.maker.non_overlapping_ab_initio.proteins.fasta and Chr1.maker.non_overlapping_ab_initio.transcripts.fasta files are models that don't overlap MAKER-P genes that were rejected for lack of support.
  • The Chr1.maker.snap_masked.proteins.fasta and Chr1.maker.snap_masked.transcript.fasta are the initial SNAP predicted models not further processed by MAKER-P

The output directory theVoid.Chr1 contains raw output data from all of the pipeline steps.  One useful file found here is the repeat-masked version of the contig, query.masked.fasta.

  • No labels

13 Comments

  1. Step 5:

    cp ~/Desktop/MAKER-P_example_data/*.fasta test_data/.

    should be

    cp ~/Desktop/MAKER-P_example_data/test_data/*.fasta test_data/.

    The test_data is missing in the original documentation at Step 5

    1. Thanks for the feedback the tutorial has been updated accordingly

  2. In Step 3 the directory I get is

    Desktop  maker_bopts.ctl  maker_exe.ctl  maker_opts.ctl  start.jnlp

    When I try to follow the directions in Step 5 I keep getting this response:

    cp: target `test_data/.' is not a directory

  3. In this example you are working with the files on the ~desktop/test_run folder. However, in the real world our data is not there. How can we connect to read our data from /iplant/home/yourusername/ directory?

    I have a large genome file, protein files, ests etc. I do not think the space on a instance is enough.

    Thanks,

    1. You can transfer data from data store to additional space in your instance (see link below)

      https://pods.iplantcollaborative.org/wiki/display/atmman/Using+Volumes

  4. If one wants to use MAKER-P in MPI mode  what is the localhost name. Should I literally type localhost or what? If not could you please revise the tutorial and write an example localhost name?

    1. localhost is understood by the system ..so add that to the file as mentioned

      1. This is where confusion arises. In the yellow section it has mentioned hostname/IP then below it is localhost:4 which one should I put. Make a file and literally declare localhost:4 (or whatever CPUs) ?

        Next create a file called "hostfile" in your run directory that lists host and CPU information in the following format:

        <hostname/IP>:<cores per node>

        For this example the hostfile looks like this:

        $ cat hostfilelocalhost:4

  5. Maker_Run_11_12_2014]$ less log_file 
    STATUS: Parsing control files...
    WARNING: Could not get initialization lock. Trying Again...
    WARNING: Could not get initialization lock. Trying Again...
    WARNING: Could not get initialization lock. Trying Again...
    WARNING: Could not get initialization lock. Trying Again...
    ERROR: Cannot get initialization lock.
    If you are running maker in parallel or via MPI
    You may be facing a race condition.
    --> rank=NA, hostname=vm64-127.iplantcollaborative.orgC
    Could you please comment on the following error?

    1. How many cpu/cores does your instance have ?

      And where are you reading the data from (if its via fuse mount to data store )

      1. ID:d622ff7b-71f2-4411-80df-1cddd4360bb9

        Based on Image:MAKER-P_2.31.3
        (2bb7bb3b-3ceb-4f76-b97a-70e2fce03dc3)

        Size:medium2 (4 CPUs, 16 GB memory, 160 GB disk)

        I mounted the data using iRODS

        My Resource Usage   Request More Resources

        12%
        You are using 371 of 3000 allotted AUs.
        25%
        You are using 4 of 16 allotted CPUs.
        12%
        You are using 16 of 128 allotted GBs.

        1. Bring your files over to local drive on atmosphere (use icommands or idrop ) ..fuse mounted iRODS is not designed for high performance ..so you will see file locking issues

          1. I did and it started to work. Thanks.