MAKER Genome Annotation and gene editing using Apollo
Rationale and background:
MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes (Campbell et al. 2013). MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices. MAKER was developed by the Yandell Lab and is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011). Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading.
Apollo is the first instantaneous, collaborative genomic annotation editor available on the web. With Web Apollo researchers can use any of the common browsers (for example, Chrome or Firefox) to jointly analyze and precisely describe the features of a genome in real time, whether they are in the same room or working from opposite sides of the world. The task of manual curation is spread out among many hands and eyes, enabling the creation of virtual research networks of researchers linked by a common interest in a particular organism or population.
This tutorial will take users through steps of:
- Running MAKER on Jetstream cloud
- Running downstream qaulity control tools on the predicted genes
- Running Apollo gene editing tool to get highly curated gene annotations
Sounds great, what do I need to get started?
- XSEDE account
- Later on, they can request a startup XSEDE allocation.
- Your data (or you can run example data)
What kind of data do I need?
- Mandatory requirements
- Genome assembly (fasta file)
- Organism type
- Eukaryotic (default, set as: organism_type=eukaryotic)
- Prokaryotic (set as: organism_type=prokaryotic)
- Additional data that can be used to improve the annotation (Highly recommended)
- RNA evidence (at least one of them is needed)
- Assembled mRNA-seq transcriptome (fasta file)
- Expressed sequence tags (ESTs) data (fasta file)
- Aligned EST or transcriptome GFF3 from your organism
- Aligned EST or transcriptome GFF3 from a closely related organism
- Protein evidence
protein sequence file in fasta format (i.e. from multiple organisms)
protein gff (aligned protein homology evidence from an external GFF3 file)
- RNA evidence (at least one of them is needed)
- For this particular tutorial we will use maize specific test data.
What kind of resources will I need for my project?
- Enough storage space on the MAKER-P Jetstream instance for both input and output files
- Creating and mounting an external volume to the running MAKER-P instance would be recommended
- Enough AUs to run your computation
Part 1: Connect to an instance of an MAKER Jetstream Image (virtual machine)
Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.
Step 2. Click on the "Create New Project" in the Project tab on the top and enter the name of the project and a brief description
Step 3. Launch an instance from the selected image and name it as MAKER-run
After the project has been created and entered inside it, click the "New" button, select "MAKER-P_v3" image and then click Launch instance. In the next window (Basic Info),
- name the instance as "MAKER-run" (don't worry if you forgot to name the instance at this point, as you can always modify the name of the instance later)
- set base image version as "1.0" (default)
- leave the project as it is or change to a different project if needed
- select "Jetstream - Indiana University or Jetstream - TACC" as Provider and click 'Continue'. Your choice of provider will depend on the resources you have available (AUs) and the needs of your instance
- select "m1.medium" as Instance size (this is the minimum size that is required by MAKER-P image) and click "Continue".
Step 4. As the instance is launched behind the scenes, you will get an update as it goes through each step.
Status updates of Instance launch include Build-requesting launch, Build-networking, Build-spawning, Active-networking, Active-deploying. Depending on the usage load on Jetstream, it can take anywhere from 2-5 mins for an instance to become active. You can force check updates by using the refresh button in the Instance launch page or the refresh button on your browser. Once the instance becomes active a virtual machine with the ip address provided will become available for you to connect to. This virtual machine will have all the necessary components to run MAKER-P.
Step 5: Create a volume
Since the m1 medium instance size (60GB disk space) selected for running MASTER instance of MAKER-P may not be sufficient for most of the MAKER runs, it is recommended to run it on volumes
5.1 Click the "New" button in the project and select "Create Volume". Enter the name of the volume, volume size (GB) needed and the provider (TACC or Indiana) and finally click "Create Volume"
Attach the created volume to the MASTER instance
5.2 Click on the MAKER-P_v3 instance now
Jetstream provides web-shell, a web-based terminal, for accessing your VM at the command line level once it has been deployed.
However, you might find that you wish to access your VM via SSH if you’ve provisioned it with a routable IP number. For SSH access, you can create (or copy) SSH public-keys for your non-Jetstream computer that will allow it to access Jetstream then deposit those keys in your Atmosphere settings. More instructions can be found here
5.3 Mount the volume to a specified drive.
Once you have logged in to your instance using webshell or ssh of your MASTER instance, you must change the directory permissions as below
Step 6 set up iCommands for data transfer
We will iCommands a service from iRODS for transfering evidence data from Cyverse data commons repositiry. iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. iCommands can used to transfer large amounts from CyVerse data to the running JetStream instance. Complete list of iCommands and its usage is here
The first time you use iCommands, you must initiate the connection to iRODS.
In a terminal window, enter
iinitto initialize iCommands and your Data Store connection. For example, here's what you would do if your iRODS user name is cyverse-user:
iinithas been finished, type
ilsto check that iCommands is working. You should see your home directory at /iplant/home/your_user_name
Download the evidence set required for annotation
Part 3: Set up a MAKER run using the Terminal window
The below list of files in the data folder will be used as evidence datasets in running MAKER-P annotaiton on maize genomes
1) genbank_ests.fasta: These are maize ESTs downloaded from genbank and identified using this search command: (EST[Keyword]) AND maize[Organism]
2) genbank_ests_ATCG.fasta: These are full length cDNAs downloaded from genbank and identified using this search command: (FLI-CDNA[Keyword]) AND maize[Organism]
3) wang_isoseq.fasta: hese are transcripts built from isoseq data. Published here:
Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing.
Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D.
Nat Commun. 2016 Jun 24;7:11708. doi: 10.1038/ncomms11708.
4) Protien files: All of these data sets were downloaded from the gramene ftp site using these commands
Sorghum: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/sorghum_bicolor/pep/Sorghum_bicolor.Sorbi1.27.pep.all.fa.gz
Rice: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/oryza_sativa/pep/Oryza_sativa.IRGSP-1.0.27.pep.all.fa.gz
Arabidopsis: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/arabidopsis_thaliana/pep/Arabidopsis_thaliana.TAIR10.27.pep.all.fa.gz
Setaria: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/setaria_italica/pep/Setaria_italica.JGIv2.0.27.pep.all.fa.gz
Brachypodium: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/brachypodium_distachyon/pep/Brachypodium_distachyon.v1.0.27.pep.all.fa.gz
5) Wessler-Bennetzen_2.fasta: This is the repeat library generated for the original B73 annotation. The helitrons were removed to prevent overmasking.
6) martin_nature_seedling_transcriptome_longer_than_300bp.fa: This is the assembnled transcripts from a very high depth seedling transcirptome published here
A near complete snapshot of the Zea mays seedling transcriptome revealed from ultra-deep sequencing.
Martin JA, Johnson NV, Gross SM, Schnable J, Meng X, Wang M, Coleman-Derr D, Lindquist E, Wei CL, Kaeppler S, Chen F, Wang Z.
Sci Rep. 2014 Mar 31;4:4519. doi: 10.1038/srep04519.
7) law_trinity_longer_than_300bp_cdhit_99.fasta: These are the trinity assemblies from the 95 RNAseq experimetns used for annotation in this paper
Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes.
Law M, Childs KL, Campbell MS, Stein JC, Olson AJ, Holt C, Panchy N, Lei J, Jiao D, Andorf CM, Lawrence CJ, Ware D, Shiu SH, Sun Y, Jiang N, Yandell M.
Plant Physiol. 2015 Jan;167(1):25-39. doi: 10.1104/pp.114.245027. Epub 2014 Nov 10.
Transcripts less than 300bp have been removed and cdhit was run on the remaining sequences with a similarity threshold of .99
8) jiao_w22_all.fasta: These are trinity assembled W22 transcirpts from the following tissues
Executables for running MAKER are located in /opt/maker/bin and /opt/maker/exe:
As the names suggest the "/usr/local/maker/bin/" directory includes many useful auxiliary scripts. For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for WQ-MAKER. RepeatMasker, augustus, blast, exonerate, and snap are programs that MAKER uses in its pipeline. We recommend reading MAKER Tutorial at GMOD for more information about these.
Step 2. Run the maker command with the --help flag to get a usage statement and list of options:
Step 3. Create control files that tell MAKER what to do. Three files are required:
maker_opts.ctl- gives location of input files (genome and evidence) and sets options that affect MAKER behavior
maker_exe.ctl- gives path information for the underlying executables.
maker_bopt.ctl- sets parameters for filtering BLAST and Exonerate alignment results
To create these files run the maker command with the -CTL flag. Verify with ls:
- The "maker_exe.ctl" is automatically generated with the correct paths to executables and does not need to be modified.
- The "maker_bopt.ctl" is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters.
- The automatically generated "maker_opts.ctl" file needs to be modified in order to specify the genome file and evidence files to be used as input. You can use the text editor "vi" or "nano" that is already installed in the instance
Open maker_opts.ctl with vi tool
Here are the sections of the "maker_opts.ctl" file you need to edit. For more information about the this please check this The_MAKER_control_files_explained - Add path information to files as shown.
This section pertains to specifying the genome assembly to be annotated and setting organism type:
The following section pertains to EST and other mRNA expression evidence. Here we are only using maize data, but one could specify data from a related species using the "altest" parameter. With RNA-seq data aligned to your genome by Cufflinks or Tophat one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter:
The following section pertains to protein sequence evidence. Here we are using previously annotated protein sequences. Another option would be to use SwissProt or other database:
This next section pertains to repeat identification:
This next section pertains to setting for gene predictors.
Keep the rest of the settings as default.
Step 7. Run MAKER-P
MAKER-P will be run using MPI for scaling with 44 CPU available on the instance.
You can track the status of the MAKER-P run by checking the contents of the maker.log file
Once MAKER-P finishes check again the status of maker.log fil. You should see the following message
Step 8: Merge gff and fasta files generated from MAKER-P run
To merge gff's
To merge fasta
The above comands creates consolidated annotation files as shown below:
Part 4: Quality control of annotated genes
Once the MAKER run is finsihed, the next step is to filter out missannotated and low evidence supporting gene models. Below section descirbes some details to filter out such gene models.
4.1 Gene and trancript Counts
Make sure the MAKER generated protein and transcripts file same counts as the mRNA counts in test.all.all
4.2 Run InterProScan on MAKER annotated Proteins
First we create a directory called Interproscan and cp the Maker anntoeted protein file "test.all.maker.proteins.fasta" to it
Next we divide the protein fasta into chunks of 10 parts using the fastq_plittl.pl script. This will allow us to run interproscan parallely on the chunks instead of a single large protien sequence file.
Create a jobs list of interproscan commands to be submitted
With this we can the batch file "interpro_jobs_to_split.txt" into mulitple files and run them in parallel.
InterProScan output a tsv file with IPR domains
Combine the all the tsv into a single tsv file
4.3 Run BLAST-P with MAKER annotated proteins against uniprot database
4.4 Update the MAKER annotation with the functional annotation
First we will update the MAKER gff with InterProScan output. In this step you will update the gff3 file to contain the iprscan information on the mRNA line
This procedure added a Dbxref tag to column nine of the gene and mRNA features that have Pfam domains identified by InterProScan in the GFF3 file. The value for this tag contains InterPro and Pfam ids as well as the Gene Ontology ids associated with the identified doamins, and looks like this:
We now update the MAKERgffandfastafiles with functions identified from BLASTP against UniProt/SwissProt
This procedureaddedafunctionaltagidentifiedbyblastptocolumnnineofthegeneand mRNA features in the GFF3 file. The value looks like this
Note=Similar to RNP1: Heterogeneous nuclear ribonucleoprotein 1 (Arabidopsis thaliana)
Similarlyaddthis information to MAKERfastafiles (protein and transcript sequences)
4.5 Build shorter IDs/Names for MAKER genes and transcripts following the NCBI suggested naming format
- --prefix : The prefix to use for all IDs (default = 'MAKER_')
- --justify:Theuniqueintegerportionof the ID will be right justified with '0's to this length (default = 8)
Thiswillcreateamappoingfileasbelowand it can be used to rename the feature ID's
maker-scaffold10-augustus-gene-0.3 PYU1_000001 maker-scaffold10-augustus-gene-0.3-mRNA-3 PYU1_000001-RA maker-scaffold10-augustus-gene-0.3-mRNA-2 PYU1_000001-RB maker-scaffold10-augustus-gene-0.3-mRNA-1 PYU1_000001-RC maker-scaffold10-augustus-gene-0.4 PYU1_000002 maker-scaffold10-augustus-gene-0.4-mRNA-1 PYU1_000002-RA maker-scaffold10-augustus-gene-0.4-mRNA-2 PYU1_000002-RB maker-scaffold10-augustus-gene-0.5 PYU1_000003 maker-scaffold10-augustus-gene-0.5-mRNA-5 PYU1_000003-RA maker-scaffold10-augustus-gene-0.5-mRNA-6 PYU1_000003-RB
alternate transcripts are assigned the value -RA, -RB, -RC, .....etc
You can explore more options with the maker_map_ids script to name feature ID's
We map short IDs/Names from genome.all.id.map to MAKER GFF3 test.all.functional_ipr.uniprot.gff , old IDs/Names are mapped to to the Alias attribute