This space is home to learning materials and tutorials created for CyVerse products and services. To search the entire CyVerse wiki, use the box at the upper right.


LEARNING MATERIALS
Skip to end of metadata
Go to start of metadata

MAKER Genome Annotation and gene editing using Apollo

Rationale and background:

MAKER-P is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes (Campbell et al. 2013). MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices.  MAKER was developed by the Yandell Lab and is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011).  Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading

Apollo is the first instantaneous, collaborative genomic annotation editor available on the web. With Web Apollo researchers can use any of the common browsers (for example, Chrome or Firefox) to jointly analyze and precisely describe the features of a genome in real time, whether they are in the same room or working from opposite sides of the world. The task of manual curation is spread out among many hands and eyes, enabling the creation of virtual research networks of researchers linked by a common interest in a particular organism or population.

This tutorial will take users through steps of:

  1. Running MAKER on Jetstream cloud
  2. Running downstream qaulity control tools on the predicted genes
  3. Running Apollo gene editing tool to get highly curated gene annotations

Considerations

Sounds great, what do I need to get started?

  1. XSEDE account
  2. Later on, they can request a startup XSEDE allocation
  3. Your data (or you can run example data)

What kind of data do I need?

  1. Mandatory requirements
    1. Genome assembly (fasta file)
    2. Organism type
      1. Eukaryotic (default, set as: organism_type=eukaryotic)
      2. Prokaryotic (set as: organism_type=prokaryotic)
  2. Additional data that can be used to improve the annotation (Highly recommended)
    1. RNA evidence (at least one of them is needed)
      1. Assembled mRNA-seq transcriptome (fasta file)
      2. Expressed sequence tags (ESTs) data (fasta file)
      3. Aligned EST or transcriptome GFF3 from your organism
      4. Aligned EST or transcriptome GFF3 from a closely related organism
    2. Protein evidence
      1. protein sequence file in fasta format (i.e. from multiple organisms)

      2. protein gff (aligned protein homology evidence from an external GFF3 file)

  3. For this particular tutorial we will use maize specific test data.

What kind of resources will I need for my project?

  1. Enough storage space on the MAKER-P Jetstream instance for both input and output files
    1. Creating and mounting an external volume to the running MAKER-P instance would be recommended
  2. Enough AUs to run your computation

Part 1: Connect to an instance of an MAKER Jetstream Image (virtual machine)

Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.

 

Step 2. Click on the "Create New Project"  in the Project tab on the top and enter the name of the project and a brief description 


Step 3. Launch an instance from the selected image and name it as MAKER-run

After the project has been created and entered inside it, click the "New" button, select "MAKER-P_v3" image and then click Launch instance. In the next window (Basic Info),

  • name the instance as "MAKER-run" (don't worry if you forgot to name the instance at this point, as you can always modify the name of the instance later)
  • set base image version as "1.0" (default)
  • leave the project as it is or change to a different project if needed
  • select "Jetstream - Indiana University or Jetstream - TACC" as Provider and click 'Continue'. Your choice of provider will depend on the resources you have available (AUs) and the needs of your instance
  • select "m1.medium" as Instance size (this is the minimum size that is required by MAKER-P image) and click "Continue". 

Step 4. As the instance is launched behind the scenes, you will get an update as it goes through each step.


Status updates of Instance launch  include Build-requesting launch, Build-networking, Build-spawning, Active-networking, Active-deploying. Depending on the usage load on Jetstream, it can take anywhere from 2-5 mins for an instance to become active. You can force check updates by using the refresh button in the Instance launch page or the refresh button on your browser. Once the instance becomes active a virtual machine with the ip address provided will become available for you to connect to. This virtual machine will have all the necessary components to run MAKER-P.

Step 5: Create a volume

Since the m1 medium instance size (60GB disk space) selected for running MASTER instance of MAKER-P may not be sufficient for most of the MAKER runs, it is recommended to run it on volumes

5.1 Click the "New" button in the project and select "Create Volume". Enter the name of the volume, volume size (GB) needed and the provider (TACC or Indiana) and finally click "Create Volume" 

Attach the created volume to the MASTER instance

5.2 Click on the MAKER-P_v3 instance now

Jetstream provides web-shell, a web-based terminal, for accessing your VM at the command line level once it has been deployed.

However, you might find that you wish to access your VM via SSH if you’ve provisioned it with a routable IP number. For SSH access, you can create (or copy) SSH public-keys for your non-Jetstream computer that will allow it to access Jetstream then deposit those keys in your Atmosphere settings. More instructions can be found here 

5.3 Mount the volume to a specified drive. 

Once you have logged in to your instance using webshell or ssh of your MASTER instance, you must change the directory permissions as below
 

Step 6  set up iCommands for data transfer

We will iCommands a service from iRODS for transfering evidence data from Cyverse data commons repositiry. iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. iCommands can used to transfer large amounts from CyVerse data to the running JetStream instance. Complete list of iCommands and its usage is here

The first time you use iCommands, you must initiate the connection to iRODS.

 

  1. In a terminal window, enter iinit to initialize iCommands and your Data Store connection. For example, here's what you would do if your iRODS user name is cyverse-user:

     

     

  2. Once iinit has been finished, type ils to check that iCommands is working. You should see your home directory at /iplant/home/your_user_name

  3. Download the evidence set required for annotation

 

Part 3: Set up a MAKER run using the Terminal window

 

Step 1. Get oriented. You will find your test data within your mounted volume "/vol_b/run_data"  List its contents with the ls command:

The below list of files in the data folder will be used as evidence datasets in running MAKER-P annotaiton on maize genomes


1) genbank_ests.fasta: These are maize ESTs downloaded from genbank and identified using this search command: (EST[Keyword]) AND maize[Organism]
2) genbank_ests_ATCG.fasta: These are full length cDNAs downloaded from genbank and identified using this search command: (FLI-CDNA[Keyword]) AND maize[Organism]
3) wang_isoseq.fasta: hese are transcripts built from isoseq data. Published here:
    Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing.
    Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D.
    Nat Commun. 2016 Jun 24;7:11708. doi: 10.1038/ncomms11708.
    PMID: 27339440
4) Protien files: All of these data sets were downloaded from the gramene ftp site using these commands
    Sorghum: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/sorghum_bicolor/pep/Sorghum_bicolor.Sorbi1.27.pep.all.fa.gz
    Rice: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/oryza_sativa/pep/Oryza_sativa.IRGSP-1.0.27.pep.all.fa.gz
    Arabidopsis: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/arabidopsis_thaliana/pep/Arabidopsis_thaliana.TAIR10.27.pep.all.fa.gz
    Setaria: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/setaria_italica/pep/Setaria_italica.JGIv2.0.27.pep.all.fa.gz
    Brachypodium: wget ftp://ftp.gramene.org/pub/gramene/release46/data/fasta/brachypodium_distachyon/pep/Brachypodium_distachyon.v1.0.27.pep.all.fa.gz
5) Wessler-Bennetzen_2.fasta: This is the repeat library generated for the original B73 annotation. The helitrons were removed to prevent overmasking.
6) martin_nature_seedling_transcriptome_longer_than_300bp.fa: This is the assembnled transcripts from a very high depth seedling transcirptome published here
    A near complete snapshot of the Zea mays seedling transcriptome revealed from ultra-deep sequencing.
    Martin JA, Johnson NV, Gross SM, Schnable J, Meng X, Wang M, Coleman-Derr D, Lindquist E, Wei CL, Kaeppler S, Chen F, Wang Z.
    Sci Rep. 2014 Mar 31;4:4519. doi: 10.1038/srep04519.
    PMID: 24682209
7) law_trinity_longer_than_300bp_cdhit_99.fasta: These are the trinity assemblies from the 95 RNAseq experimetns used for annotation in this paper
    Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes.
    Law M, Childs KL, Campbell MS, Stein JC, Olson AJ, Holt C, Panchy N, Lei J, Jiao D, Andorf CM, Lawrence CJ, Ware D, Shiu SH, Sun Y, Jiang N, Yandell M.
    Plant Physiol. 2015 Jan;167(1):25-39. doi: 10.1104/pp.114.245027. Epub 2014 Nov 10.
    PMID: 25384563
    Transcripts less than 300bp have been removed and cdhit was run on the remaining sequences with a similarity threshold of .99
8) jiao_w22_all.fasta: These are trinity assembled W22 transcirpts from the following tissues
    ear
    embryo
    endosperm
    kernel
    leaf
    root
    shoot
    tassel

Executables for running MAKER are located in /opt/maker/bin and /opt/maker/exe:

As the names suggest the "/usr/local/maker/bin/" directory includes many useful auxiliary scripts.  For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for WQ-MAKER. RepeatMasker, augustus, blast, exonerate, and snap are programs that MAKER uses in its pipeline.  We recommend reading MAKER Tutorial at GMOD for more information about these.


Step 2. Run the maker command with the --help flag to get a usage statement and list of options:

Step 3.  Create control files that tell MAKER what to do. Three files are required:

  • maker_opts.ctl - gives location of input files (genome and evidence) and sets options that affect MAKER behavior
  • maker_exe.ctl - gives path information for the underlying executables.
  • maker_bopt.ctl - sets parameters for filtering BLAST and Exonerate alignment results

To create these files run the maker command with the -CTL flag. Verify with ls:

  • The "maker_exe.ctl" is automatically generated with the correct paths to executables and does not need to be modified.  
  • The "maker_bopt.ctl" is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters.
  • The automatically generated "maker_opts.ctl" file needs to be modified in order to specify the genome file and evidence files to be used as input.  You can use the text editor "vi" or "nano" that is already installed in the instance

 

Open maker_opts.ctl with vi tool

 

Here are the sections of the "maker_opts.ctl" file you need to edit.  For more information about the this please check this The_MAKER_control_files_explained - Add path information to files as shown.

Icon

Do not allow any spaces after the equal sign or anywhere else

The files can be present in same the directory as the "maker_opts.ctl" or make sure you use the relative path if the files are present in other directories

This section pertains to specifying the genome assembly to be annotated and setting organism type:

The following section pertains to EST and other mRNA expression evidence.  Here we are only using maize  data, but one could specify data from a related species using the "altest" parameter. With RNA-seq data aligned to your genome by Cufflinks or Tophat one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter:

The following section pertains to protein sequence evidence.  Here we are using previously annotated protein sequences.  Another option would be to use SwissProt or other database:

This next section pertains to repeat identification:

This next section pertains to setting for gene predictors.


Keep the rest of the settings as default.

Step 7.  Run MAKER-P

MAKER-P will be run using MPI for scaling with 44 CPU available on the instance.

You can track the status of the MAKER-P run by checking the contents of the maker.log file

Once MAKER-P finishes check again the status of maker.log fil. You should see the following message

 

Step 8: Merge gff and fasta files generated from MAKER-P run

To merge gff's

To merge fasta

 

The above comands creates consolidated annotation files  as shown below:

Part 4: Quality control of annotated genes

Once the MAKER run is finsihed, the next step is to filter out missannotated and low evidence supporting gene models. Below  section descirbes some details to filter out such gene models.

4.1 Gene and trancript  Counts

Make sure the MAKER generated protein and transcripts file same counts as the mRNA counts in test.all.all

 

4.2 Run InterProScan on MAKER annotated Proteins

 First we create a directory called Interproscan and cp the Maker anntoeted protein file "test.all.maker.proteins.fasta" to it

Next we divide the protein fasta into chunks of 10 parts using the fastq_plittl.pl script. This will allow us to run interproscan  parallely on the chunks instead of a single large protien sequence file.

Create a jobs list of interproscan commands to be submitted

With this we can  the batch file "interpro_jobs_to_split.txt" into mulitple files and run them in parallel.

InterProScan output a tsv file with IPR domains

Combine the all the tsv into a single tsv file

 

4.3 Run BLAST-P with MAKER annotated proteins against uniprot database

 

4.4 Update the MAKER annotation with the functional annotation

First we will update the MAKER gff with InterProScan output. In this step you will update the gff3 file to contain the iprscan information on the mRNA line

 

This procedure added a Dbxref tag to column nine of the gene and mRNA features that have Pfam domains identified by InterProScan in the GFF3 file. The value for this tag contains InterPro and Pfam ids as well as the Gene Ontology ids associated with the identified doamins, and looks like this:

Dbxref=InterPro:IPR001300,Pfam:PF00648;Ontology_term=GO:0004198,GO:0005622, GO:0006508

We now update the MAKERgffandfastafiles with functions identified from BLASTP against UniProt/SwissProt

This procedureaddedafunctionaltagidentifiedbyblastptocolumnnineofthegeneand mRNA features in the GFF3 file. The value looks like this

Note=Similar to RNP1: Heterogeneous nuclear ribonucleoprotein 1 (Arabidopsis thaliana)

Similarlyaddthis information to MAKERfastafiles (protein and transcript sequences)

 

4.5 Build shorter IDs/Names for MAKER genes and transcripts following the NCBI suggested naming format

where

  • --prefix : The prefix to use for all IDs (default = 'MAKER_')
  • --justify:Theuniqueintegerportionof the ID will be right justified with '0's to this length (default = 8)

Thiswillcreateamappoingfileasbelowand it can be used to rename the feature ID's

maker-scaffold10-augustus-gene-0.3	PYU1_000001
maker-scaffold10-augustus-gene-0.3-mRNA-3	PYU1_000001-RA
maker-scaffold10-augustus-gene-0.3-mRNA-2	PYU1_000001-RB
maker-scaffold10-augustus-gene-0.3-mRNA-1	PYU1_000001-RC
maker-scaffold10-augustus-gene-0.4	PYU1_000002
maker-scaffold10-augustus-gene-0.4-mRNA-1	PYU1_000002-RA
maker-scaffold10-augustus-gene-0.4-mRNA-2	PYU1_000002-RB
maker-scaffold10-augustus-gene-0.5	PYU1_000003
maker-scaffold10-augustus-gene-0.5-mRNA-5	PYU1_000003-RA
maker-scaffold10-augustus-gene-0.5-mRNA-6	PYU1_000003-RB

 

alternate transcripts are assigned the value -RA, -RB, -RC, .....etc

You can explore more options with the maker_map_ids script to name feature ID's

 

We map short IDs/Names from genome.all.id.map to MAKER GFF3 test.all.functional_ipr.uniprot.gff , old IDs/Names are mapped to to the Alias attribute

Similary map genome.all.id.map to MAKER protein and transcript sequences
  • No labels