This space is home to learning materials and tutorials created for CyVerse products and services. To search the entire CyVerse wiki, use the box at the upper right.


LEARNING MATERIALS
 

 

Skip to end of metadata
Go to start of metadata

Rationale:

If you are involved in a genome project for an emerging model organism, you should already have an EST database, or more likely now mRNA-Seq data, which would have been generated as part of the original sequencing project. A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database. However, a trained ab initio gene predictor is a much more difficult thing to generate. Gene predictors require existing gene models on which to base prediction parameters. However, with emerging model organisms you are not likely to have any pre-existing gene models. So how then are you supposed to train your gene prediction programs?

Ab initio gene predictors perform much better when they have been trained for a particular genome, and those used by Maker are no exception. However, a reliable training set is not always easily accessible, especially for non-model species. In this situation, the Maker documentation suggests running Maker to generate an initial set of predictions using parameters trained for a related species, then using those predictions as the basis for training and subsequent annotation runs (Maker has an automated process for iterative training and re-annotation). MAKER gives the user the option to produce gene annotations directly from the EST evidence. You can then use these imperfect gene models to train gene predictor program. Once you have re-run WQ-MAKER with the newly trained gene predictor, you can use the second set of gene annotations to train the gene predictors yet again. This boot-strap process allows you to iteratively improve the performance of ab initio gene predictors. 

SNAP training:

SNAP, gene prediction uses splicing information from different species to find transcript and coding sequences within a genome assembly. 

Icon

The following steps assumes that you have run the test data using WQ-MAKER 

Make sure you have the final merged file in the working directory - test_genome.all.gff. If not go back to the main WQ-MAKER tutorial 

Step 1: Run SNAP gene prediction (Round 1)

1.1 To train SNAP, we need to convert the GFF3 gene models to ZFF format. To do this we need to collect all GFF3 files into a single directory.

There should now be two new files. The first is the ZFF format file and the second is a FASTA file the coordinates can be referenced against. These will be used to train SNAP.

1.2 The basic steps for training SNAP are first to filter the input gene models, then capture genomic sequence immediately surrounding each model locus, and finally uses those captured segments to produce the HMM. You can explore the internal SNAP documentation for more details if you wish.

1.3 The final training parameters file is test_snap1.hmm. We do not expect SNAP to perform that well with this training file because it is based on incomplete gene models; however, this file is a good starting point for further training. We need to run MAKER again with the new HMM file we just built for SNAP. Edit the maker_opt.ctl file to include the snap1 hmm file


1.4 Run WQ-MAKER 

Before running MAKER, check to make sure all worker instances have become active. 

On the MASTER instance, make sure you are in the "maker_run" directory and all of your files are in place and then run:

Wait for the MASTER to advertise master status to the catalog server before your run WQ-MAKER on the WORKERS (see below). 

 $ tail log_file.txt

Mon Sep 11 15:08:22 2017 :: Submitting file ./test_data/test_genome.fasta_000008 for processing.
Mon Sep 11 15:08:22 2017 :: Submitted task 11 for annotating ./test_data/test_genome.fasta_000008 with command: mpiexec -n 1 maker -g ./test_data/test_genome.fasta_000008 -b
ase test_genome -debug_size_limit=0
Mon Sep 11 15:08:22 2017 :: Submitting file ./test_data/test_genome.fasta_000006 for processing.
Mon Sep 11 15:08:22 2017 :: Submitted task 12 for annotating ./test_data/test_genome.fasta_000006 with command: mpiexec -n 1 maker -g ./test_data/test_genome.fasta_000006 -b
ase test_genome -debug_size_limit=0
warning: this work queue master is visible to the public.
warning: you should set a password with the --password option.

Once the log_file show the above output and once your WORKERS are in active state, then either ssh into or use webshell into each of the WORKERS and then run

Advanced users

Icon

To check the status of the WQ-MAKER job, run the following. 

The log_file.txt will tell you if the job has been finished or not. The following are the output files from WQ-MAKER 

 

Step 2: Run SNAP gene prediction (Round 2) 

Let's retrain SNAP, and run MAKER again.

Merging the gff files. 

Except to see the final merged file in the working directory - test_genome_snap1.all.gff

2.1 Run SNAP gene prediction

To train SNAP, we need to convert the GFF3 gene models to ZFF format. To do this we need to collect all GFF3 files into a single directory.

There should now be two new files (genome.ann and genome.dna). The first is the ZFF format file and the second is a FASTA file the coordinates can be referenced against. These will be used to train SNAP.

2.2 The basic steps for training SNAP are the same as before

The final training parameters file is test_snap2.hmm. SNAP is now trained and now we will train the Augustus using BUSCO (see below) and then run final WQ-MAKER. If you don't want to run Augustus with SNAP, then you can run WQ-MAKER using test_snap2.hmm file (follow steps 2.3 and 2.4 and replace the test_snap1.hmm with test_snap2.hmm and give a different name to the output directory (-base test_genome_snap2)). Finally, merge to generate final gff file (Add the option `-n` this time). 


Augustus training using BUSCO (Optional but recommended):

BUSCO (Benchmarking UniversalSingle-Copy Orthologs) is a tool that provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDBBUSCO assessments are implemented in open-source software, with comprehensive lineage-specific sets of Benchmarking Universal Single-Copy Orthologs for arthropods, vertebrates, metazoans, fungi, eukaryotes, and bacteria. These conserved orthologs are ideal candidates for large-scale phylogenomics studies, and the annotated BUSCO gene models built during genome assessments provide a comprehensive gene predictor training set for use as part of genome annotation pipelines. BUSCO assessments offer intuitive metrics, based on evolutionarily informed expectations of gene content from hundreds of species, to gauge completeness of rapidly accumulating genomic data and satisfy an Iberian's quest for quality - "Busco calidad/qualidade". The software is freely available to download at (http://busco.ezlab.org/). 

Step 1: Run BUSCO Using Docker image  (Takes a really long time)

  1. We have already installed the latest version of BUSCO Docker image in the Jetstream instance. The version that is installed is v3.0 which is the latest version as of 20th Sept 2017. 
    1. Create an Augustus directory in the main project folder

    2. Download the lineage file specific for your species of interest. All the lineage data can be found and downloaded from here. Since the test data is from Rice, here we will be downloading the embrophyta dataset

    3. Copy the test data to this folder and run BUSCO to generate Augustus gene models


    4. Once the BUSCO run is finished, rename the retraining_parameters folder which is inside the output folder based on the name of the files inside and change the permissions of the folder

    5.  Copy the augustus config file /opt folder to your current directory and the renamed augustus folder into the species folder. Then continue to Step 2.

Step 2: Adjust the parameters in the maker_opts.ctl file

Step 3: Run WQ-MAKER 

Before running MAKER, check to make sure all worker instances have become active. 

On the MASTER instance, make sure you are in the "maker_run" directory and all of your files are in place and then run:

Wait for the MASTER to advertise master status to the catalog server before your run WQ-MAKER on the WORKERS (see below). 

 $ tail log_file.txt

Mon Sep 11 15:08:22 2017 :: Submitting file ./test_data/test_genome.fasta_000008 for processing.
Mon Sep 11 15:08:22 2017 :: Submitted task 11 for annotating ./test_data/test_genome.fasta_000008 with command: mpiexec -n 1 maker -g ./test_data/test_genome.fasta_000008 -b
ase test_genome -debug_size_limit=0
Mon Sep 11 15:08:22 2017 :: Submitting file ./test_data/test_genome.fasta_000006 for processing.
Mon Sep 11 15:08:22 2017 :: Submitted task 12 for annotating ./test_data/test_genome.fasta_000006 with command: mpiexec -n 1 maker -g ./test_data/test_genome.fasta_000006 -b
ase test_genome -debug_size_limit=0
warning: this work queue master is visible to the public.
warning: you should set a password with the --password option.

Once the log_file show the above output and once your WORKERS are in active state, then either ssh into or use webshell into each of the WORKERS and then run

Advanced users

Icon

Once the job is finished merge gff file

Icon

This marks the completion of gene annoation. When you are done, look at one of the larger contigs in a viewer like apollo and compare the raw augustus calls, raw snap calls, and the evidence aware augustus and snap calls produced by maker.  If SNAP and augustus are properly trained then they will produce similar calls, and they will also be similar to the evidence aware calls from MAKER (this convergence is the result of the training).  If one predictor seems to produce calls that are still very divergent, then just drop that predictor from the analysis. A bad predictor will make all results worse

 

  • No labels