The first step for the workflow is ensuring that you have data to work with. If you already have data feel free to proceed to the next step (GWAS Tools) if not then you can use simulate to generate data to work with. In the future we plan to include the conversion script directly into the workflow, but for now it is available as an attachment to this page and can be used by following the instructions which follow.
Simulate is a python script that runs directly from the command-line. It is a forward-in-time genetic individual-based simulation software that evolves for a specified number of generations and estimates a quantitative trait from an additive model.
To Run Simulate
First, open a command-line terminal from an Atmosphere instance of Validate V.09 and change directory to /usr/bin
Next create a file for your simulate data. In the example code below the new file has been labeled as <simulations>.
Run Simulate with just the help option selected to see all possible options.
Possible options include:
-v or --verbose: Triggers verbose mode which explicitly states each option before running and prints statements at each step of the simulation process.
-s or --size: One or more integers, separated by commas with no spaces, indicating the size of your population or sub-populations. Ideally, you should place these values above 100, as any number below that will result in a population convergence and an error in the program.
-n or --number: An integer, indicates the number of genetic markers per individual. This number will be one of the key determinants of the size of the final output, and will influence how frequently the effects of the loci will change per generation. If multiple subpopulations are specified, the individuals of all subpopulations will have the number of markers denoted by n
-l or --loci: A list of integers, starting with 0 (e.g. 0,1,2) to denote which number of loci will have the effects. For example, if
-l 0,1,2is chosen, then individuals having 0 loci, 1 locus, or 2 loci will have some kind of effect, the exact effect being represented with the next command line argument...
-e or --effect: A list of floating point numbers equal in length to the list given by
-l; indicates the actual effect on the phenotype having said number of loci will have. Using the above example with 0,1, and 2 loci having effects, if one listed
-e 0.2,0.2,0.2, then in this case each additional loci would contribute 0.2 units to the phenotype of an individual.
-m or --mean: One or multiple floating point numbers separated by commas. The length of the list should not exceed the number of subpopulations specified. Represents the mean of the phenotype for your population or subpopulation. If only one number is given for the mean when multiple subpopulations exist, then the mean will apply to all (sub)population(s).
-i or --heritability: Decimal between 0 and 1. Denotes the heritability coefficient for your (sub)population(s). The heritability coefficient in this program means the probability of an individual inheriting a given set of loci and effects from a parent individual.
-g or --gen: A single integer, indicates the number of generations you wish to have your (sub)population(s) evolve. To get reliable output, we recommend a number above 50.
-r or --rate: Decimal value between 0 and 1. Denotes the genetic recombination rate of your (sub)population(s). The genetic recombination rate is the proportion of offspring in a population that will have a combination of traits not found in either of the individual's parents.
-d or --distribution: The underlying distribution of your population. Can either be 0 for normal distribution (the default option) or 1 for gamma distribution. Any other number will result in an error (for now...) If option 1 is chosen for the gamma distribution then you will need to fill in the next two arguments:
-p1 or --parameter1: A single floating point number. Denotes the shape parameter of your population's underlying gamma distribution (only required if gamma distribution was chosen as an input above).
-p2 or --parameter2: A single floating point number. Denotes the scale parameter of your population's underlying gamma distribution (only required if gamma distribution was chosen as an input above).
-f or --filename: A character string. Indicates the name to be attached to all Simulate outputs. For example, if filename was MyResults, then the genomic information file would be MyResults_genomes.csv
Using these options is simply a matter of extending on to the above command (Note: Using the option --help automatically ends further execution so that the help command cannot be selected if you intend to execute a simulation at that time.)
For example, suppose you wanted to have two sub-populations, with 1000 individuals each, 3000 markers per individual and only the first three loci have effects of 0.2 each. The population mean is 2.7 for both populations and you want to evolve the populations for 100 generations. Also, suppose you wanted to have a heritability of about 0.25. The following code would generate a single simulation:
Notice that all options with multiple arguments such as -s with 2 sub-populations are separated by commas with no spaces.
In order to generate multiple simulations, we recommend using the command-line terminal to iterate the process like below, after creating a folder on the Desktop. This way, when iterating over your simulations, you can be sure to already have them in some aggregated location.
Running the simulations through your GWAS applicaion
This tutorial does not explain how to convert simulations to the necessary formats for your application, as each format may be unique. Although we can provide a basic R script to show you how you might use R to get Simulate.py output files in to a basic PEDMAP file for the purposes of this tutorial (http://dustinlanders.net/wp-content/uploads/2014/05/convert.R.zip), once in PEDMAP file, you can use the PLINK software to convert these to other file formats simply with:
for example to exchange in to the BED-BIM-FAM format. For other conversions, check out the PLINK site athttp://pngu.mgh.harvard.edu/~purcell/plink/.
For the purposes of this tutorial, we are going to pretend that you are validating a popular bioinformatics application called PLINK! First, we will assume that you have run some equivalent of the above code and that you have 100 simulations at your disposal now to test through PLINK. Since PLINK requires a format called PEDMAP, we will provide an R script to convert your Simulate.py simulations in to PEDMAP formats for the purposes of this tutorial.
Installing Simulate2 and convert.R
Unfortunately the version of simulate that comes preinstalled in the Validate Workflow v0.9 image is not compatible with the version of convert.R listed below. To convert files to the PEDMAP form you must first download both Simulate2 and convert.R, then run your simulations (using the directions listed for Simulate in the first section of this page.
To download Simulate2 and convert.R, enter in to the command line:
Next you must modify the script for simulate2.py so that it calls convert.R from the correct location. To do this you must first change the permissions on simulate2.py so that you can make edits. To do this enter:
Once this is entered, access the simulate2.py script by right clicking its icon on the desktop and selecting "open with Leafpad." On the fourth line from the bottom change "Rscript convert.R " to "Rscript ~/Desktop/convert.R ". You may now carry on simulations using the directions listed at the start of this page with a few key differences:
- Because simulate2 is saved to the desktop you must change directory to desktop (enter <cd ~/Desktop> rather than /usr/bin.
- Change <python Simulate.py> becomes <python simulate2.py> any time it is being entered in to the command line
- Spaces, rather than commas are now required to separate multiple arguments in an option
- <normal> and <gamma> have replaced <0> and <1> as arguments for the distribution option
- When changing the above code line make sure that your capitals are consistent with the names given to your directories (i.e. Simulate vs simulate2)
The code below reflects these changes and has been put into the form for creating multiple simulations
If you would like to follow along with this section purely for the demonstration, or if you would just like to evaluate PLINK for yourself and/or use its conversion capabilities, you can acquire PLINK from the command line:
Because /usr/bin is in your path environment variable, after this, you should be able to utilize plink directly from the command-line. Also, in this example, we have made a new folder on our Desktop called "qassoc" which will house all PLINK results files.
where each Test$i has been converted in to the proper PED MAP format. In order for Validate to actually run, we must have a folder with only the results files. In this case, we have additional log files for each run that we need to get rid of. In order to get rid of them, we will run the following from the terminal:
Now we are prepared to run this folder of outputs through Validate.R!
If the aforementioned scripts are inaccessible, the Syngenta simulation data sets located in the iPlant community Data Store are also ideal for PLINK testing. The Syngenta simulation data may be found in the Discovery Environment under Community_Data -> syngenta_sim -> PEDMAP_DE. To use them in an Atmosphere instance, please consult this tutorial: Using FUSE to Mount the iPlant Data Store
Documentaiton for simulate can be found here.