What is Winnow?
Winnow is a Python-compatible known truth testing tool for genome wide association studies. Winnow is, quite simply, a tool which evaluates other tools. Winnow requires output from a GWAS tool (most GWAS tools have the same style of output, hence the reason it strictly measures these programs) after analyzing a data set. Given the "known truth" of the original data set, Winnow outputs a series of fit statistics—such as root mean-squared error and the false positive rate—to determine the validity of the GWAS tool and whether or not it was truly useful in analyzing the data. The two main statistics to look at are mean squared error (RMSE), and area under the receiver-operator curve (AUC).
How to Get Started
In your Validate Workflow v0.9 Atmosphere instance, the files for the program are located in the /usr/bin file, along with the documentation and the example files. To start, you can change your directory (though if you wish to just call Winnow from its location, this is not necessary). This can be done by opening the Terminal emulator, and typing:
Note that the documentation and the example data for Winnow are both contained in the /usr/bin directory under the Validate_Info and Example_Data folders, respectively. The actual program is contained in winnow.py, so run the following command to see all the possible inputs for Validate:
--Folder (or -F) which denotes the folder of aggregated GWAS results
--Class (or -C) to specify the “known-truth” file
--Snp (or -S) to specify a string/name for the SNP column in the input file (e.g. rs in example data)
--Score (or -P) to specify a string/name for the scoring column in results file (e.g. p-value in example data)
--filename (or -f) to specify the desired filename for the Validate output file (without file extension) defaults to Results.txt as output filename
--kttype (or -k) to specify the type of known-truth file for --class (either OTE or FGS)
Additional input options
--analysis (or -a) to specify the type of analysis, “GWAS” or “prediction” (currently, only GWAS is available and if left blank, Winnow assumes GWAS)
--threshold (or -t) to specify a desired threshold for classification metrics where necessary (default is 0.05)
--verbose (or -v) to trigger verbose mode. This option is just a flag and requires no trailing arguments.
--seper (or -s) to indicate how values are separated in the output file (comma or whitespace) (e.g. comma for exampla data)
--kttypeseper (or -r) to specify the delimination in the known-truth file (comma or whitespace)
--beta (or -b) to specify a string for the name of the estimated SNP effect column in results folder (e.g. beta in example data)
--pvaladjust (or -p) to specify an option for p-value adjustment if desired. The possible options are:
bonferonni: one-step Bonferroni method
sidak: one-step Sidak method
holm-sidak: step down method using Sidak adjustments
holm: step down method using Bonferroni adjustments
simes-hochberg: step down method (for independent statistics)
hommel: closed method based on Simes procedure (non-negatively associated statistics only)
fdr_bh: Benjamini-Hochberg method for false discovery rate control (non-negatively associated or independent statistics only)
fdr_by: Benjamini-Yekutieli method for false discovery rate control
fdr_tsbh: Two-stage FDR control (non-negatively associated or independent statistics only)
fdr_tsbky: Two-stage FDR control
--covar (or -c) to specify a string/name for the covariate weight column, if one exists in the GWAS output
--savep (or -o) to indicate whether or not to save the SNPs, p-values, and adjusted p-values to a separate file. This option is just a flag and requires no trailing arguments. If it is triggered, Winnow will create a separate file named <filename specified from command line>_scores.txt with two or three columns: SNP, P-value, and P-value Adjusted (this last column will be omitted if a p-value adjustment option is not given). Keep in mind that this file records all SNPs and p-values from all files under analysis, so depending on time constraints, you may want to turn this option off.
For the format of the known-truth file, OTE stands for “Only Truth and Effect.” As the name implies, such a file only contains the SNPs with effects and the values of those effects. FGS stands for “Full Genome Set,” and this file type lists every SNP with their effects. Both file formats can be arranged with either two rows or two columns. Also, if your aggregated results have the CSV file extension, you must change both the kttypeseper and seper options to "comma" (as opposed to the standard "whitespace" delimination).
Example command line to run Winnow
There will be two output files created on the Desktop: the actual results file, ExampleResults.txt, and another parameter file, will be created on the Desktop. Please note that if the beta argument was not included in your Winnow run, that the MAE by AUC plot in the Demonstrate2 code for the next step will be unavailable, so you must set that option to FALSE. All other plots are still available, however.
The input files for Winnow must be output from GWAS analysis tools to obtain an accurate reading; however, because many different GWAS tools exist, the format(s) must be standardized for practical use.
Data must be arranged into columns, not rows
Spreadsheet format (more specifically, CSV) is preferred, but not required
The following columns are required for any file format:
A column denoting the particular SNP being analyzed
A column with the significance score indicating whether or not a SNP is statistically significant (e.g. a p-value column)
Other columns that may be included are:
A column indicating the effect size or weight of a given SNP for determining a phenotype value
Known Truth File
The known-truth file from your analyses may be arranged in either two rows or two columns. Also, there are two acceptable formats for the known-truth file:
OTE: Only Truth and Effect. As the name implies, this format only lists those SNPs which are known to have a significant effect, along with their corresponding effect values (in this case, the SNP weight or effect size column would be mandatory in the GWAS output folder)
FGS: Full Genome Set. This type of file lists all of the SNPs in the given dataset along with each of their corresponding effect sizes or weights.
The known-truth file must be a text file which is separated either by whitespace (regular spacing or tab) or commas
Example command line when running example data
Note: It is usually helpful to convert the data into Comma Separated Value (CSV) format, and can be done with the following command line,
Then when running the example data through the following code line can be used
The output file for Winnow, specified with whatever name you chose, contains between 15-18 columns, and the number of rows depends on how many results files were placed in the aggregated folder. Each column indicates a particular fit statistic with each row indicating one result file. The fit statistics currently used in Validate, in order of appearance are…
Root mean-squared error (RMSE)^
Mean absolute error^
Matthews correlation coefficient
Area under the receiver-operator curve (AUC)
True positive rates
False positive rate
Error (defined as: (falseNegatives + falsePositives)/(truePositives + trueNegatives + falsePositives + falseNegatives))
Sensitivity (true positives/(true positives + false negatives)
Specificity (true negatives / (true negatives + false positives))
Total precision (defined as true positives divided by the total detected positives)
False discovery rate (defined as false positives divided by total detected positives)
Youden statistic (sensitivity + specificity - 1)
Average covariate weight*
^Statistic will be excluded if a beta column is not specified for the GWAS output
*Statistic will be excluded if a covariate column is not specified for the GWAS output
Though the output is typically a text file, one can easily reformat it into .CSV style or other formats. An example of formatted Winnow results from the aforementioned PLINK analysis is shown below.
In this case we can conclude that PLINK was perhaps not an appropriate tool for this dataset. The inability to identify any true positives after adjustment means that none of the SNPs from the known truth file were detected. Beyond that, the AUC levels below 0.5 indicate that using PLINK actually was less accurate than random guessing. Though we can choose to visualize this in Demonstrate, one likely would be safe in concluding that running this data with another GWAS analysis tool may be better than using PLINK.
Along with the actual results, each run through of Winnow will also produce a parameters file for future reference or use with Demonstrate. While extra lines will be included depending on whether covariates or p-value adjustments were included, several parameters will always be included: the output name, the analysis type (with beta or without beta), and the significance threshold. Using the data and options from the example above, the Winnow parameter file would look like this:
Updating or Modifying Winnow
Because of the program structure for Winnow, one may easily update or modify the source code with additional performance metrics, delimiter options, and the like. If you are interested in modifying the Winnow program or wish to add more fit statistics to the output, please consult this tutorial.
Iplant profile for Dustin Landers, the original architect of Winnow: https://pods.iplantcollaborative.org/wiki/display/~landersda
Information on the AUC: https://www.kaggle.com/wiki/AreaUnderCurve
Example data for analysis can be found: http://mirrors.iplantcollaborative.org/browse/iplant/home/shared/iplantcollaborative/example_data/Validate/Validate_Test_Data