This space is home to learning materials and tutorials created for CyVerse products and services. To search the entire CyVerse wiki, use the box at the upper right.


LEARNING MATERIALS
Maintenance: Tues, 28 Jan 2020

ACCESS TO OR USAGE OF THE FOLLOWING SERVICES WILL BE UNAVAILABLE OR DISRUPTED:

Discovery Environment         8:00am to 5:00pm MST
The Discovery Environment will be unavailable while patches and updates are applied.
        ** Currently running analyses will be terminated. Please plan accordingly.

Data Store                    8:00am to 5:00pm MST
The Data Store will be unavailable during the maintenance period.
 
Data Commons                  8:00am to 5:00pm MST
The Data Commons will be unavailable during the maintenance period.
 
Atmosphere and Cloud Services 8:00am to 5:00pm MST
Marana Cloud: Atmosphere instances in the Marana Cloud will be operational; however, you will not be able to use the Data Store within your instance, and you may not be able to access the Atmosphere web interface.
 
User Portal                   8:00am to 5:00pm MST
The User Portal, http://user.cyverse.org, will be unavailable while we perform maintenance and updates.
 
Agave/Science API             8:00am to 5:00pm MST
The Agave/Science API will be unavailable during this maintenance period.
 
DNA Subway                    8:00am to 5:00pm MST
DNA Subway will be unavailable during this maintenance period.
 
The following services will NOT be affected by the maintenance: CyVerse Wiki and JIRA 

Keep up to date with our maintenance schedules on the CyVerse public calendar
http://www.cyverse.org/maintenance-calendar
Check your local timezone here https://bit.ly/36iVOkX 
 
Please contact support@cyverse.org for any questions, or concerns.

 

 

 

 

 

Skip to end of metadata
Go to start of metadata


Winnow

What is Winnow?

Winnow is a Python-compatible known truth testing tool for genome wide association studies. Winnow is, quite simply, a tool which evaluates other tools. Winnow requires output from a GWAS tool (most GWAS tools have the same style of output, hence the reason it strictly measures these programs) after analyzing a data set. Given the "known truth" of the original data set, Winnow outputs a series of fit statistics—such as root mean-squared error and the false positive rate—to determine the validity of the GWAS tool and whether or not it was truly useful in analyzing the data. The two main statistics to look at are mean squared error (RMSE), and area under the receiver-operator curve (AUC).

How to Get Started

In your Validate Workflow v0.9 Atmosphere instance, the files for the program are located in the /usr/bin file, along with the documentation and the example files. To start, you can change your directory (though if you wish to just call Winnow from its location, this is not necessary). This can be done by opening the Terminal emulator, and typing:

Note that the documentation and the example data for Winnow are both contained in the /usr/bin directory under the Validate_Info and Example_Data folders, respectively. The actual program is contained in winnow.py, so run the following command to see all the possible inputs for Validate:

Required inputs

  • --Folder (or -F) which denotes the folder of aggregated GWAS results

  • --Class (or -C) to specify the “known-truth” file

  • --Snp (or -S) to specify a string/name for the SNP column in the input file (e.g. rs in example data)

  • --Score (or -P) to specify a string/name for the scoring column in results file (e.g. p-value in example data)

  • --filename (or -f) to specify the desired filename for the Validate output file (without file extension) defaults to Results.txt as output filename

  • --kttype (or -k) to specify the type of known-truth file for --class (either OTE or FGS)

Additional input options

  • --analysis (or -a) to specify the type of analysis, “GWAS” or “prediction” (currently, only GWAS is available and if left blank, Winnow assumes GWAS)

  • --threshold (or -t) to specify a desired threshold for classification metrics where necessary (default is 0.05)

  • --verbose (or -v) to trigger verbose mode. This option is just a flag and requires no trailing arguments.

  • --seper (or -s) to indicate how values are separated in the output file (comma or whitespace) (e.g. comma for exampla data)

  • --kttypeseper (or -r) to specify the delimination in the known-truth file (comma or whitespace)

  • --beta (or -b) to specify a string for the name of the estimated SNP effect column in results folder (e.g. beta in example data)

  • --pvaladjust (or -p) to specify an option for p-value adjustment if desired. The possible options are:

    • bonferonni: one-step Bonferroni method

    • sidak: one-step Sidak method

    • holm-sidak: step down method using Sidak adjustments

    • holm: step down method using Bonferroni adjustments

    • simes-hochberg: step down method (for independent statistics)

    • hommel: closed method based on Simes procedure (non-negatively associated statistics only)

    • fdr_bh: Benjamini-Hochberg method for false discovery rate control (non-negatively associated or independent statistics only)

    • fdr_by: Benjamini-Yekutieli method for false discovery rate control

    • fdr_tsbh: Two-stage FDR control (non-negatively associated or independent statistics only)

    • fdr_tsbky: Two-stage FDR control

  • --covar (or -c) to specify a string/name for the covariate weight column, if one exists in the GWAS output

  • --savep (or -o) to indicate whether or not to save the SNPs, p-values, and adjusted p-values to a separate file. This option is just a flag and requires no trailing arguments. If it is triggered, Winnow will create a separate file named <filename specified from command line>_scores.txt with two or three columns: SNP, P-value, and P-value Adjusted (this last column will be omitted if a p-value adjustment option is not given). Keep in mind that this file records all SNPs and p-values from all files under analysis, so depending on time constraints, you may want to turn this option off.

For the format of the known-truth file, OTE stands for “Only Truth and Effect.” As the name implies, such a file only contains the SNPs with effects and the values of those effects. FGS stands for “Full Genome Set,” and this file type lists every SNP with their effects. Both file formats can be arranged with either two rows or two columns. Also, if your aggregated results have the CSV file extension, you must change both the kttypeseper and seper options to "comma" (as opposed to the standard "whitespace" delimination).

Example command line to run Winnow

 There will be two output files created on the Desktop: the actual results file, ExampleResults.txt, and another parameter file, will be created on the Desktop. Please note that if the beta argument was not included in your Winnow run, that the MAE by AUC plot in the Demonstrate2 code for the next step will be unavailable, so you must set that option to FALSE. All other plots are still available, however.

Example Input

Analyses File

  • The input files for Winnow must be output from GWAS analysis tools to obtain an accurate reading; however, because many different GWAS tools exist, the format(s) must be standardized for practical use.

  • Data must be arranged into columns, not rows

  • Spreadsheet format (more specifically, CSV) is preferred, but not required

  • The following columns are required for any file format:

  • A column denoting the particular SNP being analyzed

  • A column with the significance score indicating whether or not a SNP is statistically significant (e.g. a p-value column)

  • Other columns that may be included are:

  • A column indicating the effect size or weight of a given SNP for determining a phenotype value

Known Truth File

The known-truth file from your analyses may be arranged in either two rows or two columns. Also, there are two acceptable formats for the known-truth file:

  1. OTE: Only Truth and Effect. As the name implies, this format only lists those SNPs which are known to have a significant effect, along with their corresponding effect values (in this case, the SNP weight or effect size column would be mandatory in the GWAS output folder)

  2. FGS: Full Genome Set. This type of file lists all of the SNPs in the given dataset along with each of their corresponding effect sizes or weights.

   

The known-truth file must be a text file which is separated either by whitespace (regular spacing or tab) or commas

 

Example command line when running example data

Note: It is usually helpful to convert the data into Comma Separated Value (CSV) format, and can be done with the following command line, 


Then when running the example data through the following code line can be used


Output Files

The output file for Winnow, specified with whatever name you chose, contains between 15-18 columns, and the number of rows depends on how many results files were placed in the aggregated folder. Each column indicates a particular fit statistic with each row indicating one result file. The fit statistics currently used in Validate, in order of appearance are…

  • Root mean-squared error (RMSE)^

  • Mean absolute error^

  • Matthews correlation coefficient

  • Area under the receiver-operator curve (AUC)

  • True positives

  • False positives

  • True negatives

  • False negatives

  • True positive rates

  • False positive rate

  • Error (defined as: (falseNegatives + falsePositives)/(truePositives + trueNegatives + falsePositives + falseNegatives))

  • Accuracy   

  • Sensitivity (true positives/(true positives + false negatives)

  • Specificity (true negatives / (true negatives + false positives))

  • Total precision (defined as true positives divided by the total detected positives)

  • False discovery rate (defined as false positives divided by total detected positives)

  • Youden statistic (sensitivity + specificity - 1)

  • Average covariate weight*

^Statistic will be excluded if a beta column is not specified for the GWAS output

*Statistic will be excluded if a covariate column is not specified for the GWAS output

Though the output is typically a text file, one can easily reformat it into .CSV style or other formats. An example of formatted Winnow results from the aforementioned PLINK analysis is shown below.

In this case we can conclude that PLINK was perhaps not an appropriate tool for this dataset. The inability to identify any true positives after adjustment means that none of the SNPs from the known truth file were detected. Beyond that, the AUC levels below 0.5 indicate that using PLINK actually was less accurate than random guessing. Though we can choose to visualize this in Demonstrate, one likely would be safe in concluding that running this data with another GWAS analysis tool may be better than using PLINK.

Along with the actual results, each run through of Winnow will also produce a parameters file for future reference or use with Demonstrate. While extra lines will be included depending on whether covariates or p-value adjustments were included, several parameters will always be included: the output name, the analysis type (with beta or without beta), and the significance threshold. Using the data and options from the example above, the Winnow parameter file would look like this:

 

Updating or Modifying Winnow

Because of the program structure for Winnow, one may easily update or modify the source code with additional performance metrics, delimiter options, and the like. If you are interested in modifying the Winnow program or wish to add more fit statistics to the output, please consult this tutorial.

 

Further Information

Iplant profile for Dustin Landers, the original architect of Winnow: https://pods.iplantcollaborative.org/wiki/display/~landersda

Information on the AUC: https://www.kaggle.com/wiki/AreaUnderCurve

Example data for analysis can be found: http://mirrors.iplantcollaborative.org/browse/iplant/home/shared/iplantcollaborative/example_data/Validate/Validate_Test_Data

Icon

This tool is still in development and we are testing it currently. If you notice any issues or have any comments we would greatly appreciate them!
Please contact us at labstapleton@gmail.com. Thank you for using our tools!

  • No labels