This space is home to learning materials and tutorials created for CyVerse products and services. To search the entire CyVerse wiki, use the box at the upper right.


LEARNING MATERIALS
Maintenance: Tues, 28 Jan 2020

ACCESS TO OR USAGE OF THE FOLLOWING SERVICES WILL BE UNAVAILABLE OR DISRUPTED:

Discovery Environment         8:00am to 5:00pm MST
The Discovery Environment will be unavailable while patches and updates are applied.
        ** Currently running analyses will be terminated. Please plan accordingly.

Data Store                    8:00am to 5:00pm MST
The Data Store will be unavailable during the maintenance period.
 
Data Commons                  8:00am to 5:00pm MST
The Data Commons will be unavailable during the maintenance period.
 
Atmosphere and Cloud Services 8:00am to 5:00pm MST
Marana Cloud: Atmosphere instances in the Marana Cloud will be operational; however, you will not be able to use the Data Store within your instance, and you may not be able to access the Atmosphere web interface.
 
User Portal                   8:00am to 5:00pm MST
The User Portal, http://user.cyverse.org, will be unavailable while we perform maintenance and updates.
 
Agave/Science API             8:00am to 5:00pm MST
The Agave/Science API will be unavailable during this maintenance period.
 
DNA Subway                    8:00am to 5:00pm MST
DNA Subway will be unavailable during this maintenance period.
 
The following services will NOT be affected by the maintenance: CyVerse Wiki and JIRA 

Keep up to date with our maintenance schedules on the CyVerse public calendar
http://www.cyverse.org/maintenance-calendar
Check your local timezone here https://bit.ly/36iVOkX 
 
Please contact support@cyverse.org for any questions, or concerns.
 
Thank You,
CyVerse Staff 

 

 

 

 

 

 

Skip to end of metadata
Go to start of metadata
Navigate space

Validate for the Developer

This documentation is intended to provide a comprehensive manual of how to add to Validate and/or alter Validate in any way to provide a more useful way to test genotype to phenotype association methods with known-truth data sets. Importantly, this manual is supposed to work in conjunction with the doc strings and other documentation within the code itself. This might be a way to jump in to the action of altering the code much quicker, and that is the intention here.

Validate, as you may be aware, is a tool written in Python that is intended to completely replace other more "ad-hoc" known-truth testing methods.

Known-truth testing is quite simply, a method of creating realistic data sets where we "know the truth" and then seeing how well our methods identify this "truth." Validate covers the second part of this sentence.

The code for Validate is written in Python. This was chosen because it is an easy-to-understand language. And since methods for Validation may change over the years, this manual was written to effectively tell tool developers and others how to alter the code so that it can appropriately do what they need it to do.

Although, in many ways, this documentation may reflect what you would expect from an API developer doc, Validate is more of an open-source script than an API. However, we have worked hard to heavily modulate the scripts to be easily modifiable and understandable by others. Further, by including well-known scientific modules, we hope to encourage the vast usage of these modules in writing additional functions and classes for Validate.

Module, Object, and Function Reference

Validate is divided up in to several modulated files listed below. For function and class references, please see the doc strings above each function in the Python code.

File Name

Purpose of File

validate.py

This file contains and initiates the main function of the application. It also makes all necessary calls to other files.

checkhidden.py

This file does not need to be modified. It simply provides functions for checking for hidden files within a directory to insure those are not included in any analysis.

data.py

This file contains a class named "Data" which transforms a delimited file in to useable data.

fileimport.py

This file provides functions to import data. Currently, white space and comma delimitation are all that's included.

gwas.py

This file contains functions for executing analysis of a GWAS application. For a prediction application, additional functions will need to be made.

commandline.py

This file contains functions for retrieving command line arguments.

performetrics.py

This file contains all individualized functions for performance metrics used in validation of an application. This is the file we expect will be most heavily modified in the future.

There are five main objects (lists in particular) that you will need to understand in order to write additional functions (especially metric functions). These are:

Object Name

What the Object Actually Is

betaColumn

The generated list or vector of SNP weights

betaTrueFalse

The "truth" about the betas expressed numerically. This is generated from truth files that are specified during execution (files with and *.ote extension. Generally, this list has the same length as betaColumn, except it contains zeros for SNPs with no effect, and then the explicit quantitative effect for all other SNPs.

snpTrueFalse

This "truth" list is equivalent to betaTrueFalse but expressed categorically with booleans rather than quantitatively.

scoreColumn

This list is the list of generated "scores" reflecting the score assigned to that SNP from the GWAS application. For example, this is the column that would be generally indicated for P-values.

threshold

This is a scalar quantity designed for some metrics that require a threshold such as True Positive Rate, for example.

Change Log

Validate for Python release 0.8.0 – Scheduled for Mid-June, 2014 – First release of Python re-write. Validate in Beta. All functionality works with defects in the H measure.

Validate for R release 0.8.0 – Released May, 2014 – Re-write of R version of Validate. Same version as included on the Validate-toolkit but installed on the DE.

Validate-Toolkit-0.3.0 – New version of the Validate written in R released on Atmosphere Spring, 2014 – Included major expansion of features. Including ability to handle different file types (for truth files), some automatic file transformation (for truth files), ten new performance measures, and corrected a bug that forced Validate to fail when only a single result/output file was being validated. (Under Github repository name "ktaR")

Validate for R release 0.3.0 – Released on DE Fall, 2013 – First release of the R version of validate. Version was an Alpha but most of all functionality was available. (Under Github repository name "ktaR")

Developer Guide

Adding an additional performance metric

Adding an additional performance metric in to the Python version of Validate is straight-forward. Let's use the example of adding a correlation (This feature is already included as the correlation between estimated SNP weights and actual "truth" SNP weights.)

1. To add this measure, we would want to make sure to import a package that contained a correlation. If not, we could write our own correlation function which would be simple enough but for now, we will assume you have no desire to reinvent the wheel and simply choose to use the pearson correlation available through the scipy module. So at the beginning of the file, we will add:

2. Then anywhere in this file, we could write the function to perform the correlation. We know from our object reference that we need to include the betaColumn and the betaTrueFalse column. The function might look something like this:

3. Finally, it needs to be added to the gwas functions in the gwas.py file, so that the correlation function we just created is called and saved during the process of validating a results file.

The first part of the return line in this function specifies a list of names. These names will be used in the header of the Validate results file. Just make sure that the number in the list corresponds to the function in the second list where the actual function is called. Finally, keep in mind that we are only altering the gwasWithBeta function, since we obviously could not analyze a GWAS application that did not include SNP weights.

  • No labels