InterproScan 5.36.75

InterproScan 5

InterProScan ver 5.36.75 is a HPC-enabled app that runs using TACC computing.

InterProScan provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium (http://www.ebi.ac.uk/interpro/about.html). InterPro signature are also mapped to functional annotation such as the Gene Ontology and UniPathways, allowing researchers to predict GO and pathways information for their protein data sets.

In this version of InterProScan, the input is limited to 50,000 protein sequences for a single analysis. Researchers with transcriptome data should use Transdecoder (or similar) to translate their transcripts and format them as fasta files. The split fasta app can be used to divide large fasta sequences files into files of 50,000 sequences or less.

For more information about InterPro & InterProScan:

Quick Start

To use InterproScan 5.35.75, import your data in fasta format.

Inputs:

  • Select fasta file 
  • Perform look up of corresponding Gene Ontology annotation: check to return GO annotation mappings (default)
  • Perform look up of corresponding pathways annotation: check to return pathways mappings (default)

Test Data

Test data for this app appears directly in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> InterproScan5-44.0.

Test file: chick_test.fasta

This file contains 15 chicken protein sequences downloaded as a fasta file from NCBI.

Input File(s)

Use chick_test.fasta from the directory above as test input. This is a fasta file of 15 chicken protein sequences downloaded from NCBI as a fasta file.

Parameters Used in App

When the app is run in the Discovery Environment, use the following parameters with the above input file(s) to get the output provided in the next section below.

  • Perform look up of corresponding Gene Ontology annotation: check to return GO annotation mappings (default)
  • Perform look up of corresponding pathways annotation: check to return pathways mappings (default)

Output File(s)

In this version of InterProScan, you can retrieve output in any of the following five formats:

  • TSV: a simple tab-delimited file format
  • XML: the new "IMPACT" XML format (XSD available here).
  • GFF3: The GFF 3.0 format
  • JSON
  • SVG
  • HTML

Please note you can only trace protein match positions to the original nucleotide sequence with GFF3 and XML.

Also note: this app also parses the InterProScan XML output to provide additional outputs. For more information on these outputs see the InterProScan Results Function documentation

Tab-separated values format (TSV)

The TSV format presents the match data in columns as follows:

  1. Protein Accession (e.g. P51587)
  2. Sequence MD5 digest (e.g. 14086411a2cdf1c4cba63020e1622579)
  3. Sequence Length (e.g. 3418)
  4. Analysis (e.g. Pfam / PRINTS / Gene3D)
  5. Signature Accession (e.g. PF09103 / G3DSA:2.40.50.140)
  6. Signature Description (e.g. BRCA2 repeat profile)
  7. Start location
  8. Stop location
  9. Score - is the e-value of the match reported by member database method (e.g. 3.1E-52)
  10. Status - is the status of the match (T: true)
  11. Date - is the date of the run
  12. (InterPro annotations - accession (e.g. IPR002093) - optional column; only displayed if -iprscan option is switched on)
  13. (InterPro annotations - description (e.g. BRCA2 repeat) - optional column; only displayed if -iprscan option is switched on)
  14. (GO annotations (e.g. GO:0005515) - optional column; only displayed if --goterms option is switched on)
  15. (Pathways annotations (e.g. REACT_71) - optional column; only displayed if --pathways option is switched on)

Tool Source for App