VIRSorter is a pipeline designed to mine microbial draft genomes for viral signal (complete viral contigs or viral regions within microbial contigs)
- To use VIRSorter 1.0.2, import your data in fasta format. Related microbial genomes (i.e. from the same taxonomic group) can be processed together. Microbial metagenomes can also be used as input, although the model derived from the microbial contigs will be less accurate (since it will be computed from a complex microbial community rather than an homogeneous set of genomes).
Input file formats, parameters, and results files
Input File(s) and Database selection
- Microbial genome(s) (draft or completely assembled) should all be in a single fasta file, with no spaces in the sequence names or in any folder containing this file.
- The viral reference database selection is the only choice to be made by the user. "RefseqABVir" includes all bacterial and archaeal viral genomes from RefseqVirus, and is the most conservative choice. "Viromes" include the same RefseqVirus genomes with the addition of unknown viral genomes from previously published viral metagenomes (human gut, seawater, freshwater). Although this database has been manually curated, there is more risk of false positive in using these unknown genomes, but the coverage of the viral sequence space through "Viromes" is much better than through "RefseqABVir", which makes "Viromes" a good choice for exploratory studies, or the analysis of microbe for which no (or few) viruses were detected so far.
- Custom viral sequences to be added to the reference can be added as a third (optional) input file. Viral genome(s) should be in fasta format (nucleotide sequence, no predicted proteins), and will be automatically added to the selected database as a first step in VirSorter computation.
- VirSorter main result file (VIRSorter_global_phage_signal.csv) is found at the root of the result folder. This table (which can be read with any spreadsheet) list all sequences detected as viral by VirSorter, organized by category. First, sequences entirely viral, from the more to less confident predictions (category 1, 2, and 3), then the prophages (viral regions detected in a cellular contig), again from the more to less confident predictions (category 4, 5, and 6). For each sequence is indicated the number of genes predicted, number of viral hallmark genes, and significance score for viral gene enrichment, non-caudovirales gene enrichment, pfam depletion, Uncharacterized gene enrichment, Strand switch depletion, and Short gene enrichment.
- Fasta and genbank files of all predicted sequences gathered by category are automatically generated and put in folder Predicted_viral_sequences/
The other folders content are as follows :
- Fasta_files/ contains input sequences pre-processed by Metavir including gene prediction
- r_X/ folders contain the succesive databases generated. r_0/ is the first revision and corresponds to the database selected by the user (RefseqABVir or Viromes, potentially complemented by custom viral sequences). r_1 and later correspond to databases generated from the previous run prediction (all viral sequences of category 1 from revision n-1 are included in a database used to mine the input dataset in revision n).
- Metrics_files/ contains intermediary files used by VirSorter to summarize metrics on each sequence
- Tab_files/ contains the raw results from hmm and BLAST search of databases.