tbl2asn (gapped)-25.8 using DE

The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Please work through the tutorial and add your comments to the bottom of this page. Or send comments per email to support@cyverse.org. Thank you.

Rationale and background: 

If your contig sequences include runs of N's that represent gaps, you will need to include assembly_gap features with the appropriate linkage evidence. If the sequences meet certain requirements, then you can generate a gapped submission with tbl2asn using the arguments -l (to add linkage evidence) and -a (to add assembly_gaps), as described below.  Tbl2asn is a command-line program that automates the creation of sequence records for submission to GenBank. It uses many of the same functions as Sequin but is driven generally by data files. Tbl2asn generates .sqn files using a template for submission to GenBank. Additional manual editing is not required before submission.

 Pre-Requisites

  1. A CyVerse account. (Register for an CyVerse account here - user.cyverse.org)

Mandatory arguments

  1. Template file containing a text ASN.1 Submit-block object (suffix .sbt).
  2. Nucleotide sequence data in FASTA format (suffix .fsa). Can be either a single fasta file (containing a single sequence) or single fasta file (containing multiple sequences) 
  3. Linkage Evidence: Type of evidence used to assert linkage across the gaps.  These are the available options (they correspond to the options for column 9 of an AGP file):
  4. Output filename

Optional arguments

  1. Feature Table or Annotation file (suffix .tbl). [Required only if including annotation]
  2. Structured comment file (suffix .cmt)

Gap details

There are two types of gap lengths:

  • Estimated Gap length: The approximate gap size is known.  This is also used if the gap is known to be small  (e. g. gap could be between 10-50 N's).
  • Unknown Gap length:  The gap size is not known (e.g. gap could be 50 or 50000 N's) but the order and orientation of the contigs are known.  We suggest using 100 N's to represent gaps of unknown length rather than a  random number because it will allow you to add assembly_gap features using tbl2asn.

Parameters

  1. Master Genome Flags 
  2. Discrepancy Report: Recommended only for annotated genome submissions, complete or WGS. See the Discrepancy Report page for information about its output.
  3. Modifiers for FASTA Definition Lines: Allows the addition of source qualifiers that will be the same for each submission

Test/sample data:


The test data are provided for testing tbl2asn (gapped)-25.8 in here - /iplant/home/shared/iplantcollaborative/example_data/tbl2asn.sample.data:

Use the following inputs/outputs and parameters for testing tbl2asn (gapped)-25.8

1. All the gaps are of estimated lengths: Every run of 5 or more Ns represents a gap of estimated length, and the linkage evidence is paired-ends:

Note that you should only include an assembly_gap for runs of N's that represent gaps.  Do not add assembly_gaps for single or short runs of N's that represent ambiguous bases. You will need to check your assembly parameters to determine what the N's represent.

  1. Mandatory argument

    1. Template file - template_BP_BS.sbt

    2. Fasta file - sample.gapped.unknown.fsa
    3. Linkage evidence - paired-ends (ie, for paired ends or mate pairs)
    4. Output file - out.gapped.sqn
  2. Optional arguments 
    1. Annotation file - multiple.tbl
    2. Structured comment file - assembly.cmt
  3. Gap details
    1. Estimated Gap length - r5k (Runs of 5 or more N's are estimated gaps and shorter runs of N's are ambiguous bases). 
  4. Parameters
    1. Organism name - [organism=Helicobacter pylori ABC1] [strain=ABC1] [host=Homo sapiens] [isolation-source=blood]
    2. Master Genome Flag - n (default)
    3. Run Discrepency report - checked  (Recommended) 

2. ALL of the gaps are 100bp and are of unknown length: All gaps are 100 Ns and are of unknown length, and the linkage evidence is by alignment to another genome of the same genus:

Note that all of the unknown length gaps must be 100 N's. An assembly_gap will be added for every run of 100 N's.  All other N's will be ignored.  Please contact us for additional instructions if there are unknown length gaps of other sizes. Note that you must know the order and orientation of the contigs.  You cannot randomly link contigs using unknown (or known) length gaps.  If you do not have linkage evidence, submit the sequences as individual contigs.

  1. Mandatory argument

    1. Template file - template_BP_BS.sbt

    2. Fasta file - sample.gapped.known.fsa
    3. Linkage evidence - align-genus
    4. Output file - out.gapped.sqn
  2. Optional arguments 
    1. Annotation file - multiple.tbl
    2. Structured comment file - assembly.cmt
  3. Gap details
    1. Estimated Gap length - r100u (Runs of 5 or more N's are estimated gaps and shorter runs of N's are ambiguous bases). 
  4. Parameters
    1. Organism name - [organism=Helicobacter pylori ABC1] [strain=ABC1] [host=Homo sapiens] [isolation-source=blood]
    2. Master Genome Flag - n (default)
    3. Run Discrepency report - checked  (Recommended) 

3. There are both estimated length and unknown length gaps: Runs of 10 or more N's are estimated gaps, and shorter runs of N's are just ambiguous bases, and all runs of exactly 100 N's are unknown gaps, and the linkage evidence is paired-ends

Note that all of the unknown length gaps must be 100 N's.  The # indicates the size of the minimum number of N's to convert to an estimated length gap. If some run's of 100 N's are unknown length and others are estimated length, please contact us for more information.

  1. Mandatory argument

    1. Template file - template_BP_BS.sbt

    2. Fasta file - sample.gapped.unknown.fsa
    3. Linkage evidence - paired-ends (ie, for paired ends or mate pairs)
    4. Output file - out.gapped.sqn
  2. Optional arguments 
    1. Annotation file - multiple.tbl
    2. Structured comment file - assembly.cmt
  3. Gap details
    1. Estimated Gap length - r10u  
  4. Parameters
    1. Organism name - [organism=Helicobacter pylori ABC1] [strain=ABC1] [host=Homo sapiens] [isolation-source=blood]
    2. Master Genome Flag - n (default)
    3. Run Discrepency report - checked  (Recommended) 

Output Reports:

  1. out.gapped.sqn - sqn file for submission to WGS
  2. multiple.val - varification report
  3. discrep - discrepency report
  4. errorsummary.val - Summary file showing the number, severity and type of errors found in all the .val files.

More information about tbl2asn (gapped)-25.8 can be found at http://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/ and http://www.ncbi.nlm.nih.gov/genbank/wgs_gapped/