Genome submissions are comprised of genomic DNA sequences representing either incomplete or complete genomes from both prokaryotes and eukaryotes. Incomplete genomes (or incomplete chromosomes of prokaryotes or eukaryotes) are those submissions that have been derived from data created by whole-genome shotgun (WGS) sequencing methods or traditional clone-based sequencing, respectively. WGS projects may be annotated, but annotation is not required. Complete genomes are those genomes or prokaryotes or eukaryotes that have chromosomes in single sequence without gaps or N's that represent gaps
This workflow enables CyVerse users to make incomplete genomes submissions to the NCBI Whole Genome Shotgun (WGS) only. If you are submitting complete genome submissions to the NCBI (prokaryotic or complete eukaryotic genomes or chromosomes) see the table below for more information.
|Complete prokaryotic genomes||GenBank Prokaryotic Genomes: Records retrievable from the Nucleotide Database|
GenBank archives complete prokaryotic genomes with user submitter-supplied annotations. Alternatively, submitters now can request automated NCBI annotation of sequences as part of the submission process. NCBI has a Prokaryotic Genomes Annotation Pipeline that may be requested when genome files are submitted to GenBank. This pipeline generates a submission-ready annotated file that the submitter could edit prior to data release. For more information, read about the Prokaryotic Genomes Annotation Pipeline.
|Complete eukaryotic genomes or chromosomes||GenBank Eukaryotic Genomes and Chromosomes: Records retrievable from the Nucleotide Database|
GenBank accepts the submission of complete eukaryotic chromosomes or complete genomes with submitter-supplied annotations. Complete genomes, with each of the chromosomes in single sequences, should be submitted to GenBank as a complete genome. The most common complete genomes are bacteria, archaea, and fungi. Complete genomes are defined for GenBank as the chromosomes, although any plasmids that are isolated with the chromosomes should be submitted too. As of July 2013, these sequences are allowed to contain gaps and are not required to include annotation. However, submitters need to know what kinds of gaps and linkage evidence are present, as described in Gapped Format for Genome Submissions. For information about annotating genomes, see the prokaryotic annotation guide or eukaryotic annotation guide.
If you are unsure about the type of data submitted to the WGS division, visit the WGS List for example projects.
This workflow is only meant for WGS submissions. The differences in GenBank purposes are:
For both Non-WGS and WGS
There are two main formats for WGS submissions:
.sqn, Optional AGP, .qvl and .tbl files
fsa, Optional AGP
An example of submission package metadata is in the Discovery Environment Data window under Community Data -> iplantcollaborative -> example_data -> WGS_submission
The submission package is created using tools in the DE. The submission package has three levels: BioProject, BioSample, and Library. Package organization is similar to the SRA organization detailed in the NCBI Quick Start Guide.
Each submission will create a BioProject, BioSample(s) and a Library folder(s).
Only one BioProject can be created per submission.
BioProject, BioSample, and Library metadata are entered using metadata templates in the DE.
Only submission package folders have metadata. Do not add metadata to the sequence files
Use the Metadata Term Guide in the DE for explanations of each metadata term. The guide is located within each template.
See http://www.ncbi.nlm.nih.gov/biosample/docs/packages/ for help determining the appropriate BioSample type for your data.
Use the saved metadata file in Step 2 to create the submission template (.sbt) using the meta2tbl app in the DE.
Run tbl2asn-gapped-25.3 or tbl2asn-ungapped-25.3 along with the submission template generated in Step 3 for converting fasta files to sqn format, depending on the type of your WGS submission. Check the output of the Validation and Discrepancy Report, and fix any problems,
The sqn file generated in Step 4 needs to be moved into the libraries folder under Bioproject -> Biosample, and save the Bioproject metadata to a file.
Run the NCBI_WGS_Submit app to submit to the WGS.
Make sure you uncheck the Validate metadata file only checkbox.
The app will both create the submission.xml metadata file and transfer all sequence files to the WGS.
An example of submission package metadata is in the Discovery Environment Data window under Community Data -> iplantcollaborative -> example_data -> WGS_submission.
The submission package is created using tools in the DE. Submission packages have three levels: BioProject, BioSample, and Library. Package organization is similar to the SRA organization detailed in the NCBI Quick Start Guide.
Until the next DE release, the submission package is the same for both SRA and WGS.
A WGS submission package contains a BioProject folder with one or more BioSample folders, each of which contains one or more Library folders, and each Library folder contains one or more sequence files. Use the Discovery Environment (DE) Create NCBI SRA Submission Folder tool to create the submission package (see figure below)
Caveats and suggestions
Enter information on the number of BioSamples and Libraries.
Name the top-level BioProject folder (click the link for more information on NCBI BioProjects).
Assign each genome to a BioProject. Genomes sequenced as part of the same research effort can belong to a single BioProject.
Enter the total number of BioSamples in your submission (click the link for more information on NCBI BioSamples).
If the same sample is used for two different genome assemblies, use the same BioSample for both.
Enter the largest number of sample-specific sequencing libraries among your BioSamples. For example, if you have two BioSamples and one of them has one library and the other has two, enter ‘2’ for the number of libraries. If you have more Libraries for some BioSamples than others, this will generate some empty Library folders in the next step.You can remove these empty Library folders, or ignore them.
Output: Metadata file saved from the top-level BioProject folder in the submission package.
Three metadata templates will be used to add metadata to the submission package: BioProject, BioSample, and Library, successively:
For each BioSample folder in the submission package, select the NCBI BioSample - Plant WGS BioSample Metadata template, and enter metadata (metadata template tutorial):
For each Library folder in the submission package, select the NCBI WGS Library Metadata template and enter metadata (metadata template tutorial). To facilitate metadata entry, enter all shared metadata for a single Library folder and then copy it to all other Library folders. After copying, you can add unique metadata to each Library folder.
Do not add metadata to the sequence files.
After the metadata has been entered, select the top-level BioProject folder in the submission package and use the ‘Save metadata’ function to save a BioProject metadata file for the submission package. This file will serve as input into the WGS submission app in the next step (Step 3):
Once the metadata file has been saved, select the NCBI WGS Submit app to validate the submission package and metadata file. Note: Do not put metadata file in BioProject folder.
Logs folder with information on job execution.
A folder named with your CyVerse username and the top-level BioProject folder ID that contains the submission.xml (metadata file formatted for ingestion by the WGS).
Tbl2asn is a command-line program that automates the creation of sequence records for submission to GenBank. It uses many of the same functions as Sequin but is driven generally by data files. Tbl2asn generates .sqn files using the template generated from Step 3 for WGS submission. Depending on whether or not your genome is gapped or ungapped, you can choose between the tbl2asn-gapped-25.3 or tbl2asn-ungapped-25.3 DE apps.
One or more .fsa (fasta) files. Nucleotide sequences in fasta file must conform to the following standards:
There should be no gaps represented, although Ns can be used to represent sequence ambiguities.
There should be no more than 10,000 sequences per file. It is often convenient to group sequences by molecule type (e.g., chromosome) or sequence status (e.g., unplaced or unlocalized).
Typically, files will end with an .fsa extension (e.g., chr1.fsa, chr2.fsa, unknown.fsa) .
Optional files: These correspond to and have the same basenames as the .fsa files:
Annotation files, if appropriate. The .tbl files have a 5-column tab-delimited table of feature locations and qualifiers.
Check the output of the Validation and Discrepancy Report and fix problems:
NCBI requires that you compress your sqn file before submitting it to WGS. So we will use "Compress files with gzip" app to compress the output.sqn file to output.sqn.gz file
Caveats and suggestions
In the DE, you can open two windows and then move the sqn files from one window to another window. If the files are big, it slightly takes more time to move them around.
If you already have sqn files, then you can upload files to the DE. See this guide to choose the most appropriate upload method. CyVerse Upload Tutorial - CyberDuck is highly recommended for your uploads.
Input: The BioProject folder (top-level of the submission package) and the BioProject metadata file (saved from the top-level of the submission package).
Logs folder with information on job execution that includes a ‘manifest.txt’. file with a log of the files transferred to the WGS.
Folder named with your CyVerse username and the top-level BioProject folder ID that contains the submission.xml (metadata file formatted for ingestion by the WGS) and a submit.ready file used to signal WGS systems that submission is complete and to process the submission package.
Caveats and suggestions
After you submitted, the submission package will be validated by the WGS system and email notifications will be sent by the WGS to the contact email added in the BioProject metadata to confirm successful submission, or to communicate submission errors.
What happens at WGS? CyVerse systems connect to WGS systems and create the submission folder on the WGS side. Files are transferred and a submit.ready file is sent to the WGS to signal that the submission package is complete and they can begin processing. The WGS system validates the submission package and generates a report.xml file containing any errors detected. The WGS system sends notification email(s) to the contact email provided in the BioProject metadata template, and to the CyVerse team to notify of either a successful or failed submission. The first email will be titled "Submission ownership transfer". Follow the instructions in that email to transfer ownership of the submission to the NCBI user included in the package metadata. After ownership transfer, you can view the submission progress at https://submit.ncbi.nlm.nih.gov/subs/. You may need to log in with the NCBI credentials for the account you used in the submission metadata. After you receive further notification from the WGS, i f there are errors, you can retrieve the submission report.xml file from WGS servers with the "NCBI_Report_Download" App in the DE, make corrections, and resubmit (see below).
Caveats and suggestions
WGS processing may take 72 hours (or longer) depending on the load on their systems. If you do not receive any notifications after a week, please email us at firstname.lastname@example.org
If error correction and resubmission are needed, the WGS-generated error report can be retrieved with the "NCBI_Report_Download" App. Use this report to correct the errors and resubmit. Corrections to the submission package can be made within the DE by updating the submission package organization or metadata and resubmitting the beginning with Step 4.
Remember to save a new metadata file from the top level of the submission package before resubmitting. It is best practice to name this file differently from the previous metadata file.
If no report.xml is retrieved after running this app, this does not necessarily mean your submission failed. The WGS system may not have generated it yet. Make sure to wait for notification from the WGS that the submission has been received and processed.
After successful processing, you should get an email something like this
If you encounter any issues during WGS submission, please send an email to email@example.com.