To search content in this manual only, enter your query above. To search for content in the entire CyVerse wiki, enter your query at the top right.
__________________

DATA COMMONS USER MANUAL
 

 

 

 

 

 

 

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The CyVerse Data Commons manages all public data in the Data Store that is stored under the /iplant/home/shared directory and supports data publication to external repositories. “Public data” on CyVerse is defined as any data that is visible to the public via datacommons.cyverse.org, whether or not the viewer has a CyVerse user account. Public data on CyVerse is also available to registered users via all methods described under Downloading and Uploading Data.

Publishing in the Data Commons Repository

CyVerse provides a landing page for each public dataset. Such landing page is populated with the metadata provided by the user.

DOIs are assigned upon request of the project lead. A DOI is a type of global identifier that allows a digital object to be persistently referenced on the Internet even if the item is moved to another online repository. DOIs use the DataCite metadata schema for purposes of citation. However, for data to be reused, more descriptive information is required so we encourage users to further document their datasets. Please see ……

What about data that has been published already elsewhere

If an upload involves data that has been published elsewhere and or has an existing DOI, project leads have the opportunity to reference those datasets using the External URL box. The existing DOI and/or a link can be added to the dataset information.

Upon publication data creators can request to retrieve their data from the repository. To do so, a User must contact the repository curator and provide a justification. A record stating that the dataset was available and including an abstract and an explanation about why the data was removed will be in place. It has to be reminded that the dataset will have a DOI and that DOI will remain active so that when people use it from a citation they can verify that the data is no longer there.

Publishing to external repositories

SRA pipeline: Data Commons enables CyVerse users to make submissions to the NCBI Sequence Read Archive directly. Submissions instructions include compressed sequenced files (FASTQ.gz, SFF.gz, and BAM.gz) and an XML metadata file, organized into a submission package.

WGS pipeline and TSA: Coming soon  publishes data to our own repository at  datacommons.cyverse.org as well as external repositories. All data published to CyVerse Curated Data receive a permanent identifier (PID) in the form of a DOI (Digital Object Identifier) or ARK (Archival Resource Key) and are expected to be stable and permanent. Data published to the Community Released folder do not have PIDs, and may be changed or  removed at any time. All data published to the Data Commons is expected to have at least minimal metadata. The sections below provide more information on each type of data publication available through CyVerse. For more details on the range of data sharing options in CyVerse, see the CyVerse Data Policy and Data Commons User Agreement.

Publishing CyVerse Curated Data

Data publication to CyVerse Curated Data a service offered for datasets that are intended to be stable and permanent. For  CyVerse Curated Data, the Data Commons provides landing pages, permanent DOIs or ARKs, and the requirement to include an open data license. Permanent identifiers allow data to have a stable location on the web so that other users can always find them, along with the information that makes them understandable, citable, and reusable.  An open data license is important to allow others to reuse your data, but it does not exclude users from the obligation to correctly cite your data.

For more information about whether or not CyVerse Curated Data is right for your dataset, the difference between DOIs an ARKs, and other questions, see the Permanent Identifier FAQs page and the Data Commons Policy.

When you are ready to publish, see the quickstart on how to request a DOI.

Publishing Community Released Data

Community Released Data folders are available for evolving datasets that individuals or communities want to make available as quickly as possible for research and reuse. Community Released Data are intended for datasets that are growing or changing frequently or that may not need long-term preservation. Data can transition from Community Released Data to CyVerse Curated Data by requesting a DOI or ARK.

To prepare your community data for pubic release, see Preparing Community Released Data Folders.

Publishing to external repositories

Currently, the CyVerse users can publish data directly to the NCBI Sequence Read Archive (SRA) and the NCBI Whole Genome Shotgun (WGS) archive. To suggest additional repositories to publish to, contact us.

To submit your files directly from CyVerse to the SRA, see the NCBI Sequence Read Archive (SRA) Submission (Workflow Tutorial).

To submit your files directly from CyVerse to WGS archive, see the NCBI Whole Genome Shotgun (WGS) Submission Tutorial.