To search content in this manual only, enter your query above. To search for content in the entire CyVerse wiki, enter your query at the top right.
__________________

DATA COMMONS USER MANUAL
 

 

 

 

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Draft - Pending final approval

 The CyVerse Data Commons User Agreement

...

This agreement is between CyVerse and the users of the CyVerse Data Commons. Using services or data available at or submitting data to the Data Commons (DC) requires agreeing to the following policies. This document covers only policies specific to the DC. The CyVerse Data Policy covers policies relevant to any data hosted by CyVerse, including data in the DC. Acceptance of this document implies acceptance of the CyVerse Data Policy. Please see other CyVerse Policies for general usage of CyVerse cyberinfrastructure.

About the Data Commons

The Data Commons (DC) provides services within the CyVerse cyberinfrastructure to organize, preserve, and publish data derived from scientific research. We strive to aid researchers in creating, managing, publishing, reusing, and discovering research data by:

  1. Facilitating metadata entry and acquisition

  2. Supporting the translation of metadata across existing metadata standards such as DataCite, Dublin Core, or MIxS

  3. Publishing data through the Data Commons Repository or to external repositories

  4. Providing access to public data that is in the CyVerse Data Store

  5. Providing persistent access to datasets through globally unique, permanent identifiers (DOIs and ARKs)

  6. Connecting data to analyses conducted on CyVerse platforms to support reproducible science

  7. Raising data visibility and discoverability

  8. Preserving datasets in secure and reliable large-scale storage systems

DC development builds on foundational CyVerse infrastructure such as our Data Store, APIs, and user interfaces, while expanding into new areas such as metadata and ontologies, a data repository, and federation with external collaborators and repositories. Key components of the Data Commons are the web data portal at http://datacommons.cyverse.org/ (also http://dc.cyverse.org) and functions within the Discovery Environment such as metadata templates, permanent identifier requests,  data submissions to NCBI, and a Projects Interface (under development).

Data Commons Mission and Vision 

Vision Statement

To provide infrastructure for open data where researchers can organize, preserve, and publish data derived from scientific research and where data can live as a searchable, discoverable, and reusable resource.

Mission Statement

To aid researchers in creating, managing, publishing, reusing, and discovering data.

Data Commons Functionalities

Data Organization and Curation

Data curation is the set of processes involved in generating and maintaining a sustainable, complete, and accurate dataset across time. In the DC, users are the primary, specialized curators of their own data, because they know their data and how it was produced. DC users are responsible for organizing and describing their data in a way that represents their research. To facilitate these activities, the DC provides functions through the Discovery Environment where users can organize and append standardized metadata to the data that they will publish, including metadata templates and bulk metadata upload. In addition, data curators on the DC team are available for consultation about how to organize data, what metadata standards are recommended for your data, and how to assign identifiers.

 

A data curator verifies all datasets that are submitted for publication and will contact users if they identify incomplete metadata or any issue that can be improved to present data to the public in a way that is clear for reuse. The curator does not verify the contents or quality of the data files; this is the responsibility of the researcher creating the dataset. Data in the DC is not peer reviewed, but it may be reviewed outside CyVerse as part of a journal article.

Data Hosting and Publication

The DC hosts public data (data accessible without a user account) that is stored in Public Data Folders under the directory /iplant/home/shared and allows CyVerse users to publish data through the Data Commons Repository (DCR). The DC also supports publication to selected External Repositories. See the CyVerse Data Policy for a detailed description of the different types of data stored at CyVerse. The key difference between Public Data Folders and the DCR is that Public Data folders are controlled by a community member and subject to change whereas folder in the DCR are unchanging and can only be updated by DC curators.

 

The following points apply to all data made available through the DC, either in Public Data Folders or the DCR:

  • By submitting data to the DC, you give permission to make the data publicly available.

  • Public data in the Data Commons are visible to anyone via the Data Commons web interface and all methods described in Downloading Data without a User Account.

  • CyVerse provides access to research data so that they can be used with CyVerse tools and services. The current focus of the DC is on life sciences data. Other data types will be considered on a case by case basis.

  • Public data that are appropriate for hosting at a long term public data repository (e.g., NCBI, DRYAD, or TreeBase) should be deposited at that repository. If there is a valid reason to duplicate data from a public repository on the CyVerse Data Store (e.g., the dataset is enhanced with additional data or features or is actively being used for analysis with the CyVerse cyberinfrastructure), the depositor should specify this in their request for a Public Data Folder or when requesting a DOI or ARK in the DCR.

  • CyVerse requires users who wish to share their data via the Data Commons to indicate the license and terms of use. CyVerse strongly encourages the adoption of licenses that support the U.S. Open Data Policy and the FAIR Data Principles, such ODC PDDL for non-copyrightable materials (i.e., data only) and CC0 for copyrightable material (Workflows, White Papers, Project Documents). Data in the DCR is required to have an ODC PDDL or CC0 license, unless prior arrangements are made (e.g., for aggregated datasets that include existing data with another license).

  • Any data, including derived data, shared with the public through CyVerse must comply with any copyright or reuse restrictions placed on the original source data.

  • If there is a dispute about data that is uploaded or published to the DC, we will not display the data publicly until the dispute is resolved.

  • When necessary or preferable for technical reasons, CyVerse may mirror or replicate existing reference databases. For data provided to CyVerse by a reference database, CyVerse will comply with the policies asserted by the specific data source.

  • Data published to any external repository via CyVerse services is subject to the terms and conditions of that repository.

  • There is no limit to dataset size in the DC. When you request a Public Data folder, you must specify the expected size of the data, and an allocation increase will be considered simultaneously, if needed. If you plan to publish a dataset to the DCR that is larger than the default allocation of 100GB, you must first request an allocation increase, so that you can upload and organize your data in your private directory. As part of your allocation increase, you should indicate that you plan to publish the data to the DCR. For methods of uploading data into the CyVerse data store, see Downloading and Uploading Data.

Public Data Folders

Public Data folders are available for evolving datasets that individuals or communities want to make available as quickly as possible for research and reuse. Public Data Folders are intended for datasets that are growing or changing frequently or that may not need long-term preservation. Data can transition from a Public Data to published in the DCR.

 

In addition to the policies above for all data in the DC, the following apply specifically to data in Public Data Folders:

  • Public data folders in CyVerse are required to meet minimal metadata standards, as described in the Preparing Public Data Folders. The owner of the folder maintains control over data organization.

  • Public data folders are owned by the user who requests them and count toward that user’s allocation. However, it is understood that users owning Public Data folders will have larger total data allocations, and their personal allocation will not be penalized for hosting a public folder.

  • To request a Public Data folder, use the Community and Public Folder Request Form. You can simultaneously request an additional allocation with this form.

  • Data in a Public Data folder can be published to a repository, but the published data should move out of the Public data folder, unless a formal request is made to continue to house them there, for example, for use in CyVerse analysis tools.

  • You may keep part of the data in a Public Data folder private (for example, data being prepared for release or supporting content such as data management documents), but Public Data folders are intended to hold primarily public data.

Data Commons Repository (DCR)

Data publication to the DCR is a service offered for datasets that are intended to be stable and permanent. For published data, the DCR provides landing pages, permanent Digital Object Identifiers (DOIs) or Archival Resource Keys (ARKs) and the requirement to include an open data license. Permanent identifiers allow data to have a stable location on the web so that other users can always find it, along with the information that makes it understandable, citable, and reusable.  An open data license is important to allow others to reuse your data, but it does not exclude users from the obligation to correctly cite your data.

 

In addition to the policies above for all data in the DC, the following apply specifically to data in the DCR:

  • Before requesting a permanent identifier, you must determine if your data is ready to publish in the DCR.

  • Before you can publish your data to the DCR, it must meet minimum requirements for data organization and metadata, as specified in the DCR Data Organization Guidelines.

  • Datasets in the DCR are required to meet minimal metadata standards, as described in the DCR Data Organization Guidelines. This includes using the DOI Request - DataCite Metadata template available through the Discovery Environment.

  • Additional scientific metadata for both the home data folder and elements contained in the folder are highly recommended.

  • All metadata will be displayed on Data Commons landing pages. The landing page is the best advertisement for a research project, and it is the user’s responsibility to provide complete and accurate data and metadata about the project for display on the dataset landing page.

  • All datasets in the DCR must include a ReadMe file and inventory, as described in the DCR Data Organization Guidelines.

  • You should include specific instructions on how to cite your dataset as part of the DOI Request - DataCite Metadata template and in the ReadMe file.

  • The DCR generally does not publish data if a canonical repository for the data type already exists (e.g., NCBI, Treebase).  

  • For data deposited in the DCR, data depositors maintain intellectual ownership and authority over the data, but no longer have the ability to edit the data or metadata. To make changes to the data published in the DCR, contact data_curator@cyverse.org.

External Repositories 

The DC provides documented and easy to use workflows for users who want to publish data through canonical repositories such as NCBI.

  • For more information, see Publishing through the Data Commons.

  • Data published to any external repository via CyVerse services is subject to the terms and conditions of that repository.

Reusing Data

The DC fully supports reuse of the data it hosts. If you download or reuse any data in the DC, you must:

  • follow the conditions that are stated in the data license for the dataset(s) you use.

  • follow any conditions for data reuse stated in the associated metadata and ReadMe files.

  • cite the dataset using its DOI and the citation information available in the dataset landing page, if you use data stored in DC for work that produces a publication.

 

New data derived from original DC data may be distributed only under terms and conditions established by the creators of the data and stated in the license.

Long–term Preservation and Access to Data in the DCR

Data in the DCR are stored in a high-performance storage resource that has built-in redundancy and is continuously monitored for security and failure, and they are synchronously backed up at both the University of Arizona in Tucson and at the Texas Advanced Computing Center in Austin. At ingest into the DCR, data are manually checked for organization, format (to ensure that they are readable by non-proprietary software), completeness of metadata, and inclusion of a ReadMe file. An md5 checksum is generated and displayed as part of the file’s metadata so that users can check its authenticity.

 

Data and metadata in the DCR are are visible to anyone via the Data Commons web interface and via all methods described in Downloading Data with a User Account. Through a contract with EZID, CyVerse is committed to the long term preservation of data in the DCR. If DCR services are discontinued for some reason, we will make arrangements to transfer the published data free of charge to another long-term repository that will sustain access to the data and metadata, and the DOIs will be redirected to the new location. All CyVerse users will be notified of the new location of DCR data before the move is completed.

 

Data and metadata in Public Data folders in the DC (not the DCR) are not guaranteed for long term preservation. Public Data folders that are in active use (have been accessed in the past year) are available via browsing at datacommons.cyverse.org, can be accessed without a CyVerse user account using any of the methods described on Downloading Data without a User Account, and are searchable to anyone with a CyVerse user account through the Discovery Environment. Data in Public Data folders that are inactive (have not been accessed in over one year) may be moved to a long-term storage archive, where the data will be available upon request. The owner of the Public Data folder will be notified before data is moved to a storage archive.

Disclaimers

THE SERVICES AND DATA OF THE CYVERSE DATA COMMONS ARE PROVIDED “AS IS”. NO WARRANTIES OR REPRESENTATIONS ARE MADE RELATING TO THE DC OR ANY DOCUMENTATION. NO WARRANTY IS PROVIDED THAT THE DATA COMMONS PORTAL OR ANY DATA WILL SATISFY ANY REQUIREMENTS, THAT THE DC OR ANY OF THE DATA THEREIN IS WITHOUT DEFECT OR ERROR, OR THAT OPERATION OF CYVERSE WILL BE UNINTERRUPTED. ALL TERMS AND CONDITIONS OF THE CYVERSE SERVICE LEVEL AGREEMENT,  CYVERSE DATA POLICY, AND CYVERSE ACCEPTABLE USE POLICY APPLY TO THE DATA COMMONS, INCLUDING THE FOLLOWING POINTS:

  • Prohibited content: Users shall not post, transmit, or store data or content on or through CyVerse servers which in CyVerse's sole determination, constitutes a violation of any federal, state, local, or international law, regulation, ordinance, court order or other legal process, as detailed in the Acceptable Use Policy.

  • Abuse and unacceptable uses: Users are prohibited from engaging in any activities that CyVerse determines, in its sole and absolute discretion, to constitute abuse or unacceptable uses, as detailed in the Acceptable Use Policy.

  • Intellectual property infringement: Users may not transmit, distribute, download, copy, cache, host, or store on a CyVerse VM or server any information, data, material, or work that infringes the intellectual property rights of others or violates any trade secret right of any other person/User. See the Acceptable Use Policy.

  • Copyright: Any data, including derived data, shared with the public through CyVerse must comply with any copyright or reuse restrictions placed on the original source data (Data Policy).

  • Disputes: If there is a dispute about data that is uploaded or published to the DC, we will not display the data publicly until the dispute is resolved (Data Policy).

Agreement and Policy Subject to Change

 

The functionalities, business model, and characteristics of the DC are continually improving; thus details of this agreement and policy are subject to revision every 3 months.

is now available on the CyVerse website at http://www.cyverse.org/data-commons-user-agreement.