Fitness for use: A scientist looking to use data often faces missing metadata that could determine whether the data is suitable for a new project -- or not. Papers that cite data sources often document missing metadata that provides evidence as to whether data is reusable for a new purpose.
A graduate student wrote a script to analyze data for a publication. They want to cite the software in the publication, but don't know where to submit the software, how to document it, or how to get a citable identifier, and how to cite the script.
Data and software publication licensing agreements and citation goals may not always align, and this creates a problem with citation practice and linking. Take for example a Dryad case,which is not frequent, yet not uncommon.
An author (researcher) submits a copyrighted perl script under GNU which was used in the data analysis. The author prefers to retain copyright, and not release the script under our CC0 terms, The Dryad repository policy state, however, that all data, including software, code,etc. is to be published under CC0.
For this reason, the author places the script on a "personal" website - it's a "bioinformatics tools webpage" and requests the Dryad curators us to link to the scrip there. In this case, Dryad's curatorial staff are not in agreement on how to proceed with curation, and the decision can impact the automatic generation of a citation. One curator says we have the obligation to link out to the script on the personal page, and considers the practical end of Dryad's mission to promote data discovery, reuse, etc; whereas the other curator says we should not link out,because in doing so, Dryad is promoting a bad practice that does not align with the repository's policy and can interfere with long term preservation, because of the absence of a Dryad DOI.
Dryad does not give detailed guidance to submitters. In the first preference, the practical approach, the Curator will encourage the submission of the script on GitHub for new works,but in this scenario, the publication was already out, with a DOI. .The goal of the practical approach is to move the submission through, in compliance with our required terms, and grounds to deny it seemed too drastic. The second, more rule bound preference was guided by strict adherence to Dryad policy.
A scientist can provide citation links to scholarly papers that were derived from particular executions of a script or model. These links can take the form of links to published identifiers for the execution trace, or from the trace to the paper via e.g., a DOI. As a scientist using e.g., R or Matlab, The goal is to link a paper to a given script run to document the exact process used to derive published data, figures, or tables. In R or Matlab, after executing a script and recording provenance information about the run, a researcher can later update the execution's provenance information by providing a link to a permanent identifier (such as a DOI) of a published document. Other researchers can discover the papers from links of the script executions.
Via data processing, analysis, modeling, and visualization processes, researchers create derived products, including derived data sets, figures, tables, animations, and other artifacts. By establishing citation relationships showing provenance relationships among these derived and source products, we can preserve the dependency relationships for use in reproducing the science, thereby enabling discovery of data and products from their relationships. For example, with appropriate relationships (prov:wasGeneratedBy, prov:used), one can determine if one product was derived from another, and following the graph of such linkages, could discover other analyses and products that were derived from the same source data sets.
Reproducible science refers to the ability to follow citations in order to reproduce and evaluate a process, including data citations and software citations. What is required for the results reported in a paper to be reproducible? That seems to be the topic of one of the keynote lectures.
|identify & version||round-2|
A user has a dataset that they want to cite in a paper submitted for publication, they have identified a repository that assigns dataset DOIs. One wants to include the dataset DOI in the publication and the publication DOI in the dataset, printed in hardcopy on both, and also linked in the metadata for both, all at once. This creates a chicken-and-egg scenario that can have a work-around (currently we (at a repository) assign a dataset DOI for a future release date, wait for the publication DOI, update the dataset metadata file with the publication DOI relationship, and only sometimes are able to include the publication DOI in the actual hardcopy of the dataset, which should not be altered after the original DOI assignment). What is the best practice for this?
|identify & version||round-2|
The Biological & Chemical Oceanography Data Management office (BCO-DMO) manages data from NSF GEO Ocean Sciences (OCE) and Polar Programs (PLR) awards. The office submits datasets (and metadata) to our Institutional Repository, the WHOI Data Library and Archives (WHOAS DSpace system). The Data Library provides BCO-DMO with dataset DOIs (minted by CrossRef). In general, every data submission receives a new DOI, but I expect that in some circumstances it would be more useful to the community if a new version of the data were submitted, and the DOI record ammended. In cases where the original data are unmodified, but new rows (e.g. the same measurements from additional observation events) or columns (e.g. additional measurements from the original observation events) have been appended, it might be more useful to add a new version of the data and append the DOI metadata record. It would be useful to have guidelines describing how responsible curators decide whether to version the DOI or request a new DOI and relate it to any previous ones. The example included below is for a dataset from the Fukushima nuclear incident. We fully expect these data to be amended, and issueing a new version (as opposed to a new DOI with a separate metadata record) would allow researchers to clearly follow the provenance.
Meta-analysis/synthesis. Maidment is proposing a National Flood Interoperability Experiment this coming summer using data services. That could provide an example where we have to figure out data citation.
HydroShare is a collaborative environment being developed for open sharing of hydrologic data and models (Tarboton et al., 2014a; 2014b). The goal is to enable scientists to load data and models into HydroShare, easily discover and access hydrologic data and models, retrieve them to their desktop, or perform analyses in a distributed computing environment that may include grid, cloud, or high performance computing model instances, and ultimately publish data and models as permanent digital objects supporting reproducible research.
Collaborative Data Analysis and Publication is one use case driving the development of HydroShare (Figure 1). This extends existing Consortium of Universities for the Advancement of Hydrologic Science Inc. (CUAHSI) Hydrologic Information System (HIS) (Tarboton et al., 2009) data sharing functionality into a dynamic collaborative environment leading to the eventual archival publication of data.
Figure 1. Collaborative data analysis and publication use case.
At (1) data are observed and then loaded (2). In the current CUAHSI Hydrologic Information System (HIS) data is loaded into an observations data model relational database on a server that publishes it using web services (Horsburgh et al., 2008; 2010). Metadata is harvested by the HIS Central catalog, and supports geographic and context based data discovery. A desktop client user (3) discovers, downloads and analyzes the data, or uses it in a model. Steps 1 to 3 are supported by the existing CUAHSI HIS. HydroShare picks up from here allowing the user to next post the results (data and model) to HydroShare as resources, retaining provenance information on the original data source (4). This will be done through sharing features being added to the CUAHSI desktop client, HydroDesktop (Ames et al., 2012). HydroShare will also support direct entry of new resources. Upon ingestion, background actions parse metadata and enable analysis based on rules and policies. The user shares posted resources with colleagues (5), designating who has permission to access the resources. A group collaborates on refining the analysis, model or result. HydroShare tracks provenance supporting reproducibility and transparency. After iteration, the result is finalized and submitted for publication (6). At this point the resources produced (data, model, workflow, paper) are made immutable, access is opened and permanent persistent identifiers (e.g., DOIs) are assigned. The data may be moved to a permanent repository under the auspices of the CUAHSI Water Data Center (7) or alternative digital library or archive.
A trusted software framework is proposed to enable reliable software to be discovered, accessed and then deployed on multiple hardware environments. More specifically, this framework will enable those who generate the software, and those who fund the development of software, to gain credit for the effort, IP, time and dollars spent, and facilitate quantification of the impact of individual codes. For scientific users, the framework delivers reviewed and benchmarked scientific software with mechanisms to reproduce results.
The trusted framework will have five separate, but connected components: Register, Review, Reference, Run, and Repeat.
- The Register component will facilitate discovery of relevant software from multiple open source code repositories.
- The Review component is targeting on the verification of the software typically against a set of benchmark cases.
- Referencing will be accomplished by linking the Software Framework to groups such as Figshare or ImpactStory
- The Run component will draw on information supplied in the registration process etc to instantiate the scientific code on the selected environment.
- The Repeat component will tap into existing Provenance Workflow engines that will automatically capture information that relate to a particular run of that software.
|large & complex data||round-2|
This dataset consists of several different data streams, all recorded at the same location (Chilbolton Observatory, Hampshire, UK, http://www.stfc.ac.uk/chilbolton/default.aspx).
Raw and preprocessed observations of meteorological phenomena
Data are stored in NetCDF format with images for the two dimensional datasets provided in png format.
Some of the older Chilbolton data on the BADC are in different formats. Early drop counting raingauge data are in ASCII format. Radar data from the winter of 1998-99 are stored in a simple ASCII format with composite images comparing radar, ceilometer, rain gauge and ECMWF data each day in gif format.
Need to have ability to accurately and easily cite a subset of the CFARR data over a set time period, but including multiple instruments.
|large & complex data||round-2|
Investigator runs experiments where the main raw data type is high resolution images and videos. Raw data is about 4 TB per experiment. Processed data is still 1 TB in order to provide a dataset that would allow reproducing the results. Current data repositories usually do not offer this much storage, so it is very hard to obtain a citable DOI for such a large dataset. Usually, dataset DOIs are not assigned unless the "trusted" allocating agent has possession of the data resource (so that the DOI will not point to a resource that moves or is changed).
|large & complex data||round-2||https://github.com/ResearchSoftwareInstitute/software-data-citation-ws/issues/21|
A PhD student has authored an R package that provides several functions for studying the size and shape of fossils. The package is an improvement over existing programs – it is faster, simpler, and free. Although she has already constructed a manual to accompany the package, she knows she cannot cite this manual as scholarly material in her field. She spends additional time preparing a traditional journal article describing the package and its use cases. However, the article is rejected from her discipline journals for being too technical and outside the scope of paleontology. Instead, she publishes it as a short piece for a software journal. Although she lists the paper on her resume, senior scholars in her field ignore the paper because it is not in a journal they recognize. The PhD student is further discouraged because although she has posted links to the manual, article, and package on her personal website, she finds few colleagues are switching to her improved method. The PhD student seeks advice from her dissertation advisor. Her advisor recommends that she should focus more on “real” material for her dissertation and drop her programming interest. Before she spends more time on her R-package, her PhD advisor wants to see proof that it is paying off for her academic career.
At https://odin.jrc.ec.europa.eu a scientific database application hosting data sets for tests performed on engineering alloys has been enabled for data citation using DataCite DOIs. The collection consists of about 20,000 discrete data sets, of which to date approximately 10% have been assigned DOIs. For traditional scientific publications, it is typically the case that hundreds of data sets will be reported. Obviously, citing the data sets individually in the references section of the written publication is not practical. However, for transparency and reproducibility the full range of data sets needs to be cited.
Attribution for derived products. When one data product is based upon others, in essence, the original products “contribute” to the derived citation. Thus, data-to-data attribution helps in measuring utility of sources for a data set.
At DOE’s Atmospheric Radiation Measurement Program (ARM) data center at Oak Ridge National Laboratory they are making a concerted effort to ensure that authors, who use their data properly, attribute and cite the data sets they are using. There was an historical look at publications to ensure that those using ARM data had a citation to that data. They worked with Thompsons Reuters to make sure when data were cited that there could be an active link between the data set and the publication. Going forward with new publications, ARM always asks people who request their data to cite the data when they are used in articles. Further, ARM tells the requestor how to cite the data to make it easier and to ensure quality of the citation. The DOI (which is assigned to all ARM datasets) is the critical linkage between the data set and the scientific publication.
Many datasets are the product of the aggregation of data from multiple sources and often multiple parties. For example, the Open PHACTS platform (http://dev.openphacts.org) provides integrated access to over 10 different databases. These databases, for example, Chembl and Uniprot, are amalgamations of data extracted both automatically and via human curation from other sources such as the literature. These databases in turn rely on data models and ontologies developed by still others. For example, the GO or Chebi ontologies. Additionally, integrators may slightly change datasets through format changes or addition of links.
- Issue 1: The provenance or credit chain of a single answer given by a data integration platform can be much bigger than the answer itself. How do we correctly ensure credit is given to all actors in the system? Furthermore, how do we ensure that these chains can be effectively traced back?
An additional aspect of such integrations is that they are all under context flux. Note that, Uniprot is released every 4 weeks. Chembl is released quarterly, and some such as SureChembl are hourly. How does a data integrator appropriately capture and expose this information? Currently, this is often done by providing versioned data dumps. However, this may not be allowed or supported in many cases due to licensing, technical or policy issues.
- Issue 2: How do we cite developing data sets originating from multiple sources?
How would middleware (e.g., GI-Cat/GI-Axe) which mediates between data services provided by a repository (including things like data sets available via ftp, OPeNDAP, OGC W*S, etc. or advertised through OAI-PMH, THREDDS, web-enabled folders, etc.) provide users of its services, citation information? What would the citation(s) cover? Who would they credit? In what roles?
Assuming that the RDA-devised strategy for dealing with the subset specification (i.e., dynamic data) were implemented by repositories, how would a broker or other such middleware service generate a citation to pass along to whatever client the user is using (e.g., arcGIS, IDL, R, Matlab, project-specific GUI, etc.) to support this form of reproducibility?
Metrics of software and data utility: citing the data used in a paper provides an aggregate measure of how much the data influences scientific process.