Thursday, November 12, 2015

Open Science Framework (OSF): A useful free tool for data and workflow management for scientific reproducibility

On October 13, 2015, DataOne hosted a webinar led by Courtney Soderberg from the Center for Open Science.

The webinar had two goals: (1) To outline the issues with existing scientific workflows that can lead to bias and results that are not reproducible, and (2) To introduce the Open Science Framework (OSF) as a tool to overcome these biases and increase the reproducibility of science.

Regarding issues of reproducibility, most scientists are probably aware of the narrow issue of computational reproducibility, i.e., the ability to take the data collected by a team of researchers, perform the same analyses, and reach the same conclusions.  Ms. Soderberg described this issue in her talk, but she also described more subtle biases and issues with reproducibility.  One issue is publication bias: analyses often change through the course of a project, and only the final (successful) analyses and results are documented, while negative results or dead-end analyses are never captured.  Related to publication bias is Hypothesizing After Results are Known, or HARKing.  To present a succinct story, publications often present hypotheses as a priori, whereas hypotheses may in fact have been generated after researchers spent significant time poring over the data.  In what is known as researcher degrees of freedom, data processing and analytical decisions are often made after seeing and interacting with data, severely increasing the potential for false positive outcomes, often outside of the awareness of a researcher.  (For further discussion of reproducibility problems, I suggest the enlightening recent special issue of Nature on this topic.)

In response to these various potential sources of bias, the OSF, a free web-based resource for data and workflow management, builds in mechanisms to reduce (or at least document) potential sources of research bias.  The OSF is meant to be used through the whole research life cycle, from project conception to final paper and data publication, and all actions taken, wiki entries written, and files uploaded on the OSF are timestamped and version controlled.  For example, it is possible to document a timestamped hypothesis prior to data collection and analysis to avoid HARKing.  More details on the OSF can be learned by viewing Ms. Soderberg's excellent presentation in full; below, I provide a few highlights:

  • OSF pages can be public or private, and there is granular control over access to individual pages and sections for collaborators or the general public.  Public projects are fully searchable.
  • Built-in tools smooth the collaboration process.  One can create templates for common file types, and projects can be "forked" to create copies of files/folders with original content intact.
  • Third-party software such as GitHub, Google Drive, and FigShare can be seamlessly integrated through add-ons.  This is especially useful for large files that exceed the current 128 MB limit for individual files stored with OSF (no total storage limit across all files).  The one catch is that, while all file versions uploaded directly to OSF are stored permanently, linked third-party content remains stored with third parties subject to their version control/storage policies.  Nonetheless, OSF does keep track of all version changes (even if it does not keep the original files).
  • Permanent identifiers (GUIDs) are assigned to projects created on OSF.  Other unique identifiers (e.g., DOIs, ORCID, LinkedIn) can be assigned to projects and/or researchers.
  • Versions of a project can be "registered" at a fixed point in time, such as when submitting an article for publication.  Registered versions become read-only and fully include all linked (third-party) content, so a registered project can provide a stable data/workflow accompaniment to a published journal article.  Registered versions can remain private for an embargo period of up to four years.  Once public, registered projects can be assigned a DOI.
  • Data sustainability is extremely important to OSF.  In case the Center for Open Science disappears, a "sustainability fund" has been established to maintain existing data in a read-only format indefinitely.
  • Public projects are fully searchable.

I strongly encourage all scientists to investigate OSF as an option for workflow and data management.  The advantage of OSF is that it provides a flexible, robust architecture for many data management challenges.  The disadvantage is that it may not fulfill discipline-specific needs of sediment experimentalists.  As we continue to develop the SEN Knowledge Base, we will closely follow developments of OSF and other data management platforms.

Raleigh L. Martin
UCLA Dept. of Atmospheric and Oceanic Sciences


Click here to view the webinar on the DataOne website.

No comments:

Post a Comment