By Shreejoy Tripathy
The concept of data sharing is central to the annual Congress of Neuroinformatics, organized by the INCF (International Neuroinformatics Coordinating Facility).
We as a community recognize that many advances could be made in neuroscience if authors went beyond simply describing their data within a publication to sharing and uploading the entire dataset used to draw scientific conclusions.
This would make it easier to reproduce original findings as well as make it possible to re-use data for other purposes, like large meta-analyses (see Akil et al for a nice introduction)., With the recent announcements of the EU Human Brain Project and US BRAIN Initiative, the importance of data sharing for neuroscience has reached a high pitched cadence (see Kandel et al.).
In full disclosure, I am hardly unbiased when it comes to data sharing. I co-developed neuroelectro.org, an online resource that organizes information on neuronal electrophysiological properties built from text-mining the existing literature. In building NeuroElectro, I quickly realized the data we really wanted – the raw electrophysiological voltage and current traces — were usually not published within a paper but instead remained on the author’s original backup tapes and hard drives. Thus I was quite excited to hear about PLOS’ updated data policy this March, which requires that all data underlying the publication be made fully available without restriction and with rare exception.
This change now requires answers to the following questions: where is the data stored? How can someone gain access to it? How can this data be discovered?
In short, PLOS wishes to change the fundamental practice for data sharing from “data will be made available upon request” to “data is made immediately available upon publication at the following location”.
During the neuroinformatics meeting, Jennifer Lin, representing the PLOS Data team, with PLOS ONE Editorial Director Damian Pattinson and Tom Nichols, a PLOS ONE Academic Editor and nueroimaging statistician from Warwick University, hosted a working lunch to discuss the new data sharing policy. This was in part to address specific concerns raised about the policy following its announcement. After briefly outlining the updated policy, we had a lively discussion on specific aspects.
The fundamental issues revolved around what constitutes data sharing.
- Is it necessary to share the entire “raw” data files? Or is some reduced form sufficient?
- If you collect a large dataset but only analyze and publish a single aspect of it, are you obligated to share the entire dataset in its entirety? For example, in cellular neurophysiology, are authors now obligated to share each voltage trace sampled at 10 KHz from every recorded cell? Or simply appropriate summary measurements and calculations in an Excel spreadsheet?
- In addition, do authors also need to annotate such data files with necessary metadata to make them meaningful and discoverable?
After acknowledging that the policy as originally announced was technically not specific in answering these questions, PLOS’ Jennifer Lin specifically clarified that the new policy did not require such raw, large (> ~1GB) datasets be uploaded and deposited. Instead the intention, she explained, was to require that tables of summary measurements, such as those represented within bar graphs and scatterplots that directly underlie the published findings be made available.
Diving into the Details
While the neuroinformaticists present at the PLOS Workshop whole-heartedly agreed that it would be ideal if all collected data were deposited in their most raw form, there was general agreement that this was currently not feasible or practical for some domains. For example, in cases of data from human subjects, exceptions would be made if consent had not been originally obtained for sharing the collected data. In general, the PLOS representatives indicated that reasonable exceptions to the data sharing policy would be granted on a case-by-case basis. However, the usual comment of “we do not wish to share our data yet because we wish to continue publishing with it” was offered explicitly as an exception that would not be granted under the new policy.
We also discussed other issues related to neuroscience datasharing, for example:
- WHO PAYS? Maryann Martone, of UC San Diego, advocated that the funding agencies that paid for the original research (like the US NIH) should also fund costs associated with annotating, depositing, and long-term data hosting.
- WHAT DATA FORMATS? For most neuroscience sub-domains, the data format and metadata standards are still in active development (this is a major topic of research in neuroinformatics). For example, while Tom Nichols indicated that neuroimaging has a widely-embraced file format with NIFTI , the metadata standards required to describe complex imaging experimental designs and contrasts are still under development, e.g. NIDM. This is similarly true for neurophysiology data, in which Fritz Sommer and Jeff Teeters of UC Berkeley indicated that standards are actively being developed by the neurodata-without-borders project.
- WHICH DATA SETS? Given that sufficient data standards don’t yet exist in most domains of neuroscience (gene microarray and sequencing data are notable exceptions), several of us suggested that manuscript reviewers would be best posed to request to authors what data be shared. Our reasoning was that reviewers would know what data is most central to the manuscript and what constitutes a reasonable mechanism for sharing such data. For example, for a study on cellular neuroscience, a reviewer could suggest that the spreadsheet summarizing synaptic bouton counts in control and manipulated cell cultures be deposited, in lieu of uploading the corresponding raw microscope images.
Ultimately, my perspective is that we should be mindful that data sharing is currently a moving target in neuroscience. In the meantime, given that PLOS’ policy is in some senses an experiment, it will be helpful to keep track of where the policy succeeds: in 2-3 years, which datasets are most cited and reused? What is it about how these datasets are organized which makes them most useful, in terms of their annotation quality and comprehensiveness?
Thus just as we have an “h-index” which tracks author paper citations, Asla Pitkänen of the Univ. of Eastern Finland suggested the development of an “s-index” or sharing-index to keep track of how an author’s datasets are shared and reused.
In summary, as domain standards are developed, tested, and proliferate, I expect that what constitutes adequate data sharing will continue to become clearer.
Akil et al. Challenges and Opportunities in Mining Neuroscience Data. Science 11 February 2011: Vol. 331 no. 6018 pp. 708-712 doi: 10.1126/science.1199305
Kandel et al. Neuroscience thinks big (and collaboratively). Nature Reviews Neuroscience 14, 659–664 (2013) doi:10.1038/nrn3578
Shreejoy Tripathy is a post-doc in the Centre for High-Throughput Biology at the University of British Columbia where he uses a combination of data mining, machine learning, and domain knowledge to link disparate data modalities in neuroscience, with a focus on neuron electrophysiology and genomics. His website: http://www.neuroelectro.org/. On Twitter @neuronjoy