Open Data Projects Win Wellcome Trust, NIH and HHMI Open Science Prize

April 18, 2017 Sheryl P. Denker In the News Innovation Open Science Technology

Editor’s Note: See related items in PLOS Biology and The Official PLOS Blog.

“Scientists can do much more with their own data if things are shared publicly and shared publicly quickly in order to have potential for real world impact.” -Trevor Bedford, lead of the Open Science Prize winning team.

The Open Science Prize, a new initiative from the US National Institutes of Health (NIH), Howard Hughes Medical Institute (HHMI) and the Wellcome Trust, encourages and supports open science approaches that generate benefit to society, advance research and spur innovation. An integral component of the selection process is demonstrated use and generation of open data, so PLOS is proud that this year’s winner of the Open Science Prize is PLOS author and evolutionary, computational biologist Trevor Bedford of the Fred Hutchinson Cancer Research Center in Seattle, Washington. Finalists for the prize are also PLOS authors, including Michael Bamshed’s team, featured in a blog post for Rare Disease Day; Aurel Lazar’s and Ann-Shyn Chiang’s team, for the Fruit Fly Brain Observatory and Ben Goldacre’s team, for OpenTrialsFDA.

These scientists and their teams are making sure that open content – from publications, datasets, code and other research outputs – are discovered, accessed and reused. Bedford and his team won the prize for development of nextstrain.org, a website that integrates shared, open sequence data from global research teams into a model for real-time tracking of virus evolution. This provides the larger community a powerful graphic tool to facilitate pathogen surveillance and epidemiological investigations.

Open Data Tool Accelerates Policy and Research

In an interview discussing the value of open science approaches, Bedford spoke about open data, attribution, licensing and his experience in using preprints to support a publication strategy that releases data quickly while providing peer-reviewed citations for himself, his international collaborators and his postdocs and students.

One of the three final criteria in judging for the award is the level of demand and utility demonstrated by the proposed service or tool. This criterion worked in favor for nextstrain.org, as the team works with viral sequence data, made publicly available, to infer transmission patterns and evolutionary dynamics. Over the course of the last 15 years, according to Bedford, methods have gotten to a good place. Most recently, “fast genomic turnaround times means more actionable information is possible. This has created a powerful situation during outbreaks, where context is needed for robust conclusions, so investigators are willing to share data,” says Bedford. “We need to put datasets together for comprehensive inferences about what is going on,” he continues.

In creating the nextstrain.org website, Bedford wanted to do something useful that wouldn’t be construed as scooping other people’s data for a publication. He sees the website as a good way to provide value to the community and work with other labs’ data, yet not be perceived as wanting to make a claim of ownership in the same way as a preprint or published paper would. Those involved in the project are committed to use and reuse of properly attributed pre- and post-publication data that is out there and referenceable.

What gives Bedford’s collaborators their intellectual property claim? “I admit this is a wild west at the moment for sequence data,” he says. Many researchers deposit sequences in GenBank before publication “but fear that it is not clear this is prepublication data,” he adds (GenBank doesn’t have these type of settings). Scientists also post data to lab websites or GitHub with caveats that the data is prepublication; his website uses all these sources. Sequences posted with GitHub are immediately incorporated with sources notified of data use.

When asked if everyone is a believer in open data and if there were instances when he encountered resistance or hesitancy to share data, Bedford replied they use whatever people want to share. He has noticed a positive trend in the sharing ethos, however. During the time of the Ebola outbreak there was a significant lag that by the time of Zika was less so. The publisher agreement, signed by PLOS and others, to make data rapidly and openly available helped in this area, he believes. “The requirement for sequence data to be deposited in GenBank or otherwise made publicly available at the time of manuscript submission, not publication, contributes to research reproducibility,” says Bedford. PLOS, through its own sequence deposition policies and partnerships for enhanced methods reporting, continually works to strengthen these issues.

For some, the Open Access, Open Science community needs to do a better job of showcasing the value of this more transparent and open way of doing science, from bench to publication and beyond. Thus far there has been positive engagement with the World Health Organization for influenza vaccine strain selection via the related tool, nextflu.org (eventually slated to migrate to the nextstrain website). Bedford envisions three audiences that would make practical use of his team’s open data tool:

Those performing viral sequencing or using sequence data, as a useful platform to compare and share data
Those involved in outbreak responses, as a tool to understand data, transmission patterns and strain evolution
Researchers or others interested in characterization of mutants and the ability to look at historical mutations

Publishing and License Choice

Bedford has an integrated publication strategy for his lab and work that best uses the various venues available. He publishes in a mix of Open Access and paywalled journals, creates webtools, deposits datasets and posts preprints. One strategy is to publish a statistical model or methods article, develop the model into a website or webtool and link to the website in published articles (rather than embedding JavaScript for the tool directly into the article).

He likes the pattern of building an ecosystem around a work: post a preprint with links to published/released genomes, update the preprint with new data or analysis and then submit that paper for publication to a peer-reviewed journal. This allows his team to capture the whole chain of research and progress, establishing provenance of credit along the way. Concerns of datasets posted on GitHub or GenBank getting scooped are similar to the scooping concerns surrounding the preprint server conversation. Helping people understand they’re putting an intellectual claim on their data (or paper) with posting has ameliorated, but not eliminated, those concerns.

Those using source code to develop tools for Open Science have several choices in licensing. For smaller projects, Bedford prefers the MIT license (also used for code developed at PLOS that is released as Open Source) which provides free and unlimited use and reuse rights, provided attribution is made clear. Other projects of his, including nextstrain.org, are released to the public under a GNU General Public License (GPL). This license provides that anyone using the source code to generate a derived product must, in turn, make that product open source. In other words, if a commercial entity adopts his open source code, that company must provide their code open source as well. The license status is essentially inherited and passed down to the next generation of product together with the code. One benefit of choosing the less restrictive MIT license, similar to CC BY for published articles, is maximum reuse without restriction.

Congratulations to all finalists of the Open Science Prize, sharing their work and data for the benefit of basic science, translational research and global public health.

Image Credits: The Open Science Prize, nextstrain