Unrestricted Text and Data Mining with allofPLOS

November 28, 2017 Sheryl P. Denker Innovation Open Science Publishing Technology

Content mining, machine learning, text and data mining (TDM) and data analytics all refer to the process of obtaining information through machine-read material. Faster than a human possibly could, machine-learning approaches can analyze data, metadata and text content; find structural similarities between research problems in unrelated fields; and synthesize content from thousands of articles to suggest directions for further research explorations. In consideration of the continually expanding volume of peer-reviewed literature, the value of TDM should not be underappreciated. Text and data mining is a useful tool for developing new scientific insights and new ways to understand the story told by the published literature.

Application and Challenges

Researchers have leveraged text mining of abstracts and NCBI databases to advance precision medicine through discovery of disease-gene-variant relationships, employed text mining of journal articles for sleep disorder terminologies to determine publication trends, and used text mining to cluster and relationship-map BioMed Central journal content. A study posted on bioRxiv found that text mining full articles gave significantly better information that mining abstracts only, as expected. However, the authors of this study described challenges in the way content was presented and in the need to obtain copyright permissions. In addition to content availability and license status, support for early adopters and training for future practitioners are also cited as barriers to broad use of TDM for research purposes. The foundational value of CC BY licensing for TDM is that no additional permissions or documentation are required. Open Access facilitates TDM:

not on case-by-case basis, but for all people, in all places, and at all times
without lengthy legal agreements or restrictions
by providing unrestricted reuse, remix and mining rights

No Restrictions, No Conditions: allofPLOS

With more than 200,000 fully Open Access research articles available for content mining, PLOS can help advance the discussion and application of content mining through real-world experiences. Through our API we provide article text and meta-data in a single XML file format according to the Journal Article Tag Suite (JATS), the National Information Standards Organization (NISO) standard tag suite for archiving and exchanging journal article content.

The new allofPLOS project is a step forward in providing researchers easier opportunities for new discovery and illumination of non-obvious connections between data, research articles and fields of study. With allofPLOS, in addition to the content of every PLOS article (excluding Figures or Supplemental Data) provided in JATS XML format, the XML parsing tools are provided. By including tags, content and parsing tools together, we hope to simplify and streamline the process for those wanting to experiment with content mining and TDM tools.

With content mining, scientists, educators, policymakers and others can identify and map patterns and trends across millions of articles, extract the information they want, and gain new insights to advance research. TDM results can be shared as a new research article or as a database for others to use.

Setting the Stage for a Text and Data Mining Future

To support policies and public awareness that TDM for research purposes is compatible with current and future publishing industry practices, in 2015 PLOS participated in construction of The Hague Declaration on Knowledge Discovery in the Digital Age, a set of five core principles and a roadmap for action to enable researchers to carry out TDM of digital content on the web without legal repercussions. Unrestricted access to the scientific literature together with standards that promote machine readability of the facts, data and ideas contained within ensures that journal content is available for maximum discovery and reusability.

“We are producing so much information, not just as published literature but as even data from sensors, from monitoring activities, monitoring the planet, and monitoring species, and living things and nonliving things it is simply not humanely possible to attract full value from this, let alone value that we don’t even know that exists inside it,” says Puneet Kishor, former Science and Data Policy Manager, Creative Commons, in a video on The Hague Declaration website. “Using computers and machines is the only way programmatically to figure out what’s hidden inside,” he says.

Next Steps

Visit the PLOS Text and Data Mining page to download the PLOS research article corpus and XML parsing tools, and stay tuned to this space for upcoming stories of how researchers are using these tools.