Sharing paleodata (Part 1): What databases are out there?

June 24, 2013 PLOS Blogs Open Data

Science depends on the ability to make observations, repeat experiments, test hypotheses, and share knowledge. When a new study comes out, other researchers evaluate an author’s arguments based on the data they present and the analyses they perform. This is why most journals require that at least some of the original data be included as part of the manuscript. Without access to the original data, science loses that critical requirement of repeatability.

Original data can be a description: “This is what the anatomy looks like, and here is a picture showing that feature”. Other times, it’s a table of measurements or a map of locations. Original data can also take the form of a chart or graph that summarizes the individual data points and reports trends for the group. Depending on the study, the number of data points you used might be 20, or 20,000. You might have to write new software to run a specific analysis. Printed in a journal, the spreadsheet or code involved could take up hundreds of pages. In these cases it’s impractical to publish all of the original data or code in printed format. But you still need that data and code for the work to be repeatable.

My particular line of research involves the examination of bone tissues in order to study the growth patterns of living and fossil animals. The primary data I use are microscopic bone tissue characters (the number or shape of bone cells, patterns or orientation of blood vessels) that vary in appearance around the slide. It’s hard to reproduce images in a print journal at the size and resolution others need to verify my observations, much less enough to capture variation around the slide. To make my observations verifiable or repeatable, you need an enormous image – ideally, the whole slide, at high resolution and decent magnification (say, 4x or 5x). Even if the file sizes were reasonable (they’re not – I create a lot of gigapixel images), a printed version that allowed you to see all the bone cells and blood vessels would be more like a map in size than a page of a journal.

OMNH 34784 Tenonto tibia CL T1-2 BF 4xMOR tiny — It is hard to see the histological details of an image this size (click on the image for slightly larger one). Go to MorphoBank to see it large enough to examine histological details. Image (c) 2012 Sarah Werning / The Sam Noble Oklahoma Museum of Natural History.

How to solve this? Some researchers publish data to their own websites. For example, Drew Lee (Midwestern University) posts high-resolution histology images on his Paleohistology Repository. Larry Witmer’s lab (Ohio University) posts 3D visualizations of their CT data on their website, along with a bunch of downloadable interactive projects on animal anatomy. Emma Schachner (University of Utah) posts supplemental images, posters, and other ancillary images to her website, www.theropoda.com. These are pretty spectacular, but not every researcher has the time to maintain a data site, and eventually the researcher that owns them won’t be around to maintain them. So while they fill the gap now, this is not an ideal longterm solution.

Most academic journals allow you to publish online supplementary materials, including ginormous spreadsheets or software code (ginormous images are sometimes problematic – some journals limit uploaded files to 10MB). So while the data is usually out there, even in this form it’s not always accessible. You might need to subscribe to the journal to access it, or the file could get lost over time (I know the latter seems ridiculous, but unfortunately, it’s not uncommon for older papers, or when journals change publishers).

Even if the data is easy to access, comparing it in a meaningful way could require pulling together hundreds or thousands of sources. In these cases, a sort of universal clearinghouse for data would be ideal. These exist, for certain types of data. For example, almost all journals published in the US (and elsewhere) require anyone whose data includes a new genetic sequence to upload it to GenBank before publication. GenBank is a database maintained by the National Center for Biotechnology Information (NCBI), a subdivision of the National Institutes of Health (NIH). GenBank shares and synchronizes data with two similar databases (one in Europe and one in Japan) as part of the International Nucleotide Sequence Database Collection (INSDC). GenBank assigns a unique accession number to each sequence, and these are reported in the paper when it is published.

UCMP77305_Isoodon_sm — There is no standard repository for measurements or morphological data, but I outline some options below. UCMP 77305, *Isoodon* femur. (c) 2013 Sarah Werning

The advantage to hosting your raw data on publicly-accessible databases or data clearinghouses is clear: researchers around the globe have access to your sequences and can discover them easily, even if they weren’t familiar with your original paper before running a search. If you receive federal funding (say, from the NSF), it’s now also required. Every grant proposal submitted to NSF requires a Data Management Plan, which explains how the grantee will share “the primary data, samples, physical collections and other supporting materials created or gathered in the course of work” at “at no more than incremental cost and within a reasonable time”.

There’s no universal designated clearinghouse for bone measurements or pictures of fossil histology (yet), but a number of databases and repositories have sprung up in the last ten years that host different types of data. A few of these (for example, Dryad) have become the repository of choice for multiple journals, the first steps toward the type of data clearinghouses I described above.

Because paleontology is so interdisciplinary and draws on so many types of information, our data can be found in a number of repositories (in addition to supplemental information hosted on journals’ sites). There so many options for a paleontologist who wants to share data that it might be hard to keep them straight. Over my next few blog posts, I’ll be discussing some of the common data repositories used by paleontologists. For each of these, I’ll give pertinent information, such as funding sources/costs and site features/limitations, discuss my thoughts on usability, and highlight some recent paleo papers that take advantage of these databases. My hope is to assemble a comparative guide for paleodata repositories over the next month or so. The first of these will go up on Wednesday and discuss Dryad (mentioned above). Other sites I will discuss in the near future include Figshare, MorphoBank, and Morphbank: Biological Imaging.

repository logos — Coming soon: I review all of these. Clockwise from top left: MorphoBank, MorphBank: Biological Imaging, Dryad, Figshare.

Today, I’ll leave you with a list of some of the data repositories I’ve used and encountered in my paleontological research, along with a brief description of the type of data they host. Please post good suggestions for additions to this list in the comments.

SARAH’s REPOSITORY REPOSITORY

A comprehensive list of data repositories can be found at Databib. However, not all of these are data repositories in the sense that you can use them to host the data for a given manuscript (some of them just provide data summaries). Particularly useful to me is this list of Biological Science Databases: http://databib.org/index_subjects.php#Bio

The remainder of these sites are ones you can use to host your own data (some restrictions may apply).

AMCED: http://rruff.geo.arizona.edu/AMS/amcsd.php (crystal structures)

Biomesh: http://www.biomesh.org/ (FEA models for biology)

data.gov: http://www.data.gov/ (science data generated by state- and federally-funded agencies)

Digimorph: http://www.digimorph.org/ (CT data)

Dryad: http://datadryad.org/ (all types of data)

ESA Data Registry: http://data.esa.org/esa/style/skins/esa/index.jsp (ecological and environmental data)

GenBank: (http://www.ncbi.nlm.nih.gov/genbank/ (genetic sequences)

figshare: http://figshare.com (all types of data; explanatory media including figures, presentations, and posters)

miRBase: http://www.mirbase.org/ (microRNA sequences)

MorphBank: Biological Imaging: http://www.morphbank.net/ (phylogenetic data, morphology images)

MorphoBank: http://www.morphobank.org/ (phylogenetic data, morphology images)

Movebank: https://www.movebank.org/ (animal tracking data)

Neotoma: http://www.neotomadb.org/ (paleoecological data, including faunal occurrences and pollen data)

NOAA Paleoclimatology: http://www.ncdc.noaa.gov/paleo/paleo.html (paleoclimate data)

Paleobiology Database: http://paleodb.org/ (fossil occurrence data; some measurements)

Pangaea: http://www.pangaea.de/ (earth sciences data)

Protocolpedia: http://www.protocolpedia.com/ (lab protocols)

TreeBASE: http://treebase.org/ (phylogenetic trees and supporting data/files)

ZooBank: http://zoobank.org/ (official registry of zoological nomenclature)