This is the fourth in Neuroskeptic’s series of interviews for the PLOS Neuroscience Community! If you missed them, check out the interviews with Srivas Chennu, Michael Corballis and Cordelia Fine. Neuroskeptic’s regular blog is at Discover Magazine.
How common is “p-hacking” and what does it mean for science? I spoke to Megan Head, of the Division of Evolution, Ecology and Genetics at the Australian National University in Canberra, Australia. She’s the first author of “The Extent and Consequences of P-Hacking in Science”, published in PLOS Biology on March 13th.
In that paper, which has attracted a lot of interest and has already been viewed over 30,000 times, Head and colleagues used automatic text parsing (text-mining) to extract published p-values from the Results and Abstract sections of every open access paper in the PubMed database. Head et al. found an excess of p values just below 0.05 – this being the threshold that conventionally denotes statistical significance. This implies that p-hacking or other biases are acting to favour the publication of significant results.
NS: In your study, you used text mining to automatically extract p-values from scientific publications. Previous studies have used the same approach – how does your method differ from the ones that went before?
MH: Our method is different from previous work that uses text-mining to look at p-hacking in that it extracts data from the full text (results section) of articles rather than just abstracts.
This is important for studies of p-hacking, because the p-values presented in the abstract may not be representative of p-values presented more generally.
You found that “p-hacking is widespread throughout science” but also that “its effect seems to be weak relative to the real effect sizes being measured.” Were you surprised by either of these results?
I wasn’t surprised that it was widespread. Evidence for p-hacking has been found previously in specific disciplines, and I thought there would be no reason why those disciplines would be special. Initially I was surprised that the effect of p-hacking was weak, but on reflection this makes sense. In my discipline – Evolutionary Biology – data used in meta-analyses often include data from studies where that data did not form part of the primary hypothesis. This kind of data is less likely to be hacked than data relating to the primary hypothesis of a paper. Further, I suspect that in Evolutionary Biology most p-hacking occurs for p-values very close to the significance threshold, so in reality effect sizes aren’t being altered drastically by p-hacking
You looked at all of the Open Access papers in the PubMed database. Do you expect the situation to be any different for non-Open Access papers?
It’s hard to say. There is a lot of debate about quality control in open access journals versus prestige of some non-open access journals, both these factors might affect the extent of p-hacking. But actually I suspect that the result would be the same because papers are often sent to multiple journals before they are eventually published.
When collecting data for this paper we had thought that it would be interesting to compare open access and non-open access journals, however, our inability to easily obtain data from non-open access journals made this impossible. Being able to obtain this kind of data is another advantage of open access that is often neglected.
You’re an evolutionary biologist – what led you to decide to do a study of p-hacking?
I had read lots of papers from the psychology/neuroscience literature on p-hacking and was interested to see how widespread it was and how bad the problem was in my field. The bias created by p-hacking could potentially inhibit scientific progress, so this is an important question for any field. Science is the best method we have for finding out how the world works. But we still need to be critical of our methods to ensure unbiased and rigorous results.
In your opinion, what would be the single best way to reduce or mitigate p-hacking?
I think the best way to reduce p-hacking is to educate researchers on the way that common practices may create bias. Many researchers don’t realise that the methods they employ lead to p-hacking. For example, often when researchers are asked what their sample size will be they reply something like “I’ll do a certain number samples and check the results to see if I need to do more” What they mean is if their results are significant they’ll stop, or if they are close to significant they’ll do more – this is one form of p-hacking.
I hope that our study helps to promote the issues surrounding p-hacking and also highlights that this is an issue that all disciplines should be concerned with.
Are issues around questionable research practices being much discussed in your field of evolutionary biology?
Our group has started doing work on this, but in general, no, it is not really discussed. I think most evolutionary biologists either think that this is a problem with other disciplines or accept that there are problems but think nothing can be done about it.
You point out that “Many researchers don’t realise that the methods they employ lead to p-hacking.” Who do you think bears the responsibility for educating people about this? Is it a matter for statistics lecturers, or is it a broader issue?
We should certainly be teaching about questionable research practices like p-hacking to students early on. However, I wouldn’t put sole responsibility on statistics lecturers, doing that could lead to a long lag in better practice while students move through the ranks. There are plenty of opportunities to be having these conversations for example when advising about experimental design/analyses, when reading draft manuscripts, when discussing journal articles and when reviewing papers.
As you point out, these issues are often discussed in the context of psychology, but the same problems crop up widely – seemingly in almost every field that uses p-values. What do you make of the argument that the best way to stop p-hacking is to stop using p-values altogether?
I’m not convinced that ditching the p-value is the answer to the problem. Studies have shown that hacking can occur with with other metrics that are used to indicate the importance of findings as well – for example effect sizes.
I think researchers need to have a better understanding of what the p-value represents and combine that information with information from other metrics presented. They also need to acknowledge the potential for bias and employ practices that reduce this bias in their research. I think taking arbitrary thresholds too seriously is also a mistake.
Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. On Twitter @Neuro_Skeptic
The views expressed in this post belong to the author and are not necessarily those of PLOS.