PLOS’s open data fever dreamPosted: February 25, 2014
The publisher of the largest scientific journal in the world, PLOS, recently announced that all data relevant to every paper must be accessible in a stable repository, with a DOI and everything. Some discussion of this is going on over at Drugmonkey, and this is a comment that got out of hand, so I posted it here instead.
What is the purpose of this policy? I don’t see how anyone could be fooled into thinking this could somehow help eliminate fraud. Fraud is about intent to deceive, and one can deceive with a selective dataset as easily (or, actually, much more easily) than with Photoshop.
What else? Well, you could comb through the data of that pesky competitor or some other closely related work, looking for mistakes or things they missed that you could take advantage of. Frankly, I can’t imagine bothering. I mean, how could you not have something better to do? Like collecting data. Maybe you’re a stats scold who wants to check the assumptions behind every ANOVA. I can sorta see this… I wish data were presented better in a lot of papers. But I don’t really lose sleep over this for reasons described below.
Then, there is the data utility argument: Everyone should have access to datasets so they can run their own analyses and make new discoveries with the same data. Efficiency, right? Leverage the “utility” we get out of the Johnson’s payroll taxes. This, to me, is lunacy, outside a small subset of disciplines (genomics, crystallography apparently) that generate data that is inherently amenable to curated storage and is to a large extent independent of the details of how it was acquired (a sequence is a sequence, a series of electrical events in a neuron is just a bunch of stuff that happened one time). Two thoughts occur to me:
1. Would I put my name on an analysis of someone else’s data? I would not. I want to be intimately familiar with how the data were generated. I want to talk to the person. I want to have seen the experiment done at least once. Everything else is GIGO.
2. For real, can you get grants and tenure for publishing new analyses of other people’s data? If yes, sign me the fuck up for that shit.
3. OK, three thoughts. Isn’t this the mother of all repeated measures fallacies [edit: Ian points out below that what I mean here is multiple testing, not repeated measures]? Don’t we all know better than to keep re-analyzing the same data until something comes up p<0.05? I mean, y’all remember the recent p-value hysteria about how science was crumbling under its rotten foundation of bad stats? Right. Well, who in the hell believes something is A True Fact because of one p-value? Or one dataset? Or one paper? Scientific knowledge is not data. It is consistent results from a wide range of experimental approaches and negative results from attempts to falsify a hypothesis. FROM DIFFERENT EXPERIMENTS. ACROSS TIME AND SPACE. Science is the motorboat… data are the wake behind it…shit we’ve already churned through. I know there are special cases of clinical trials or whole genomes or whatever that have special utility as resources for data mining and meta-analysis, or where data collection is prohibitively expensive. Most science, I argue, is not like that. Want to know more? Get more data.
4. Someone publishes using analysis A that the whizzle is critical for regulation of yumping. If someone takes that same dataset, does analysis B, and then says “Yes, the whizzle does regulate yumping,” do you believe it more? You should not. What if analysis B says the whizzle is not required for yumping. Do you believe the first paper less? Not particularly, because based solely on one dataset, the whizzle-yumping connection is provisional at best anyway. What to do? Someone has to go get more data, using their science brain to design a better experiment.
4. This is the second point #4. Who pays for this? If I pay out the nose to dump a TB of videos and physiological recordings on Figshare or whatever, who decides if I’ve annotated it sufficiently? Am I obliged to spend time answering every query/complaint/conspiracy theory someone emails the PLOS editorial staff?
5 or 6. But, you say, the clarification of the policy seems to say as long as you just put in “spreadsheets of measurements” you are ok. OK, so remind me again what the point of this was? If you’re showing me the spreadsheet, you’ve already done the important part of the analysis. Now we’re just back to the stats scolds and whizzle-doubters
Sharing data is good. You should do it, it’s part of being a good science citizen. A sweeping, ambiguous, unenforceable mandate like this is nuts.
Maybe I am thinking too narrowly about the kind of data in my subfield. Every video? Every confocal stack? HDs full of EM data? Every trace? Every recording? Who is going to host and curate this?
I dunno….convince me, wackaloons.