Don’t say Policy when you mean Mission StatementPosted: March 13, 2014
Hopefully my final thoughts on the PLOS data whatnot. There is a good overview and compelling defense of the PLOS data access policy from Tal Yarkoni here. I like this take on it, and it nudged me in the open data direction….but… I will now repeat myself and elaborate.
1. The “taxpayer rights” argument? Really? Not even worth bringing up to dismiss (which to be fair is what Tal mostly did).
2. The point with e.g. behav/phys data is only partly that it is hard to make accessible and that there are no accepted standards. It is also that it is bad science to use the same dataset for subsequent analysis after analysis. Bolstering belief in a result or hypothesis comes from new data and different approaches to the same question…we don’t want to learn more and more about a particular set of data (I know some of you do, but maybe you can settle for touching every third lamp post on the way home), we want to learn more about the phenomena of interest. In my field, that can only mean experimentation. So perhaps the willingness to expend effort providing accessibility is proportionate to the value different fields see in having this kind of accessibility.
For me, the first step to doing *any* work based on another group’s result is to replicate it in our lab the best we can. This can mean repeating exactly what they did, or doing an experiment of our own that should be consistent with their result but uses our strengths or is a better starting off point for where we’d like to go with it. The point is, you don’t believe something because it’s published. Y’all know that, right? That’s not a flaw in the system, that’s just science. Unknown confounds, honest errors, noisy systems…c’est la vie. It takes multiple kinds of data from multiple approaches before a result starts to acquire the property of being a stable fact/assumption that can underlie further experiments without further confirmation.
So, using someone else’s primary data is so far over my trust threshold I can’t even. I don’t even use anyone else’s saline. Say there is a hot result in your field that you want to explore further. Is it that there are two kinds of responses to this?
A. I’m going to see if I can confirm that, then take it my direction.
B. I want to re-analyze that dataset.
B makes no sense to me. Say you get a different answer, and they did something wrong. Good for you, and good for science (see #6 below). Say you come to the same conclusion… so what? That might increase your trust in their bar charts, but it shouldn’t increase your trust in their data! I’m not a volunteer fact checker, and I treat any new result in my field that really matters to me with a provisional “maybe,” as should you, dataset in hand or not.
4. PLOS volunteer AE’s are not going to manage this in any agreed upon or consistent way. They haven’t been given guidance on how to handle the myriad author-reader conflicts and scenarios that will arise from this, and we know they don’t currently enforce other PLOS policies in agreed upon or consistent ways. If this means in practice that the policy is to require authors to tick a box and then responding to complaints with “the author ticked the box,” then it is the Open Data folks who should be angry.
5. I would love a stable, annotated, searchable repository for my primary data…a little upfront work so I don’t have to worry about long term storage (drawers full of slowly disintegrating DVDs) sounds fine. But it doesn’t exist. Though the early roots were academic, current repositories for genomics (and a few other limited data types) are supported by a US federally funded program created by an act of congress, so I have little patience for the gene types who go on about how easy it is to share a few GBs of sequence*. Imagine Genbank, but orders of magnitude larger and encompassing hundreds or thousands of data formats. Who is even talking about that? Who will pay for it? Yes, yes, I know there are options: right now that space is occupied by a couple companies and non-profits that are frankly untested at large scales and over time, are poorly subscribed, and are (let’s face it) likely to not exist in the medium to long term.
6. Transparency is by far the best argument here, and the only one I find compelling. However, PLOS sacrificed transparency immediately with the frankly bizarre redefinition of “primary data” to mean the measurements or downstream analysis of the primary data, as a kind of special exclusion for cases where the primary data aren’t the kinds of things they happened to have in mind when they wrote the policy. Thus, to the extent it’s been clarified, the policy has been clarified into irrelevance with respect to what I see to be its primary legitimate purpose.
So I get the big picture point and agree. Build it, and I’ll show up. Mandate it with no planning or forethought, I’ll publish elsewhere.
* And then when you point out the difficulties, they back track about 10 miles to “they don’t really mean primary data in your case, they mean a table of numbers you made from your primary data… duh! And they don’t really mean any of it, because if you give the editor a reason why it’s hard they’ll surely give you an exception.” OK, so what part of that is a “policy”?