PLOS clarification confuses me more

…the Data Policy states the ‘minimal dataset’ consists “of the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. This does not mean that authors must submit all data collected as part of the research, but that they must provide the data that are relevant to the specific analysis presented in the paper.

I have been trying to parse this for the last couple of days. I cannot see how “the dataset used to reach the conclusions” could somehow be something different from “all data collected as part of the research.” Trivially you could mean “experiments the results of which are not in the paper.” But that’s dumb, who would even consider including that?

So what IS the distinction PLOS is trying to make? Are the videos of monkeys that Marc Hauser used the “minimal dataset” or are they in the vague subsection of “all data collected as part of the research” category you don’t need to make accessible? If you had 30 DVDs of videos, how would you make them accessible with metadata and DOIs if you wanted to? I can’t see how those videos could be anything other than “the dataset used to reach the conclusions.” But the impression I’ve gotten from others (and from the lengthy PLOS clarifications and FAQs), however, is that the output file from the manual coding of these videos would be the expected “available” data under the PLOS policy.

This illustrates my point, and what PLOS seems unwilling to address. Making the video accessible in the way they demand is insanely burdensome. Making the data file that is the result of coding the video accessible is easy but pointless. Everything important about the analysis occurred between the video and the coded file.

This isn’t just true of behavior videos…it’s nearly anything where you perform measurements that attempt to extract “important” features from complex, continuous phenomena. In other words, a LOT of experimental science. When you have large video or physiology datasets, there is usually not an agreed upon standard for how you convert that into a manageable set of measurements that are amenable to quantitative analysis, or necessarily even agreement on what features are important to measure. For the data I collect, for example, some of are analyzed purely by code. This is great for resting easy about bias, but often makes terrible, systematic mistakes that have to be manually checked, because code is so stupid. For example, often there is no “correct” threshold or segmentation criterion that is going to work in every case, so you either have to have exclusion criteria or a partial human judgment step or some other way of filtering your data. At the other end of the spectrum are totally subjectively scored events (like Hauser’s monkeys). Because humans are humans, you do your best to make the experimenter blinded to whatever independent variables you are interested in. That’s not always possible, so I try to avoid experiments that rely solely on “eyeballing” something.

Most behavioral or physiological analysis is somewhere between “pure code” analysis and “eyeball” analysis. It happens over several stages of acquisition, segmenting, filtering, thresholding, transforming, and converting into a final numeric representation that is amenable to statistical testing or clear representation. Some of these steps are moving numbers from column A to row C and dividing by x. Others require judgment. It’s just like that. There is no right answer or ideal measurement, just the “best” (at the moment, with available methods) way to usefully reduce something intractable to something tractable.

It is interrogating or revisiting this judgment that is the only purpose I can see to “open data” for these kinds of experiments, so that means you need every TB of raw data to make this useful. So, again, is the PLOS policy to make this happen? If so, how? If not, it is pointless. That’s why I think for these kind of data and in the absence of a massive investment (on the scale of Genbank) repository for enormous video files and raw time series data (in what formats? annotated how? paid for by whom?), the PLOS policy is either impossibly burdensome or pointless. Even then, given the large set of uncontrollable and unknown variables that affect the collection of experimental data like this, pooling them or comparing them would be very bad science indeed.

I’m trying to think of more ways to frame this. here’s another try:
There is no answer to the question “what is rat behavior in response to X?”  in the sense that there is an answer to “what is the structure of rat myoglobin?” There is only what some particular group of rats did when some version of X (as understood and applied by a given experimenter under lab conditions that they can only partially control) happened to them.  The “data” are everything those rats did that the experimenter chose to observe/measure after they did X to them. (You’ve already made important choices in deciding what to measure.) Many information-reducing and analytical steps later, there is a summary of what you decided was important and quantifiable about what the rats did.  

This approximate, mediated, interpreted, judgment-based, tentative kind of conclusion is what neuroscience (and many sciences) lives with. We never get discrete right/wrong answers about sequences, or phosphorylation sites, or the number of planets around a particular star, or the mass of a subatomic particle. We aren’t measuring facts of nature, we are asking “what kinds of things usually happen when…?”

If you think that’s frustrating to the “standardize your formats so all your data can be pooled/analyzed by others” set, imagine how frustrating it is to those of use who are doing the experiments. I wish there were a sequence of numbers I could pull out of a behavioral dataset that is the True Result of That Experiment, let alone The True Facts of That Behavior. There is not and never will be, there is only the raw video/trace of some stuff that happened one time.

The gene jockeys keep saying “we have these standards/repositories because the community demanded them.” It might be worth reflecting on reasons why other communities haven’t demanded them beyond being bad scientists, old fashioned, selfish, and uncollaborative.

I will also note that an ongoing issue at PLOS ONE is the extent to which a given editor buys into the PLOS ONE mission. This is as clearly an articulated policy as one could hope for (and one I deeply believe in), and the majority of editors understand it and follow it even when reviewers don’t. A significant portion of the editors, however, don’t get the mission and don’t follow it, happily rejecting papers based on reviewers that say they aren’t “novel” or “exciting” enough.  A smaller proportion don’t even seem to know that P1 has a mission that differentiates if from other journals and go on about how P1 should be working to increase its impact factor. Given this editorial variance, I can’t wait to see the wildly differing experiences people have navigating this ball of confusion about what a “minimal dataset” is.

About these ads

19 Comments on “PLOS clarification confuses me more”

  1. drugmonkey says:

    Most behavioral or physiological analysis is somewhere between “pure code” analysis and “eyeball” analysis. It happens over several stages
    You’ve already made important choices in deciding what to measure.

    could use a little more stitching together. That place between eyeball and pure code is informed, over time, by the experience of the scientist, of the lab, of the sub-sub-field, of the training tradition. The good part is that it takes what is a bit of a subjective process and puts some objectivity into the mix. The decisions about how to score, use, collect, analyze, reduce, etc the data seem more sciency and objective if you do things the same way, each and every time. But as you point out, different people doing the same nominal assay might have arrived at different rubrics for their data.

    People do fight over their differences. A little. And it may make the occasional paper fail to gain acceptance at one journal….but it will get in at the next one. So nobody’s OneTrueWay to do the assay will come to dominate.

    A problem with the PLoS initiative is that it is going to start pounding all the pegs down into one consensus hole. And that will, if successful, homogenize the science. Which isn’t good. Sometimes the little variances in methods turn up really cool stuff. Sometimes the variances protect us from over generalizing. Sometimes the fact that the same result is obtained from a diversity of approaches enhances our confidence that it is the right conclusion being drawn. Etc.

  2. rxnm says:

    That last paragraph is particularly bang on, DM… and reminds me of what drives me up the wall about BRAINI stuff, particularly the “brain observatories” and huge expenditure of effort on very narrow anatomical questions and tools. It’s a push toward homogeneity in approach, methods, and model, and that is death for creative science.

  3. “Everything important about the analysis occurred between the video and the coded file”

    This is true. If it’s too burdensome to do for an entire publication’s worth of data, I wonder if it could be done only for the the Fig1/main finding, to give readers a sense of the general approach, which would presumably apply to other datasets…

  4. Chris says:

    If PLoS’ form of words fails it is because they are addressing a big chunk of science in one go. That makes it easy to pick holes. A more charitable unpacking might go as follows:

    1. Sharing is the new black, but PLoS almost never specifies what should be reported; it characterises it generically as the pyramid of data and work below the paper (up to some sensible cut-off), but what that means in each community is going to be different and can only be agreed through peer consensus. A while back, MINI attempted to standardise some neuroscience reporting wrt metadata content (but still _not_ experimental design); in following that project I became aware that neuroscientists seem often to take a dim view of sharing. If that is still generally true then journal editors would find that and would _not_ require more for such submissions than could be so supported. This means another round of ‘MI’ projects, essentially, carried out piecemeal through referee conversations like case law, but in the end the only authority on what you should be sharing (what is useful to share) is you and those like you. Get a group and make the arguments cogently in a paper; bring the opposition, if any, out of the woodwork, then that collective view validates your future position wrt sharing your data and you’ve done a service for those like you.

    2. What PLoS does dine out on is _where_ to put stuff. This is brilliant and does a great job of promoting Dryad and FigShare — Godsends both — the important point being that if you want to share you now can, whatever your domain. Size might be an issue at the upper extreme, but you can’t argue with this stuff. Basically, if you want to, or need to, then you can. Ace.

    Everyone seems to think that those writing policy are Stalinist-Sadist lizards on a mission. In fact those ‘wielding’ policy are almost never willing to enforce it at present, largely because all those policies in some sense refer to community consensuses that have never been established. All it needs is the word ‘enough’. Enough according to whom? According to those that know. Which is you. But because everyone kicks off without any attempt to really unpack things and understand each other (blech, yeah, whatever) all this goes by the wayside.

  5. DrugMonkey says:

    My “enough” is the status quo. “On request”. So presumably we’re done here? I can submit an article to PLoS with no new steps?

  6. namnezia says:

    DM, I’d say yes.

  7. rxnm says:

    Chris, I’ll plead guilty to picking holes… the fact that they used the norms of what’s useful and practical for one subset of fields and just sorta blanket mandated it for everyone is my point.

    And in pointing out the merely misguided and impractical, I’m not even addressing the very legitimate times when you might not want to share data. So far, PLOS acts like these will be easily solvable on a case-by-case basis through consultation with editors, who will apparently based on some criteria (what criteria?) grant requests to not follow the policy. I guess they have a lot of time on their hands. I wonder how they will feel when they are fielding massive numbers of complaints and requests citing this policy from ARAs, Republican congressional aids looking for “wasteful spending,” and other science trolls.

  8. namnezia says:

    Presumably as long as you are willing to provide the data, you ca create your own repository (eg. your lab’s hard drive)

  9. DrugMonkey says:

    No Namnezia. The policy *requires* a third-party. No such third party in their right mind is going to sign off unless they verify the archive. IMO.

  10. chrisftaylor says:

    @DM/rxnm Logically, if a community (fuzzy thing but tending to revolve around a subject or technique) demonstrated majority support for a position paper (imagine a conference session) that asserted there are no data (beyond those in the paper) worth sharing, given either their singular nature or the herculean effort involved in collecting them, then I would expect editors to defer to that. Of course getting those data (or just one datum, even) from the PDF is a pain so you could still submit a little tabbed file tagged with some nice keywords to Dryad :)

    Overall we’re going to need a few meetings amongst the great and the good and the rest in various fields to agree a sensible approach for each. There will be less-straightforward cases where consumers are more broadly spread (by field) than producers; credit mechnisms based on data citation that may be the answer there.

    Anyway PLoS’ effort is wide open to criticism, but it also supports an interpretation as a sensible, flexible, consultative request to share what is useful, to validate and extend the research presented, and to support who-know-what future use by tagging and bagging it.

  11. rxnm says:

    Chris, if that much work is required to agree on what’s needed for certain subdisciplines, isn’t PLOS on the hook for an ill-conceived, premature, and unenforceable mandate?

    If a community (by some definition) is not asking for this (or wants something different), why does PLOS feel it is its place to impose it?

    Who is going to mediate this flexibility and consultation? AEs? Reviewers? PLOS staff? These are basic procedural questions PLOS is not answering.

    Again, I am not against the principle here, just saying that the narrow thinking behind this policy could lead to an epic can of worms.

  12. Anon says:

    Hm, seems like it’s time for a online poll – all neurophysiologists who think the PLoS requirements are insane, raise your hands…

  13. […] As one blogger points out, even this carefully crafted statement leaves a lot of room for interpretation and creates potentially significant burdens or unrealistic expectations for researchers in some fields: […]

  14. […] had led to criticism at the DrugMonkey blog, and a February 26 clarification seemed to do little to convince another critic. In particular, there were objections to a section that began […]

  15. Suresh says:

    As a neurophysiologist, I can easily see how data could be made accessible along with code: it would represent the data files (possibly after some extraction steps) and then the code that produced the figures and results in the paper, and that was probably used while exploring the database actively before publication (since a large set of our papers are pilot results rather than truly pre-designed). While I think everyone benefits from having such datasets available, I don’t think that a citation is the right reward for the amount of work that was done .Having a data-source tag in Pubmed, that can be used in CVs (the number of papers that used my data as a source) and is given appropriate credit would be a pre-requisite before open sharing is possible, IMO……

  16. rxnm says:

    Suresh, Thanks for commenting. I think the point is that what you consider “extraction steps” someone else might consider “data manipulation” and vice versa. The stated purpose of this policy is to allow reanalysis and, presumably, to try to allow different methods of analyzing the raw data.

    And, if anyone wants to pool data from multiple papers (a terrible idea, but it’s out there), you would be even stupider to pool the “data used to generate the figures” if those “data” had all been produced with different extraction steps relative to raw recordings.

  17. Suresh says:

    Yes, I agree, and while I do not know why PLOS is (really) doing this, my own perspective does not sacrifice trust. It is not to police people’s results, but to reach the endpoint where people all over the world with different theoretical perspectives, skills and interests can work on public datasets and test novel theories and analyses. I think DrugMonkey wrote somewhere that the main result of open data would be that people questioned the conclusions of published papers, and I agree completely, but I don’t see why that is not desirable, especially given the (poor) level of statistical rigor in the fields I am familiar with, and the preponderance of extremely complex analyses that are reported with almost non-existent detail and assumption-checks, and where it would be very useful to know whether other experts looking at the data would reach the same conclusions, especially when looked at in different complementary ways (a Granger causality analysis would be one such example). I bet simply eliminating bugs in the data analysis would change a clearly-greater-than-zero percentage of the results. I am not sure how conscientious people are about sending out their raw data to anyone interested: in any event, getting people to put (some form of) data out along with the paper is fine, I think. I just don’t think that given the credit-assignment and funding policies in place at the moment, simply getting experimentalists to work incredibly hard generating data and then just letting go for “citation rewards” is a workable measure: citation are too flawed to serve as a good measure for credit-allocation anyway. Just my take…

  18. […] of a multiresistant KPC-3 and VIM-1 carbapenemase-producing Escherichia coli strain in Spain PLOS clarification confuses me more I Didn’t Want To Lean Out: Why I Left, How I Left, and What It Would Have Taken to Keep Me in […]

  19. […] DrugMonkey’s complaint that the inmates are running the asylum at PLOS (more choice posts are here, here, here, and here); the yays have Edmund Hart telling the nays to get over themselves and share […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s