PLOS’s open data fever dream

The publisher of the largest scientific journal in the world, PLOS, recently announced that all data relevant to every paper must be accessible in a stable repository, with a DOI and everything. Some discussion of this is going on over at Drugmonkey, and this is a comment that got out of hand, so I posted it here instead.

What is the purpose of this policy? I don’t see how anyone could be fooled into thinking this could somehow help eliminate fraud. Fraud is about intent to deceive, and one can deceive with a selective dataset as easily (or, actually, much more easily) than with Photoshop.

What else? Well, you could comb through the data of that pesky competitor or some other closely related work, looking for mistakes or things they missed that you could take advantage of. Frankly, I can’t imagine bothering. I mean, how could you not have something better to do? Like collecting data. Maybe you’re a stats scold who wants to check the assumptions behind every ANOVA. I can sorta see this… I wish data were presented better in a lot of papers. But I don’t really lose sleep over this for reasons described below.

Then, there is the data utility argument: Everyone should have access to datasets so they can run their own analyses and make new discoveries with the same data. Efficiency, right? Leverage the “utility” we get out of the Johnson’s payroll taxes. This, to me, is lunacy, outside a small subset of disciplines (genomics, crystallography apparently) that generate data that is inherently amenable to curated storage and is to a large extent independent of the details of how it was acquired (a sequence is a sequence, a series of electrical events in a neuron is just a bunch of stuff that happened one time). Two thoughts occur to me:

1. Would I put my name on an analysis of someone else’s data? I would not. I want to be intimately familiar with how the data were generated. I want to talk to the person. I want to have seen the experiment done at least once. Everything else is GIGO.

2. For real, can you get grants and tenure for publishing new analyses of other people’s data? If yes, sign me the fuck up for that shit.

3. OK, three thoughts. Isn’t this the mother of all repeated measures fallacies [edit: Ian points out below that what I mean here is multiple testing, not repeated measures]? Don’t we all know better than to keep re-analyzing the same data until something comes up p<0.05? I mean, y’all remember the recent p-value hysteria about how science was crumbling under its rotten foundation of bad stats? Right. Well, who in the hell believes something is A True Fact because of one p-value? Or one dataset? Or one paper? Scientific knowledge is not data. It is consistent results from a wide range of experimental approaches and negative results from attempts to falsify a hypothesis. FROM DIFFERENT EXPERIMENTS. ACROSS TIME AND SPACE. Science is the motorboat… data are the wake behind it…shit we’ve already churned through. I know there are special cases of clinical trials or whole genomes or whatever that have special utility as resources for data mining and meta-analysis, or where data collection is prohibitively expensive. Most science, I argue, is not like that. Want to know more? Get more data.

4. Someone publishes using analysis A that the whizzle is critical for regulation of yumping. If someone takes that same dataset, does analysis B, and then says “Yes, the whizzle does regulate yumping,” do you believe it more? You should not. What if analysis B says the whizzle is not required for yumping. Do you believe the first paper less? Not particularly, because based solely on one dataset, the whizzle-yumping connection is provisional at best anyway. What to do? Someone has to go get more data, using their science brain to design a better experiment.

4. This is the second point #4. Who pays for this? If I pay out the nose to dump a TB of videos and physiological recordings on Figshare or whatever, who decides if I’ve annotated it sufficiently? Am I obliged to spend time answering every query/complaint/conspiracy theory someone emails the PLOS editorial staff?

5 or 6. But, you say, the clarification of the policy seems to say as long as you just put in “spreadsheets of measurements” you are ok. OK, so remind me again what the point of this was? If you’re showing me the spreadsheet, you’ve already done the important part of the analysis. Now we’re just back to the stats scolds and whizzle-doubters

Sharing data is good. You should do it, it’s part of being a good science citizen. A sweeping, ambiguous, unenforceable mandate like this is nuts.

Maybe I am thinking too narrowly about the kind of data in my subfield. Every video? Every confocal stack? HDs full of EM data? Every trace? Every recording? Who is going to host and curate this?

I dunno….convince me, wackaloons.


17 Comments on “PLOS’s open data fever dream”

  1. DrugMonkey says:

    Yes. Very well put.

  2. Clearly I must be one of these “wackaloons” of which you speak, although I have never thought of myself as one. I just thought making data (and my scripts) available as I published papers seemed like common sense for science. Sure there are some data types that are harder to store than others. However, as others have said, an honest effort is what is being asked for. I do not think the editors at PLoS (or other journals or funding agencies that require this) would complain if you could not put format X for your data (the raw data coming off the machine in a proprietary format), but some flat text file of the data that you use for the analysis would probably be fine.

    Here are some of my thoughts on your points (I discuss this more over on my blog, http://genesgonewild.blogspot.com/2014/02/why-would-any-scientists-fuss-over.html).

    I agree that using the raw data to “root out” fraud, may be rare. However, if such a case did occur (say someone simulated data but pretended it was empirical data), some forensic data sleuthing may shed some light on it (the data fits a particular distribution more closely than other data in the experiment, or its distribution is radically unlike other related data in the field). Still, my guess the amount of sleuthing will be small. However, in the age of increasing post-publication reviewing (and sleuthing of fraud, image manipulation etc), perhaps some students will do it.

    You seem concerned that some competitor will look at your published data, and try to find mistakes that invalidate your results. A couple of thoughts on this. First, I have had publically available data for years, and to my knowledge this never happened. More importantly, if I did make a mistake, I would like to know what it is. Also how we analyze data changes (statistics is as rapidly developing as any other field). I have published several papers, where the analysis was appropriate for the time, but I know reflect back on them and can see better ways to address my questions (plus many new questions, more on that below).

    You also discussed an issue of multiple testing (you called it repeated measures, but that refers to something different than what you discussed). This issue is well encapsulated in the idea of data dredging, or (from a quote attributed to Ronald Coase)

    “If you torture the data long enough, it will confess to anything”

    Of course this is an issue. However, as long as the provenance of the original data is known and spelled out clearly during the re-analysis, readers will be aware. However, providing such raw data is incredibly useful for meta-analyses (common in my field, and I believe in many clinical fields as well) where the results from many studies can be examined simultaneously. This is far easier to do with raw data than with summary data (which may not summarize some of the relevant data at all).

    More to the point, there have been some truly heroic experiments done. Many years and years ago. These data are still really relevant and important (possibly more so that the original experimenters even knew). Many of the PIs who oversaw or led such work have since retired or died. Is this data now gone forever? How could this ever possibly be good for science? Would it not be better to make it available along with publication. Sure most data may never get looked at or re-used. But for that very small fraction (much like all of our science), it could be critical.

  3. rxnm says:

    Thanks for the comments, Ian, and the correction.

    I don’t really care about people having access to my data in principle (I was imagining scenarios where people might care). I do care about the enormous cost and effort of making TB of data available, curated, annotated, and DOI’d, whether someone wants to access it or not (hint: there is a 99.9999% chance that no one wants to).

    Again, the PLOS policy says “all data” and suggests that the purpose is to allow reanalysis. Behavioral videos, ephys recordings, imaging data, cannot be reduced in a way that doesn’t make critical choices about what is important and how it should be measured. By the time you get to reasonably-sized flat file of measurements (“spreadsheerts of original measurements”), you have lost the point of reanalysis.

    If what PLOS actually wants is not “all data” but “the table I used to make this bar chart,” nothing could be easier. That doesn’t seem to be the intent or wording (unambiguously, anyway) of the policy. But say it’s a chart of a secondary analysis of the output of a primary analysis that is bast on a subset of measurements from the raw data. What am I obliged to provide? How much annotation of each of these data sets? Why can’t someone who is interested pick up the damn phone? (I have yet to meet a scientist who is unwilling to talk about their work.)

    I think this “old data might be useful someday” is bunk. Certainly not true often enough to justify such a sweeping mandate. We should test hypothesize with more data and different experiments, not “test” data with endless reanalysis. (And, again, I see the important exceptions in thing like clinical studies, genomes.)

    I think this is driven in part by certain kinds of data…. DNA sequence in particular. It is useful to have stable repositories for this because DNA sequences are on some space/time scale a stable fact of nature. This is not true of behavior and physiology. Re-examining the same instances/events of a broader set of phenomena over and over creates the illusion that we are somehow gaining more information when we are not.

    That said, if someone really wanted to reanalyze animal behavior videos (in a fraud investigation, for testing or comparing a new analysis method, etc) of course this should be fine. And it is. Yes, some people are assholes and resist sharing, so what else is new? I assume they won’t publish with PLOS.

    I won’t because curating and annotating data well enough so that my lab can figure out what the hell it is is already hard enough. This is pretty much why we have journals to publish figures in.

  4. Thanks for your reply. I have a couple of thoughts

    With respect to behavioural data (from video):
    I would really enjoy having a longer conversation with you about some of these issues. I am particularly interested with respect to behavioural data (for you, me and others in the community). Like you we generate large amounts of “raw” video data for behavioural analysis. In our current (and very slow) analysis pipeline we use event recording software to manually “curate” (I can not think of a better word for it) and extract the behavioural states and events from these videos (in addition to the other experimental covariates). Since we are currently using JWatcher (but experimenting with JAABA for more high throughput work) we end up with flat text files that are the next level of “raw” data. We have some simple scripts to parse these into formats to make it easier to analyze. Certainly these “raw” output files from JWatcher can easily be put up on figshare or github or DRYAD, along with the parsing scripts (and even the reformatted versions of the file for analysis). We write a simple readme file (almost exactly the same one the students make for me so I know what the pipeline is) and can post it along with the data so anyone who wants to use such data can.

    Similarly for our large genetics of shape (combining geometric morphometrics of Drosophila wings with genomic analyses), we have large sets of raw images of wings, followed by raw text files of the b-splines, followed by the superimposed landmark data (the last in an easy to use format). We submit all of this (not the images, more on that below) along with scripts and the readme file for the pipeline. I have found that by requiring all of this in my lab, it has made it much easier for me when working with students and post-docs to make sense of everything.

    But the question about what to do about the Terabytes of raw video and image data. I do not have an easy answer for this. However, given the size of our (well my labs) SRA deposits for nextgen sequencing data (our experiments tend to be on the order of 50-500GB of data) are approximately the same size (in terms of memory on disk) for our imaging experiments, I think that suggests that such databases are quite possible. In terms of figuring out how to incorporate meta-data (organisms, aspects of experiments), simple text files with information (with some standardized elements for species, etc, and other elements left flexible for each experiment) may work. However, there does not yet seem to be a major motivation for this. Of course, we could all make our own lab youtube channels, and I think this may be a fun (and possibly futile) experiment.

    Your other major point seemed to be about who would use the data. I think you may be surprised. Let me provide a few examples.

    With my own work on the genetics of shape we have generated ~3*10^6 wing images (across many different experiments). Over the past 3-4 years, I have been asked 3 (maybe 4) times for portions of the raw images from folks working in computer vision/biometrics who want to try out their new methods for either feature extraction or classification. I keep such sets zipped up and ready to FTP. However, these have also resulted in several new collaborations for me (with a few papers and a few small funded proposals). I had no idea at the outset how useful such data would have been, and now I try to make sure that when we collect data for our experiments, we do so in manner that also maximizes the utility for these researchers (without sacrificing our explicit experimental goals of course).

    Another great example is from one of the oldest studies of natural selection. The so called Bumpus data (you can read about it and download it here http://www.fieldmuseum.org/explore/hermon-bumpus-and-house-sparrows).
    As new statistical methods have been developed to quantify natural selection, this has often been a go to data set. Not because it is ideal in any particular sense, but it allows for comparison with other methodological approaches.
    More generally in studies of natural selection, it has not been the individual empirical studies that have been particularly thought provoking (although of course some have been), but the meta-analyses. As almost every author of such meta-analyses has pointed out, they could have done much deeper analysis if they had the raw data.

    This is equally true for the study and estimation of genetic covariance matrices in Evolution and quantitative genetics.

    With respect to the behavioural data you might be collecting (and we are collecting), the sequences of behavioural events may be very useful for those trying to understand underlying motivational states (using hidden markov models). Indeed a relatively new extension of HMMs (random effect HMMs), really does not lots of sequential data across many individuals for behavioural analysis.

  5. There was a letter about the costs of archiving data in Dryad in Nature a while back (http://researchremix.wordpress.com/2011/05/19/nature-letter/), which shows that’s it’s actually very good value for money. Most datasets I’ve ever seen at Molecular Ecology are well under 10GB.

  6. Well argued piece rxnm but, like I said to drugmonkeyblog, I think you protest too much. It is the principle that matters. While many people are generous with their data and make it available on request, there are many who do not. Yes, it is inconvenient but that is a price we pay. I think its a straw dog argument to say that there are types of data that are useless to third parties. Of course this is true and PLOS isn’t going to demand that all data is made available. To do so would be discriminatory. The policy is stupid if taken literally and they (PLOS) have released a follow-up statement to clarify what they mean – in response to well argued as well as hyperbolic responses from the community. The issue of re-analysis you bring up is a good one. But I think it needs perspective. Who is going to re-analyse? I think its safe to assume these are going to be peers who are familiar with the type of research. They may have had a hard time replicating the data and want access to try to see what they are doing wrong. Sometimes people are intimidated by bigger/established labs. They can’t simply ask. So having data available without relying on the depositing lab would be a big help. Unfortunately, we are not all “good science citizens”.

    It is hard to justify the policy on a benefit/cost balance given that most data sitting in public repositories will collect dust (or errant electrons), but that’s also not the issue the policy tries to address. It’s the principle that the results we publish should include the root data on which they are based to the extent possible. This has to be practical and it looks like PLOS recognizes there will be exceptions and technique-dependent limits. Indeed, they won’t be able to police the amount or extent of supporting data. Hence, only in situations where there is a reasonable complaint of insufficiency will they be able to fall back on the intent of the policy. Unfortunately,

    I think this is a move in the right direction. The requirement for public deposition of data is onerous and needs to be weighed in each case to avoid unreasonable burden. But I also think that we’ll find the right balance and that without such policies, there would be no recourse to addressing authors who refuse to make their data available (for whatever reason).

  7. rxnm says:

    Tim, if we were talking 10s of GB, no problem. Behavior/physiology papers routinely have TBs of raw data. Here is where PLOS is inconsistent… the policy says they want “any and all of the digital materials that are collected and analyzed.” Later they say that “spreadsheets of original measurements” are acceptable.

    If it is the latter, depositing data becomes trivial. If it is the former it is an expensive and time consuming burden.

    So PLOS can either majorly walk back the “any and all” or they need to start specifying what is required for different types of data and how they intend to support authors’ depositing them somewhere.

    I am still processing Ian’s thoughts on this, which are valuable. How do we provide data on imaging or behavior or physiology in a way that is useful to others? If I wanted to “reanalyze” someone’s 4D confocal files or behavior data, giving me “spreadsheets of original measurements” is useless. That’s because “making the charts” isn’t the analysis…converting complex biological phenomena into manageable sets of numbers is the critical component of analysis here, and all of the choices about filtering, segmentation, and how to measure ARE the analysis.

    Jim, I get that this is “right” and momentum will carry us that way. Again, what PLOS needs to provide is REAL clarification. This policy is incredibly lazy in its sweeping and inconsistent mandates, and dumps all interpretation and enforcement on authors and volunteer AEs/reviewers. Why would I bother?

    What should an author with hours (or days) of video or other large-scale datasets do to comply? Because the way the policy is written, they cannot. They have eliminated entire (very large) disciplines from their author pool.

  8. brembs says:

    I’m also a wackaloon. In fact, we are beginning to set up our lab such that any and all data we produce is made publicly available (with DOI at all) as soon as we generate it, with no additional effort from us at all. No more pestering your students to back up their data. We are currently almost ready to submit a manuscript, where all figures are just single lines of R code in the manuscript to evaluate said data – no more photoshopping and illustrating. Huge time and effort saver: look at your data after you collected them and your data gets automatically backed up and made available with DOI. Write your paper without fiddling with your (R) figures, just insert code and write legend. Any reader who double clicks on the figure will be able to edit the R code to see other aspects of the data.

    Once this transformation is complete, we’ll not even notice PLOS-like mandates – we’ll notice if a publisher won’t provide such services.

  9. Neuro Polarbear (@NeuroPolarbear) says:

    I am obviously on the other side of this debate, but I have to admit, brembs approach is very exciting.

  10. […] paper should become community property. Some have the extreme position that as soon as a datum is collected, regardless of the circumstances, it should become public knowledge as promptly as it is recorded. […]

  11. rxnm says:

    brembs system sounds awesome… I saw something similar in a small company. All data was automatically curated (you had to enter metadata when setting up data acquisition), immediately accessible (internally) and integrated with other experiments, etc. But it took an IT team 2 years to develop it to be stable and robust and required one full time IT/data support person to manage. I don’t have the expertise or resources to be able to manage data like that within my lab, let alone distribute it that way…I think it requires an economy of scale of a large lab or, I think in brembs case, a deep interest, commitment, and ability to develop this kind of system. I have neither.

    I have heard a lot of good arguments and examples from both “sides” (there aren’t really sides here, just kind of a continuum of opinions on the practicality/desirability of of the PLOS policy). Working on a follow up post based on this and clarifications from PLOS.

  12. brembs says:

    Actually, it was a postdoc in my lab (Julien Colomb) who used rOpenSci’s “rFigshare” to implement this on the side. As of now, it’s a proof of concept that only works for one of our experiments (the easiest one, of course!), but as we modernize our tools, we will implement this until eventually the entire lab will be running it.

    Our library is currently working on the back-end of this, so we won’t have to use figshare.

    Our library is also involved in a project with other labs to create citable software repositories (sort of a GitHub for scientific code) for our software as well.

  13. readkev says:

    Reblogged this on Kevin the Librarian and commented:
    I wanted to bring this post on the fear’s of PLOS’s new open data policy from a neuroscience researcher’s blog. It addresses many of the concerns from the research community concerning sharing data, but also highlights many of the ways libraries can contribute. I encourage you to read through the comments section to learn about additional (and innovative) ways researchers are working towards meeting this requirement, as well as many other requirements that will begin to emerge in the near future.

  14. Suresh says:

    “For real, can you get grants and tenure for publishing new analyses of other people’s data? If yes, sign me the fuck up for that shit.” – I think the answer to this is yes. In fact, computational biologists routinely brag about this, as a positive sign of their career-choice and lifestyle, and computational neuroscientists do much the same these days (of course, they would like to think that it is their math skills that allow them to occupy this “easier” perch, compared to the bench-grubbers). It is important though that the people generating the data get clear credit, either as co-authors or through some other “data-source” recognizing mechanism. In my field, generating data is extremely risky, anxiety-inducing and difficult: it is psychologically debilitating to publish papers with open data and then have someone else with more time (and possibility as a result, skillset) on their hands do some analysis that the experimenter would like to do oneself, in principle With good credit-assignment methods, these problems can hopefully be overcome, but I thought the PLOS jump was too premature.

  15. […] It’s no secret that researchers have mixed feelings about the policy; some are angry and frustrated, others see the light and understand that this has been a long time coming. What librarians can do […]

  16. The motorboat analogy strikes me as a little blasé… I can think of a ton of examples in genetics where software used to draw conclusions is continually found to be lacking and updated in some way to give better results – if the raw/original data isn’t available it’s then significantly harder to reanalyse for updated values, reinterpret in light of new findings etc. My interest for example in NHEJ paired end assembly, where better tools for doing so make this kind of attitude seem a bit self-assured. I think it’s important to know that the best available technology in 5 to 10 years may disprove, improve or give reason to look again at old results. Not sure how the situation in neurosci is though and more constructive criticism the better the policies so thanks for posting

  17. rxnm says:

    I think you’re exactly right about genetics/genomics…. and that is my point. That is a community that has been at the forefront of open data for the simple fact that it is useful to them. They are also the scientific culture from which PLOS arose. What I object to is the idea that because it is useful to their unicorn/snowflake discipline (to use the pejoratives they have been directing this way), that it is therefore useful and should be mandatory for all disciplines. I am not against sharing, I am against putting enormous effort and resources (on the scale of Genbank, but worse) into something with little scientific value.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s