Michael Kinsley & Climategate

The recent news about data sharing leading to progress in fighting Alzheimer's reminded me that I wanted to make one point about the long dead and forgotten about "Climategate" scandal.

The point isn't even mine, it's Michael Kinsley's, namely his often used dictum that "the scandal isn't what's illegal; it's what's legal."

Which is to say: I agree with the three independent reviews that the Climate Research Unit and Phil Jones acted within the normally accepted range of contemporary scientific practice. The issue is that I and many others find that range, well, scandalous.

The scandal has many dimensions, but the specific one I want to talk about here is the availability of data and computer code. My argument is simple: science depends on the ability of researchers to reproduce, test, and extend results found by others, and that is impossible when the result is based upon data and computer code that is not made freely available.

And to fill out the minor premises: science is good, or more specifically, results that have been exposed to reproduction, testing, etc. are orders of magnitude more reliable than results that have not been so exposed, and that (1) encouraging such reliability is good and (2) distinguishing between information sources with vastly different degrees of reliability is also good.

I'm not just advocating open data for science. What I propose is:

  1. The NSF, NIH, DARPA and other government funders of science require that any projects or researchers they fund make all data and computer programs they develop as part of their funding available without cost and without encumbrance for future research. I'm fine with people who don't like that restriction applying to the NEA.

  2. Science journals do not publish papers unless the data and computer programs used in the production of the research are available for a reasonable cost. (It's okay, for example, to use Mathematica, Mac OS, or other such commercial programs.) Scientists who dislike this restriction are still free to blog, tweet, and issue press releases about their research, or publish such in People magazine.

  3. Universities and colleges only make the research component of tenure and other hiring decisions according to papers meeting the above criteria. I don't want to discourage scientists from blogging, but I would categorize that as an educational, not research activity. (And yes, universities and colleges should weight educational activities much more than they do now when making tenure decisions.)

  4. Public policy, particularly policy affecting billions of people and touching trillions of dollars of economic activity, be based to the maximal extent possible on papers meeting the above criteria.

In short, I propose that we take back the word science, and reserve it only for research that is openly reproducible and testable. We will of course need a name for the other stuff that we used to call science, something like "non-creative writing".

Now, the strongest and most obvious counter argument to all of this is that the benefits of such open reproducibility would be outweighed by the costs in reduced information flow. We want, for example, the world to know that high temperature superconductivity is possible as soon as possible (so that post-docs and the like can start preparing to move into the field), but the under the proposed scheme they might be tempted to sit on their data until they have squeezed the last drop of analysis from it, lest they be scooped by another group. I think the issue here can be solved, if indeed it needs solving, be slightly changing the incentives around press releases. I think something like a press release is what we want here -- something that communicates the news, but also communicates its provisional nature -- and while there clearly are some incentives to issue press releases now, those incentives may have to be increased.

Another potential casualty would be research about, or that depended upon, expensive proprietary computer programs. I used to this sort of research myself, in the field of speech recognition. There were (and to my knowledge still are) no good open source speech recognizers, so the majority of research was done by the ten places that had a state of the art recognizer, and none of it was reproducible unless you were at one of those ten places. And even then, it was darn hard (and sometimes even impossible) to reproduce another group's result, because no one shared source code. For example, sometimes the alleged improvement was really just (unintentionally) covering over a bug in the other guy's code. I shed no tears at the thought of not calling this sort of thing science.

Phil Willis, the chairman of the House of Common's investigative committee into Climategate, described the "standard practice" of not routinely making the data and programs used in climate science as "reprehensible", saying it "needs to change and it needs to change quickly". He's right, and he's right even with "climate science" replaced with just "science". If I had Alzheimer's, I would be happy that the researchers in that field had started sharing data and I would be furious that they hadn't started doing so twenty years ago.