Tuesday, March 2, 2010

The endurance of University data records - be discouraged

Much has been made of the destruction or loss of data from the files of the Climate Research Unit (CRU) of the University of East Anglia. Dr Phil Jones, The Director until this all became public, did not do himself much benefit with his remarks before the Commons Select Committee in the UK, that is looking into the Climategate matter. Destroying the raw data, or not making it available, so that all that one can use is the modified and gridded data, means that there are no checks that the adjusted data has been properly derived. But the destruction of research data is not only encouraged in some institutions, it is mandated by regulation. It is, however, a point that a lot of folk may have missed. So I thought I would mention this since I suspect that it affects much more than just the data at my own University.

I retired from the University last Friday, and have spent today throwing away about 80% of the material in one of the three offices that, transiently, it has been stored in. (It happened that in recent months the three folk who had worked with me on many of my research programs also retired, and so their records were boxed and collectively stored with mine until they could be sorted). We worked through file after file, with data going back to the first experiments that I had run some 40-years ago when I came to this place as a very junior Assistant Professor. That data was still on graph paper, with hand-plotted curves. It went into the trash barrels. As did many of the journals that I had paid large chunks of money for over the years, and almost all of the correspondence dealing with the millions of dollars of contracts that I have managed during my term here.

I am working with the University Archivists, and they and a couple of students helped me work through many of the files, and did most of the actual disposal. We have done some pretty interesting things over the years (I was incredibly fortunate to be involved in many of the activities that changed my discipline from an academic curiosity into something that impacts, in one way or another, many peoples lives every day). But not much of that is being kept.

A treatment for skin cancer that discriminates between healthy and diseased tissue – save the patent – the rest into the trash. Cleaning the Statue of Freedom atop the Capitol building in Washington, save the proposal, the final report and one paper. The rest – into the trash.

You might think that I am being deliberately destructive, but this is what the regulations require. For those as incredulous as I was, until last week, here is the information:

Notice that the applicable dates are for three years back. In other words three years after I get a contract or grant I am supposed to archive the proposal, report and the sample of data (the paper that I mentioned earlier), and then four years later I am supposed to destroy all the research data. Hope the sanctions are not too onerous, since, until today I had kept everything. Now much of it lies in grey trash bags stacked down a hallway.

I actually got a bit annoyed about this last week, and it was then that this all came to my attention. As it happened when my pension was calculated (ours is based on length of service) the record did not show that I worked for part of 1997. Now I knew that I had, but if I (or actually the Center staff) had followed University rules, rather than what I had wanted, then there would have been no other record against which to compare the facts relative to the records that are in the Central Personnel Office. Given, however, that we had kept the records, the copy was found, sent up and the matter was straightened out within about an hour. (However, since a large number of boxes have recently left for the incinerator I don’t believe that my replacement as Director has continued my cautionary practice).

In the past the Center has been audited, and had, on another occasion to defend a set of experiments that were investigated by a government agency. In the first case I found a record from a period that I suppose I should have had destroyed that showed that the audit inquiry was misinformed, and in the other I was able to supply all the documentation required (foregoing that it took several full weeks of several individuals time to copy – this being before much of our information was stored digitally). As a result, and based on that information, the inquiry was discontinued.

The amount of space that is needed to store digital records is trivial against the bookcases of material that have just gone into the trash. But storing the material only in digital form has some risks. I have just finished a comprehensive review of one of our programs, requiring data that was stored digitally back in about 1987. I cannot open the files for any of the information. I can’t find readers that will read some of the disc storage that I recorded it on. (And where I made copies of the files and transferred them, the current versions of the software won’t read files from that far back). It wasn’t in this case too much of a problem since, in violation of policy, I had the paper copies and just scanned them in to get what I needed (and created a digital copy), but those paper copies will be in those grey bags next week.

There are problems with data storage. If I had kept the written records, then when I vacate the room, then the books will go onto a bookcase in the hall for the students who want them to help themselves (my colleagues already have), a small amount of material will go home, some will go to the Archives, but the majority will burn, or be landfilled. Because the person who follows me into that space has their own research and documentation, which they will put into the bookcases that I am vacating. There is not enough room to store the material. We used to use microfiche to do that – I haven’t seen anyone use one of those readers in years, I stopped when ours broke.

The Federal Government and the National Labs are no different. I was on a National Panel which needed some information on a project from one of the National Labs and we wrote for it. It was about 15-years after the experiments. They no longer held the data, and there was no-one there that we could talk to about the work. (Which was one of those supposedly crazy ideas that folk go out and try, and bless my socks, this one worked, and might have been helpful if we could have found out more).

So while I continue to think that it is madness to be spending the amount that we are on research into the possible problems of the greenhouse gases without a more robust set of raw data that everyone agrees has integrity and that has been compiled in a way that is logical and transparent, I have to point out that the protocols governing records at Universities are not supportive of my position. Not that this makes me feel any better, rather the reverse. And there are many, many research programs that do not have that level of visibility.

I used to joke in my class that disasters happen in about 20-year cycles, because nobody read anything that was older than that, and thus missed some less-than obvious design features, which became forgotten until their lack led to disaster. But I had not realized that the data was all gone. And in the digital age, if it isn't on the web who knows where to look for it.

Troubling thoughts!


  1. The policy to destroy research data seven years after the end of a project troubles me too. A colleague once told me that models and theories come and go but a good observation lasts forever. As you point out, though, it may not.

    I’ve worked for universities but never at one. My inexperienced view of academic scientists is that they usually collect data as the basis for publication. Once the reports and the papers are written, or before then, it’s on to the next grant and/or the next contract. The university’s administration values its scientists for their successes and awards, the money and prestige they can attract and, in the best case, for their teaching ability. Data preservation is not what universities do; none of the foregoing rewards it.

    A newsletter I produce recently received a letter from an academic oceanographer near retirement. Oceanographic observations made by a scientist on a ship have often been fabulously costly per bit and are irreplaceable. Those exact observations can never be made again. This scientist was worried that his data would vanish when he retired, and he thought the government should do something about that. He had, of course, used the data to publish but the publications had no room for the raw data which might, for example, be useful to someone investigating ocean climate. His university also had no mechanism for data stewardship and, I suspect, no interest in it.

    I have a bit more experience with the Canadian government’s oceanographic program, where the government does, indeed, have a program to rescue data from its own retiring scientists. Some of them also started work when notes, graph paper and a calculator were the standard analytical tools. The rescued data must go through steps for description, quality control, formatting and archiving, after which they become available for others to retrieve on demand. The government’s historical physical oceanographic observations have been relatively well looked after in this fashion, the chemical and biological data less so. Unlike an academic department, a government agency often has a mission that entails preserving and building on the past. However, science within government has fallen from favour in Canada, leaving programs like data rescue minimally funded.

    I’ve spent some time thinking about this problem but to little avail. One solution might see government granting agencies require all the data arising from a grant to be quality controlled and submitted electronically, along with the metadata, before the grant can be signed off. After a while, someone at the granting agency might realise they had something valuable and start organising it for retrieval.

    Even in the short term, it does no good to save data without the metadata to describe its collection, analysis and quality control. In the longer term it does no good to save it without a mechanism for retrieval, and that’s a costly affair. Data storage and archiving is neither sexy, fashionable nor publicly demanded, and that seems pretty much to be the end of the story.

    Congratulations on your retirement, HO. I hope it brings you continuing interest, enjoyment and good health.

  2. Thanks for the comment and the good wishes, which are much appreciated.

    One of the sources for the data, that I am sort of relying on, lies in the appendices to the dissertations and masters theses that were also produced from the work. I generally insisted that these contain the data files, and these are stored at the University in the library, in paper. But it may, in some cases, be quite difficult to know and then get access to those documents. (I have been refused permission to see some on occasion).

