Hanging on to data

In Musings

I had a conversation the other night about oceanographic data sampling; how it depends on getting time on a research ship, and living with the limitations of the size of the ship and how long it can reasonably stay at sea. Data sampled this way is patchy, and focused on areas accessible from the various research bases. This is not new. It turns out that there is a body of temperature data from the 17th century that corresponds to the routes used by the East India Company and we turned to wondering how such data can survive.

At the time this reminded me of a general principle of computer systems design that data should not be deleted. From a system perspective this allows for auditing of changes and reuse of old data sets in the light of new knowledge. Nothing new here except that a day or so later the New Statesman carried an article1 on what we might call data rot on the world wide web. Nothing particularly new here either, but it set my mind rambling.

Now, data is lost because someone deletes it (deliberately or accidentally), or a hardware fault corrupts it or the storage technology is out of date. If you are a bank then you know about these things and you make sure your backup processes handle them. You use access controls (software and hardware), multiple copies, and a re-copy process that keeps your technology up to date. Any organisation serious about not deleting stuff knows these things and takes precautions appropriate to the type of data. It can also happen that data is lost simply because it becomes, well, lost. You forget the name of the spreadsheet, you drop the usb stick down the drain, or the url ceases to work. This last one is becoming more and more of a thing. Sometimes, almost routinely, web sites are restructured, old pages are lost and there may not be an adequate search facility. At other times, a web site may disappear entirely, either because someone didn't want to pay for hosting, or the domain name was not kept up. Much of what we see is run by businesses, and businesses change, get taken over, or simply close. But even if you can reach the data, you still have the problem of reading it. Flash, anyone? Old database formats? Old document formats?

Actually, I don't have a problem with the equivalent of burning all your letters. I have more of a problem with reading a book or paper and trying to follow a reference that was only downloaded a year or so ago, only to find the url doesn't work. I can even shrug that off. What, I think, is really the problem is that cyberspace is like a self storage facility - it's rented. This not about whether the rents are fair. Google's rent is rather like the Devil at the crossroads, but other rents may be quite fair. I'm not talking about fairness. What bothers me is that it is ephemeral. A missed payment and it's gone. Data can be lost in the real world, of course, but on the internet we have the potential to burn down the library of Alexandria any time someone misses a payment.

If someone bought the content of a self store box they might see a collection of old looking books, leather bound, pages filled with handwritten numbers, and they just might recognise that these books could be important. I wonder if we could do something similar with the internet?