It’s time to come clean: I no longer know what data is. I am looking pretty hard but I just can’t see it any more. It’s a troubling realisation for someone who has spent twenty years or so trying to preserve the stuff. But the most unsettling part is this: I don’t think it’s me who is lost. Don’t get me wrong, this is not some delayed attack of post-modern angst. I am just trying to get to the end of the day. Is it possible that, just as it was reaching a crescendo of profile, polemic and promise, data has vanished, like Bilbo Baggins on his eleventy-first birthday?

It just used to seem simpler to get through the day. I never doubted that data was a slippery idea – we can’t even agree if it’s singular or plural (hats off to Christine Borgman for the best title on my book case: Big Data, Little Data, No Data). But the few simple concepts that steadied my nerves don’t seem to work any more. I have never been entirely certain I understood ‘data’ in the scientific or theological sense of the word, and have long doubted that it could exist independently of theory or method or language. I strayed into digital preservation from archaeology, so I can be entirely robust on my fourfold hermeneutics of interpretation: that claims to truth based solely on data are absurd and that ‘raw data’ is an oxymoron. We understand the world in so many ways, but we always end up capturing that world in language and only the brave or the boorish can ignore the implied trap: that language is scarcely an unproblematic container. But I don’t share these thoughts often because not everything is archaeology (not yet). I am mindful of my limits and, in all humility, readily defer to wisdom in other domains as to how their epistemologies stack up.

So when asked about data I have ducked the question and given a work-a-day computing definition that distinguishes data from software and hardware. Data can be shared because it can be packaged and because it is independent of the tools we apply to it. Data has no on-off switch and it is not executable. It’s the output from the device, not the device. It’s the present inside the file format wrapping paper. Data has no physical form and no physical constant because it’s distinct from media or hardware. You can’t hurt yourself on data and it is in some ways blissfully ignorant of the rest of the world. It doesn’t require patches or updates (pause for a minute to recall my futile rage during last Monday’s the 55 minute reboot). You can just about see it if you have the right kind of scanning electron microscope. Data might be formatted or unformatted; structured or unstructured; born digital or digitized. It could lack important metadata and be mostly useless because even dark data is data. It could be a single observation in a CSV or a theory of everything asserted through 1200 pages of tightly argued PDF; it could be fastidiously validated and tightly calibrated quantities; or coothie remembrances repeated so often that they might as well be true: but if it looks like information content that someone could use then it’s probably data. This big-rock-candy-mountain definition won’t ever trouble dictionary corner, but it has helped me get through the day.

I have come to the grave realization that my feeble classification of ‘data versus application’ no longer works. It never really has. Practical experience tells me that data implies all sorts of internal dependences and applications that confound naivety: libraries and services that exist in the space between inert data and active processes. And that matters because it means the file is no longer – and indeed has never been - the atomic unit of preservation.

Here are three short stories that illustrate the point. Two of them are whimsical, as usual to make them memorable and illustrate a wider point. All of them are real and all of them point to a serious challenge and subtle misalignment in digital preservation practice.

A Ship Called Dignity

I had an extraordinary phone call a few years ago. A recruitment consultant called to say that she was looking to fill a ‘very senior appointment’ in a ‘high profile and dynamic’ public institution. The role in question included ‘significant strategic development opportunities’ in ‘information and digital’ and would only be offered to the ‘highest calibre candidate’ with an impressive record of successful change management. It was all super-confidential and her client wasn’t intending an open process. Did I know anyone that might fit the bill?

I’d like to put on record that I suddenly, and somewhat unexpectedly, transformed into the coolest cucumber in the salad drawer. Instead of instantly recommending myself and spluttering through my own shambolic CV, I asked a few well-placed questions then coyly asked if she might send the further particulars to see if I could think of anyone for her. The email duly arrived with a file called something like ‘FurtherPaticulars.pdf’.  What followed was perhaps a cunning aptitude test: I may never know. I did subsequently establish that the organization behind this prestigious job was unusually touchy about the authenticity of documents, so touchy that it had commissioned its own font as a way to validate the origins of their most important documents. The recruitment agent was clearly oblivious, the font hadn’t been wrapped into the PDF, and nor did I have a local copy to deploy: because that would have enabled someone less scrupulous to start forging. My gold-plated-life-changing-job-prospect thus rendered as wingdings. A range of attempted font substitutions, exports to Postcript and what not later, I still don’t really know what the job was about. I do recall the sullen moment that I passed on the contacts of a few shining stars of the digital preservation community while skulking back to my obscure portfolio career. So here I am: dignity intact but no super-yacht to call my own.

I take two lessons from the experience: an indifferent one about how to deal with head-hunters on those rare occasions; and an important one about data and dependencies. The data (the PDF file) only made sense with respect to an external component (the font) which I couldn’t access, and without this context the file could not be rendered. Is the font data?: not by the definition above. Is the pdf data without the font embedded?: in the definition above probably yes but with very little comfort to anyone using the file. Is it only data when the text and the font are packaged together?: that certainly seems more useful but it takes me to a different and more complicated place where the boundary between data, representation information and application is porous. And when you realise that this is a rudimentary office document, intended for personal use on a personal computer then it suggests that my work-a-day distinction between data and application is bogus. If this is true of a PDF, then how much more will it be true of a multi-user relational database, an intricate CAD model or a spreadsheet groaning under the weight of its macros. And if it’s true on a desktop pc with canonical releases of software and self-contained files how much more complicated is it going to be in a highly-distributed and highly-interdependent environment like the Cloud.

Keeping the F in Font

Digital preservationists talk a great deal about the Cloud as a storage platform, but for me all the discussion about data security and jurisdictions of server farms misses the wider point. The penny hasn’t quite dropped and it only will when we stop seeing the Cloud as storage facility for archives but as a utility or a service to data creators. There’s a major architectural change between computing as a product and computing as a service which needs some attention. We haven’t spent enough time yet thinking about the implications of preserving cloud-born data. My thwarted career change might as well serve as a useful exemplar of why.

In a desktop computing environment files are more of less self-contained on a disk, and canonical versions of software are provided so that a user can create or alter files in a more or less canonical format in which data is wrapped. Now I grant you the files might be remote, and may even be spread over a number of file stores and symbolic references and disks: but there is a thing somewhere that can be described as a file. Likewise, the software bit is dynamic, with versions and subversions and service packs supporting them: but there is a single stack and all of them are present on the grey or beige or black box in front of you.

In a cloud environment, you don’t need to worry about the software or software updates because someone else is looking after that: all you need is a browser, a web connection and a login. And each time you use the software it will adapt to your device or your browser and will have been updated on the fly. The software is remote and the stack of applications is assembled on demand. It might look like you have a file, but it might actually be a series of symbolic links to a highly distributed series of byte streams. You might be able to save all that to a local hard disk: but that is just a neat way of synchronising with the remote storage, and is in some sense always an export from the original. There may never be a file in the conventional sense, nor a self-contained programme to run or, or a stable format to be rendered.

Once you’ve spotted that platform, software, infrastructure and even data have become services in the Cloud, then it becomes quite easy to think of a remote service being compiled from distributed micro-services. And that is what they are: services behaving as services. There’s a sort of fractal recursion implicit here, with micro-services drawing on yet smaller services and so on. It may not carry on ad infinitum, but can continue as long and as far as the business need exists and supply chain can support. And thus long chains of interdependence emerge in which multiple components have maintenance cycles that are invisible, at least until they go wrong. Understanding and assessing the resulting interdependencies is pretty troublesome, but it can be all too obvious when something goes wrong and it really should be keeping digital preservation specialists busy.

At first inspection, the font dependency issue described above dissolves when you move to a cloud based architecture. Fonts are an example of the micro-service on which a service is constructed and they can be summoned when you need them. If everything is always available on demand and if problems are constantly being corrected on the fly, then the absence of a font is unlikely to be a major problem. But small incremental corrections also mean small incremental errors and which can be amplified without much notice. That is what happened for a few hours in January 2016 when the Merriweather Sans font was updated in Google Docs.

Fonts typically include special characters called ligatures which smooth the visual discontinuity between specific glyphs. A close inspection of much text, whether in print or on screen will reveal that combinations like ‘ti’ are sometimes replaced with a special joint character called a ligature where the dot above the ‘i’ and the stroke through the ‘t’ elide. A small flaw in the update to Merriweather Sans font in January 2016 meant the ‘fi’ ligature failed. So for a few hours the letter f disappeared.

Now I am as easy going as anyone. We could have spent the hours joking about how early printing presses used f and s interchangeably; we could have sniggered knowingly about how Google’s parent company is called Alphabet; we could have spent the time constructing ever wittier lipograms. But I am not sure if I can cope with a world where that sort of uncertainty exists: where a distant micro-service can suddenly constrain my creativity to 25 characters. What fresh cryptographic hell is this?

It’s all good fun. The lights stayed on and we all made it to the end of the day. Someone noticed and someone fixed it and we were all back at work the next day none the wiser. But it underlines a fragility in our data. That if a small error in the update to a font can change the apparent meaning of documents, then there’s something quite serious happening to authenticity and integrity of documents. If it is true of simple things like documents it will certainly be true of complicated ones like CAD plans, GeoData or multi-user relational databases. If it is true of things that are being used every day when people notice unexpected flaws, then you can bet it will be true of things which need to be preserved but which are not be used very frequently. If it is true of unexceptional or uncontested data then it will be even more true of highly contested or high impact systems.

The digital preservation questions that arise are profound: what are the limits of dependency; how can we trace such dependencies confidently, and how can we assess the risks such dependencies pose. What are the implications for licensing and is there a potential for orphaned micro-services much as there is for orphan works. How might we ensure that data is authentic and accessible under reproducible conditions when the underlying services are continuously variable. These are not small questions. It would seem that a significant proportion of the cloud just became representation information.

Shared Space and Common Function

Both of these experiences came to mind while reading Jen’s Mitcham’s excellent blog on trying to archive Google Drive, then discussing it with her afterwards.  It’s worth a read in full because if you run any kind of repository or archive, this story is coming your way.  To summarise in one sentence: a researcher was required to deposit data Jen’s institutional repository so instead sending files or a disk across campus sent a link to Google drive.  The rest you can pretty much imagine for yourself, and the parts you can’t imagine Jen will enlighten. 

For me the significant part of Jen’s account isn’t that this is strange, but that it is perfectly sensible and entirely legitimate.  From the producer’s perspective, it’s a bit like being asked to archive a network drive: but it’s conceptually and practically very different for the archive. It nudges us to a new realisation about the coming digital preservation challenge: the cloud is not just where archivists manage and store data, it’s where we all also initiate, design and share our digital creativity.  It is assembled on demand, it is always new, and it almost impossible to pin down.  We can generate downloads but the act of downloading is a kind of migration or normalisation and it makes the problems of migration and normalisation all the more obvious. 

Executable Futures

There’s nothing very surprising to a digital preservation audience that files have dependencies and unless these are well managed the data they contain will become inaccessible.  We’ve long debated the scope and nature of representation information and significant properties.  It’s just another, albeit digitally transformed, version of archives needing context and description before they can be used.  Indeed, if my headhunter had been smart enough to supply a PDF/A then this whole blog might never have happened.  But the implications are worth reflecting on. 

If data cannot be separated from applications then the obvious response is to wrap them together and preserve them both.  There’s a boundary issue about just how much of the software stack we’d need to preserve but that’s not so different from the other types of recursion in representation information and the solution is surely to revisit some of our experimental registry services and figure out why they have not fully worked. Natasa Milic-Frayling and colleagues in Persist have sketched out some interesting proposals for that.  To turn that into a slogan, software preservation increasingly looks like the future of data preservation.

Software preservation also loops us back round to the interconnected topics of emulation and virtualisation.  It seems to me that trends within IT are leading us to a place where the file-based approaches to digital preservation need to be re-thought.  If we cannot isolate data from application, then we need not only to preserve the software but also ensure access to robust and reproducible execution environments, which is a strong argument for investment in emulation.  Again, to turn that into a slogan: the future of migration is likely to be emulation.

So-far-so-inward-looking.  Now rehearse the conversation you’re going to have in 10 years’ time with the head of cyber-security.  Imagine saying that you want to run an old piece of software; that for reasons of authenticity and reproducibility it has to be the old, unpatched version that doesn’t include all the vital upgrades that keep the network safe; that you are unsure about what sorts of calls it might make to external services and sources; that the licences have perhaps expired; that you are a little unsure what the underlying data is but it’s probably business critical because otherwise we wouldn’t have kept it. Good luck getting through that day.

I no longer know what data is, but I know that I cannot continue to think that there is a clear boundary between software and data.  Although I may wish for simpler times, data and software have always been mutually dependent.   We’ve spent a lot of time persuading people to preserve data: it’s about time we persuaded them that the software matters too.


Scroll to top