I was asked recently to sketch out some thoughts about archives and artificial intelligence. I am drawn to the topic as usual but with little real clue of where to start, so my point of departure is a comment on ethics. I have no real mandate to frame the ethical tone for what should be a very important debate, but if we don’t start here – if we put technology first – then there’s every possibility that we will end in the wrong place, either through sterile solutionizing, or worse by selling the whole farm to obscure, unaccountable and deeply unattractive corporate interests.

So, my challenge is whether the methods and approaches of digital preservation can help build an artificial intelligence that works for the people, with the people and perhaps even by the people?

It seems like an awfully remote prospect. The large technology companies have mostly escaped the social contract of an earlier generation of capitalists, with incredibly large client bases and incredibly large net-worth based on remarkably small workforces. There are good reasons why the rest of us might feel estranged and apprehensive about the ways artificial intelligence could (and already does) constrain, shape and control our daily lives. And nor for that matter have the tech giants covered themselves in glory. Instead they have engineered dramatic and secretive incursions that all too often disregard the security and well-being of many thousands – at times many millions – of their users. Artificial intelligence seems to be applied in business and corporate systems to us, and about us, but seldom with us. That leads me to adopt the old anti-colonial slogan ‘Nothing about us without us’. It’s an old adage. But it’s hard to think of a more appropriate counterproposition to how AI is perceived.

So far, so ranty. This is a digital preservation blog, so a digital preservation theme has to emerge sooner or later.  It seems to me that there are four ways in which we intersect with artificial intelligence:

  • We can use artificial intelligence to do digital preservation better. That’s coming into focus slowly but noticeably.
  • We can become better at capturing, receiving and arranging the inputs and outcomes of artificial intelligence, knowing and documenting the variables in play and the dependencies that exist between them.
  • We can take steps to preserve artificial intelligence at a systemic level. That’s something we’ve barely even begun to think about. We’re used to worrying about data and our usual complaint is that as the amount of data grows, so the job gets larger. We’ve largely forgotten that the quantities of data mean that ranking algorithms and personalization of view are more important to the shaping of public discourse than the content that they purvey (DPC 2019, 86)
  • And we can disrupt or take over artificial intelligence - creating, monitoring and maintaining AI and developing the kinds of services on which it depends.

All of these are possible but it’s the last one that has my attention. Instead of fearing change can we be the change?

I am going to make some lazy assumptions about artificial intelligence. For example, I am going to focus on an important developmental subset of AI which is increasingly ubiquitous: big data analytics (for a readable introduction see Mayer-Schonberger and Cukier 2013). This, for better or worse is how most of us have used, and have been used by artificial intelligence. What does our experience of big data analytics indicate about the operation of artificial intelligence and the role of digital preservation there?

There’s a clue in the name: big data analytics depends on the existence and maintenance of big data. How should we conceptualize this big data? How does it arise? It comes from the digitization of just about every aspect of life. This is one of the critical developments of the last few years and it’s easy to become desensitized to the repeated and ever-more-frenzied statements about data growth. Let me be clear that I am not talking about the 2d digitization that replicates paper records as digital surrogates. The creation of digital surrogates is slow and expensive, and my point is to underline how little digitized content there is compared to how much born digital content exists and is created every day.

You can’t describe this digitization without acknowledging the incursion of technology into our daily lives and how that gets turned into data points of one form or another. This data is most often about us, but mostly it’s not for us: it mostly benefits those with corporate interests (Zuboff 2019). I am thinking about internet-enabled beds which will report how well you are sleeping; or vacuum cleaners that report the cleanliness of your house. I am thinking about phones that know how far you walked today (every day); and how that compares to the calorie count from your Uber-eats order; and which might be selling that data to your insurance broker so he or she can sell you a pension you won’t live long enough to earn. Privacy is the only luxury that most of us can no longer afford (Lovink 2019). It’s almost impossible to avoid being someone else’s behavioural surplus.

Underpinning this digitization is a process of recombinant innovation. The deep story of the digital age is not really about innovation: it’s about infinite recombinations of existing innovations with infinitely expanding data points: one small innovation is combined with another to create a third innovation, and so on ad infinitum (Brynjolffson and McAfee 2014). The opportunities are vast and depend on being able to crunch ever greater quantities of data. The more data, the more combinations are possible. That’s because alongside the growth of data, and, if anything even more astonishing, is the seemingly limitless acceleration of processors.

So, big data analytics tell us that more data and faster processors are the building blocks for AI; that we cannot really escape being bought and sold in the market place of surveillance capitalism; and that because innovation is fundamentally combinative there is no practical limit to how it is likely to expand.

This is where I start to part company with the promise of big data analytics and by extension AI.  I spy four problems.

Firstly, if your digital preservation klaxon hasn’t gone off already then you need to change its batteries. Data frequently goes dark faster than it can be used. We can barely get PDFs to work through two generations. We know that, but if someone could tell the Financial Times or the Wall Street journal then I’d be very grateful. How then can we generate meaningful time series data at anything resembling the real world where change is slow and incremental? If AI is built on data, then it’s built on sand.

Secondly, and related, David Rosenthal has been telling us for many years now to beware the rising costs of data storage (Rosenthal 2012). We have grown up with an inter-generational price crash of data storage and may now be blind to the possibility that it will not be the historic norm. A point will come, in fact it has arrived already, where the price crash will end. If storage costs level-out, then our consumption habits will have to change rapidly too. Otherwise the blunt trauma of economics will change them for us. One can already see chaotic processes of data loss in which the haphazard exigencies of business failure decide what digital legacies our children can enjoy.

Also, remember that miniaturization is finite. All the computing that’s ever impressed us is about electrons passing through transistors. By all means you can increase computing power by cramming ever larger numbers of ever smaller transistors onto silicon wafers. But there’s an end point to this and it’s already in sight. The transistor was invented in 1957; and just because we’ve only ever known computing power to have been increasing ever since it doesn’t follow that this will be the historical norm. You cannot change the laws of physics: you cannot miniaturize the electron.

I promised I would relate this to archives. Here we go.

The proponents of big data analytics argue for more-better data and more-better computing: keep everything because everything has potential, everything will be useful, storage is cheap, and superabundance is an opportunity. Don’t bother cataloguing because we’ll get around to that with super-duper new technologies. Metadata shmetadata. Archivists demur: less is more, appraisal empowers use, storage is expensive, overabundance is a risk. Cataloguing is critical. Metadata matters. It seems to me that, for the reasons given a moment ago, archivists have history on their side. Don’t even get me started on global warming.

That would be a nice conclusion when speaking to a digital preservation audience. But it also leads me to direct two further challenges which may be altogether more uncomfortable.

Firstly, for digital preservation: where are the journal articles and conference papers, not to mention the techniques and tools which enable archival appraisal of digital materials at scale? Maybe I missed them. Reading the digital preservation literature, it’s hard to know what we can afford to keep and what we cannot afford to lose
If my hunches about technology are right, then we need to get a move on. Currently preservation seems to be a binary state: either something is preserved, or it is not; a repository is trusted, or it is not. There is an urgent need for the digital preservation community to face up to the suffocating proliferation of digital materials and respond with much more subtlety. That means better capability to identify the digital objects that matter, and more nuanced intentions for everything else. If we can identify and prioritize high value collections, we can set equally high expectations about how much effort it’s going to take to preserve them. By extension that means we are also identifying those things which are not needed and would do well to delete. If we do not dispose of something, we will not be able to preserve anything. And somewhere between these two extremes, perhaps we can begin to envisage a middle ground, where less intervention and best endeavours enable planned, perhaps even graceful decay. ‘Just in case’ is not a bad argument; but we can’t let it consume all our resources.

Secondly, and this is where the heat-ray falls especially on archivists: appraisal in the digital age isn’t easy but it’s vital. In an era of post-truth obfuscation and sinister deletion, the ability to collect, retain and authenticate is suddenly a super-power. In an era of relentless proliferation, the confidence to select and consolidate, with implied permission to relegate and de-duplicate, is ubiquitously essential. In an era where data is the ‘new oil’ of the ‘information society’, we hold the keys not only to the past, but now also to the future.

One would have thought that this generation more than any other would be the age of the archivist, a continuing proof of common cause for the common good. I know it doesn’t feel that way. I wish it did, but if archival appraisal practice was more cleverly embedded in digital preservation tools; if together we can codify and express our expectations and assumptions about archival significance; if in our generation we can remake the tools and techniques that have served the cause of truth and authenticity for centuries, then perhaps there’s a chance.

On one hand, if you want to know how to control and influence artificial intelligence, then you could do a lot worse than use the tools of artificial intelligence. That will both secure and transform the archival profession. Let’s use the weight of the tree to ensure it falls in our favour.

And there’s no appraisal without values (Caswell 2019). We might pretend that our role is the humble and neutral functionary of an objective record, but I don’t buy that. If we get this right, then we’ll have an artificial intelligence that embeds the value judgements and the ethics that we bring to the table through our appraisal processes.

So, let’s get appraisal and selection to the top of our shared research agenda in digital preservation. In our sights, and in our hands: artificial intelligence by the people, for the people, with the people.

Nothing about us without us.


Brynjolffson E and McAfee A 2014 The Second Machine Age: Work Progress and Prosperity in a time of Brilliant Technologies, Norton, New York

Caswell M 2019 Whose Digital Preservation? Locating our standpoints to reallocate resources, in iPRES2019, 16th International Conference on Digital Preservation, M. Ras and B. Sierman, Eds., Amsterdam, 2019. [Online]. Available: https://vimeo.com/362491244/b934a7afad

DPC 2019, The BitList 2019: The Global List of Digitally Endangered Species, Second Edition, online at: http://doi.org/10.7207/DPCBitList19-01

Lovink, G 2019 Sad By Design: On Platform Nihilism, Pluto Press, London

Mayer Schonberger V and Cukier K 2013 Big Data: A revolution that will Transform How we Live Work and Think, John Murray, London

Rosenthal, D 2012 Storage Will Be A Lot Less Free Than It Used To Be in DSHR Blog online at: https://blog.dshr.org/2012/10/storage-will-be-lot-less-free-than-it.html

Zuboff, S 2019 The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power, Profile, London



I am grateful to Lise Jaillant who provoked me to consider the issues of archives and artificial intelligence and to Sarah Middleton, Jen Mitcham, Sharon McMeekin and Paul Wheatley who commented on an earlier draft of this paper.


Scroll to top