Blog

Unless otherwise stated, content is shared under CC-BY-NC Licence

Preserving born-digital design and construction records: the #DPClinic

Jenny Mitcham

Jenny Mitcham

Last updated on 27 June 2022

On Friday last week we held a #DPClinic session to highlight and celebrate the Technology Watch Report we made available to the wider community earlier this month. In this session, we invited Aliza Leventhal and Jody Thompson to discuss their report on Preserving Born-Digital Design and Construction Records with the community and answer any questions that emerged. We also wanted to use the opportunity to ask some questions of the attendees and find out their perspective on the topic.

Read More

Sharing the Load

Helen Dafter

Helen Dafter

Last updated on 21 June 2022

Helen Dafter is Archivist at The Postal Museum in the UK


Readers of this blog will be well aware of the three legged stool of digital preservation. One key element of this stool is staff skills. For some time I have been concerned that digital preservation skills at The Postal Museum are concentrated in one member of staff (myself). This is undesirable in terms of both organisational and individual resilience.

Read More

Re:Use - it's about time

William Kilbride

William Kilbride

Last updated on 14 June 2022

I was honoured to give a keynote lecture at the start of the International Digital Curation Conference in June 2022.  The text below is a slightly adapted version of the talk which was also recorded and will be made available in due course.

I am grateful for the opportunity to speak today and for Kevin’s words of welcome and introduction.  I’ve been tied up lately with an awful lot of detail.  I am fine with that, but I won’t pretend it’s where I am happiest.  So, I am grateful for the opportunity to step out away from the day to day and try to search a wider horizon. 

The theme of iDCC this year is Re-Use.  Time and effort and expertise are spent on digital curation and preservation because we anticipate and seek to enable the re-use of digital materials in the future.  We recognise that re-use cannot be taken for granted – we have too many stories of how and why it can be hard to re-use resources – and we want to protect and transmit the significant efforts and benefits and opportunities with which our digital objects are imbued. 

There is no re-use without preservation.

This is a good theme for me as director of the Digital Preservation Coalition.  Much of this presentation develops things I’ve learned from our members over the years.  I’ll come back to the DPC later, but you should also know that this really is a personal reflection. 

There are two other equally important sources for what is going to follow. 

As Kevin has mentioned I fell into digital preservation by accident: through an interest in archaeology and data generated in field work.   That’s going to emerge as a theme.

Also, this talk is a longing to be in company again.  The pandemic has taught me so much about myself.  I can do a huge amount of my job – almost all of it – remotely.  And while I find digital preservation to be a constantly engaging and changing theme at the DPC, it’s the ‘Coalition’ bits of the job that hold my attention. 

I am eager to be moving around and working together again, in real time and in real spaces: meeting colleagues and sharing their times and spaces beyond the virtual.

So this paper will be a bit of a ramble.  I mean that literally. Let’s walk together.  There’s not a dress code for IDCC, in person or virtually.  But I recommend a sturdy pair of shoes for this one.

It's almost midsummer in the Northern hemisphere. We’re reaching the point in these nordic regions when the daylight and the gloaming are almost continuous; when dogs and their owners don’t sleep; children and their parents even less so. 

And we’re out for a walk you and me, in the soft light pre-dawn. Thick wet grass beneath us and chalk below.   We’re standing in the floor of a dry valley – not much more than a hollow really – walking slowly towards the darker skies of the southwest.  The path is straight and climbs gradually. It’s just discernible as a small bank to the left and right.  We’ve a way to go and distance and the darkness and the stillness and the gradient all are against us.  The thick dew chills our shins and calves but we can’t wait.  We’ve an appointment to keep.

Now the sky is lightening behind us, gradually illuminating some stone blocks about half a mile in front. They puncture the horizon and there’s a trick of the eye as they seem to float without a clear connection to the ground.  There’s a distance and a rhythm and we fall into silent step.

The deceptive levitation falls to Earth as we approach: there’s a bank and ditch which has disrupted our line of sight.  They define a flat, grassy circle about 100m in diameter and it’s thick with giants. 

The stone blocks dominate now.  Cold, almost damp to the touch and smooth too. Their patterns get ever more complicated the closer in we get: two rough-hewn sentinels that flank the path; a large circuit of neatly worked pillars and lintels four meteres high which create a portal and glimpse of a tightly defined semi-circuit of smaller standing stones. All are dominated by a U-shaped arrangement of pillars and lintels almost eight meters high. 

We arrive at the centre of the enclosure and look back over our path. As we do, the rising sun of midsummer’s morning traces the avenue and lights up the core.    

We’ve kept our appointment. We’ve made it around the sun once more. 

Just to remind you, this is a conference about digital curation and I’m here to speak to you about re-use.  You might well be wondering when I’m going to get onto that.  So here it comes:  we’ve just successfully re-used some very old technology which still seems to be doing what it was designed to do around four and a half thousand years ago.  Beat that with a stick.

There are many esoteric themes we could develop here – I’m not about to go all new age or publish my travels with the Grateful Dead.  Let’s stick with the things we can verify.  Three points.

Firstly, you might think Stonehenge is old, and you’d be right.  But the structure that you imagine as Stonehenge is only the most recent version. The ditch and bank were already 500 years old before any stones went up and the cursus at the end of the avenue was perhaps 1000 years older still.   The Stonehenge of our imagining is a modern interloper.  We have just re-used something that was born of re-use.

Secondly, perhaps more important, it’s built for re-use.  There are predictable moments when it will work as intended and these have been coming round for generations.  You can’t sequester it to the neolithic because it’s a feature of every landscape since: the Bronze Age, the Middle Ages, late stage capitalism, the Romans, maybe even the Druids.  I can’t pretend to describe the conditions or rituals or meanings they would have attached to it: perhaps they ignored it, perhaps they celebrated in knowledge or in ignorance.  But there’s a significant point here about always-already-re-use being implicit in the design. 

Put another way, if it stopped working, then something has gone badly wrong.

Thirdly, someone is going to ask about impact sooner or later.  What was the impact of this new technology?  I like the idea of some Neolithic committee reviewing the grant application and a bright spark disputing the outcome and outputs.  I grant you, there’s a naivety here of the ‘if you build it they will come’ to what I am about to say but I do think that it’s possible to overthink the re-use and impact and inadvertently miss the bigger opportunity.

You can’t always predict how technology might come to be re-used.

The key phrase to remember from the Blue-Ribbon Task force is that all our actions are path dependent and temporally dynamic: we have to make choices and those choices have the potential to tie reuse to a given use case which may or may not be the best ones in the long term.   It’s okay to admit that we don’t always know the next or best use case will be. 

But this is a puzzle: path dependency implies making some choices about how re-use will proceed and enabling those; but in the process not constraining other re-uses that we might not have predicted and may be more impactful 100 years from now.

There’s a deep dive here which I will avoid because I’ve spoken about it before.  I think the concept of the designated community as expressed in the digital preservation literature has a colonialist tinge.  For decades now curatorial practice in memory institutions have been trying to dismantle the gatekeeping and exclusions which have historically inhibited access. Taking the side of users has often meant taking the side of justice: and if you’re not for justice then you’re for something less appealing.  But here we come, the digital curators giving ourselves the power to admit or ignore and finding all sorts of reasons to be happy with the silences that arise.

Back at Stonehenge and you’re slowly dying inside, because I seem to be telling you that your next grant application should be built like one of the wonders of the ancient world. As if completing your PhD wasn’t enough, I seem to think you should be Ramasees the Great.

Let’s breathe for a moment: yes Stonehenge was a lot of work for very many people over many centuries.  But it’s not the scale of the effort that makes it robust, it’s the design. I might as well have taken you Warren Field near Banchory in the Grampians.  It’s less famous.  It’s a line of twenty or so postholes marking a series of posts arranged to align with the midwinter solstice and the phases of the moon, reconciling the asynchronous lunar and solar calendars. 

It's less famous than Stonehenge, and much less impressive on the ground. It’s not much more complicated than an intricate washing line.  It was probably built by hunter gatherers more than 10,000 years ago and it’s arguably the oldest astronomical calendar in the world.  Twenty or so poles not hundreds of tonnes of rock.  And it’s still marking the days and seasons, and all being well it will continue marking the days and seasons when the Voyager probe emerges from the other side of the Oort Cloud, 40,000 years from now. How’s that for low-cost long-term impact?

Technology endures if you want it to.  It endures if you build that in from the start.  It’s not the scale of the thing, it’s the design.

I’m drifting towards a core historical claim here with numerous exceptions and counter examples: but as far as the archaeological record is concerned, re-use is the norm not the exception.

That could be pure economic conditioning: scarce resources are more cheaply repaired than replaced.  It could be more deliberate and scheming: appropriation rather than re-use; and it could be the hermeneutic of the record: that things which are re-used are more durable and therefore more likely to be discovered and talked about.

If I am right, and even if I am not, it puts this year’s IDCC, with its theme of Re-Use, into a sort of historical context.  The question arises, why do we live in a time when re-use is not just taken for granted?  Why does there even need to be a conference about it?

You could frame that more assertively: why is it, on the verge of an environmental and ecological catastrophe, that re-use and recycling of expensive and hard-won resources are not an imperative from which deviation would be considered a gross abuse and global scandal? 

Maybe to answer this I need to position digital preservation in a more recognisable time scape. 

My sense is that digital preservation in one form or another began in the early 1980’s.  I know there are earlier honourable mentions, but they are outliers.  It was the late 70’s and 1980’s that saw computing become mainstream, and 1990s that saw the exponential rise of home computing and the Internet.  These are the decades that, without which, the digital estate might have taken a radically different direction (if it existed at all).  

So what?  Well, the economic forces that shaped the 1980s and 1990s created the norms of the digital universe we occupy today and thus, digital preservation. 

One might suggest that, had these economic processes been less consumerist, less disposable, more resilient, and more sustainable, then the endemic challenges of technical obsolescence, resource discovery and short-termism which we address in our daily work might not have arisen in the way that they subsequently have. 

It takes a massive leap of historical imagination to consider a world where obsolescence and abandonment were not the norm.  Imagine a world where we took for granted that data and software that, if not reproducible accessible and preserved, were no data or software at all.  Imagine if the costs of proper documentation were funded.  Imagine a research assessment exercise where, instead of saying ‘I found something new and it has changed everyone’, the proudest boast was ‘I kept something going and it has served people well’. 

By some strange process that I cannot fully delineate, digital preservation is permanently cleaning up after the neo-liberal economics of the 1990s. In the arc of human history, it’s our generation that seems to be the aberration. And that’s why we need a conference about re-use.

This is also why digital preservation and curation make bleak reading.  You’ve asked me to talk about re-use but in my mind, without preservation and curation, re-use is either impossible or seriously curtailed. 

I have two messages for colleagues in research data management and your approached to openness and re-use. 

On one hand the types of tools and services emerging within the ecology of research data management are exemplary when compared to parallel sectors and industries.   

A lot of the core thinking around digital preservation emerged from the research community in the late 1990s – I’m particularly thinking about the emergence of OAIS from the Space Science but not uniquely.  Many other tools and concepts like the CoreTrustSeal have been widely and quickly influential beyond their initial scope.  So, if you will permit some word play, the tools and services which emerge at conferences like iDCC are not simply about re-use but there is a significant opportunity for them to be re-used.

But we can’t be complacent. DPC was commissioned to study the emerging landscape of European Open Science Cloud in 2021, to assess whether and to what extent EOSC was equipped to deal with the digital preservation challenge. 

We were able to shine the light on lots of good practice: data management planning, persistent identifiers, policy drivers, repositories and continuous quality improvement through certification.  What was striking though, was the good practice seemed to be fragmented and inconsistent. 

There are good exemplars – I am proud to be associated with the Archaeology Data Service which has an end-to-end approach to data which puts preservation on the agenda early and sees it through to preservation and reuse at the other end.    It’s one of a small set of examples which show what can be done.  But there are too few examples where this is the case: it’s the exception not the rule. 

Data management plans are too often created as a gatekeeping exercise at the start of research and not opened or updated as the research continues.  They may have a passing resemblance to what turns up at the end but that’s not guaranteed.  Accountabilities and responsibilities are unclear.  The result is that the relevant data simply never reaches a repository, or worse it arrives in a state which significantly constrains re-use. 

I am minded that metadata requirements seem to be at the root of the problem: either because they are too big or too little or both: others might want to pick that up.  The reality is that, without the supporting and integrated infrastructure of policy and accountability there’s a risk that repositories are being set up to fail: they are robust in their own terms but are not able to meet the wider policy requirements because the supply chains are broken.

It doesn’t have to be like this.  It’s not like this in other sectors. 

You might not have spotted the data management planning associated with the Crossrail Project

For those not familiar with it, CrossRail has been the largest transport project in the UK for many years.  It has built and connected new railway lines across the centre of London connecting Reading with London Heathrow Airport and Paddington station to the west end of the city with Stratford in the East.  It has admittedly cost 18.8bn pounds, but probably significantly more.  A project of this scale has involved long supply chains with multiple contractors and sub-contractors and sub-sub contractors each generating and using data, all ultimately under the supervision of a partnership agency which was wound up on completion.

It’s far too early to know whether this has genuinely succeeded in making data available for re-use in the flexible and open manner they intended, let alone in the way that we envisage for research data.  But the point is about understanding the supply chains involved in data management over large scale and complex projects, and the moments when preservation intervention is needed.

Going back to EOSC, certainly when we published our report we thought quite hard about where research data sits in relation to the BitList, the Global List of Digitally Endangered Species.  This provides a very high-level classification scheme for the risks faced by digital resources.  It includes classifications of ‘Endangered’ and ‘Critically Endangered’

  • Digital materials are listed Endangered when they face material technical challenges to preservation or responsibility for care is poorly understood, or where the responsible agencies are poorly equipped to meet preservation needs.
  • Digital materials are listed Critically Endangered when they face material technical challenges to preservation, there are no agencies responsible for them or those agencies are unwilling or unable to meet preservation needs.

Almost all research data falls somewhere between the two.  We hope that the newly established EOSC Long Term Data Preservation Task force will move things to safer ground. 

Let’s go back to the idea of resilience emerging from design before creation rather than as an intervention afterwards.

We know what the digital preservation challenges look like: there are lots of interconnected problems and they are also dynamic so it turns out to be an emergent issue.  And we can quickly and rightly get sucked into the details of how to solve these issues.  But there’s a paradox in digital preservation: any solution is subject to the same processes of obsolescence is it is designed to address. This is ready made for solutionism, where the drive to find solutions ends up creating new problems,

We ultimately have to recognise that this is a socio-technical challenge, so if all we bring is technology we are likely to miss something important.  It’s not just about how it’s about Why.  Let’s go back to the start.  Why do we do what we do.

Digital materials (images, documents etc) have value.  They create opportunities in the real world. But access depends on software hardware and people and we can rely on the idea that technology and people change.  Therefore we face constantly emerging barriers to reuse, and constantly emerging barriers to those opportunities in the real world. 

Digital preservation is not for the sake of the bits and the bytes.  It’s about people and opportunity. We can recognize two contradictory truths: five decades of checksum integrity is no small achievement, but nor is it much of a boast.  If all we seek is preservation then we lack ambition: if all we want to facilitate is re-use then we're not going far enough.  We want to make and to change the world.

Our common challenge: not about saving digital materials, it’s about expanding opportunity and creativity. Our common challenge: not about avoiding a digital dark age, it’s about coming good on the digital promise.

I have a proposal for you. Let’s walk this path together.

Read More

Analysing PDFs with the PyMuPDF library

Edith Halvarsson

Edith Halvarsson

Last updated on 7 June 2022

This blog post is by Sebastian Lange, Software Engineer with the Bodleian Digital Library Systems and Services (BDLSS) department and Edith Halvarsson, Digital Preservation Officer with Bodleian Libraries’ department of Open Scholarship Support.  


Analysing PDFs with the PyMuPDF library 

Like many heritage institutions Bodleian Libraries holds a vast collection of PDFs, created in various flavours and software over the past 20 years. These documents have come to the libraries from diverse sources – such as digitization suppliers, academic depositors, and born-digital personal archives. 

We wanted a quick and dirty way of scanning our PDF collections for particular features, tailoring these to the needs of the Libraries’ vast and diverse collections. Using the PyMuPDF library we created a small tool which helps us gather more information about the current state of our PDFs, especially but not exclusively, regarding their accessibility. While our PDF analysis tool is less detailed than validation tools (like veraPDF), using the PyMuPDF library can be a good first step for analysing PDFs and flagging potential high-level digital preservation risks.

Read More

Digitisation of The Scotsman Collection: Digital Access and Preservation at Historic Environment Scotland

Christopher Viney

Christopher Viney

Last updated on 17 May 2022

Christopher Viney is Archive Digitisation Officer at Historic Environment Scotland


Over the past decade issues in access to archival collections have been thrown sharply into focus demonstrating the value of digitising analogue archival material. This provides a greater level of access and mitigates potential barriers that can often shut people out of our institutions. Indeed, digitisation and digital preservation provides a key tool in the work of Historic Environment Scotland to deliver on its vision ‘to make sure Scotland’s heritage is cherished, understood, shared and enjoyed with pride by everyone’. After the fantastic work of the Archive Digital Project, showcased in the online exhibition “Beyond the Physical”, Historic Environment Scotland has continued to improve access to and preservation of our collections. One such recent project has been the digitisation of The Scotsman Collection.

 

Read More

No time to waste: what’s ticking at CLOCKSS

Alicia Wise

Alicia Wise

Last updated on 11 May 2022

Alicia Wise is Executive Director of CLOCKSS


Time is a thief of memory, even for formal publications, unless long-term digital preservation arrangements are in place. It takes a community to safeguard the scholarly record. It is too big a job for any single organisation, and too horrific for our species if done badly.

Read More

Reducing the pain of procurement

Michael Popham

Michael Popham

Last updated on 6 May 2022

Michael Popham is Digital Preservation Analyst at the DPC and Jenny Mitcham is Head of Good Practice and Standards at the DPC.


Perhaps you’ve been given the go-ahead to procure a “digital preservation system”, or you’re trying to work out what differentiates such a system from the applications and infrastructure that you already have in place? How do you decide what you really need, especially in light of the rapidly evolving marketplace of commercial and open source preservation solutions? The DPC has recently launched a set of resources designed to help.

Read More

First steps to a guide for computational access to digital repositories

Leontien Talboom

Leontien Talboom

Last updated on 4 May 2022

Leontien is a collaborative PhD student at The National Archives, UK and University College London, her research is about access to born-digital material. 


Within the digital preservation community, the term computational access is popping up more and more frequently. It is often linked to other terms such as artificial intelligence, data mining and deep neural networks. However, there is often little understanding of what these terms actually mean and how they relate to each other. 

Read More

A #DPClinic chat about persistent identifiers

Jenny Mitcham

Jenny Mitcham

Last updated on 3 May 2022

On Friday last week, our latest #DPClinic chat delved into the topic of persistent identifiers (PIDs). As I remarked at the start of the session, persistent identifiers are something that pop up as an example of accepted good practice in DPC RAM, our Rapid Assessment Model. They are mentioned at the managed level of the metadata section with the example “Persistent unique identifiers are assigned and maintained for digital content.”

Read More

Capturing and preserving practice based research

Holly Ranger

Holly Ranger

Last updated on 27 April 2022

Holly Ranger is Research Data Management Officer in the Research & Knowledge Exchange Office at the University of Westminster


Practice Research Voices (PR Voices) is an Arts and Humanities Research Council funded project led by the University of Westminster. The project is scoping the development of an Open Library of Practice Research for the dissemination and preservation of practice research, building on existing software and standards and guided by open research principles.

‘Practice research’ is ‘an umbrella term that describes all manners of research where practice is the significant method of research conveyed in a research output’ (Bulley and Sahin, 2021). Practice research outputs are typically multi-component portfolios or collections of non-text file formats which are disseminated and hosted in separate places such as personal websites, institutional repositories, archives, and commercial video-sharing platforms. These factors pose a significant challenge to the preservation and reuse of practice research and practice research data.

Read More

Scroll to top