Paul Wheatley

Paul Wheatley

Last updated on 30 March 2023

Last Thursday I took part in an OPF panel debate "Do unacceptable file formats exist?" It was great fun to debate file format policies and preservation approaches with Valentijn Gilissen from DANS. And we had amazing moderation from Sam Alloing who had to get to grips with a text chat that exploded out of control as 200 audience members all dived in with loads of fascinating insight. The exported text chat was 50Kb long!

Sam and Valentijn had graciously asked me to join them for the debate after I posted my last blog post "File format recommendations - I wouldn’t say they are unacceptable, but I wouldn’t recommend them either". As we try to continually understand and solve the variety of challenges the digital preservation world throws at us, we of course won’t always agree on the approach we take. But coming together and discussing our differences has to be the best way forward for a collaborative community. So many thanks to Sam and Valentijn for making this happen.

One of the broader points I attempted to make in the debate was that file format obsolescence isn't the number one risk facing our digital content. This is something that others (particularly David Rosenthal) have talked about many times before, but I kind of got collared for a new blog post on this, so here it is....

But isn't this just Paul doing a bit of whataboutism?

Well yes, this was a debate about file format stuff. Guilty as charged. But my concern around these blanket approaches to premeditated migration of large numbers of file formats, without identified preservation risk, is not just about the risk of damaging data in the process. It’s the worry that we might be better spending our limited resources elsewhere. So that's what I'm going to dig into below...

The top 10 risks?

In a cavalier manner during the webinar (I hope those who watched it realised I was playing a bit of a provocative role to spice up the debate) I suggested that file format obsolescence wasn't near the top, or possibly even in, the top ten risks facing our digital content. Of course, in reality the file format issue is a little more nuanced than that - in some ways that was my whole argument in the webinar. But anyway, this isn't a blog post about the top ten risks (what a cop out!) but it is a few thoughts about what some of the big risks might be.

Digital preservation risks

Cyber attack has to be near the top of the list. Ransomware is a big deal, frequently affecting all sorts of organizations that had thought they were well prepared on the cyber security front. Read any story about what it did to an organization, and it should rightly send a shiver down your spine. Getting your bitstream preservation right, your multiple copies, different technologies, routes to access those different copies, validating your mitigations over time and so on (I talk about this a little bit here). There are some really vital things to consider, and digital preservationists themselves need to be on top of this. This is not just a role for the IT department and their firewall! DPC Members who missed it, here are the recordings of our recent cyber security event. It’s an eye opener!

Closely related to this risk is human error in all its various guises. Reported incidents of serious data loss seem often to involve multiple issues. There was a problem with the usual scheduled back up. There was a power cut to some of the servers. And then someone did a data migration and pressed the wrong button, executing the right process in the wrong order... Boom! Digital preservationists need to be really worried about the potential for *stuff just going wrong*. They should be thinking about where those points of failure might be and what can be done to lessen the impact or catch the problem before it’s too late. During the SPRUCE Project we came up with the motto "Trust nothing, validate everything". Steady at the back there - I'm not talking about validating file formats! But I am talking about making your digital preservation processes methodical and verifiable. Double check processes that move or change data. If it’s a manual process another person should check it - people make mistakes. If it’s an automated process a different bit of software should validate it - all software has bugs. There’s something to be said for not just having mitigation in place for your various bitstream preservation / security risks, but for looking at how you monitor or verify that mitigation. Reporting and governance structures are key.

Perhaps equally pressing is the risk that we simply don't preserve content that will otherwise be lost. Literally, we could be worrying about whether BMP is a preferred format or not (this is a reference to the webinar, in case you're wondering) whilst the Archive Team uses volunteer effort to rescue data from the latest social media site to be closed down because it doesn't fall under the collecting policy of institutions designed before the internet even existed. That's just one example, and yes I'm again being deliberately provocative here, but if we don't collect it, it won't be preserved.

And finally, here it is, the biggest risk facing our digital content: resourcing. If we don't have the money to do it, it won't get done. I'm yet to encounter an adequately resourced digital preservation outfit. So to my mind that makes prioritisation really important. We have to attempt to understand the range of complex and multi-faceted risks before us and then try as best as we can to prioritise our limited resources to mitigate those risks.

And that's one of the main reasons I worry about pre-emptive file format migration for main-stream, low-risk file formats. If we spend too much energy there, there's probably another important risk that isn't getting enough attention.

Scroll to top