Last week I joined a webinar entitled “A Comparison of Recommended File Formats and the New Dutch Method for File Format Assessment”. It’s great to see the outcomes of this collaborative work, and it’s clear that it has already played an important role in bringing out some key themes in the preservation approaches of various organizations. But I felt that a number of aspects give cause for concern. The collation of file format policies has highlighted some approaches that I believe should be challenged and discussed further. I also suggest that the communication of the work could be enhanced to avoid misleading readers and propagating potentially poor preservation practice. These are of course just my thoughts on this topic, but I’d love to continue the conversation and try to better understand the motivation around these approaches to addressing file formats.

The work and the presentation

There’s quite a rich history to this kind of work, for example: Rog and Van Wijk reported their efforts on ‘Evaluating file formats for long-term preservation’ in 2008  and Stephen Abrams published a chapter in the Digital Curation Manual on File Formats in 2007. These and other state of the art research were summarised in a Technology Watch Report by Malcolm Todd in 2009 who concluded that cost and user needs were too often overlooked from such analyses. In 2013, Johan van der Knijff responded to a series of activities to score file format risk with a blog post that questioned the largely unproven foundations of this approach. Tellingly, Johan also stated "It also makes me wonder why this idea keeps on being revisited over and over again."

The most recent discussion centred around a relatively new spreadsheet which collates details of the file format policies of 24 different institutions. Most were archives. The spreadsheet lists each organization’s policy for around 250 different file formats. The “policy” in each case is summarized with one of the following scores:

  • 2: Preferred

  • 1: Acceptable

  • 0: Acceptable for imminent transfer

  • -1: Unacceptable

  • Null

The spreadsheet can be viewed here.

Preferred, Acceptable or just plain Unacceptable?

I feel that “Preferred” is a relatively benign concept within these format policies. If it’s possible to influence the creators of data that will be preserved by an institution, then providing a list of “preferred” formats that might be easier to preserve could well be helpful. If a decision is made by the creator to produce their data in a “preservation friendly” format then that seems a useful outcome. Of course, users are prone to not following optional guidance. Whether it’s file format recommendations or records management guidelines, we can ask, but more often than not, many users will create data in a form of their choosing in a way that is convenient to them. But ultimately, flagging some file fomats as “preferred” seems unlikely to cause any major difficulties.

Notions of “acceptable” or “unacceptable” formats are a much more concerning concept for me. If an organization refuses to accept data in a particular format, what will be the likely outcome? Will the creator/depositor of data in an “unacceptable” format take the time to go back to the drawing board and carefully recreate their data in an “acceptable” format? I suspect many will not. More than likely they will give up and their data will not be preserved, or they will attempt to migrate their data to an “acceptable” file format. What we know about file format migration suggests it is a complex process that is usually lossy. Asking a depositor to perform a migration, without any of the expected digital preservation best practice processes such as verifying the accuracy of the migration or documenting the process seems a risky and unhelpful approach. Even saving out a file from an application in (in theory) a more standard/preservable/open format, rather than the native/proprietary format may result in loss of data.

Why would an organization therefore restrict what file formats they accept? Having a short list of file formats to preserve certainly makes things easier and cheaper, but if the data you have acquired has already been damaged – and perhaps arguably is now of suspect authenticity (archivists grimace now) – then this seems a counter-productive strategy. I raised this issue in the webinar. Two of the speakers observed that their organizations did not restrict what formats they accepted. The third speaker noted they were in a difficult position as their organization had a restrictive format policy and they hinted they were not sure this was the best approach. This was not resounding support for much of the actual contents of the work they were presenting, and I was left a little uncertain as to why they were doing it at all. Although I was of course glad to hear they had reservations about restrictive format policies!

According to the spreadsheet, 9 of the 24 institutions described have considerable numbers of “unacceptable” file formats listed in their policies. That’s just over a third of the listed organizations. Maybe this approach really suits their particular situation. But I worry that this gives the impression to the user that this is a strategy that should be followed more widely, when in fact it appears not to correlate with good digital preservation practice. The first speaker in the webinar noted that the work as a whole was not to be considered a recommendation of what formats to preserve, but the title of the spreadsheet: “International Comparison of Recommended File Formats” and the lack of associated guidance to explain what the presenter said, appears to set this resource up to mislead. There are of course many organizations that do not have restrictive file format policies, and if they don’t have a “file format policy” they might not be listed in the spreadsheet at all. Again, this could easily create a false impression that restrictive format policies are typical.

Digging deeper into the spreadsheet

I won’t pull out specific examples from the spreadsheet as I don’t want to comment on the policies of particular institutions but I don’t think I can write this blog post without observing that there are some rather bizarre details to take in. Since when has SVG been a preservation worry, or a range of audio formats been anything but simple to playback with open source software, or has CSV made preservationists shudder, or container formats been seen as an issue, or SIARD been a bad format for database preservation (maybe in specific cases, more on "it depends" later), or PDF championed for unusual and complex data types (ok maybe not that last one, we’ve certainly seen that before)? And I’m trying to be on my best behaviour by not mentioning *cough*TIFF and spreadsheets*cough*. If I had any hair I could get hold of, I think I would be pulling it out! Thank goodness we have an entry from the National Archives of New Zealand, an institution with a history of digging into complex preservation challenges and sharing their learning with the community. They list every single file format in the spreadsheet as “acceptable”.

Preferred, inferred or an unacceptable approach to file format policy?

More broadly, I’m afraid I have some further concerns with this work. It places a lot of importance on the notion of format, with little mention of other considerations when performing preservation activities. Whilst it’s obviously very useful to know what file format your data is in, my impression of the community is that we have largely moved away from the idea of taking preservation actions based on just this one criteria. Developments in the community, such as the latest version of the NDSA Levels certainly back up this statement. This is in part because our data is becoming more complex. A digital object is often made up of many different file formats or is intrinsically linked with software. Considering file format in isolation is therefore often unhelpful if not meaningless in these cases. The other reason is that experience and well documented preservation good practice suggests that a whole array of contextual characteristics need to be considered before taking preservation action. And in the weeks following the acrimonious purchase of twitter by Elon Musk and the disruption that has followed, it should be clear that there’s much more to the issue than format obsolescence…

On the introductory tab the spreadsheet provides a short list of 5 potential preservation strategies. Whilst it briefly notes “emulation” and “museum” as options, the emphasis across this entire work is clearly on file format migration. Two approaches to migration are listed, including something that I’ve not seen articulated in this way before: “Early migration”. This is described as “Data are migrated to a designated portfolio of file formats before submission. The responsibility for the migration therefore lies with the data producer…” Again this appears to be giving authority to what I think is a questionable approach. It’s also at odds with growing support for notions of minimal effort ingest and parsimonious preservation. In short, we know that preservation actions like migration are usually damaging, so it might be best to only take action when it is really needed, if ever. We typically don’t have the resources to speculate and pre-emptively migrate, and almost certainly don’t have the resource to pre-emptively migrate in a dependable, well documented, accurate and verified manner. If we (the preservers) don’t have the resource, how can we expect the depositors to be able to afford it, never mind do a quality job, given that they will likely have little digipres expertise or tech? One of the 26 institutions listed in tab 2 of the spreadsheet has their digital preservation strategy described as “Minimal effort ingest…”. It’s a shame this isn’t listed in the range of different strategies on tab 1.

A final worry is the way the spreadsheet adds up the results of the file format policies into a single numerical figure which is labelled as a “Total Score”. One of the webinar presenters noted that this was not an indication of format preference that should be followed by others, but of course that is exactly the impression that the spreadsheet gives.

Final musings

Are particular file formats “good” or “bad” for preservation? Should arbitrary format “sustainability” scores be used in isolation to decide when to transform data? Should file formats be normalized or migrated before they even reach a preservation system? I think the answers to these questions is a resounding no. A scoring system for format sustainability might give a quick impression of preservation challenges that lie within a repository full of different formats, but it really shouldn’t be the basis for changing that data for the purposes of preservation. A sensible preservation approach for a specific file format in one institutional setting cannot be robotically reproduced in a different setting with any confidence of success. Digital preservation is a complex and challenging task. Its decision points can’t be easily summarized or automated. Context is everything. Yes, I’m going to say it once again: *it depends!*

  • Who is the user of the data?

  • Why is it being preserved?

  • How will it be accessed?

  • What are the legal considerations?

  • What will it cost?

  • How effective and accurate are the preservation tools?

  • Have we tested the action and analysed the results?

If we don’t have the resource to navigate a complex decision well, I argue that we need more resource to make it happen, not revert to a crude decision process. And if we don’t have sufficient resource then lets target what we do have more effectively. I don’t think pre-emptive file format migration is at the top of the priority list.

 

This blog post was edited to include a reference to Johan's very relevant 2013 blog post - thanks Johan.

Comments   

#1 Rachel Tropea 2022-11-24 23:45
Really enjoyed this article Paul (and almost any article that includes the word 'context'!), it's so important to not let crude policy dictate what is accepted into an archive - no disrespect intended. Policy can tend towards over-simplifica tion / generalisation, and avoid complexity. Advising content creators about file formats at the point of creation, or ideally even earlier at the ideation phase of research or project or similar, is a way to raise awareness but regardless, file formats can't always dictate what we keep or destroy. And as for asking producers to emulate, it's unrealistic. Sometimes when we ask people to provide a list of their records (and I'm talking tens or hundreds of records - often half a day's work at most), we don't hear from them again!

"Taken out of context I must seem so strange" (Ani DiFranco)
Quote
#2 Paul Wheatley 2022-11-30 15:26
Thanks Rachel. That's a great celebration of "context" - love it! And perhaps only "Ransomware" or "Human Error" come anywhere near "Not getting the data to preserve in the first place" on the digipres risk scale...
Quote
#3 Stephen Abrams 2022-12-04 18:59
Paul: I agree with your overall argument. At Harvard, while we still have some references to preferred, acceptable, etc. formats on our website, this terminology is explicitly NOT referenced in our preservation policy. Instead, we say that we will accept any format, but with the important caveat that prospective long-term outcomes may differ, potentially significantly, depending on what that format is. If contributors have any discretion regarding format, we're happy to discuss the best possible choices in order to maximize opportunities for positive outcomes. In essence, we've changed the focus of consideration away from strict object Eligibility towards softer user Expectation. We think this better reflects the reality of long-term customer needs and aspirations as well as maximizing acquisition of Library and University content into a managed stewardship program, without which a negative long-term outcome is almost certainly inevitable.
Quote
#4 Paul Wheatley 2022-12-06 15:17
Thanks for that response Stephen! I love this summary "In essence, we've changed the focus of consideration away from strict object Eligibility towards softer user Expectation" as ever, far more eloquent than my ramblings *:-)
Quote

Scroll to top