Leslie Johnston is Director of Digital Preservation at the U.S. National Archives and Records Administration (NARA) in Washington DC, USA

In 2017 the DPC announced its “Bit List” of Digitally Endangered Species as a crowd-sourcing exercise to discover which digital materials our community thinks are most at risk, as well as those which are relatively safe thanks to digital preservation. The “Items of Concern” portion of the list included PDF, a format bearing some discussion and use cases.

Let’s start with a number - the U.S. National Archives and Records Administration (NARA) has over 10.2 million PDFs in its holdings. In a collection of close to 1.5 billion files, that doesn't seem like a significant file management challenge, until one digs deeper into the data.

During 2017, NARA created a Holdings Profile: see my blog post for World Digital Preservation Day 2017 for an introduction to that work. The outcome was an analysis of what formats NARA has in the holdings. It’s inevitably not perfect, as there are always levels of uncertainty involved in assessing file formats, especially when an organization has been accepting a wide variety for formats for 50 years. There are different levels of tooling in place for different portions of the holdings housed in different preservation systems, something that will be gradually rectified as we migrate records into the new ERA 2.0 system.

NARA has Transfer Guidance that it has issues to federal agencies to identify the preferred and acceptable file formats for the transfer of permanent records to its custody. It is just that--guidance--but it is clear that PDF versions 1-1.6, 1.7 (ISO 32000-1:2008), and PDF/E are acceptable, and that PDF/A-1 and PDF/A-2 are the preferred versions for different record types ranging from scanned text to textual documents, posters, presentations, etc.

Anecdotally we assumed a very high level of compliance in the holdings, and we were both right and wrong on that count. For the U.S. federal records holdings, of the 18 variant versions of PDF in the holdings 11 were on the preferred and acceptable list. What was surprising was that 7 were not: they are all variants of PDF/X, which is not on the list. Federal records account for 2 million of the PDFs in NARA’s holdings, so the total of 23,713 PDF/X files account for only 1.2% of the PDF files from federal agencies. That’s 98.8% compliance. Where we were wrong is that large portions of the holdings that are not in the federal collections are not currently characterized at the format version level. These approximately 8.2 million files are known to be PDFs and very likely close to 100% compliant, but we do not yet know which versions the files definitely represent so we cannot verify compliance. In coming years we will be better able to measure compliance with the already mentioned planned record migration into ERA 2.0, our consolidated preservation infrastructure.

But this led to a different conversation as NARA began to create File Format Preservation Action Plans for the holdings. There are 18 variant versions of PDF identified at different levels of certainty.  The Preferred version of the format is PDF/A. Should NARA migrate all of its PDFs to PDF/A? Is it even possible?

The answer to “Is it possible?” is “Maybe?” There are tools for make such conversions, but because PDF/A requires much more information than is included in a “standard” PDF (especially the markup of its structure) conversion must be accomplished on a one-by-one basis, not on 2 million at a time. A conversion can be batched without mapping the structure, but files that do not comply with the PDF/A prerequisites will not be converted, and some of the files will self-identify as PDF/A in their header metadata but not pass a full smoke test against the specification. This process could reduce the number requiring bespoke conversion, but likely not by much. That’s not scalable for the millions of files for which NARA is not the original creator and cannot recreate the full creation context.

The answer to “Should we?” is “Not now, but we’re going to discuss it.” All format migrations are accompanied by risks, even ones within a single family of formats. The analysis that led to NARA including PDF versions 1.0-1.7 on its acceptable list means that the formats are, by definition, currently at an acceptable level for preservation risk. As the ever-ongoing analysis of formats continues, this will obviously require future consideration because a new migration trigger, such as the loss of renderability of earlier versions, is possible. As Acrobat is one of the most error-tolerant rendering tools a loss of renderability is not likely, but some degree of risk is inherent with a 25 year-old format. NARA is dedicating efforts to minimize the format variants that are managed now and will require migration in the future, so it’s a topic that will be both reviewed and acted upon in the near future.

