Duff Johnson

Duff Johnson

Last updated on 29 November 2017

Duff Johnson is Executive Director of the PDF Association and Project Leader for ISO 32000 in Winchester, Massachusetts, USA

What is a “document”? It’s a record of some (typically written) content - a publication, a contract, a statement, a painting - at a moment in time. Until the advent of computers (and scanners), the only media considered useable for such records were papyrus, vellum or paper pages.

PDF became the document format of choice for business, government and the general public because it delivers the key qualities of paper in a digital format. PDF is fixed, self-contained, readily shareable and relatively hard to change. It’s not just PDF’s innate characteristics that make it successful, but the fact that PDF interoperates smoothly with paper documents. The classic “PDF it, send it, print it, sign it and return it” type of workflow introduced new efficiencies when PDF surfaced into public consciousness in the mid-to-late 1990s. This approach used only the most basic of the format’s capabilities, but it was enough to enable the slow economy-wide transition to digital documents.

Before long, users were scanning the signature page and adding it to (or replacing) the original page in the PDF; the cycle back to a digital document was complete. This was, of course, an extremely crude approach to facilitating document approvals, but the capability made PDF very tolerant of variations in workflow and records-keeping practices in a way that’s unimaginable for databases and HTML.

PDF continues to evolve far beyond a simulacrum for good old paper. The technology includes a wide swatch of features – tagging, XML-based metadata, attachments, 3D support, digital signatures and more – that support advanced document-handling and consuming workflows. PDF is so capable and so reliable, that some wonder why bother with an archival subset at all.

Why indeed. PDF is good, but necessarily more general-purpose than the more restricted subset known as PDF/A, whose PDF files are designed to last forever. For all its deserved reputation for reliability, PDF allows developers to make files that use external resources and encryption; both capabilities are non-starters for the preservation community. If the world uses PDF – and it does – then preservationists need PDF/A.

Introduced in 2005 as ISO 19005, PDF/A is now required or best-practice technology in many workflows that result in valuable documents. Filing cabinets and storage boxes are disappearing as ECM systems, cloud storage and local capacity swallow the documents that used to exist only on paper. When new documents are shared, the common-ground is PDF. When they are finalized for archival purposes, ideally, they are PDF/A.

Some think HTML will “beat” PDF because it’s more flexible and less static, but this misconstrues both formats’ respective purposes and fails to appreciate that browser developers can (and have) simply started to add support for PDF. Heedless of theory, PDF continues to gain in mind-share: over time, the number of searches for PDF documents relative to all other searches continues to go up.

PDF’s purpose is to be a document, with all that implies (see above). But that’s not the purpose of HTML. HTML isn’t a document, it’s an experience. HTML is about making and consuming; PDF is how you keep it, and PDF/A is how you keep it forever (preserving the file’s actual bytes, of course, is up to you).

Funded by the EU’s PREFORMA project, the open-source and industry-supported veraPDF PDF/A validator is the result of a collaboration between the Open Preservation Foundation and the PDF Association. veraPDF is a crucial tool to help preservationists everywhere understand their existing collections of PDF files, evaluate new additions, and advice their contributors on best-practices. Following PREFORMA’s acceptance-testing period in the summer of 2017, Adobe has released updates to their own PDF/A validator to ensure it’s consistent with veraPDF.

This is not only the present, it’s also the future. PDF, an open, standardized, broadly-capable digital document technology, has proven equal to the transition from paper to the electronic world. PDF’s advanced metadata, authentication, attachments and other features provide a proven framework for future development of the digital document. PDF has no competitors. Even in the world of SharePoint, Office 365 and Google Docs, PDF and PDF/A represent the only sufficiently capable technology for archiving the full gamut of digital document content.


6 years ago
Nice post. While I agree with that PDF is the best format for records and archives, there are 2 of your premises that I look at a bit differently.

A document is not necessarily a record. Records have a requirement to be persistent. Documents do not have this requirement. Using this logic, a record is a document and but a document does not need to be a record.

HTML is a definitely a document. format. It lacks a good persistence model, so it makes it a poor format for records. But to say that it is not a document format seems to me to be incorrect. It is a fine document format and is used as such by millions of people. It doesn’t seem correct to just summarily dismiss it.

Scroll to top