About
veraPDF is an open source conformance checker that validates all current parts and levels of the ISO 19005 (PDF/A) specifications. PDF/A is the archival document standard based on original PDF specifications. veraPDF is one of three conformance checkers developed with funding from the PREFORMA (PREservation FORMAts for culture information/e-archives) project.
The challenge
- Organisations that have responsibilities for long term preservation and access to digital content face several challenges:
- developing canonical, unambiguous interpretations of complex format specifications;
- obtaining software that can validate files of particular formats according to the canonical specification; and
- understanding the technical properties of files to inform long term preservation and access decisions.
veraPDF helps organisations to address these challenges. The conformance checker can be used to validate PDF/A files in different scenarios in digital preservation workflows: creation, ingest, digitisation, migration. The purpose of the software is to:
- verify that file has been produced according to the specifications of a standard file format;
- verify that a file matches the acceptance criteria for long-term preservation;
- report properties that deviate from the standard specification and acceptance criteria in human and machine readable formats; and
- perform simple automated fixes for deviations in the metadata of the preservation file, leaving the original bitstream untouched.
Open source licencing
The PREFORMA project specified that the software must be made available under a dual licence (MPL v2+ / GPL v3+). The test datasets and documentation are licensed under CC-BY-4. Using an open source approach means that anyone can download and use the software, modify it to meet their requirements and can provide feedback by reporting bugs or requesting new features. It also means that the code and documentation are made publically available and will be sustainable beyond the life of the project, or an individual developer.
User community
One of the core goals was to establish an active user community. The draft functional and technical specifications for veraPDF were published openly for review by the community during the design phase. We make regular software releases and respond to user feedback. We also run webinars and host mailing lists to help encourage use of the software and answer questions.
Development
veraPDF developed several software components:
- a general purpose software library for format validation;
- a PDF/A validation model;
- a PDF/A parser and conformance checker; and
- command line, GUI and REST interfaces.
The validation software library is intended to provide a starting point for anyone who wants to create a validator for any file format. The veraPDF software takes a set of XML validation rules and applies them to a validation model. Both of these concepts, i.e. the validation rules and model, are format agnostic.
All veraPDF software is developed in Java so that it is cross-platform, meaning it can be installed on Windows, Mac, and Linux machines.
veraPDF offers full support for all PDF/A versions (1,2,3) and levels (A,B,U). It can also extract and report technical details from PDFs to support custom policy checks beyond the PDF/A specifications:
- It produces an XML report on all metadata, resources, embedded files, pages, annotations, document security, etc
- It has an embedded facility to create and check against policies that can be developed using an XML schema.
One use case might be that an archive does not allow attachments to their PDF/A files. veraPDF can detect and analyse any embedded files, which are allowed in the PDF/A-3 specification.
Deliverables
- veraPDF 1.0 was released on 10 January 2017. The most recent release was veraPDF 1.12 on 9 May 2018 which has had over 4,300 downloads at the time of writing. The latest release is always available at: http://downloads.verapdf.org/rel/verapdf-installer.zip.
- All of the source code, test corpora and documentation is publicly available on the veraPDF GitHub organisation home page: https://github.com/veraPDF.
- Every software release is tested against the veraPDF test corpus and the results are published online: http://tests.verapdf.org/.
- A dedicated documentation site is available at: http://docs.verapdf.org/.
veraPDF test corpus
To fully understand the PDF/A format specification we developed the veraPDF test corpus, a substantial body of open test data for the PDF/A specifications (Versions 1B, 1A, 2B, 2U, 2A, 3B, 3U, 3A) as well as a number of additional tests files for ISO 32000-1. The test corpus complements the Isartor and Bavaria test suites and contains over 1,500 files.
Testing approach
We carefully examined each clause in the standards, and developed a formal grammar to describe the requirements in a machine-readable fashion. We then produced validation rules with an accompanying programmatic test for each requirement. PASS and FAIL corpus files were created to test the validator’s functionality. This process highlighted any misunderstandings on the veraPDF consortium’s part or revealed ambiguities in the standards. When there was an issue, we worked with the PDF Association’s PDF Validation Technical Working Group (TWG), analysing PDF validation issues as part of a transparent process.
Although the ISO specifications could not be revised, the ambiguities resolved during the development of veraPDF were recorded as a PDF Association Technical Note to aid consistency in interpretation. PDF Association Technical Notes have a good track-record of adoption by the industry. The development of veraPDF has also directly influenced the standardisation process, with several issues raised leading to enhancements in a forthcoming new part for PDF/A.
Consortium
Members of the verPDF consortium are:
Open Preservation Foundation, PDF Association, Dual Lab, KEEP SOLUTIONS, Digital Preservation Coalition.
The veraPDF consortium’s unique partnership helps bridge a gap between cultural heritage organisations and industry with specialist expertise to create a product that meets the needs of both communities. veraPDF has been adopted by a wide range of memory institutions, digital preservation vendors, PDF industry vendors and beyond.
Post project
Funding from the PREFORMA project ended in August 2017. veraPDF is now sustained and maintained by the Open Preservation Foundation. Dual Lab provides active user support and carries out maintenance and bug fixes. The PDF Association’s PDF Validation Technical Working Group continues in their role resolving ambiguities that arise and helping industry to adopt veraPDF standards.