Edith Halvarsson

Edith Halvarsson

Last updated on 7 June 2022

This blog post is by Sebastian Lange, Software Engineer with the Bodleian Digital Library Systems and Services (BDLSS) department and Edith Halvarsson, Digital Preservation Officer with Bodleian Libraries’ department of Open Scholarship Support.  


Analysing PDFs with the PyMuPDF library 

Like many heritage institutions Bodleian Libraries holds a vast collection of PDFs, created in various flavours and software over the past 20 years. These documents have come to the libraries from diverse sources – such as digitization suppliers, academic depositors, and born-digital personal archives. 

We wanted a quick and dirty way of scanning our PDF collections for particular features, tailoring these to the needs of the Libraries’ vast and diverse collections. Using the PyMuPDF library we created a small tool which helps us gather more information about the current state of our PDFs, especially but not exclusively, regarding their accessibility. While our PDF analysis tool is less detailed than validation tools (like veraPDF), using the PyMuPDF library can be a good first step for analysing PDFs and flagging potential high-level digital preservation risks.

What the tool does

Our small PDF analysis tool identifies different fonts which are used in PDFs, checks if these are embedded, and looks for forms, bookmarks, and titles. In addition to this, the pdf analyser outputs information about the number of words, pages, and images in a file, which helps us to spot other useful information - for example if a PDF is a converted image or a publication/text.

How it works

Our PDF analyser relies heavily on the fitz module of the PyMuPDF library.

For our purposes it was useful to first analyse our collections with Siegfried (a signature-based file format identification tool), which can output results to JSON. This gave us a starting point to feed into our PDF analysis tool. 

Our tool then iterates over every PDF file, as identified by Siegfried, and uses the different functions offered by the fitzmodule to get the information we want. For the identification of digitized PDFs without a text layer, we are relying on the assumption that, if a PDF does not contain any text, it is to be classified as an ‘image’. Even though this could cause potential inaccuracies, we have achieved very good results with it after extensive tests, which justify such an approach. 

What's next

Using the findings from the PDF analysis tool we will look at methods to enhance our PDFs making them more accessible to users. We hope that other DPC members will find this approach useful and have a go at using PyMyPDF to tailor scans that meet the needs of their own unique PDF collections. 

Comments   

#1 Egor Eremeev 2022-06-17 08:04
Thanks for sharing this interesting note. Parsing text from a number of pdfd with an unknown quality is a so known pain :-). Could you advise, if the source code of the pdf-tool opened and available somewhere?
Quote

Scroll to top