A promising path for parsing?

Also in this section
Blog Topics

Latest Comments

A perspective on need among digital preservation professionals
- Micky Lindlar 3 months ago
  
  Hi James! Great work - thanks for conducting it and raising awareness of it through this blog. I'm ...
An Unexpected Gift
- Niamh Murphy 7 months ago
  
  This is fantastic! Thank you so much, Andy! Merry Christmas!
Workflows At The University Of Sheffield: Showcasing The Work Of The Last Year
- cate 3 weeks ago
  
  I love this! Would it be possible to see a higher res image of your workflow illustration? I think ...

DPC Blog RSS Feed

Also in this section

Paul Wheatley

Last updated on 15 February 2022

This blog post is in some ways a follow up to my 2018 post "A valediction for validation?". It was prompted by a tweet from Micky Lindlar that got me thinking about this stuff again...

"Controversial statement of the day: the longer I work in digital presevation, the more pointless i find file format validation." It was a tweet that ellicited a significant number of replies that agreed with Micky's sentiment, suggesting perhaps a growing disenchantment with validation and thus my corresponding and continued interest in an alternative.

...and shortly after that a fascinating and information-dense presentation from Tim Allison and colleagues for an OPF webinar. As well as a brilliant follow up conversation on twitter prompted by questions from Tyler Thorsted to Tim.

Neither a rushed question at the end of a presentation or the limited character count of a tweet or two seemed appropriate for me in trying to follow up. I wasn't even sure what I wanted to ask. So instead, this blog post is my attempt at (rather slowly) processing all this information, and looking at whether it takes us any closer to using existing parsers (rather than purpose built file format specification driven validators) to identify files with issues of concern for long term preservation (or at least whether that idea might work) - as I proposed back in that Valediction for Validation blog post. Hopefully it prompts corrections/thoughts/or further discussion.

So what might tools which parse files to render or extract properties, usefully tell us? Here are a few dimensions of interest:

1) Different tools produce different rendering results
If the output of different tools rendering the same file is significantly different, does this represent an area of concern for digipres? Different rendering or extraction results may point to ambiguities that have accidentally arisen due to poor file construction or have been intentionally exploited.

In a follow up tweet, Tim notes that "On rendering testing, our team hasn't done anything yet. Another team on the program has taken on the rendering differential analysis, and their code isn't open yet (I don't think)." and "However, I think @Passwort12345 runs rendering comparison code for #ApachePDFBox as part of regression testing." and furthermore "And the multicompare tool that I have for text should be expanded to handle the output of a rendering comparison tool so that you could search for files where, for example, mutool and xpdf yield rendering differentials."
This could be a particularly interesting avenue to explore for digital preservationists. There are of course many different ways of comparing the results, from diffing the textual output as Tim mentions, to even some sort of visual comparison of a renderer's screen output.

2) Parser crash
If a parser (or several different parsers) crash while traversing a particular file, this might well indicate an issue of interest for digipres. Early results in the presentation showed a graph of numbers of crashes by each parser, when running against a sample of common crawl data. This could be of interest, but I suspect crashes are quite rare (I think that's what the graph showed) and I assume are likely to be fixed before too long. If they're fixed, *maybe* they would then report a warning or error if trigggered by a file format issue. Of course, maybe once fixed, there isn't an issue to concern digipres.

3) Error messages
In my Valediction blog post I speculated that error messages and warnings produced by parsers/renderers might be useful in identifying digipres issues. However, Tim's quick summary of what he's seen does not initially bode well (noting this is just looking at the complex world of PDF): "Every tool will complain about different things. Some times there are warnings that are meaningful from a rendering/text extraction/functionality standpoint, and some times it'll be something that... Kind of irked one developer here or there. In the observatory, we continue to be amazed at the diversity of errors/warnings/complaints we get out of different tools."
Which isn't hugely promising but perhaps not unexpected. Could it still be useful to trawl through these and identify useful reports? Tim continues "pdfcpu offers some great debugging information and you can turn it to lenient; we currently have it set to strict." and "In short, your mileage will vary substantially tool to tool and task to task. In the observatory, we're looking for correlations across tools."

Correlations would of course be very interesting, but I would also speculate that working backwards from a test corpus of problematic PDFs might be an interesting experiment for digipres, although one that would likely require a lot of manual digging to interpret the results and identify useful error messages that would then become the triggers in our automated file analysis. I *think* what Tim is summarising is that there isn't likely to be an easily automateable way of pulling out useful stuff from errors and warnings - and this is the broader approach the file observatory is trying to take. But maybe there would be specific value for digipres in the manual graft of sifting error reports and identifying the useful ones?

Further thoughts

Parsers will have in built tolerance for typical but not necessarily format specification compliant issues and some of these might be of interest to digipres folks. The issues behind these tolerances might be related to ambigouous areas of the file format spec. Some might just be helpful tolerances, that mean we us digipres types don't need to worry - the sort of things a renderer will helpfully sidestep and display the file for the user without concern (eg. the badly formed date/time fields that validators often bombard us with but which every renderer expects and tolerates without concern). However, some may smooth the way for users, whilst concealing a potentially important concern for digipres. I've certainly encountered cases where renderers address bad data by "doing the best they can" and throwing up content they can display while throwing out content they can't. At the same time they provide no warnings to the user that something catastrophic might have happened. So do the parsers provide warnings for these kinds of issues (hidden for someone rendering with a GUI), and if not, could we even devote some dev time to improving warning/error reports that could be useful for us? Moving our precious digipres dev effort from developing and maintaining home grown validation tools to improving error reporting on already well maintained open source extraction tools or renderers feels like a much better long term strategy for our community.

Quite a bit of what Tim has said is not that encouraging - and it's perhaps not surprising. The complexity of the whole PDF continuum means that naturally there aren't going to be any easy answers there. But I am hearing various points of interest for digipres, and I would be excited if we could follow this up.

Exif via the Gif shop

Ok, so this is about PDFs not GIFs, but that would have been a great final section title if it had been! Before getting around to finishing this blog post, I discovered that a number of my questions have begun to be answered with some great practical work. Maybe it even renders this post a bit redundant - which is great! Regardless, I'll carry on and will come full circle by referencing an excellent blog post by Micky's colleague Yvonne Tunnat: "PDF Validation with ExifTool – quick and not so dirty", which is the latest in a series of posts on a similar theme. I'll pick out what particularly interested me.

Yvonne discusses identifying bad PDFs by parsing them with Exiftool and then sifting the error messages. After ruling out a set of minor or inconsequential errors and warnings, Yvonne describes a subset of errors that are used to flag up any PDFs requiring further inspection. Interestingly, one of the follow up checks at that point is to use pdfaPilot to migrate to PDF/A-2b, with (some) errors (again) flagging up possible issues. This highlights the possibility of utilising layers of automated assessment. Running lots of tools over every file might well be overly time consuming (something Yvonne touches on) but after an initial pass, more tools could then be applied to confirm or clear the much smaller subset of potentially problematic files, perhaps narrowing the field before any manual intervention is required.

The blog post (and a few points made on twitter in response) point to the key potential problems that appear most useful to target: confirming inital magic-based format identification, identifying files that don't parse/render, identifying incomplete files, identifying encrypted files, and identifying obvious corruption. However, I like the way Yvonne describes the issues in a more functional way, eg. "This is an error which usually makes the migration to PDF/A-2b impossible". I've always found the concept of validation to be almost completely dependent on theoretical constructs around what operations we might want to conduct in the distant future. Yvonne's approach hits the other end of the spectrum by identifying obstacles to actual processes we want to perform now (or at least seem like obvious contenders for performing in the near-future).

Yvonne's blog post reads to me like a major part of the recipe for a wider "Parse-Evaluate" project that looks at applying this approach to a number of file formats, using existing parsing tools. It would be driven by working with real data sets, initially focusing on filtering existing parser errors to pick up on problematic files, and then going further to tweak/reinforce/add error messages to the parsers as required. Even better, to then combine with Tim's file observatory approach of comparing parsing and rendering between different tools and feeding the learning back into the digipres process.

Who's in, who's out?

Add comment

A perspective on need among digital preservation professionals

An Unexpected Gift

Workflows At The University Of Sheffield: Showcasing The Work Of The Last Year

Paul Wheatley