Richard Lehane is an archives and recordkeeping consultant with Recordkeeping Innovation, Sydney, Australia. Next year he'll be joining the IAEA's Archives team in Vienna, Austria

When I find spare moments, I work on siegfried, a file format identification tool like DROID and fido. I've been tinkering on it now for over five years. Automation has been critical for me to sustain the project; otherwise I just wouldn’t be able to attend to all the things that need doing, besides improving the tool itself. So far I’ve automated:

  •  testing,
  •  building and publishing releases,
  •  updating signatures,
  •  profiling the codebase,
  •  and benchmarking.

Automating these processes isn’t just about relieving me of manual work and freeing up time, it is also about putting in safety nets so that I can dive in and make changes knowing that any serious errors or regressions will surface in the tests and benchmarks.

The first thing I automated was siegfried’s test suite. This suite wraps Ross Spencer’s skeleton suite and runs every time new code is pushed up to Github via Travis-CI. These tests pay dividends with every PRONOM release because they are comprehensive and cover a broad range of edge cases in PRONOM signatures. The skeleton suite has identified a bunch of siegfried bugs over the years (and has also yielded a number of bug reports to the TNA).

Travis-CI is also responsible for building Debian packages. This was the next major thing I automated. Any tagged push to siegfried’s master branch triggers a debian build on Travis-CI and a windows build on Appveyor. The executables are then automatically published to Bintray and back to Github. In other words, as soon as I’m happy that the code is in a fit state to publish a new release, I can just commit that code to the repository and new linux and windows releases are automatically built and published.

Last year I wrote builder – a script which (again… a theme is emerging here!) runs on Travis-CI and harvests the PRONOM database following a PRONOM update to build fresh versions of the skeleton test suite. Not only does this save me time but it also means I don’t need all the Java and Python dependences for the skeleton suite installed on my local development machines.

The most recent automation I added in July this year and is probably the biggest and most complex of them all. It is a custom continuous benchmarking service that runs large scale benchmarks against siegfried (and also DROID and fido) when I push changes to Github. This service also does automated code profiling. I won’t describe it in detail here but, if you’re interested, read this post.

So, now that I’ve established I’m a fan of automation (and have included a bit of a shameless siegfried plug too), let’s generalise and discuss automation in digital preservation.

Automation is often promoted as a necessity in digital preservation work because of the problem of scale: the notion that, because so much digital content is being created today, if we can’t automate our work we’ll drown in a *digital deluge*. Although there is definitely truth in this argument, I’d caution that automation isn’t something that should be applied blindly or as a one-size-fits-all proposition.

Premature optimisation is the root of all evil

The next time you hear someone (or yourself) say, "that's great, but it won't scale", it's worth considering exactly what they mean by scale. We all have scaling problems, but they're not all necessarily the same ones. In thinking about scale and digital preservation there are a number of dimensions to consider:

  • our scale (usually small)
  • the size of our potential customer base or jurisdiction (often big)
  • the size of our *actual* customer base, or the portion of our jurisdiction looking to engage with us (often small)
  • the number of transfer requests (varies)
  • the size of those transfers (varies)
  • the complexity of those transfers (varies).

A solution designed to manage a small number of customers, with really big transfer volumes, across a small and well-defined set of content types might look quite different to a solution built for lots of customers with small but very diverse transfers.

I've recently left NSW State Archives where I'd was part of the digital archives team. Several years ago, at the start of that project, we spent quite a bit of time designing (and partly building) a workflow engine and agency portal to enable us to deal with the large volume of agency transfers we expected would soon come through the door. We ultimately pivotted to a much more bespoke model (using document templates for agency interaction and scripts for a lot of the processing work) that proved a better fit for the type of scale we faced: small volumes of highly complex transfers that ranged greatly in size.

In the machine learning community, they often say "there's no such thing as a free lunch". This means that there isn't a single algorithm to suits all domains and you need to experiment with different algorithms to identify the right approach for your particular problem. A similar message applies in digital preservation. Pick the right set of tools for the context in which you are working and recognise that with any tool there are trade-offs. This is because automating anything involves the application of constraints. File format normalisation is a perfect example: if you are willing to accept a possible cost in fidelity, then applying the constraint of only supporting a limited number of "preservation"-quality formats *might* be worth it. (Or not).

Automation is important but do it tactically (like I did for siegfried): start small and manual, find out where *your* scaling needs are, and understand what trade-offs you're willing to make.

Enjoy World Digital Preservation Day. And try siegfried!

Scroll to top