Walker Sampson

Walker Sampson

Last updated on 4 November 2020

Author: Walker Sampson, University of Colorado, Boulder; in collaboration with Keith Pendergrass, Harvard Business School,  Boston; Tessa Walsh, Artefactual Systems,  BC, Canada; Laura Alagna, Northwestern University Library, IL;

For the past several years, Keith Pendergrass, Tessa Walsh, Laura Alagna, and I have been exploring how to make digital preservation more environmentally sustainable. Recently, we have focused on building a global community of practice, striving to add environmental sustainability as a third, co-equal pillar of digital preservation practice along with digital object management and successful use. We began our work with a research article, then created a workshop protocol, and have been engaging in outreach and education efforts. We are honored to be a Digital Preservation Awards finalist for the Dutch Digital Heritage Network Award for Teaching and Communications.

We have written a couple of blogs posts on our work previously, first discussing our workshop protocol and then providing details on implementing policy and workflow changes at Baker Library Special Collections. Here I would like to describe efforts at the University of Colorado Boulder to generate guidelines on the use of three storage tiers. In our article we recommend storage tiers as a way to accommodate retention needs for a range of content – some of which may not merit or immediately need top-tier storage. I suspect many institutions may have multiple storage options available to them, with varying qualities to each – though here, we have just begun the process of organizing these options into a unified strategy.

The storage tiers are:

  • Local redundant storage (1) – this tier sets content entirely on campus, with two tape redundancies in one facility. While transfers to this tier can be confirmed for integrity through the Globus transfer protocol, no fixity checks are made on this tier – though media is regularly monitored. Therefore, content moving here is packaged with BagIt first. This is the lowest cost tier both financially and environmentally.

  • Remote redundant storage, no processing (2) – this tier sets content in two geographically dispersed locations in the United States, with multiple redundancies on each and annual fixity checks. Content in this tier can also be modeled to the Portland Common Data Model, which articulates collections, objects and files and the relationships between them.

  • Remote redundant storage, processing (3) – same as above, but with processing available through an Archivematica instance, allowing format migrations and retention of the SIP. This would be the highest tier in both cost and impact.

Our discussions have covered considerable ground, but here I thought it would be most useful to examine a single issue that touches many axes we are trying to work on in formulating policy.

Today, we use the Archivematica-based processing on (3) for only a few collections. These are collections that feature a wide range of file types – usually personal or work documents and data. For these, it's valuable to have an automated process create normalized copies of formats where possible. For collections with only a handful of formats (which is most) we use (2), as there is no need for immediate generation of derivatives and no near risk of format obsolescence.

We also have derivatives created from digitized audio-visual content. These derivatives are labor-intensive: trained personnel make frame and frame rate adjustments, track syncs and color corrections, and numerous other modifications. The digitizing equipment cannot perform these operations automatically when generating the initial preservation copies. While the preservation content is always retained, these derivatives, which we term as 'mezzanine' copies, are the 'real' digitized copies as they constitute the workable and viewable content in the best quality. Downloadable or streamable access copies can be made from the mezzanines.

Our default stance is to store the preservation copies in (2) while the mezzanines remain in (1) – but a good argument exists for the reverse. The preservation copies are generated through an automated process and are about 3x the size of the mezzanines (which are themselves typically 100+ GB) – and not viewable. Meanwhile, the mezzanines contain the skill-based decision-making of staff here and are themselves the parent copy of various access copies. In some sense, losing the mezzanines is more costly than the preservation copies – the latter can be reproduced with less-trained staff and the physical media will be extant for some time, while the latter would require well-trained staff to reproduce, and would cut off the generation of future access copies. Taking into account both budget and environmental impact, it could be wiser to set the master-level, preservation copies on local redundant storage (1) and the mezzanines on (2).

Of course, if the physical media were in a poor state, this balance would change. So too if the size difference between preservation and mezzanine lessen. This issue has not been resolved yet, and we are still in discussions about the most prudent course. Regardless, I hope that focusing in on this single issue has illustrated how tiered storage points can open up new preservation strategies – perhaps counterintuitive ones.

Scroll to top