Leslie Johnston is Director of Digital Preservation for the National Archives and Records Administration in Washington DC, USA

One of the greatest challenge for any archive is the multiplicity of file formats. For the United States National Archives and Records Administration (NARA), with several decades of history accessioning and managing electronic records, this is compounded. We received our first transfer of electronic records in 1970!

How do you plan a preservation strategy to account for decades of electronic files? I started by drawing a picture. Not a literal picture, of course, but I wanted to find a way to analyze and visualize what NARA has in its holdings.

NARA has recently completed a file format profile of its electronic records files. Why did we do this? Because we could not plan without first getting a better idea of what we really have. NARA operates under several different regulatory mandates, each with different restrictions on collection schedules and scope, as well as access controls. This led to the implementation of multiple systems--developed over more than 20 years with different technologies--which meant a real challenge in understanding the scope of the holdings.

I worked with the system owners and our IT operations to get the most granular reporting possible on each set: federal, legislative, and individual presidential administrations. The reporting didn’t always match in terms of granularity, given different tooling for the format analysis and report generation, but in the end I was able to compile a record of what we have, what formats we have, and counts. Could we identify every file format with complete certainty? No. Were there decisions in the past about format normalization that I had to take into account? Yes. Will it help me plan for preservation program and technology priorities? Absolutely.

Johnston 1

Visualization of File Format Frequency from Hundreds of Millions to a Single File

So what did I learn when I could see everything in the aggregate? The most straightforward is that I learned what file formats make up the bulk of our holdings. I learned what percentage we couldn’t characterize and map to documented formats with certainty. I learned that a surprising (to me) number of software companies have used the same file extensions over time, which was important because we have a subset of our holdings where we haven't run DROID or JHOVE, and there are 6 or more possible formats based on file extension alone. I learned just how many variations of certain formats we have acquired over the decades, PDF being the most diverse. I can now better see the risks in the holdings.I learned we had more of certain formats than I expected. Of course the processing archivists saw them coming in over time, but now we have a more complete view of just how much has been transferred. And I validated that having strong transfer format guidance is essential, but we of course take in the records in the format the agencies have because those are the tools and formats they use to do their jobs, and there must always be exceptions.

We are preparing to put a major new release of our Electronic Records Archives (ERA) repository into production in 2018. Now I can create a more inclusive set of File Format Preservation Action Plans to document our format risks and plan for format migrations. This process has informed and changed the prioritization for acquiring tools to view, process, and migrate those types of records in the updated system framework. And I look forward to a future where this data compilation and reporting is easier as we consolidate holdings into the new environment.

Another way this approach helped us was in how we look at access. We have a roadmap for the evolution of our systems, including the National Archives Catalog; I can share the trends in our holdings to help them plan for formats that they will need to be able to display and deliver to the public.

When I called this a picture of NARA’s holdings, this analysis actually can be used to create a literal picture, such as the one above. I plugged the data into an analytical tool that I can use to query the data but also to create visualizations. Now I can better tell the story of our records holdings--what we have and how we got it--in a variety of ways to make our work more concrete for the rest of the federal government, staff, and the public.

Scroll to top