Finding the Cutting Edge in Common Formats

Also in this section
Blog Topics

Latest Comments

Archiving Facebook, Right Now
- George Oates 1 month ago
  
  Hi Andy, Yes, it's still a difficult nightmare, and that's just to archive your own stuff! Thanks for ...
Preserving Legislative Records: Why they matter and what the Nairobi City County Assembly can teach us
- Nelly 1 month ago
  
  Thanks for this informative insight Villy
- Adah 1 month ago
  
  A Very informative article.

DPC Blog RSS Feed

Also in this section

Elizabeth Kata

Last updated on 11 November 2019

Elizabeth Kata is Digital Archives Assistant at the International Atomic Engergy Agency (IAEA). She attended iPres2019 with support from the DPC's Career Development Fund which is generously funded by DPC supporters.

Placing a session with the title “Common Formats” under the theme “Cutting Edge” seemed at first contradictory as I looked over the iPRES 2019 program, but the four papers presented in this session demonstrated cutting edge work being done with and to preserve common formats, from data tape recovery to PDF/A analysis. And read more to see what upcoming actions this session inspired!

Recovering Data from Data Tapes: When the Common Format is No Longer Common

Johann van der Kniff (@bitsgalore) got the session off to an exhilarating start with his paper highlighting some results from the Web Archaeology project of KB | National Library of the Netherlands. In the first part of his presentation, Johann introduced some of the issues in recovering digital data from tape formats, a topic thus far not widely covered in forensics literature. Even though the IT department at the KB still had drives to read the tapes in question (DDS-1, DDS-3, and DLT-IV), issues arose around connecting the drives to the forensic workstation because of the unclear standards around SCSI and what connectors should be used to interconnect SCSI devices.

After resolving the hardware needs, Johann turned to the software needed to read the data from the tapes. While he explained that this can be done using the tools dd and mt from the command line, to make the process more user-friendly and to avoid the risk of accidental data loss when using dd, Johann developed tapeimgr to read data tapes using a simple graphical user interface. Using this tool, Johann was able to extract data from most of the tapes, with each session returning one bitstream file. He then needed to identify the format of the container files from the tapes and extract the content from the containers.

The data in question came from xxLINK, a Dutch web development and hosting company that created and hosted websites for several Dutch institutions. Using the data collected from the data tapes, the Web Archaeology project of the KB was able to reconstruct early version of the Schiphol airport website, among others, which predated the first snapshot of the website from the Internet Archive. Using truly cutting edge methods, the project will be able to expand the documentation of the early Dutch internet landscape.

You Don’t Know .applejackpie! Learning From the 16000+ File Extensions in the Library of Congress Collections

Trevor Owens (@tjowens) presented on behalf of a team from the Library of Congress (LoC) the results of a survey of file extensions found in their collections. The survey focused on content managed in their Content Transfer System (CTS), comprising of about 681 million files taking up about 8 TB of data. (This notably excludes the digital content for the Motion Picture, Broadcasting and Recorded Sound Division.) Over 16,000 unique file extensions were represented in this set!

The top ten file extensions by count included expected formats like tif, jpg, txt, pdf, xml, and gz, among others, but the tenth most common file extension was no extension at all, accounting for over 5 million files and over 3 TB of data! The paper explains, “It is likely that most of this content is related to system functions, scripted operations, or datasets, but more advanced format analysis is required to determine if any of this content represents known file formats that should be managed as collection materials.”

Analysing the top file extensions by size showed that container files (gz) and tif accounted for about 80% of the content in the LoC’s CTS. About 95% of the content was covered by eight file extensions: .jp2, .tif, .jpg, .gif .xml, .txt, .pdf, and .gz. Of the remaining file extensions, 2,761 appear only once, 14,064 appear fewer than 100 times, and “3,810 files representing 272 file extensions are 0 byte files, meaning there is no content to the file except for a filename.”

The analysis performed will serve as the basis for further policy development and resource planning. Among other conclusions, they noted the need for format characterization and validation tools to be able to perform at collection-level scale and the need for tools to analyse the contents of container files.

In the discussion following, the question of the presence of these file extensions in the PRONOM registry was raised. Leslie Johnson mentioned that the US National Archives and Records Adminstration had recently done a risk anaylsis with similar results (published on GitHub). Their results also showed the need to perform gap analysis and get missing formats into PRONOM, Siegfried, and DROID.

As a preservation risk management strategy, Trevor suggested assuming the most common formats will likely continue to be supported, making it worthwhile to focus on the common but endangered. For the truly rare, we may need to rely on digital archaeology.

Complex Audio Objects: Less Common Formats, Creative Solutions

Nick Krabbenhöft (@NKrabben, New York Public Library) presented strategies for storing complex audio objects. The New York Public Library (NYPL), like many other cultural heritage institutions, is digitizing its audio and video material to ensure that the recordings outlast their media. The specifications for digitizing audio are widely agreed upon (96Hz, 24-bit resolution, BWF) , but the storage of the resulting files has been less widely discussed, and there is no clear consensus. The NYPL encountered issues due to the 4 GB maximum file size of WAV files. This presented a challenge for organizing the streams, regions, and faces stored on magnetic media. Following the OAIS model, the relationships and hierarchies must be documented, which becomes even more imperative if the files need to be split in unexpected ways, and this must be clearly communicated if working with outside vendors.

Nick discussed other strategies for handling these complex objects, including using formats like RF64, which support much larger file sizes but do not have universal support. Another strategy would be to use container formats like MXF and Matroska, which enjoy wide(r) support but might be met with resistance from current practitioners. These formats allow for sequential storage, but they do not have a metadata standard for encoding time, which is needed to assemble regions into the correct timeline, for example. Nick also mentioned using FLACC, a lossless audio compression format, with the potential for 30-50% storage savings. As he noted, “No one will congratulate you for saving on storage, but they will yell at you for having to buy more.” NYPL has experimented with using Matroska and found it easier to model a complex audio object than through storing the relational metadata in a sidecar file, but how this strategy holds up for access remains to be seen. There is no one-approach-fits-all strategy, but cultural heritage institutions would be wise to document and share their strategies for digitizing and storing complex audio objects in order to dedvelop prefered strategies.

News We Can Use: PDF/A in the National Digital Newspaper Program

Anna Oates (@annaoates, Federal Reserve Bank of St. Louis) and William Schlaack (University of Illinois at Urbana-Champaign) finished the session with their analysis of the use of the PDF/A format in the National Digital Newspaper Program (NDNP), a U.S. project to enable access to and preservation of digitized historic newspapers. The project, which is a collaboration between the National Endowment for the Humanities and the Library of Congress, provides support for awardees to digitize local and regional historical newspapers while also providing technical guidelines for the submission, including a TIFF, JPEG2000, ALTO XML, and PDF of each newspaper page digitized. Before accepting submission packages, they must be run through the Digital Viewer and Validator (DVV), a tool developed by LoC for this NDNP project. The tool validates that the files meet certain specific project criteria, beyond basic format validation.

While not mandatory, use of PDF/A for the PDF files is recommended where possible. The DVV tool does not validate specifically for PDF/A conformance, and the authors identified VeraPDF as the preferred tool for validating PDF/A files, among other reasons because it allows users to create a profile for their own needs.

Using PDFs from the Chronicling America website, which hosts the digitized newspapers, Anna and William put together a test corpus of 382 pdf files. All of them failed PDF/A conformity. They analysed the results of the rule failures by grouping them into four overarching types XMP Metadata, Embedded Images, Embedded Fonts, and Object Streams. In particular for the rules for embedded images and embedded fonts, it appeared that better guidance by the NDNP project and/or changes to the DVV validation tool could help improve PDF/A conformity.

Taking the Next Steps

After this session, I thought about what we could steps we as a community could take to mitigate some of the risks presented in this session. One of the possibilities I saw was to address gaps in PRONOM. After discussing with others, we decided to call for a PRONOM Research Week. During the week of 18-24 November, volunteers are encouraged to help with PRONOM’s research backlog. You can enhance documentation, supply sample files, or create a signature, among other things.

We’re posting the PRONOM research backlog in a GitHub repository, which will also be the central location to share sample files and submissions. We also kindly ask for people to sign up via this Google spreadsheet, so that people can coordinate. We’ve also posted some resources here, which include blog entries and webinars to help people learn more and get started. It’s a small step, but a way to keep the ball moving after iPRES and to support the sustainability of common formats.

Add comment

Archiving Facebook, Right Now

Preserving Legislative Records: Why they matter and what the Nairobi City County Assembly can teach us