Euan Cochrane

Euan Cochrane

Last updated on 23 November 2018

Euan Cochrane is Digital Preservation Manager at Yale University Library


In the EaaSI program of work we're developing the ability to click on a link to a digital object (for example in a library’s catalogue or an archival finding aid) and have it "automagically" open in a representative version of the “original” software, within your web browser, using an emulator.  For example, the gif below demonstrates clicking a link to automatically open a Microsoft Works file running in Windows 98 within a web browser[1].

 Euan_1 (1) (3).gif

This is important as when files are opened in more modern software their content can be distorted/changed or lost. Here is the same works file opened within Microsoft Word from Office 365:

cochrane 1

You may notice that the formatting has changed, the page count increased significantly, and new information has been introduced.

We're calling this product, that allows for automagically opening digital objects in original software, the Universal Virtual Interactor (UVI) and have a few reasons for that name:

  • Firstly, it is intended to be able to be "universal" and (theoretically) work with any files/digital objects.
  • Secondly, its called an interactor not a "viewer" or "renderer", as it's not just about "rendering" or "viewing". Rendering and viewing are primarily passive activities but digital object experiences are not passive, they’re interactive!

We want to be able to enable users to interact with their digital objects presented as an experience that is as close to the "original" as possible. That interaction might include such things as turning on and off "track changes" functionality in a document, viewing embedded metadata through standard application menus, browsing and submitting queries through database interfaces, interrogating and temporarily changing spreadsheet formulae or embedded scripts, etc.

What follows is an exploration of how we're working to achieve the "magic" behind the UVI.

The “magic” behind the UVI

To achieve automagic interaction we have to match digital objects[2] to configured interaction computing environments.   A configured computing environment is a little more complicated and consists of a set of computer hardware and software that can be used together for whatever purpose you might have. That hardware might be physical or emulated/virtualized and the software may be installed and configured on a physical hard drive or may be installed and configured in an ‘image’ file that an emulator or virtualizer can use.  An example environment of this type might be a Windows 98 Second Edition based computer with Microsoft Office 97 installed within it running on emulated hardware that simulates an Intel Pentium 1 computer with 64MB of RAM, a Soundblaster 16 audio card and a cirrus logic video card.

The EaaSI program of work is committed to configuring at least 3000 emulated or virtualized configured computing environments during the first grant-funded phase of work. These environments will be fully documented with structured documentation, much of which sourced from and contributed to https://wikidata.org.  Many of the environments will be fairly simple variants of one another. For example we are using Windows 98SE as a “base” environment from which we are creating many derivative environments where the only difference is the set of applications that is installed on them (e.g. Microsoft Office 97 vs Corel WordPerfect Office 2000). The conceptual simplicity of these environments does belie both a lot of complexity in both the metadata required to make automatic use of them and a lot of power through having these available as unique addressable computing environments.  The Emulation as a Service (EaaS) API that is part of EaaSI allows for programmatic access to these environments meaning that provided sufficient documentation is available we should be able to match digital files to environments that include software that can interact with them and enable that interaction to happen automatically.

Matching files to configured interaction computing environments

Our goal with the UVI is to match digital objects to configured computing environments automatically. Moreover, our goal is to match them to the “best” environment, both from authenticity and functionality perspectives. “Best” is subjective though, so a secondary goal is to ensure that deciding which is “best” is flexible within the context in which the UVI is configured, to allow both administrators and users to have input in the selection process.

So, given that context, how do we match digital files to interaction computing environments?

When a file  is sent to EaaSI it gets identified by a suite of tools. It gets matched to a file format identifier in wikidata.org, often via its PRONOM identifier. The dates the file was created and last edited are also extracted from the file where possible.

EaaSI then looks up the list of software applications that can interact with that format, and narrows that list in a few ways, including:

  1. It looks at what applications used that format as their default "open" format
  2. It looks at what applications used that format as their default “save-as” format
  3. It looks at the age of the file, deciding whether either date is likely accurate (they often aren’t)
    1. if the dates are both close to right now and the file format standard hasn't been used for a long time then the dates are unlikely/less likely to be accurate
  4. It looks at the dates when the applications that used the identified format as the default save or open format were superseded by software that no longer used the format as a default save or open format
  5. It looks at information about which applications were most popular during which periods of time

By weighting these factors EaaSI can narrow the list to a set of applications that are most likely to have been "original" to the time when the file was in regular use, or at least contemporaneous with it.

EaaSI then looks at the configured computing environments that already exist to see which have the application(s) installed and configured on them ready for use. There are then a number of options that network can implement:

  1. If there is only one recommended application then use that application to deliver the interaction experience for the digital object
  2. If there is more than one recommend application then either:
    1. Default to the highest weighted option
    2. Provide the user a UI in which the user can choose which environment to use
    3. Randomly assign an environment from the recommended options
    4. Use the application that was most popular at the time (likely the highest weighted option)

All of the metadata required for this, metadata about the software applications and the open/import, save-as/export formats they support, their default formats for each, and their dates of use, popularity etc is being captured or confirmed as part of the work of the EaaSI program and is considered a significant output of the program.

As we improve the algorithm behind the results, the improvements will be incorporated into the source code behind EaaSI which will be published under an open license.

A better way

It would be much better if we didn't have to make informed guesses using an algorithm like this in order to match objects to interaction environments and could instead be more deterministic in our matching. We wouldn't have to guess if we had tools that could identify the original intended interaction application for digital objects more deterministically. In most cases the original intended interaction application is the software that created it, in the majority of alternative cases the original intended interaction application can be easily identified if you know the original creating application. For example, for PDF files created by Microsoft Office the intended original intended interaction application is normally the contemporaneous version of Adobe Acrobat Reader. So, if we could identify the original creating application of a file, then we would be a long way towards achieving our goal.

File format identification tools such as DROID and Siegfried match file format standards to files’ structures by looking for patterns within the files that match to patterns the file format requires there to be in the files, those patterns are called ‘signatures’. It ought to be possible to use the same tools to match the file structures and patterns within them to creating applications as well as matching the files to file format standards. To do this assumes two things though:

  1. That the creating applications do create unique signature patterns within the files they create
  2. That we can develop infrastructure to support identifying those patterns and matching them to specific application versions in the same way PRONOM supports matching patterns to file format standards.

The first of these is empirically testable and we already know that some applications do create unique patterns within files that can be used to identify the creating application. For example, Microsoft Office 2007 aimed to adhere to the Open Document Format (ODF) file format standards and Excel 2007 was able to write out Open Document Spreadsheet (.ods) files with formulae in them. By virtually all accounts the Excel implementation adhered to the letter of the standard in this particular case. Interestingly however OpenOffice.org, the leading Open Source alternative to Microsoft Office at the time, and a leading proponent of the Open Document Format standards, created them differently. There were heated online debates about which approach was more correct (see links here: http://fileformats.archiveteam.org/wiki/OpenDocument_Spreadsheet) but most importantly this difference in implementation allows tools to automatically identify which set of code was used to create the files.

The second of these assumptions is more challenging due to a lack of existing infrastructure by way of standard identifiers and metadata for software application versions, and lack of  metadata and infrastructure for storing software application version signatures. We aim to use https://wikidata.org to meet this need in the future by capturing creation-application-version signatures in the application-version item pages in Wikidata.

Automatically opening the files for interaction

Once files are matched to configured interaction computing environments EaaSI can then serve the environment to the user with the file in an additional attached drive (e.g. another hard drive or a floppy or CD/DVD drive and have that drive location opened automatically on boot and presented to the user for interaction. This is achieved by creating an empty image on demand and putting the file into it, then attaching the image to the emulated hardware and having the environment prepared with that location opening automatically on boot. However that process does not actually automate the opening of the files themselves, which is what we are aiming for. More work is required to achieve that level of automation.

The process to automate the opening of a file within an emulated computing environment differs by operating system and application and is not possible with all applications, however there are a few options we know we can implement and are working to implement:

  1. Upon request, EaaSI opens the disk image of the configured interaction computing environment and puts the file to be interacted with into the “startup” location for operating systems that support that (e.g. variants of Microsoft Windows and Apple Mac OS)
    1. For files that can’t fit within the free space of the environment’s disk image, EaaSI puts a link in the startup location to the file stored in another drive that was created on-demand with the file within it
    2. This requires the file’s mime type to be associated with the appropriate interaction application within the operating system such that the required application is configured as the one to automatically open that mime type, and that information has to be available as metadata for EaaSI to use - all things we are implementing within EaaSI
  2. Create a script/program that opens the file using the operating system and required applications’ available methods and include it in the operating system’s start up process. For example running LibreOffice Writer in Linux with the the following bash command should open the file “file.odt”
    libreoffice --writer file.odt
    Then use either the method above, or e.g. in the case of linux operating systems or e.g. MS-DOS, edit the startup sequence to include the appropriate command.
  3. Use a pre-recorded GUI macro to automatically open the file using mouse and keyboard maneuvers. This is probably the most widely applicable option but may require restricting manual  user-input until the file is open which could have various challenges associated with it (such as identifying when to re-enable user input).

Most of these methods require additional metadata to be captured about the software applications as-configured within the configured software environments, e.g. where the executable files are, what the default open-as formats are, etc. These are all metadata items that we are recording as part of the EaaSI program of work.

A work in progress

This is a work in progress and we will be iterating it over time to improve the performance of the UVI. In addition, the UVI will only ever be as useful as the set of available interaction environments and their metadata. We hope this set will improve over time and are working to ensure it will. We would love to hear from the digital preservation community about this approach and would appreciate any feedback or ideas you might like to contribute please get in contact by emailing This email address is being protected from spambots. You need JavaScript enabled to view it., and,  #joinSPN!


[1] This example file was also used as an example in the rendering matters research undertaken at Archives New Zealand

[2] The intent is for the UVI to be able to work with multi-file digital objects, however additional logic required for multi-file objects than single-file objects. That additional logic is mostly a more complex but not more complicated variant of the single-file case, so for simplicity only the single-file example will be used in this post.

Comments   

#1 Chip German 2018-11-30 16:25
Good, clear explanation here, Euan. This piece will rise to the top of my resources when trying to explain what we're trying to do and what we have to work through in order to do it.
Quote

Scroll to top