Handbook

Moving pictures and sound

 

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

Overview

 

This case study provides a brief novice to intermediate level overview summarised from the DPC Technology Watch Report on Preserving Moving Picture and Sound. Five "mini case studies" of UK collections that have run preservation and access projects for sound and moving image content are included. The report itself provides a "deep dive" discussing a wider range of issues and practice in greater depth with extensive further reading and advice (Wright, 2012). It is recommended to readers who need a more advanced level briefing on the topic and practice.

 

Introduction

 

The audiovisual domain is unique in that digitization is routinely critical to preservation. Audiovisual digitization for preservation is so pervasive that the two words have come to be used interchangeably. Audio and video need digitization for the very survival of their content, owing to the obsolescence of playback equipment and decay and damage of physical items, whether analogue or digital. The basic technology issue for collections of moving images and sound is the necessity to digitize all content currently sitting on shelves. Film on shelves can be conserved (unless it is already deteriorating), but still needs digitization to provide access.

A vital issue in preservation is access: motivation and funding for digitization purely for preservation purposes is difficult, if not impossible. There is great public, institutional and educational interest in the audiovisual record of the twentieth century. Creating access to that record is the key to obtaining the support needed for the digitization and preservation of the content.

The landscape for 'moving pictures and sound' is complicated: physically, there are large differences between audio, video and film recordings. The formats and record/playback equipment are completely separate; the digitization procedures are different; the digital files have different wrapper formats and metadata (with some overlaps); and the storage requirements differ, with video taking roughly 100 times as much storage per second of material as does audio, and high resolution digital film taking roughly 10 times more storage than video.

In addition culturally and economically, there are significant preservation and curation differences between collections from:

  • commercial media industries – music, cinema and commercial broadcasting where preservation needs a commercial justification, a business case;
  • public bodies – public service broadcasting, academic collections and heritage institutions such as national museums, libraries and film institutes where preservation needs a cultural heritage justification, though increasingly this sector also needs a business case;
  • technical areas such as medicine, geology and surveillance, where recordings of images or of seismic events are raw data, kept as medical records or for reprocessing; and
  • other – a wide range of independent collections, ranging from individual efforts to material gathered by non-profit specialist institutions (for example, steam engine clubs or ethnological research) that do not fall into any of the above categories, though their material may eventually end up being donated to a public collection.

Within the landscape is a range of technologies including engineering, computing, Internet technology, archiving, media management, museum collections management, curation, preservation, access, knowledge management and resource discovery.

 

Technical challenges

 

Audiovisual recordings are surrogate reality. The technology allows the listener and viewer to get a sensation of what a situation sounded and looked like, but the technology actually only captures the sequence of light patterns or sound pressures acting on the recording instrument (camera, microphone). These patterns (for film) and signals (for video and audio) are more like data than like artefacts. The preservation requirement is not to keep the original recording media, but to keep the data, the information, recovered from that media.

A key technology issue is moving digital content from carriers (such as CD and DVD, digital videotape, DAT and minidisc) into files. This digital to digital 'ripping' of content is an area of digital preservation unique to the audiovisual world, and has unsolved problems of control of errors in the ripping and transfer process.

The final technology area is digital preservation of the content within the files that result from digitization or ripping, and the files that are born digital. While much of this preservation has problems and solutions in common with other content, there is a specific problem of preserving the quality of the digitized signal that is again unique to audiovisual content. Managing quality through cycles of lossy encoding, decoding and reformatting is one major digital preservation challenge for audiovisual files. The other issue is managing embedded metadata.

For three decades for audio, and for at least two decades for video, archives have been digitizing their analogue content for preservation and access. The problem areas are:

  • successful playback of the originals, in order to get an optimal signal to digitize;
  • standards: what compression level, encoding method and file format to use; and
  • efficiency: digitizing the existing analogue materials fast enough and economically enough to cope with the size and urgency of the problem.

 

Stages in sound and moving image digital preservation

 

For sound and moving image preservation, the following stages in the overall process need to be kept clear:

  • signal: the audio from a microphone, the video signal coming out of a video camera. These signals have physical properties (bandwidth; dynamic range) that can be defined and measured. The quality of a recording and the success or failure of any process of copying, digitization or preservation can be reduced (in large part) to how well that process maintains these two physical properties of the original signal;
  • recording of a signal onto a carrier (also called support, physical medium or recording format). For a century, the methods of capturing a signal were tied to the carrier of the signal: a wax cylinder, film reel or videotape. Digital technology produces recordings that are independent of carriers. Carrier independence is liberation: discs, tapes and films deteriorate or get damaged. Born digital recordings are liberated from these carrier-based problems, leading to a desire to liberate analogue recordings by digitization;
  • digitization: analogue recordings can be played back and recorded onto a new carrier, or digitized and so released from carrier dependence. Digitization has to ensure that the digital version has the same bandwidth and dynamic range as the original, to capture the original quality; and
  • digital preservation of the digital representation of a signal, meaning preserving the numbers, but also preserving the technology needed to decode (render) the numbers. Audiovisual content has a particular problem. The coding of the signal can be a compromise, not actually capturing the full signal, but instead losing some of it (lossy encoding) to get a more compact representation, thus reducing storage and transmission costs. Unfortunately coders/decoders (codecs) go out of use, and are replaced by newer technology. The file format holding the coded signal, the wrapper, is also subject to obsolescence. The failure and obsolescence of storage technology and the obsolescence of encode/decode methods and wrapper formats are major digital preservation problems for audiovisual content.

 

Access and rights

 

Sound and moving picture content arising from cinema, broadcasting and the commercial music industry is constrained by rights issues. Music has copyright protection for the composer and for the physical object containing a performance (so-called magnetic copyright). Cinema productions are protected, and music used in a film retains its separate protections. Broadcasting is even more complicated, as all the parties involved in a production may have rights in future exploitation subsequent to the one or two transmissions that were specified in typical contracts. These rights are seen as protection by rights holders, but are also seen as restrictions on access. The situation for a public broadcaster is particularly difficult. The public invariably feel that any production by a public broadcaster has already been paid for by them, is already publicly owned and should be available for public access. Unfortunately that understandable feeling is not the same as the legal definition governing when a work enters the public domain (usually determined by expiry dates on copyright and other rights).

 

 Case study 1: The Open University (OU) Access to video assets project

 

This is an access and re-use project. The focus is to digitize (where necessary) audiovisual assets previously created by the OU, and place them in an asset management system so that current OU teaching and other activity can find and use these assets. Preservation is a by-product of the project rather than an end in itself. This project provides an important example of combining preservation of content with use of content, something of value to the institution in order to obtain a budget and deliver a benefit. The project was presented at the DPC Briefing Day 'Preserving Digital Sound and Vision'. The project digitized 1,200 videotapes and films, and placed the results in a Fedora digital repository. Also, 145,000 pages of documentation were digitized, providing the overall educational framework around the 1,200 items, giving them context and enhancing their ability to be re-used. The user interface provides granularity and time-based navigation. Overall this project is an outstanding example of best practice.

 

 Case study 2: British Library Archival sound recordings project

 

This is a JISC-supported preservation and educational access project that ran (in its initial phase) from 2004 to 2006. A second phase added further material. Nearly 50,000 recordings of speech, music and sounds of 'human and natural environments' were digitized and placed online. The online catalogue is open to all and licensed UK further or higher education institutions can also listen to the audio. Anyone can listen to 2,000 of the items (or any of them by attending the British Library reading room in London). The differences in access between educational institutions and the general public reflects the overall issue of rights as the one remaining constraint on open access to audiovisual materials in public institutions.

 

 Case Study 3: Imperial War Museum PSRE project

 

The Imperial War Museum has one of the UK's major film collections. It has been collecting film since its founding in 1919, beginning with footage from the Great War that led to the institution's founding. The Public Sector Research Exploitation (PSRE) fund made an award of nearly £1 million for cataloguing, digitization and online access (to the catalogue and the footage). The project ran from 2006 to 2009 and is of particular interest in that it is specifically aimed at commercial exploitation of a collection, and at sustainable business models around digitization and web access. The result is a website (http://film.iwmcollections.org.uk/) where anyone can view content in low quality; pull documents, stills and key frames into a lightbox; and fill a shopping basket to then purchase content.

 

 Case Study 4: British University Film and Video Council Newsfilm Online project

 

This is another project with JISC sponsorship. For four decades to 1960 newsreels shown in cinemas were the main way for the general public to see moving images of current events. The initial project ran from 2004 to 2008. The results are available through a website which, as for the BL Archival Sound Recordings project, has full functionality for registered universities and colleges. The general public can see the full catalogue and can see a single key frame for each item. Since the original phase of the project, the content has been augmented by ITN/Reuters news covering the events from decades after the decline of newsreels. Newsreel items are short: the initial project provided 3,000 hours of content, but that represented 60,000 items. In addition, as with the Open University project, documentation was also placed online for context and to support search and retrieval: 450,000 pages of bulletin scripts.

 

 Case Study 5: BFI and Regional Film Archives Screen Heritage UK (SHUK) project

 

SHUK is a large (£22.8 million) and complex project (involving 12 regional film archives in addition to the BFI). The project was complicated by changes in the structure and funding of the BFI, as well as a change of government and a raft of other issues. Nevertheless the project has produced major achievements:

  • conservation, not digitization: construction of a £6-million vault for film conservation;
  • digitization: film scanning and digital storage equipment for the regional film archives;
  • access: online catalogues of regional film archive content, available to the general public.
 

SHUK launched on 5 September 2011 with a BBC BFI joint production, The Reel History of Britain (SHUK, 2011).

 

Conclusions

 

The basic technology issue for collections of moving images and sound is the necessity for digitization of all content that is currently sitting on shelves. Audio and video need digitization for their very survival, owing to obsolescence and decay of physical items, whether analogue or digital. Film on shelves can be conserved (unless it is already deteriorating) but needs digitization for access.

Playback for preservation-quality digitization implies the need for optimal recovery of the original quality, which requires professional equipment and experience. The major technical obstacle is that, for many physical formats, the needed equipment is largely obsolete, meaning that parts and repairs and skilled operators are in increasingly short supply. The urgent recommendation is, do not wait! Audiovisual holdings need to be documented and made part of a preservation plan.

The situation for sound heritage is clear. The digitization standards, encoding, wrapper and metadata are all agreed and well documented in IASA TC-04 (IASA, 2009). Uncompressed audio in the Broadcast Wave Format (BWF) wrapper is widely used and well supported. There is no reason for the basic encoding to ever be changed, though the BWF wrapper may eventually become obsolete. The only significant problem is the failure of some standard audio applications to handle embedded BWF metadata correctly (ARSC, 2011). All archives need to be aware of the risk of loss of embedded metadata. The situation for video is complex, but there is a PrestoSpace roadmap for guiding choices on the digitization of various legacy formats. There is advice from the PrestoCentre and from JISC Digital Media on the digital preservation of the resultant files. A big challenge is a registry of applications that work properly on embedded video metadata, where the diversity is huge. There is no single agreed wrapper, metadata standard or even encoding standard, and the change from standard definition to high definition brings a new set of applications, wrappers and encodings.

There is emerging technology that can improve audio (capture of the bias tone and consequent removal of temporal variation) and video transfers (direct digitization of the RF signal from the read head), which could be useful in those cases where current technology fails. So the recommendation is not to wait until such technology is further advanced and more widely available. If there are playback problems that cannot be resolved, the original audio or video format should be kept so that such advanced technology can be applied in the future.

Quality checking of the results of digitization remains an issue for video. There is a need for effective integration of signal processing technology with human checking in order to produce a really efficient method of quality control within a preservation factory approach. Quality checking is equally relevant to digital preservation – any changes or migrations due to digital obsolescence need to be checked for preservation of signal quality. Again, a purely manual approach does not scale (to the tens of millions of hours of audiovisual content in European collections), while purely algorithmic substitutes for 'looking and listening' have never been completely successful and remain an area where further research is needed.

 

Resources

Wright, R., 2012. Preserving Moving Pictures and Sound DPC Technology Watch Report 12-01 March 2012

http://dx.doi.org/10.7207/twr12-02

This report is for anyone with responsibility for collections of sound or moving image content and an interest in preservation of that content. New content is born digital, analogue audio and video need digitization to survive and film requires digitization for access. Consequently, digital preservation will be relevant over time to all these areas. The report concentrates on digitization, encoding, file formats and wrappers, use of compression, obsolescence and what to do about the particular digital preservation problems of sound and moving images (33 pages).

SHUK, 2011. Screen Heritage UK Marks new Era for Britain's Film Archives

http://www.bfi.org.uk/sites/bfi.org.uk/files/downloads/bfi-press-release-screen-heritage-uk-marks-a-new-era-for-britains-film-archives-2011-09-01.pdf

BFI Press release. 8 pages

IASA 2009 IASA TC-04, Guidelines on the Production and Preservation of Digital Audio Objects (IASA-TC 04 Second edition 2009) Canberra, IASA.

http://www.iasa-web.org/audio-preservation-tc04

This is the standard guide to digitization of audio, and the sections on metadata and digital storage are of value to all forms of digital media.

Casey, M. and Gordon, B., 2007. Best Practices for Audio Preservation. Bloomington, Indiana University Bloomington.

http://www.dlib.indiana.edu/projects/sounddirections/papersPresent/

Another audio resource (that also includes a range of digitization software tools) comes from the Sound Directions project of Harvard and Indiana Universities: much is also relevant to video digitization. (160 pages)

Digital Preservation Coalition Briefing day on Preserving Digital Sound and Vision, April 2011

https://www.dpconline.org/events/past-events/preserving-digital-sound-and-vision-a-briefing

This DPC briefing day in April 2011 provided a forum to review and debate the latest development in the preservation of digital sound and vision. Seven presentations (including the Open University) are linked from the programme and available to download.

ARSC Technical Committee, 2011. Study of Embedded Metadata Support in Audio Recording Software. Association of Recorded Sound Collections.

http://www.arsc-audio.org/pdf/ARSC_TC_MD_Study.pdf

A study of support for embedded metadata within and across a variety of audio recording software applications. The findings raise serious concerns, particularly for the archiving and preservation communities who rely on embedded metadata for interpretation and management of digital files representing preserved content into the future. (21 pages)

AVPreserve

http://www.avpreserve.com/

US based media and information management consulting firm. It website provides a range of resources for AV preservation.

BUFVC NewsFilm online Project

http://www.webarchive.org.uk/wayback/archive/20140614061518/http://www.jisc.ac.uk/whatwedo/programmes/digitisation/bufvc.aspx

British Film Institute

http://www.bfi.org.uk

the British Film Institute can advise on film and also on video – they hold a lot of video, and have a Curator for Television. Its remit is collection and preservation of film and television, and technical advice.

British Library Sound Archive

https://www.bl.uk/subjects/sound#

General technical advice on audio preservation is available from the British Library Sound Archive. Its remit is collection and preservation of all forms of audio, and technical advice.

Film Archives UK

http://filmarchives.org.uk

Collection and preservation of general audiovisual content of regional significance in the UK

JISC Digital Media

https://www.jisc.ac.uk/

Advice and training on still images, moving images and sound. This includes their InfoKits for Digital File Formats, Digitisation funding and sustainability, and High Level Digitisation Guide for Audiovisual Resources.

PrestoCentre

http://www.prestocentre.eu

Website provides audiovisual information, resources and advice. Access has recently been extended so that all resources are now freely available to all.

Sustaining Consistent Video Presentation

http://www.tate.org.uk/research/publications/sustaining-consistent-video-presentation

This technical paper addresses approaches to identifying and mitigating risks associated with sustaining the consistent presentation of digital video files. Originating from two multi-partnered research projects – Pericles and Presto4U – the paper was commissioned by Tate Research and is intended for those who are actively engaged with the preservation of digital video.

JISC 2009 - Archival Sound Recordings Showreel

https://www.youtube.com/watch?v=KPy9ZqWEHog

Engaging short video on British Library archival sound recordings project published on 22 Jun 2009. (6 mins 11 secs).

 

Further case studies

Podcasts in the Archives: Archiving Podcasting Content at the University of Michigan

http://files.archivists.org/pubs/CampusCaseStudies/CASE12.pdf

In this Society of American Archivists campus case study Alexis. A. Antracoli, University of Michigan, examines the challenges involved in developing best practices and workflows for archiving and preserving podcasting content. One major issue involved establishing standards of practice for ingest, storage, and access, especially the generation and storage of appropriate descriptive, technical, and preservation metadata. Another challenge centered around developing the necessary technological infrastructure to support an Open Archives Information System (OAIS)-compliant system. 2010. (14 pages).

 

References

 

ARSC Technical Committee, 2011. Study of Embedded Metadata Support in Audio Recording Software. Association of Recorded Sound Collections. Available: http://www.arsc-audio.org/pdf/ARSC_TC_MD_Study.pdf

IASA, 2009. IASA TC-04, Guidelines on the Production and Preservation of Digital Audio Objects, IASA-TC 04 Second edition 2009, Canberra, IASA. Available: http://www.iasa-web.org/audio-preservation-tc04

SHUK, 2011. Screen Heritage UK Marks new Era for Britain's Film Archives. Available: http://www.bfi.org.uk/sites/bfi.org.uk/files/downloads/bfi-press-release-screen-heritage-uk-marks-a-new-era-for-britains-film-archives-2011-09-01.pdf

Wright, R., 2012. Preserving Moving Pictures and Sound DPC Technology Watch Report 12-01 March 2012. Available: http://dx.doi.org/10.7207/twr12-02

 

Save

Read More

Web-archiving

 

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

Overview

 

This case study provides a brief novice to intermediate level overview summarised from the DPC Technology Watch Report on Web-Archiving. Three "mini case studies" are included illustrate the different operational contexts, drivers, and solutions that can be implemented. The report itself provides a "deep dive" discussing a wider range of issues and practice in greater depth with extensive further reading and advice (Pennock, 2013). It is recommended to readers who need a more advanced level briefing on the topic and practice.

 

Introduction

 

The World Wide Web is a unique information resource of massive scale, used globally. Much of its content will likely have value not just to the current generation but also to future generations. Yet the lasting legacy of the web is at risk, threatened in part by the very speed at which it has become a success. Content is lost at an alarming rate, risking not just our digital cultural memory but also organizational accountability. In recognition of this, a number of cultural heritage and academic institutions, non-profit organizations and private businesses have explored the issues involved and lead or contribute to development of technical solutions for web archiving.

 

Services and Solutions

 

Business needs and available resources are fundamental considerations when selecting appropriate web archiving tools and/or services. Other related issues must also be considered: organizations considering web archiving to meet regulatory requirements must, for example, consider associated issues such as authenticity and integrity, recordkeeping and quality assurance. All organizations will need to consider the issue of selection (i.e. which websites to archive), a seemingly straightforward task which is complicated by the complex inter-relationships shared by most websites that make it difficult to set boundaries. Other issues include managing malware, minimizing duplication of resources, temporal coherence of sites and long-term preservation or sustainability of resources. International collaboration is proving to be a game-changer in developing scalable solutions to support long-term preservation and ensure collections remain reliably accessible for future generations.

The web archiving process is not a one-off action. A suite of applications is typically deployed to support different stages of the process, though they may be integrated into a single end-to-end workflow. Much of the software is available as open source, allowing institutions free access to the source code for use and/or modification at no cost.

 

Integrated Systems for Web-archiving

 

A small number of integrated systems are available for those with sufficient technical staff to install, maintain and administer a system in-house. These typically offer integrated web archiving functionality across most of the life cycle, from selection and permissions management to crawling, quality assurance, and access. Three are featured here.

 

PANDAS

PANDAS (PANDORA Digital Archiving System) was one of the first available integrated web archiving systems. First implemented by the National Library of Australia (NLA) in 2001, PANDAS is a web application written in Java and Perl that provides a user-friendly interface to manage the web archiving workflow. It supports selection, permissions, scheduling, harvests, quality assurance, archiving, and access. PANDAS is not open source software, though it has been used by other institutions (most notably the UK Web Archiving Consortium from 2004 to 2008). It is used by the NLA for selective web archiving, whilst the Internet Archive supports their annual snapshots of the Australian domain.

Web Curator Tool (WCT)

The Web Curator Tool is an open source workflow tool for managing the selective web archiving process, developed collaboratively by the National Library of New Zealand and the British Library with Oakleigh Consulting. It supports selection, permissions, description, harvests, and quality assurance, with a separate access interface. WCT is written in Java within a flexible architecture and is publicly available for download from SourceForge under an Apache public licence. The WCT website is the hub for the developer community and there are active mailing lists for both users and developers. The highly modular nature of the system minimizes system dependencies.

NetarchiveSuite

NetarchiveSuite is a web archiving application written in Java for managing selective and broad domain web archiving, originally developed in 2004 by the two legal deposit libraries in Denmark (Det Kongelige Bibliotek and Statsbiblioteket). It became open source in 2007 and has received additional development input from the Bibliothèque nationale de France and the Österreichische Nationalbibliothek since 2008. It is freely available under the GNU Lesser General Public License (LGPL). The highly modular nature of the system enables flexible implementation solutions.

 

Third party and commercial services

 

Third party commercial web archiving services are increasingly used by organizations that prefer not to establish and maintain their own web archiving technical infrastructure. The reasons behind this can vary widely. Often it is not simply about the scale of the operation or the perceived complexity, but the business need and focus. Many organizations do not wish to invest in any skills or capital that is not core to their business. Others may use such a service to avoid capital investment. Moreover, organizations are increasingly moving their computing and IT operations into the cloud, or using a SAAS (Software as a Service) provider. Web archiving is no exception. From a legal and compliance perspective, third party services are sometimes preferred as they can provide not just the technology but also the skills and support required to meet business needs. This section introduces some of the third party services currently available but is of course a non-exhaustive list, and inclusion here should not be taken as recommendation.

 

Archive-It

Archive-It is a subscription web archiving service provided by the Internet Archive. Customers use the service to establish specific collections, for example about the London 2012 Olympics, government websites, human rights, and course reading lists. A dedicated user interface is provided for customers to select and manage seeds, set the scope of a crawl and crawl frequency, monitor crawl progress and perform quality assurance, add metadata and create landing pages for their collections. Collections are made public by default via the Archive-It website, with private collections requiring special arrangement. The access interface supports both URL and full text searching. Over 200 partners use the service, mostly from the academic or cultural heritage sectors. The cost of the service depends on the requirements of the collecting institution

Archivethe.Net

Archivethe.Net is a web-based web archiving service provided by the Internet Memory Foundation (IMF). It enables customers to manage the entire workflow via a web interface to three main modules: Administration (managing users), Collection (seed and crawl management), and Report (reports and metrics at different levels). The platform is available in both English and French. Alongside full text searching and collection of multimedia content, it also supports an automated redirection service for live sites. Automated QA tools are being developed though IMF can also provide manual quality assurance services, as well as direct collection management for institutions not wishing to use the online tool. Costs are dependent upon the requirements of the collecting institution. Collections can be made private or remain openly accessible, in which case they may be branded as required by the collecting institutions and appear in the IMF collection. The hosting fee in such cases is absorbed by IMF.

The University of California's Curation Centre (UC3)

As part of the California Digital Library, provides a fully hosted Web Archiving Service for selective web archive collections. University of California departments and organizations are charged only for storage. Fees are levied for other groups and consortia, comprising an annual service fee plus storage costs. Collections may be made publicly available or kept private. Around 20 partner organizations have made collections available to date. Full text search is provided and presentation of the collections can be branded as required by collecting institutions.

Private companies

Private companies offer web archiving services particularly tailored to business needs. Hanzo Archives, for example, provide a commercial website archiving service to meet commercial business needs around regulatory compliance, e-discovery and records management. Hanzo Archives emphasize their ability to collect rich media sites and content that may be difficult for a standard crawler to pick up, including dynamic content from Sharepoint, and wikis from private internets, alongside public and private social media channels. (More details about the possibilities afforded by the Hanzo Archives service can be found in the Coca-Cola case study) Similarly, Reed Archives provide a commercial web archiving service for organizational regulatory compliance, litigation protection, eDiscovery and records management. This includes an 'archive-on-demand' toolset for use when browsing the web. In each case, the cost of the service is tailored to the precise requirements of the customer. Other companies and services are also available and readers are encouraged to search online for further options should such a service be of interest.

 

 Case study 1: The UK Web Archive

 

The UK Web Archive (UKWA) was established in 2004 by the UK Web Archiving Consortium. It was originally a six-way partnership, led by the British Library in conjunction with the Wellcome Library, Jisc, the National Library of Wales, the National Library of Scotland and The National Archives (UK).

UKWA partners select and nominate websites using the features of the web archiving system hosted on the UK Web Archive infrastructure maintained by the British Library. The British Library works closely with a number of other institutions and individuals to select and nominate websites of interest. Selectively archived websites are revisited at regular intervals so that changes over time are captured.

The technical infrastructure underpinning the UK Web Archive is managed by the British Library. The Archive was originally established with the PANDAS software provided by the National Library of Australia, hosted by an external agency, but in 2008 the archive was moved in-house and migrated into the Web Curator Tool (WCT) system.

A customized version of the Wayback interface developed by the Internet Archive is used as the WCT front end and provides searchable access to all publicly available archived websites. Full text searching is enabled in addition to standard title and URL searches and a subject classification schema. The web archiving team at the library have recently released a number of visualization tools to aid researchers in understanding and finding content in the collection.

Special collections have been established on a broad range of topics. Many are subject based, for example the mental health and the Free Church collections. Others document the online response to a notable event in recent history, such as the UK General Elections, Queen Elizabeth II's Diamond Jubilee and the London 2012 Olympics.

Many more single sites, not associated with a given special collection, have been archived on the recommendation of subject specialists or members of the public. These are often no longer available on the live web, for example the website of UK Member of Parliament Robin Cook or Antony Gormley's One & Other public art project , acquired from Sky Arts.

 

 Case study 2: The Internet Memory Foundation

 

The Internet Memory Foundation (IMF) was established in 2004 as a non-profit organization to support web archiving initiatives and develop support for web preservation in Europe. Originally known as the European Archive Foundation, it changed its name in 2010. IMF provides customers with an outsourced fully fledged web archiving solution to manage the web archiving workflow without them having to deal with operational workflow issues.

IMF collaborates closely with Internet Memory Research (IMR) to operate a part of its technical workflows for web archiving. IMR was established in 2011 as a spin off from the IMF. Both IMF and IMR are involved in research projects that support the growth and use of web archives.

IMR provides a customizable web archiving service, Archivethe.Net (AtN). AtN is a shared web-archiving platform with a web-based interface that helps institutions to easily and quickly start collecting websites including dynamic content and rich media. It can be tailored to the needs of clients, and institutions retain full control of their collection policy (ability to select sites, specify depth, gathering frequency, etc.). Quality control services can be provided on request. Most is done manually in order to meet high levels of institutional quality requirements, and IM has a dedicated QA team composed of QA assessors. IM has developed a methodology for visual comparison based on tools used for crawling and accessing data, though they are also working on improving tools and methods to deliver a higher initial crawl quality.

Partner institutions, with openly accessible collections for which the IM provides a web archiving service, include the UK National Archives and the UK Parliament.

Access to publicly available collections is provided via the IM website. IM provides a full text search facility for most of its online collections, in addition to URL-based search. Full text search results can be integrated on a third party website and collections can be branded by owners as necessary.

Following the architecture of the Web Continuity Service by The National Archives (The National Archives, 2010), IM implemented an 'automatic redirection service' to integrate web archives with the live web user experience. When navigating on the web, users are automatically redirected to the web archive if the resource requested is no longer available online. Within the web archive, the user is pointed to the most recent crawled instance of the requested resource. Once the resource is accessed, any link on the page will send the user back to the live version of the site. This service is considered to increase the life of a link, to improve users' experience, online visibility and ranking, and to reduce bounce rates.

Web archiving collections are available for public browsing from the IM website, a combination of both domain and selective collections from its own and from partner institutions.

 

 Case study 3: The Coca-Cola web archive

 

The Coca-Cola Web Archive was established to capture and preserve corporate Coca-Cola websites and social media. It is part of the Coca-Cola Archive, which contains millions of both physical and digital artefacts, from papers and photographs to adverts, bottles, and promotional goods. Coca-Cola's online presence is vast, including not only several national Coca-Cola websites but also for example, the Coca-Cola Facebook page and Twitter stream, and other Coca-Cola owned brands (500 in all).The first Coca-Cola website was published in 1995.

Since 2009, Coca-Cola has collaborated with Hanzo Archives and now utilizes their commercial web archiving service. Alongside the heritage benefits of the web archive, the service also provides litigation support where part or all of the website may be called upon as evidence in court and regulatory compliance for records management applications.

The Coca-Cola web archive is a special themed web archive that contains all corporate Coca-Cola sites and other specially selected sites associated with Coca-Cola. It is intended to be as comprehensive as possible, with integrity/functionality of captured sites of prime importance. This includes social media and video, whether live-streamed or embedded (including Flash). Artefacts are preserved in their original form wherever possible, a fundamental principle for all objects in the Coca-Cola Archive.

Hanzo Archives' crawls take place quarterly and are supplemented by occasional event-based collection crawls, such as the 125th anniversary of Coca-Cola, celebrated in 2011. Hanzo's web archiving solution is a custom-built application. Web content is collected in its native format by the Hanzo Archives web crawler, which is deployed to the scale necessary for the task in hand.

Quality assurance is carried out with a two-hop systematic sample check of crawl contents that forces use of the upper-level navigation options and focuses on the technical shape of the site.

The Archive is currently accessible only to Coca-Cola employees, on a limited number of machines. Remote access is provided by Hanzo using their own access interface. Proxy-based access ensures that all content is served directly from the archive and that no 'live-site leakage' is encountered. The archive may be made publicly accessible in the future inside The World of Coca-Cola, in Altanta, Georgia, USA.

The Coca-Cola web archive collection contains over six million webpages and over 2TB of data. Prior to their collaboration with Hanzo, early attempts at archiving resulted in incomplete captures so early sites are not as complete as the company would like. The collection also contains information about many national and international events for which Cola-Cola was a sponsor, including the London 2012 Olympics and Queen Elizabeth II's Diamond Jubilee.

 

Conclusions

 

Web archiving technology has significantly matured over the past decade, as has our understanding of the issues involved. Consequently we have a broad set of tools and services which enable us to archive and preserve aspects of our online cultural memory and comply with regulatory requirements for capturing and preserving online records. The work is ongoing, for as long as the Internet continues to evolve, web archiving technology must evolve to keep pace.

Alongside technical developments, the knowledge and experience gained through practical deployment and use of web archiving tools has led to a much better understanding of best practices in web archiving, operational strategies for embedding web archiving in an organizational context, business needs and benefits, use cases, and resourcing options. Organizations wishing to embark on a web archiving initiative must be very clear about their business needs before doing so. Business needs should be the fundamental driver behind any web archiving initiative and will significantly influence the detail of a resulting web archiving strategy and selection policy. The fact that commercial services and technologies have emerged is a sign of the maturity of web archiving as a business need, as well as a discipline.

 

Resources

Pennock, M., 2013. Web-Archiving, DPC Technology Watch Report 13-01 March 2013

http://dx.doi.org/10.7207/twr13-01

This report is intended for those with an interest in, or responsibility for, setting up a web archive. It introduces and discusses the key issues faced by organizations engaged in web archiving initiatives, whether they are contracting out to a third party service provider or managing the process in-house and provides a detailed overview of the main software applications and tools currently available.

ISO, 2012, ISO 28500:2009 Information and Documentation – the WARC file format

http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717

The WARC (Web ARChive) format is a container format for archived websites, also known as ISO 28500:2009. It is a revision of the Internet Archive's ARC File Format used to store web crawls harvested from the World Wide Web.

ISO, 2013 ISO/TR 14873:2013 Information and Documentation – Statistics and quality issues for web archiving

http://www.iso.org/iso/catalogue_detail.htm?csnumber=55211

This technical report defines statistics, terms and quality criteria for Web archiving. It considers the needs and practices across a wide range of organisations such as libraries, archives, museums, research centres and heritage foundations.

Meyer E 2010 (a), Researcher Engagement with Web Archives: State of the Art Report, JISC

http://ie-repository.jisc.ac.uk/544/

This report summarizes the state of the art of web archiving in relationship to researchers and research needs focussing primarily on individual researchers and institutions.

Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations

Zittrain, Jonathan and Albert, Kendra and Lessig, Lawrence, Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations (October 1, 2013). Harvard Public Law Working Paper No. 13-42. Available at SSRN: http://ssrn.com/abstract=2329161 or http://dx.doi.org/10.2139/ssrn.2329161

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2329161 or http://dx.doi.org/10.2139/ssrn.2329161

This article from the Perma project team documents a serious problem of reference rot: more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs found within United States Supreme Court opinions, do not link to the originally cited information. It proposes a solution for authors and editors of new scholarship that involves libraries undertaking the distributed, long-term preservation of link contents.

Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot

http://dx.doi.org/10.1371/journal.pone.0115253

This large-scale study looked into approximately 600K links extracted from over 3M scholarly papers published between 1997 and 2012. Those were links to so-called web-at-large resources, i.e. not links to other scholarly papers. It found one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.

The National Archives, 2010.Government Web Archive: Redirection Technical Guidance for Government Departments, version 4.2, The National Archives (UK)

http://www.nationalarchives.gov.uk/documents/information-management/redirection-technical-guidance-for-departments-v4.2-web-version.pdf

This guidance describes an innovative service that provides URL rewriting and redirection functionality for UK Government web pages by setting up redirection to the UK Government web archive where a requested URL does no longer exists on a departmental web site.

MEMENTO and the Time Travel Service

http://www.mementoweb.org/

Memento is a tool which allows users to see a version of a web resource as it existed at a certain point in the past. It is now used in several web archives. The Time Travel service based on Memento checks a range of servers including many web archives and tries to find a web page as it existed around the time of your choice.

Archive-It

http://www.archive-it.org/

Hanzo Archives

http://www.hanzoarchives.com/

Wayback

http://www.sourceforge.net/projects/archive-access/files/wayback/

Netarchive Suite

https://sbforge.org/display/NAS/NetarchiveSuite

PANDAS

http://pandora.nla.gov.au/pandas.html

UC3 Web Archiving Service

https://cdlib.org/services/uc3/about/

Web Curator Tool

http://webcurator.sourceforge.net/

International Internet Preservation Consortium

http://www.netpreserve.org

The IIPC is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. There are many valuable resources on the website including excellent short videos such as the example below.

Why Archive the Web?

https://www.youtube.com/watch?v=pU32rjTaMFE

 

A short video published on 18 Oct 2012 introducing the challenges of web-archiving and the IIPC. (2 mins 53 secs).

What is a Web Archive?

https://youtu.be/ubDHY-ynWi0

This short video explains 'Web Archiving' and why is it important that the UK Legal Deposit libraries support it. It was produced as part of the Arts and Humanities Research Council funded 'Big UK Domain Data for the Arts and Humanities' project.(2 mins 31 secs)

What do the UK Web Archive collect?

https://youtu.be/1QLMPIRwJEo

This video for users explains what they can expect to find and where they might go to access the three collections that the UK Web Archive hold. It was produced as part of the Arts and Humanities Research Council funded 'Big UK Domain Data for the Arts and Humanities' project. (2 mins 55 secs)

 

Further case studies

NDSA Website content case studies

The US National Digital Stewardship Alliance (NDSA) examines the value, opportunities and obstacles for selective preservation of the following specific web content types:

Science, Medicine, Mathematics, and Technology forums

http://www.digitalpreservation.gov/ndsa/working_groups/documents/ScienceForums_CaseStudy_public_v2.pdf

December 2013 (3 pages).

Science, Medicine, Mathematics, and Technology blogs

http://www.digitalpreservation.gov/ndsa/working_groups/documents/ScienceBlogs_CaseStudy_public_v2.pdf

December 2013 (3 pages).

Born‐Digital Community and Hyperlocal News

http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_CaseStudy_CommunityNews.pdf

February 2013 (3 pages).

Citizen Journalism

http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_CaseStudy_CitizenJournalism.pdf

February 2013 (3 pages).

 

On the Development of the University of Michigan Web Archives: Archival Principles and Strategies

http://files.archivists.org/pubs/CampusCaseStudies/Case13Final.pdf

Michael Shallcross, Bentley Historical Library, University of Michigan details the strategies and procedures the University Archives and Records Program (UARP) followed to develop its collection of archived websites, and how it initiated a large-scale website preservation project as part of a broader effort to proactively capture and maintain select electronic records of the University. 2011 (29 pages).

 

References

 

Pennock, M., 2013. Web-Archiving, DPC Technology Watch Report 13-01 March 2013. Available: http://dx.doi.org/10.7207/twr13-01

The National Archives, 2010. Government Web Archive: Redirection Technical Guidance for Government Departments, version 4.2, The National Archives (UK). Available: http://www.nationalarchives.gov.uk/documents/information-management/redirection-technical-guidance-for-departments-v4.2-web-version.pdf

 

Read More

Glossary

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

Introduction

Acronyms and Initials are a feature of any specialised discipline. In an emerging discipline, such as digital preservation, another major difficulty is the lack of a precise and definitive taxonomy of terms. Different communities use the same terms in different ways which can make effective communication problematic. The following working set of definitions and acronyms are those used throughout the Handbook and the DPC Technology Watch Reports and Website. They are intended to assist in its use as a practical tool.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z


 

 A

Access As defined in the Handbook, access is assumed to mean continued, ongoing usability of a digital resource, retaining all qualities of authenticity, accuracy and functionality deemed to be essential for the purposes the digital material was created and/or acquired for.

ADS Archaeology Data Service. A UK based service active in digital preservation. http://ads.ahds.ac.uk

AIP Archival Information Package. An Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS (OAIS term).

AMIA Association of Moving Image Archives, an organisation active in the field of moving image archiving. http://www.amianet.org

ARC Container format for websites devised by the Internet Archive, superseded by WARC.

ASCII American Standard Code for Information Interchange, standard for electronic text. https://en.wikipedia.org/wiki/ASCII

Authentication A mechanism which attempts to establish the authenticity of digital materials at a particular point in time. For example, digital signatures.

Authenticity The digital material is what it purports to be. In the case of electronic records, it refers to the trustworthiness of the electronic record as a record. In the case of "born digital" and digitised materials, it refers to the fact that whatever is being cited is the same as it was when it was first created unless the accompanying metadata indicates any changes. Confidence in the authenticity of digital materials over time is particularly crucial owing to the ease with which alterations can be made.

 B

Bit A bit is the basic unit of information in computing. It can have only one of two values commonly represented as either a 0 or 1.The two values can be interpreted as any two-valued attribute (yes/no, on/off, etc).

Bit Preservation A term used to denote a very basic level of preservation of digital resource as it was submitted( literally preservation of the bits forming a digital resource). It may include maintaining onsite and offsite backup copies, virus checking, fixity-checking, and periodic refreshment to new storage media. Bit preservation is not digital preservation but it does provide one building block for the more complete set of digital preservation practices and processes that ensure the survival of digital content and also its usability, display, context and interpretation over time.

Born-Digital Digital materials which are not intended to have an analogue equivalent, either as the originating source or as a result of conversion to analogue form. This term has been used in the Handbook to differentiate them from 1) digital materials which have been created as a result of converting analogue originals; and 2) digital materials, which may have originated from a digital source but have been printed to paper, e.g. some electronic records.

BWF Broadcast WAV format, the European Broadcasting Union standard for a WAV file, with extra metadata. http://www.digitalpreservation.gov/formats/fdd/fdd000003.shtml

Byte (B) A unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit of memory in many computer architectures.

 C

CCSDS Consultative Committee for Space Data Systems, the body responsible for the OAIS Reference Model. http://public.ccsds.org/default.aspx

Chain of Custody A key concept in forensics whereby the custody and provenance of digital hardware, media and files are safeguarded through, for example, the appointment of evidence custodians. The purpose of the Digital Evidence Bag (DEB) is to hold digitally, along with the evidential digital objects, provenance metadata that can be updated as required: a concept that is familiar to digital preservation practitioners.

Checksum A unique numerical signature dreived from a file. Used to compare copies.

CLIR Council on Library and Information Resources. US based organisation active in digital preservation. http://www.clir.org

CNI Coalition for Networked Information. US based organisation active in digital preservation. http://www.cni.org

Continuing Access refers to the right of a subscriber to an electronic publication and their users to have on-going permanent access to electronic materials which have already been leased and paid for by the subscriber from a publisher. It is a term used, along with its synonyms perpetual access and post-cancellation access, in the information industry to describe the ability to retain access to electronic materials by the subscriber/licensee after the contractual licensing agreement with the publisher/licensor for those materials has ended, whatever the reason for the cessation. It may also cover as appropriate arrangements for digital preservation needed to guarantee some elements of continuing access.

COPTR Community Owned digital Preservation Tool Registry hosted by The Open Preservation Foundation. http://coptr.digipres.org

Crawl The act of browsing the web automatically and methodically to index or download content and other data from the web. The software to do this is often called a web crawler.

 

 D

Dark Archive is an archive that cannot be accessed by any current users but may be accessible at future dates subject to the occurrence of specific pre-defined events ('trigger event'). Access to the data is either limited to a few set individuals or completely restricted to all.

DCC Digital Curation Centre. A UK based organisation active in digital preservation. http://www.dcc.ac.uk

DDI Data Documentation Initiative. A de facto international metadata standard for describing data from the social, behavioral, and economic sciences. http://www.icpsr.umich.edu/DDI

Designated Community an identified group of potential consumers who should be able to understand a particular set of information from an archive. These consumers may consist of multiple communities, are designated by the archive, and may change over time (OAIS term).

Digital Archiving This term is used very differently within sectors. The library and archiving communities often use it interchangeably with digital preservation. Computing professionals tend to use digital archiving to mean the process of backup and ongoing maintenance as opposed to strategies for long-term digital preservation. It is this latter richer definition, as defined under digital preservation which has been used throughout this Handbook.

Digital Forensics The application of scientific technical methods and tools toward the preservation, collection, validation, identification, analysis, interpretation, documentation and presentation of digital information derived after-the-fact from digital sources.

Dim Archive provides bit preservation for the content plus digital preservation planning and actions for long-term perpetual access, and also limited current access (perhaps limited to on-site users or previous subscribers post-cancellation, etc.).

DigCurV Digital Curator Vocational Education Europe. A project funded by the European Commission to establish a curriculum framework for vocational training in digital curation. http://www.digcurv.gla.ac.uk/

Digital Materials A broad term encompassing digital surrogates created as a result of converting analogue materials to digital form (digitisation), and "born digital" for which there has never been and is never intended to be an analogue equivalent, and digital records.

Digital Preservation Refers to the series of managed activities necessary to ensure continued access to digital materials for as long as necessary. Digital preservation is defined very broadly for the purposes of this study and refers to all of the actions required to maintain access to digital materials beyond the limits of media failure or technological and organisational change. Those materials may be records created during the day-to-day business of an organisation; "born-digital" materials created for a specific purpose (e.g. teaching resources); or the products of digitisation projects. This Handbook specifically excludes the potential use of digital technology to preserve the original artefacts through digitisation. See also Digitisation definition below.

  • Short-term preservation - Access to digital materials either for a defined period of time while use is predicted but which does not extend beyond the foreseeable future and/or until it becomes inaccessible because of changes in technology.
  • Medium-term preservation - Continued access to digital materials beyond changes in technology for a defined period of time but not indefinitely.
  • Long-term preservation - Continued access to digital materials, or at least to the information contained in them, indefinitely.

Digital Preservation Management Workshop and Tutorial An intensive training workshop and online tutorial developed and maintained by Cornell University Library, 2003-2006; extended and maintained by ICPSR, 2007-2012; and now extended and maintained by MIT Libraries, 2012-on. http://dpworkshop.org/

Digital Publications "Born digital" objects which have been released for public access and either made available or distributed free of charge or for a fee. They may consist of networked publications, available over a communications network or physical format publications which are distributed on formats such as floppy or optical disks. They may also be either static or dynamic.

Digital Records See Electronic Records

Digital Resources See Digital Materials

Digitisation The process of creating digital files by scanning or otherwise converting analogue materials. The resulting digital copy, or digital surrogate, would then be classed as digital material and then subject to the same broad challenges involved in preserving access to it, as "born digital" materials.

DIP Dissemination Information Package. An Information Package, derived from one or more Archival Information Packages (AIPs), and sent by Archives to the Consumer in response to a request to the OAIS (OAIS term).

DLF Digital Library Federation. A US based organisation active in digital preservation. http://www.diglib.org

Documentation The information provided by a creator and the repository which provides enough information to establish provenance, history and context and to enable its use by others. See also Metadata.

DOI Digital Object Identifier. A technical and organisational infrastructure for the registration and use of persistent identifiers widely used in digital publications and for research data. The DOI system was created by the International DOI Foundation and was adopted as International Standard ISO 26324 in 2012. http://www.doi.org

DPC Digital Preservation Coalition. A UK and Ireland based organisation active in digital preservation and responsible for the Digital Preservation Handbook. http://www.dpconline.org

DPTP Digital Preservation Training Programme, an intensive training course run by the University of London Computer Centre. https://dptp.london.ac.uk/

DRAMBORA Digital Repository Audit Methodology Based on Risk Assessment. A set of risk assessment tools developed by the Digital Curation Centre. http://www.dcc.ac.uk/resources/repository-audit-and-assessment/drambora

DROID A file profiling tool developed and distributed by TNA to identify file formats. Based on PRONOM. http://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid/

 E

Electronic Records Records created digitally in the day-to-day business of the organisation and assigned formal status by the organisation. They may include for example, word processing documents, emails, databases, or intranet web pages.

Emulation A means of overcoming technological obsolescence of hardware and software by developing techniques for imitating obsolete systems on future generations of computers.

Escrow A widespread legal practice of the deposit of content or software source code with a third party. Escrow takes place in a contractual relationship, formalized in an escrow agreement, between at least three parties: the provider, the customer, and the third party providing the escrow service.

 F

FIAF International Federation of Film Archives, an association of the world's leading film archives. http://www.fiafnet.org

FIAT International Federation of Television Archives, a professional association for those engaged in the preservation and exploitation of broadcast archives. http://fiatifta.org

File Format A file format is a standard way that information is encoded for storage in a computer file. It tells the computer how to display, print, and process, and save the information. It is dictated by the application program which created the file, and the operating system under which it was created and stored. Some file formats are designed for very particular types of data, others can act as a container for different types. A particular file format is often indicated by a file name extension containing three or four letters that identify the format. http://en.wikipedia.org/wiki/File_format

Fixity Check a method for ensuring the integrity of a file and verifying it has not been altered or corrupted. During transfer, an archive may run a fixity check to ensure a transmitted file has not been altered en route. Within the archive, fixity checking is used to ensure that digital files have not been altered or corrupted. It is most often accomplished by computing checksums such as MD5, SHA1 or SHA256 for a file and comparing them to a stored value. http://en.wikipedia.org/wiki/File_Fixity

 G

GIF Graphic Interchange Format, an image which typically uses lossy compression. http://en.wikipedia.org/wiki/GIF

Gigabyte (GB) A unit of digital information often used to describe data or data storage size, equates to approximately 1,000 Megabytes (MB).

GIS Geographical Information System, a system that processes mapping and data together.

 H

HTML Hypertext Markup Language, a format used to present text and other information on the World Wide Web. Since 1996, versions of the HTML specification have been maintained by the World Wide Web Consortium (W3C). http://en.wikipedia.org/wiki/HTML

 I

IASA International Association of Sound and Audiovisual Archives, an association for archives that preserve recorded sound and audiovisual documents. http://www.iasa-web.org

IIPC The International Internet Preservation Consortium. http://www.netpreserve.org

Information Assurance An aspect of digital security, specifically directed at ensuring that the quality of the information is demonstrably safeguarded, that it has not been tampered with or accessed inappropriately.

Ingest the process of turning a Submission Information Package (SIP) into an Archival Information Package (AIP), i.e. putting data into a digital archive (OAIS term).

InterPARES project International Research on Permanent Authentic Records in Electronic Systems. http://www.interpares.org

ISO International Organization for Standardization. http://www.iso.org/iso/home.html

 J

JHove2 A characterization tool for digital objects. Characterisation is comprised of four elements: identifying the object's format; validating that the object conforms to its format's technical norms;, extracting technical metadata from the object; and assessing whether the object should be accepted into a repository, based on policies set by the curator. https://bitbucket.org/jhove2/main/wiki/Home

JPEG Joint Photographic Experts Group, a committee that oversees international standards for compression and processing of digital photographs . The majority of JPEG formats are lossy. http://www.jpeg.org/

JPEG 2000 a revision of the JPEG format which can use lossless compression.

 K

Kilobyte (KB) A unit of digital information often used to describe data or data storage size, equates to approximately 1,000 Bytes

 L

Life-cycle Management Records management practices have established life-cycle management for many years, for both paper and electronic records. The major implications for life-cycle management of digital resources, whatever their form or function, is the need actively to manage the resource at each stage of its life-cycle and to recognise the inter-dependencies between each stage and commence preservation activities as early as practicable. This represents a major difference with most traditional preservation, where management is largely passive until detailed conservation work is required, typically, many years after creation and rarely, if ever, involving the creator. There is an active and inter-linked life-cycle to digital resources which has prompted many to promote the term "continuum" to distinguish it from the more traditional and linear flow of the life-cycle for traditional analogue materials. We have used the term life-cycle to apply to this pro-active concept of preservation management for digital materials.

Lossless Compression A mechanism for reducing file sizes that retains all original data.

Lossy Compression A mechanism for reducing file sizes that typically discards data.

LOTAR (LOng Term Archiving and Retrieval) a digital preservation standard for 3D CAD models and product data management information developed by LOTAR International, an industrial consortium of aerospace and defence companies from the US and Europe. http://www.lotar-international.org

 M

Megabyte (MB) A unit of digital information often used to describe data or data storage size, equates to approximately 1,000 Kilobytes (KB).

Metadata Information which describes significant aspects of a resource. Most discussion to date has tended to emphasise metadata for the purposes of resource discovery. The emphasis in this Handbook is on what metadata are required successfully to manage and preserve digital materials over time and which will assist in ensuring essential contextual, historical, and technical information are preserved along with the digital object. The PREMIS Data Dictionary for Preservation Metadata has become a key de facto standard in digital preservation.

METS Metadata Encoding and Transmission Standard, a standard for presenting metadata using XML. http://www.loc.gov/standards/mets/

Migration A means of overcoming technological obsolescence by transferring digital resources from one hardware/software generation to the next. The purpose of migration is to preserve the intellectual content of digital objects and to retain the ability for clients to retrieve, display, and otherwise use them in the face of constantly changing technology. Migration differs from the refreshing of storage media in that it is not always possible to make an exact digital copy or replicate original features and appearance and still maintain the compatibility of the resource with the new generation of technology.

MIME Multipurpose Internet Mail Extensions. A protocol for including non-ASCII information in email messages. Software typically include interpreters that convert MIME content to and from its native format, as necessary. http://en.wikipedia.org/wiki/MIME

MPEG Moving Picture Experts Group. A committee responsible for the development of international standards for compression, decompression, processing, and coded representation of moving pictures, audio and their combination. https://mpeg.chiariglione.org/

 N

NCDD The Netherlands Coalition for Digital Preservation. http://www.ncdd.nl/en/

NDSA National Digital Stewardship Alliance a US based organisation active in digital preservation. http://www.digitalpreservation.gov/ndsa/

NESTOR The German competence network for digital preservation. http://www.langzeitarchivierung.de/Subsites/nestor/EN/Home/home_node.html/

 O

Open Archival Information System (OAIS) An Archive, consisting of an organization, which may be part of a larger organization, of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. It meets a set of responsibilities, as defined in section 4 of the OAIS standard that allows an OAIS Archive to be distinguished from other uses of the term ‘Archive’. The term ‘Open’ in OAIS is used to imply that the OAIS standards are developed in open forums, and it does not imply that access to the Archive is unrestricted. The OAIS abbreviation is also used commonly to refer to the Open Archival Information System reference model standard which defined the term. The standard is a conceptual framework describing the environment, functional components, and information objects associated with a system responsible for the long-term preservation. As a reference model, its primary purpose is to provide a common set of concepts and definitions that can assist discussion across sectors and professional groups and facilitate the specification of archives and digital preservation systems. It has a very basic set of conformance requirements that should be seen as minimalist. OAIS was first approved as ISO Standard 14721 in 2002 and a 2nd edition was published in 2012. Although produced under the leadership of the Consultative Committee for Space Data Systems (CCSDS), it had major input from libraries and archives.

OPF Open Preservation Foundation, formerly the Open Planets Foundation. http://openpreservation.org

 P

PAIMAS Space Data and Information Transfer Systems - Producer-Archive Interface - Methodology Abstract Standard. This ISO 20652:2006 standard covers the first stages of the ingest process defined by OAIS reference model. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=39577

PDF Portable Document Format, a set of formats and open standards maintained by the International Organization for Standardization for producing and sharing electronic documents originally developed by Adobe Systems. The original page description format has been elaborated over successive versions to enable the embedding of such complex objects as image, audio, and moving image files, hyperlinks, embedded XML metadata, and updatable forms. Specification for various versions and profiles of the format are now maintained by the International Standards Organization. http://www.adobe.com/uk/products/acrobat/adobepdf.html

PDF/A Versions of the PDF standard intended for archival use. http://www.aiim.org/Research-and-Publications/Standards/Committees/PDFA

PDI Preservation Description Information. The information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, Context, and Access Rights Information (OAIS term).

Perpetual Access see Continuing Access.

Petabyte (PB) A unit of digital information often used to describe data or data storage size, equates to approximately 1,000 Terabytes (TB).

PIN Pérennisation des Informations Numériques, the French national interest group for digital preservation. http://pin.association-aristote.fr/doku.php

Post-cancellation Access see Continuing Access.

PREMIS Preservation Metadata: Implementation Strategies. A de facto standard for digital preservation metadata. http://www.loc.gov/standards/premis/

PRONOM A database of file formats, software products and other technical components required to support long-term access to electronic records and other digital objects of cultural, historical or business value. Used with DROID. http://apps.nationalarchives.gov.uk/PRONOM/Default.aspx

PST Personal Storage Table is a file extension for local 'personal stores' written by the program Microsoft Outlook. PST files contain email messages and calendar entries using a proprietary but open format, and they may be found on local or networked drives of email end users. Several tools can read and migrate PST files to other formats. http://en.wikipedia.org/wiki/Personal_Storage_Table

 Q

 R

Reformatting Copying information content from one storage medium to a different storage medium (media reformatting) or converting from one file format to a different file format (file re-formatting).

Refreshing Copying information content from one storage media to the same storage media.

 S

Sandbox Containment A secure computing environment for running novel, unattested or experimental code or changes in code, including potentially malicious code. The environment is self-contained with tightly controlled resources and is characteristically virtual.

SGML Standard Generalized Markup Language an ISO standard for how to specify a document markup language or tag set. http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language

Significant properties Characteristics of digital and intellectual objects that must be preserved over time in order to ensure the continued accessibility, usability and meaning of the objects and their capacity to be accepted as (evidence of) what they purport to be. https://www.archives.gov/files/era/acera/pdf/significant-properties.pdf

SIP Submission Information Package. An Information Package that is delivered by the Producer to the OAIS for use in the construction or update of one or more Archival Information Packages (AIPs) and/or the associated Descriptive Information (OAIS term).

SMPTE Society of Motion Picture and Television Engineers, a professional organisation and technical standards body for television and motion picture. https://www.smpte.org

 T

TDR Trusted Digital Repository. A trusted digital repository has been defined as having “a mission to provide reliable, long-term access to managed digital resources to its designated community, now and into the future”. The TDR must include the following seven attributes: compliance with the reference model for an Open Archival Information System (OAIS), administrative responsibility, organizational viability, financial sustainability, technological and procedural suitability, system security, and procedural accountability. The concept has been an important one particularly in relation to certification of digital repositories.

Terabyte (TB) A unit of digital information often used to describe data or data storage size, equates to approximately 1,000 Gigabytes (GB).

Three-Legged Stool A conceptual approach to digital preservation that suggests a fully implemented and viable preservation programme addresses organisational issues, technological concerns, and funding questions, balancing them like a three-legged stool. Developed as part of the Digital Preservation Management Workshop and Tutorial.

TIFF Tagged Image File Format, a common format for images typically lossless. http://en.wikipedia.org/wiki/Tagged_Image_File_Format

TRAC Trusted Repository Audit and Certification, toolkit for auditing a digital repository. http://www.crl.edu/sites/default/files/d6/attachments/pages/trac_0.pdf

Trigger Event This terminology is used when specific conditions relating to an electronic publication and its continued delivery to users are met. If the publication is no longer available to users from the publisher or any other source for a variety of reasons then a trigger event is said to have occurred. They can set in motion access for users via an archive where the electronic publication may be digitally preserved.

 

 U

UKWA UK Web Archive. http://www.webarchive.org.uk/ukwa/

 V

 W

WARC The WARC (Web ARChive) format is a container format for archived websites, also known as ISO 28500:2009. It is a revision of the Internet Archive's ARC File Format used to store web crawls harvested from the World Wide Web. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717

WAV the standard file wrapper for audio; see BWF (Broadcast WAV Format) for the professional variant. http://en.wikipedia.org/wiki/WAV

Writeblockers Tools that prevent an examination computer system from writing or altering a collection or subject hard drive or other digital media object. Hardware writeblockers are generally regarded as more reliable than software writeblockers.

 X

XML Extensible Markup Language, a widely used standard (derived from SGML), for representing structured information, including documents, data, configuration, books, and transactions. It is maintained by the World Wide Web Consortium (W3C). http://www.w3.org/XML/

 Y

 Z



Read More

Complex objects and software

 

Under construction icon-orange This page is under construction

 

complexobjects

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Read More

Computer-aided design

 

Under construction icon-orange This page is under construction

 

cad

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Read More

Documents and PDF/A

 

Under construction icon-orange This page is under construction

 

pdfa

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Resources

 

Case studies

bv_icon_casestudy

Newspaper e-prints

http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_CaseStudy_NewspaperEPrints.pdf

The US National Digital Stewardship Alliance (NDSA) examines the value, opportunities and obstacles for selective preservation of the PDF printmasters for newspaper e-prints. February 2013, 3 pages

 

Read More

eBooks

 

Under construction icon-orange This page is under construction

 

ebooks

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Read More

Email

 

 

Under construction icon-orange This page is under construction

 

email

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Resources

 

Case studies

bv_icon_casestudy

Society of American Archivists campus case studies

Partnering with IT to Identify a Commercial Tool for Capturing Archival E-mail of University Executives at the University of Michigan

http://files.archivists.org/pubs/CampusCaseStudies/CASE-14-FINAL.pdf

Aprille Cooke McKay, Bentley Historical Library, University of Michigan, examines the challenges and opportunities of partnering with IT to issue a Request for Proposal (RFP) for commercial e-mail archiving software. 2013. 53 pages

Will They Populate the Boxes? Piloting a Low-Tech Method for Capturing Executive E-mail and a Workflow for Preserving It at the University of Michigan

http://files.archivists.org/pubs/CampusCaseStudies/CASE-15-FINAL.pdf

Aprille Cooke McKay, Bentley Historical Library, University of Michigan. The first part of the paper describes a pilot study testing whether university executives and leaders would flag e-mail messages of long-term value to transfer to the archives. The second part describes the steps taken to move from an ad hoc approach to digital records transfer and processing to one much more routinized. 2013. 91 pages.

Read More

Geospatial data

 

Under construction icon-orange This page is under construction

 

gis

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Read More

Contents

This contents page provides an "at a glance" view of the major sections and all their component topics.

You can navigate the Handbook by clicking and expanding the "Explore the Handbook" navigation bar or by clicking links in this contents page.

The contents are listed hierarchically and indented to show major sections and sub-sections. Landing pages provide overviews and information for major sections with many sub-sections.

Maintenance and additions to the new Handbook will be ongoing. Any new sections agreed for the next DPC publications plan will be shown as "coming soon".

Status  Digital Preservation Handbook [landing page]
tick4 Complete tick4 Introduction
Coming soon tick4 How to use the Handbook
  tick4 Development and acknowledgements
  tick4 Digital preservation briefing [landing page]   (PDF of this section)
  tick4 Why digital preservation matters
  tick4 Preservation issues
  tick4 Getting started    (PDF of this section)
  tick4 Institutional strategies [landing page]   (PDF of this section)
  tick4 Institutional policies and strategies
  tick4 Collaboration
  tick4 Advocacy
  tick4 Procurement and third party services
  tick4 Audit and certification
  tick4 Legal compliance
  tick4 Risk and change management
  tick4 Staff training and development
  tick4 Standards and best practice
  tick4 Business cases, benefits, costs, and impact
  tick4 Organisational activities [landing page]   (PDF of this section)
  tick4 Creating digital materials
  tick4 Acquisition and appraisal
  tick4 Decision tree
  tick4 Retention and review
  tick4 Storage
  tick4 Legacy media
  tick4 Preservation planning
  tick4 Preservation action
  tick4 Access
  tick4 Metadata and documentation
  tick4 Technical solutions and tools [landing page]   (PDF of this section)
  tick4 Tools
  tick4 Fixity and checksums
  tick4 File formats and standards
  tick4 Information security
  tick4 Cloud services
  tick4 Digital forensics
  tick4 Persistent identifiers
  tick4 Content-specific preservation [landing page]   (PDF of this section)
  tick4 e-Journals
  tick4 Moving pictures and sound
  tick4 Web-archiving
  tick4 Glossary

 

Save

Save

Save

Save

Save

Read More

Scroll to top