DPA 2012: DPC Decennial Award - Finalists - Digital Preservation Coalition

Advocacy

The Decennial Award is a special award offered to mark the tenth anniversary of the DPC. It will be presented to the project, initiative or person that in the period 2002-2012, the judges assess to have made the most outstanding contribution to ensuring our digital memory is available tomorrow. Four finalists have been selected for this prestigious prize. Here the nominees describe their motivations, their projects and their impact.

Archaeology Data Service at the University of York

ADS_Staff Archaeology is unusual in that the creation of knowledge results from the physical destruction of primary evidence, making access to data all the more critical in order to test, assess, and subsequently reanalyse and reinterpret both data and the hypotheses arising from them. Over the years, archaeologists have amassed a vast collection of fieldwork data archives, a significant proportion of which remain unpublished. Furthermore, much fieldwork data is increasingly born digital, making it all the more precious. Access to data, even those which are published, is often difficult or inconvenient at best.

The Archaeology Data Service (ADS) was established in September 1996, as one of five discipline based service providers within the Arts and Humanities Data Service (AHDS). The ADS developed from a successful bid to the AHDS made by a consortium of university Departments of Archaeology and the Council for British Archaeology, led by the University of York. From an early stage the ADS also began to receive external funding from a variety of other organisations, such as English Heritage, reflecting the diverse nature of the archaeological sector. It developed a charging policy which is based on a one-off digital deposit charge, and provides a sustainable financial model for digital archiving. Its innovative approach to charging was widely welcomed and ADS has worked with
Charles Beagrie Ltd on the Keeping Research Data Safe projects to share best practice.

The ADS works with national and local archaeological agencies and those research councils involved in the funding of archaeological research, to negotiate deposition and secure archiving of project data. This includes data derived from fieldwork as well as desk-based studies. The types of data involved include: text reports, databases (related to excavated contexts or artefacts, for example), images (including aerial photographs, remote sensing imagery, photographs of sites, features and artefacts), digitised maps and plans, numerical datasets related to topographical and sub-surface surveys and other locational data, as well as 3D reconstructions. In March 2011 the ADS was accredited with the Data Seal of Approval, an international ‘kite-mark’ for digital repositories, becoming only the second UK repository to gain this recognition, after the UK Data Archive at the University of Essex. In 2012 it was accredited by MEDIN as a Data Archive Centre for the Marine Environment Data and Information Network.

Over a decade before the current push towards Open Data the ADS made the data it preserves available for re-use. On 15th September 1998 the ADS launched the first version of ArchSearch its online catalogue. All data sets are freely available for download and re-use; in 2002 it launched HEIRPORT, the first interoperable gateway for the historic environment sector. In 2005 it completed ARENA, the first portal to European cultural heritage and digital archives. In 2011 it launched TAG, the first Transatlantic Archaeological Gateway. ADS is now the UK supplier of 3D heritage data to CARARE, a Europeana best practice network, and a lead member of ARIADNE, a new EU Research Infrastructure project. Between 2007-9 ADS worked on Archaeotools,a JISC-EPSRC-AHRC eScience project to undertake Natural Language Processing of its collections and to implement a faceted browse interface, now available via ArchSearch. The ADS has also worked on Open Linked Data as another means of providing access to its online holdings, via the STAR and STELLAR projects.

ADS now preserves over 17,000 grey literature reports and over 500 data rich digital archives, derived from archaeological research projects and primary fieldwork. The archives represent some of the most important sites in British Archaeology. The grey literature has become the basis of other research projects, most recently a Leverhulme grant awarded to Professor Mike Fulford to study rural settlement in Roman Britain, and an ERC senior investigator award to Professor Chris Gosden, who is looking at Regional Identities in Britain. The significance of the digital heritage preserved is immeasurable. In an ongoing JISC commissioned survey on the Impact of ADS 74% of users said that ADS was very or extremely important for their academic research, 64% said it was important for their private research, and 55% said it was important for their learning and skills development. 48% of depositors reported that not being able to provide data to ADS would have a severe or major impact on their work.

ADS has worked extensively on data standards. In 1999 ADS published the first Guides to Good Practice, and published 6 titles in its first 6 years. In 2006 it completed work on ‘Big Data’ for English Heritage, on the “Preservation and Management Strategies for Exceptionally Large Data Formats”. In 2011 it launched the new online second editions of the Guides, freely available and including new areas, such as standards for Underwater Archaeology, derived from the EU-funded VENUS project. Keith Kintigh, Past President of the Society for American Archaeology writes that the ADS ”is an enormous asset to the UK’s archaeological community - within and outside academic settings. It provides both secure preservation and rich access to the irreplaceable records of archaeological investigations, thereby allowing this information to be effectively used in archaeological research and cultural heritage management.

ADS has been a key player internationally in advancing initiatives concerned with the preservation and dissemination of cultural heritage information and has, indeed, been a model of a sustainable and productive digital archive of archaeological data and documents. In our own multi-institutional effort to develop a digital archive for archaeological data in the US, ADS has not only served as a valuable model, its staff have provided critical advice and assistance. Further, ADS has been a major driver of international efforts to establish interoperability of digital repositories facilitating the sharing of archaeological information. Looking beyond archaeology, with its long (for a digital repository) history of success, ADS has also been a widely cited exemplar of a successful disciplinary repository.”

The ADS has been awarded two British Archaeological Awards for Innovation, in 2008 and 2012, and it featured extensively in the successful nomination of the Department of Archaeology at the University of York for a Queen’s Anniversary Prize for Higher and Further Education in 2011.

Note - ADS (nominee) and University of York (Host) may not vote for ADS; JISC (initial funder) may not vote for ADS

PREMIS Data Dictionary for Preservation Metadata

premis team at iPres 2010 Since winning the 2005 Digital Preservation Award, the PREMIS Data Dictionary has become the international standard for preservation metadata for digital materials. Developed by an international team of experts, PREMIS is implemented in digital preservation projects around the world, and support for PREMIS is incorporated into a number of commercial and open-source digital preservation tools and systems. PREMIS maintains XML schema to support implementation; engages in outreach around the world; and leads initiatives to develop guidelines and best practices for implementing the Data Dictionary. PREMIS is the international nexus for preservation metadata, and is recognized as a core standard for state-of-the-art digital preservation.

The PREMIS Data Dictionary for Preservation Metadata (http://www.loc.gov/standards/premis/v2/premis-2-2.pdf) is a comprehensive guide to core metadata to support the long-term preservation of digital materials. The Data Dictionary was produced through an international consensus-making working group, with representatives from many countries and domains. It includes detailed descriptions of the metadata needed to support the digital preservation process, along with guidelines for implementation and use. PREMIS also developed an XML schema to facilitate implementation of the Data Dictionary by institutions managing and exchanging PREMIS-conformant preservation metadata.

The PREMIS Working Group developed a data model that defines five key entities – Objects, Events, Rights, Agents, and Intellectual Entities – associated with the digital preservation process; this was used to organize and scope the Data Dictionary. For each entity, the Data Dictionary defines lists of “semantic units” – discrete pieces of information – where a semantic unit represents a property of the entity. The Data Dictionary provides guidelines for use and implementation notes for each semantic unit. An important feature of the preservation metadata defined in the Data Dictionary is the ability to support linking between the five entities, as a means of documenting important relationships within the digital preservation process. To promote applicability in as wide a range of contexts as possible, the Data Dictionary is neutral in terms of the strategies or encoding used for implementation; the PREMIS XML schema offers one alternative, but it is not required.

After the release of version 1.0 of the Data Dictionary in 2005, the Library of Congress (LC) established a PREMIS Maintenance Activity, which includes LC as managing agency, the PREMIS Implementers’ Group (PIG) listserv for communication with PREMIS implementers, and an Editorial Committee to promote implementation and coordinate future revisions as implementation experience suggests. The PREMIS Editorial Committee has been receptive to the implementation community by revising the PREMIS Data Dictionary according to an established revision process as issues were encountered and change proposals submitted. The PREMIS Data Dictionary is currently in version 2.2, having been revised to provide more detailed information about preservation rights and a mechanism for extensibility, among other enhancements.

Currently, work is progressing on a PREMIS OWL ontology to enable the use of preservation metadata within a Linked Data model, which allows information to be more easily interconnected, especially between different repository databases. It will integrate with the W3C Provenance ontology (PROV) (http://www.w3.org/TR/prov-primer/) and the PREMIS controlled vocabularies available from the Library of Congress’ Linked Data Service http://id.locgov/. Other activities have included the development of a conformance statement, registries of implementations and tools, and a best practice guide for implementing PREMIS with the Metadata Encoding and Transmission Standard (METS). The PREMIS Implementation Registry (http://www.loc.gov/standards/premis/registry/premis-fulllist.php) lists 47 projects as of Aug. 2012; the list is not comprehensive, since the Committee is dependent on having implementers submit the information, but it reflects the wide variety of uses across domains and countries.

As part of its community outreach, PREMIS has sponsored and conducted numerous tutorials around the world to educate implementers and potential implementers. In addition, it has organized and held two “implementation fairs”, all held in conjunction with the International Conference on the Preservation of Digital Objects (iPres), where implementers share information about projects, issues, solutions, implementation experiences, tools and services. A third implementation fair will take place at iPres 2012 in Toronto (http://www.loc.gov/standards/premis/premis-implementation-fair2012.html).

PREMIS has enjoyed considerable success in being accepted as the international preservation metadata standard regardless of domain, type of institution, type of resources being preserved, or geographic location. Some countries have mandated its use in preservation repositories in the cultural heritage sector. The Editorial Committee is planning to involve the International Organization for Standardization (ISO) in the near future to place the Data Dictionary in a more formal standards environment.

As PREMIS has matured, both open-source and commercial tools and services have been built to support it. The PREMIS-in-METS Toolbox http://pim.fcla.edu is an openly available resource that extracts preservation metadata from digital objects in the form of PREMIS XML, converts it to PREMIS in METS (or vice versa), and validates it according to the schema and the best practice guidelines. Some of the key digital preservation solutions that incorporate support for PREMIS include Archivematica, DAITTSS and ExLibris’ Rosetta.

PREMIS continually seeks to refine and enhance the Data Dictionary and its value to implementers. For example, the PREMIS Data Model provides a framework for implementing preservation metadata by defining the entities and relationships involved in the digital preservation process. As PREMIS gained wider adoption, implementation experience suggested a number of ways to update and adjust the Data Model to enhance its value to implementers. In response to this, the Editorial Committee is currently engaged in updating the Data Model for version 3.0 of the Data Dictionary: for example, making Intellectual Entities in scope for PREMIS metadata as another level of a preservation object; and making Environment (i.e. hardware and software) a separate entity within the Data Model. This work has been informed by a number of preservation initiatives, particularly the EU-funded Planets project.

In summary, the PREMIS Data Dictionary is a freely-available resource for the entire digital preservation community that has advanced the theory, practice, and understanding of digital preservation by standardizing the information that a repository must know in order to carry out its digital preservation processes. This resource is supported by a reliable apparatus, in the form of the Maintenance Activity and Editorial Committee, responsible for maintaining and enhancing it to ensure that it continues to meet the needs of implementers. Widespread adoption of PREMIS has led to the development of a variety of tools and services supporting its use. The original PREMIS Data Dictionary, published in 2005, has since emerged as the definitive international standard for preservation metadata, and is now part of the permanent infrastructure of standards and best practices supporting long-term digital preservation.

Note - Library of Congress (Host) may not vote for PREMIS

PRONOM and DROID from The National Archives

PRONOM_DROID_TNA_DPA2012 DROID supports batch processing of large numbers of files. It is freely-available to download under an Open Source license, and is written in the platform-independent programming language Java. It provides both a graphical user interface and a command-line interface. DROID provides comprehensive reporting on collections of digital records, including formats, extensions, PUIDs, filepaths, and check sums, the latter offering a quick method of finding duplicate files even when the files may have different filenames. All reports can be saved and exported as spreadsheet files for detailed analysis.

Since DROID connects via web services to the PRONOM registry, users always have access to the latest available file format signatures. Users may also develop and implement their own signature files and we have produced detailed and freely available information on how to achieve this, meaning that individuals and institutions are not tied to The National Archives’ research alone.

In addition to appointing a full-time File Format Signature Developer, we have invested in continuous development of both PRONOM and DROID. DROID 5 introduced scanning of archive formats, such as .zip, meaning that DROID now reports on the contents of these files without the need for manually opening each archive file. DROID 6 provides container signatures for the first time, enabling accurate identification of compound formats, such as OLE2 used by Microsoft. DROID 6.1, to be released in autumn 2012 provides further stability and a more efficient command-line identification option. DROID 7, the next major release, has an openly available wiki for interested parties to submit their own requirements.

The next development we are planning for PRONOM is to make the entire registry available following a Linked Data approach. We have already made available via our website a prototype of Linked Data PRONOM.

The achievements of PRONOM and DROID are clear: they have stimulated debate and further thought on the subject of digital preservation; PRONOM was the first publicly available technical registry for file format information and both PRONOM and DROID have inspired a number of similar tools. PRONOM provided a significant amount of data to the Unified Digital Format Registry recently launched by the University of California Curation Center and the California Digital Library. DROID is embeddable within digital preservation workflows and systems conforming to the Open Archival Information System model, for example it is fully embedded within Tessella’s Safety Deposit Box system for which we won a Queen’s Award for Enterprise in 2011.

PRONOM and DROID are 10 and seven years old respectively and throughout their lifetimes they have contributed significantly to the field of digital preservation. The National Archives remains wholly committed to these tools, which have not only helped to drive the success of our own digital records infrastructure, but have also been recognised and adopted worldwide.

Note - The National Archives (nominee) may not vote for PRONOM and DROID

International Internet Preservation Consortium

IIPC_Image_DPA2012 The Internet has enabled an unprecedented era of knowledge sharing, creativity, innovation, and connection. It has also created new challenges for institutions whose mission is to document and preserve contemporary knowledge and culture. Libraries and archives have long collected information to help scholars and the general public to understand contemporary history, culture, science, technology, economics, and society. Much of today's information is published on the World Wide Web – blogs and facebook pages are today's diaries, websites have replaced hard copy newsletters. In many countries, government forms and documents are more readily accessible on the web than they are in paper form.

Hundreds of millions of people around the world use the web as their primary resource to acquire and exchange information, to establish personal and professional networks, to buy goods, to view or distribute films, videos and photos, to listen to music, and to research a candidate or set of issues that might influence their choices during an election. The availability of online resources is now taken for granted. But, one thing is certain about the Web; like the weather, it changes. One cannot assume that a resource discovered today will still be accessible in two hundred years or twenty weeks or even two days hence. An estimated 44 percent of Web sites that existed in 1998 vanished without a trace within one year of initial publication. The rate of change has dramatically accelerated as web sites have become more dynamic, interactive, and personalized.

By 2000, there was an urgent need for governments, institutions and the constituents they served to understand that the web is the essence of our society, who we “are”. It is our culture and social fabric, and we should not risk losing a record of the significant roles the Web plays in our societies. At the same time cultural heritage institutions were charged with helping to bridge the technological divide that was growing at an alarming rate—to bring the breadth and depth of digital information and resources to those without the means to access, study and use the Web of their own accord. These were the dilemmas facing cultural heritage institutions in 2003 – they needed greater control over web archiving projects, better methodologies, and tools. They needed dedicated budgets and opportunities to advance digital scholarship. And, each realized that they could not (and still cannot) address these immense needs alone.

The International Internet Preservation Consortium (IIPC) was formed to ensure knowledge and information from the Web is preserved and made accessible for future generations everywhere. Eleven national libraries and the Internet Archive established the IIPC to develop common tools and standards for Web archiving and to encourage and support libraries, archives, museums and cultural heritage institutions everywhere to address Internet content collecting and preservation. Since its inception in 2003 the IIPC has grown to include forty-one members, all willing to share best practices, develop tools and resources for the global cultural heritage community. The strengths of the organization remain the ability of its members to put aside political, social, cultural, geographic, language, and technological differences to engage in an ongoing collaboration for which there is no final or ultimate solution to be reached. Each member shares a long-term commitment to helping institutions around the globe to create and sustain successful web archiving strategies and programs.

The primary achievements of the IIPC over the last decade include:

Open Source Web Archiving Tools Development
One of the very first projects of the IIPC was to develop an open source tool to capture web content called Heritrix. All original member institutions contributed engineering resources, funding or other support to the effort. The tool, Heritrix, was initially released in December 2003 after 4 months of co-development. It remains the only java-based, archival quality, open source, web crawler freely available for download today.
In the second and third phases of the consortium, members sponsored development of a suite of open source software libraries and applications called WARC tools that are used to validate content collected from the Web and to view it from an archive, and then dedicated time and resources to test the solutions in individual institutional workflows.

Best Practice and Standards Development
In 2007, IIPC members recognized the need to define a clear standard for preserving content collected from the web. Over a period of two years, the IIPC proposed and developed an ISO standard file format for archiving web content known as WARC. In 2011, the IIPC developed a suite of educational outreach programs that included sponsored workshops on social media, collection development, and international survey of legal deposit legislation. In parallel, sponsorship of a PhD student and the advancement of web archival research and publication became a central focus of the IIPC budget as the third phase of the consortium comes to a close and the fourth phase begins.

Content Collection and Preservation
Millions of web sites have been collected and preserved by individual institutions and in collaborative efforts such as international event based archives for natural disasters (e.g. tsunamis, earthquakes), national elections & revolutions (e.g. European Union, Jasmine revolution/Arab Spring), and the 2010 and 2012 Olympics.

Sustained Collaboration
A leading accomplishment of the IIPC was the creation of a sustainable, membership-based, international consortium. The IIPC fosters ongoing, collaboration by requiring active participation and direct contribution of its members through volunteer time, attendance at annual forums, and annual tiered dues based on an institution’s annual budget. In the second phase of the consortium, the IIPC also imposed a one-year limit on the term of the Steering Committee Chair. This enables institutions of all sizes to participate freely and openly in the consortium and to rotate through the highest level of leadership as Chair of the Steering Committee.

Spreading the burden helped each institution to accomplish more in the same time period than they ever could have accomplished on their own. In fact, the need to partner to address current and future challenges presented by innovative publishing models, as well as ever evolving technology, services, and popular trends is even more pronounced today than it was ten years ago. The IIPC is a model of collaborative action to preserve digital content from the Internet.

Note - Library of Congress (Host) may not vote for IIPC