DPC

Digital File Longevity (compiled for R&D in Digital Asset Preservation)

Resources compiled by Julian Jackson

This subject is relatively new and there are not many sources dealing with the longevity of digital image files. The sites below list many a variety of sites which will be helpful in finding out more. Howard Besser of UCLA, in particular, has written much on the subject and compiled links to other useful resources

We know that photographic negatives, transparencies and prints last a long time. They are reliable forms of storing data. Recently the Royal Geographic Society reprinted Frank Hurley's pictures from the 1913 Antarctic Exhibition - from his original glass negatives, nearly 100 years old. An example of how robust the storage medium was - remember these negatives had been in sub-zero conditions and transported across an ocean in a tiny lifeboat!

In the headlong rush to put photographic images into digital form, little thought has been given to the problem of the longevity of digital files. There is an assumption that they will be lasting, but that is under question.

"There is growing realisation that this investment and future access to digital resources, are threatened by technology obsolescence and to a lesser degree by the fragility of digital media. The rate of change in computing technologies is such that information can be rendered inaccessible within a decade. Preservation is therefore a more immediate issue for digital than for traditional resources. Digital resources will not survive or remain accessible by accident: pro-active preservation is needed." Joint Information Systems Committee: Why Digital Preservation?

The 1086 Domesday Book, instigated by William the Conqueror, is still intact and available to be read by qualified researchers in the Public Record Office. In 1986 the BBC created a new Domesday Book about the state of the nation, costing £2.5 million. It is now unreadable. It contained 25,000 maps, 50,000 pictures, 60 minutes of footage, and millions of words, but it was made on special disks which could only be read in the BBC micro computer. There are only a few of these left in existence, and most of them don't work. This Domesday Book Mark 2 lasted less than 16 years.

Digital media have to be stored, and the physical medium they are stored on, for instance a computer's hard disk drive or a CD-rom have finite lifespans. But the primary problem is of obsolescence. Computer formats sink into oblivion very rapidly. Howard Besser, of the UCLA School of Education & Information Studies says: "Fifteen years ago Wordstar had (by far) the largest market penetration of any word processing program. But few people today can read any of the many millions of Wordstar files, even when those have been transferred onto contemporary computer hard disks. Even today's popular word processing applications (such as Microsoft Word) typically cannot view files created any further back than two previous versions of the same application (and sometimes these still lose important formatting). Image and multimedia formats, lacking an underlying basis of ascii text, pose much greater obsolescence problems, as each format chooses to code image, sound, or control (synching) representation in a different way."

If an image has been generated on negative or transparency, then scanned and transformed into a digital file, then the original is safe. However if it has been digitally originated, such as much of today's news and sport photography, then vital parts of our cultural heritage may be lost forever. This problem will get worse as more photography becomes completely digital.

The two aspects of the problem

The longevity problem can be divided into two questions the lifespan of the medium on which the file is stored, e.g. a CD-rom, and the obsolescence of the format: digital formats age quite rapidly because they are superseded by new formats, particularly if they are proprietary ones.

As the British Joint Information Systems Committee says: "Preservation is therefore a more immediate issue for digital than for traditional resources. Digital resources will not survive or remain accessible by accident: pro-active preservation is needed"

The key technical approaches for keeping digital information alive over time were first outlined in a 1996 report to the US Commission on Preservation and Access (Task Force 1996).

  • Refreshing involves periodically moving a file from one physical storage medium to another to avoid the physical decay or the obsolescence of that medium. Because physical storage devices (even CD-roms) decay, and because technological changes make older storage devices (such as 8 inch floppy drives) inaccessible to new computers, some ongoing form of refreshing is likely to be necessary for many years to come.
  • Migration is an approach that involves periodically moving files from one file encoding format to another that is useable in a more modern computing environment. (An example would be moving a Wordstar file to WordPerfect, then to Word 3.0, then to Word 5.0, then to Word 97.) In a photographic environment we come across older Photoshop files that are no longer readable and had to update them into a new format. Migration seeks to limit the problem of files encoded in a wide variety of file formats that have existed over time by gradually bringing all former formats into a limited number of contemporary formats.
  • Emulation seeks to solve a similar problem that migration addresses, but its approach is to focus on the applications software rather than on the files containing information. Emulation backers want to build software that mimics every type of application that has ever been written for every type of file format, and make them run on whatever the current computing environment is. (So, with the proper emulators, applications like Wordstar and Word 3.0 could effectively run on today's machines.)

Both a migration approach and an emulation approach require refreshing.

This places a burden on individual photographers and small photolibraries, who have enough to contend with, with the rapid changing of their environment. That costly digital files might be unusable in a few years is a worrying thought. While TIFFs and JPEGs - because of their wide acceptance - are likely to be more resistant to becoming obsolete, nevertheless this will probably happen. Users of images need to be aware of this and have a plan to refresh the data onto new formats if necessary. This necessitates having good back-up copies to work from.

The role of meta-data

It has become clear to many institutions that there should be world-wide standards of data embedded in every file: who created it, when, what format, captioning and copyright information, for example. This would make access easier and also help preservation in the future. Although there are various institutions fumbling towards standards, of course creating one universal one will not be easy. A valuable project is the Dublin Core Metadata Initiative which is having workshops and projects to create various metadata standards.

http://dublincore.org/

The decay of physical media

Photographic materials tend to decay slowly over time, so you have enough warning to copy a treasured print, for example. Digital media tend to fully function, or not, and you have to open the file to find out. This adds another layer of uncertainty to the process. While this is the lesser of the two problems, it still has to be thought about. The lives of hard drives or CD-RWs is somewhat speculative. In the case of the latter, accelerated lifespan tests have been done, but we still have incomplete data, as the medium is quite new. It would seem wise to backup vital data on two different media, for instance a hard drive and a CD-RW until more is known.

This is a complex problem. Howard Besser at UCLA seems to be one of the best sources for further information:

Information Longevity http://sunsite.berkeley.edu/Longevity

Besser, Howard. Digital longevity,

Besser, Howard. Longevity of Electronic Art,

Task Force on the Archiving of Digital Information

Other sources:

Journal of Electronic Publishing: http://www.press.umich.edu/jep/

Joint Information Systems Committee http://www.jisc.ac.uk.

Sepia (European group investigating preservation of photos) http://www.knaw.nl

Resource Links:

Papers and Analysis of Problems

Other sources:

Journal of Electronic Publishing: http://www.press.umich.edu/jep/
Joint Information Systems Committee http://www.jisc.ac.uk.
Sepia (European group investigating preservation of photos) http://www.knaw.nl

PRESERVATION MANAGEMENT OF DIGITAL MATERIALS: A HANDBOOK
by Maggie Jones and Neil Beagrie

Published by THE BRITISH LIBRARY October 2001
Price £15.00 Paperback, 145 pages, 297x210mm, ISBN 0 7123 0886 5

Julian Jackson is an internet consultant and writer, specialising in the Photographic industry. His website is www.julianjackson.co.uk. He publishes two essential eBooks: Picture Research in a Digital Age 2 and Internet Marketing for Photographers, which are available from his website:http://www.julianjackson.co.uk

Read More

Future R&D for Digital Asset Preservation

DPC Forum with Industry
5th June, 2002. Prospect House York Road London SE1 7AW

Since its inception the DPC has aimed to gain industry awareness of its key messages and of the future needs and opportunities that lie ahead. This forum is part of that process. During the day representatives from the private and public sector will be speaking. They will address long-term trends and the research and development issues involved in the implementation of continuing access and preservation strategies by industry and government. Issues covered will include preserving TV and broadcast archives and research and development in the public and private sector.

Meeting Report

Introducing the first Digital Preservation Coalition (DPC) Forum focussed on developing a dialogue with Industry, DPC Secretary Neil Beagrie welcomed guests and members of the DPC. He then placed the question of digital preservation and opportunities for industry participants, firmly within an international context. The US National Digital Information Infrastructure and Preservation Program (NDIIPP): a $175m national programme; the National Science Foundation Cyberinfrastructure initiative; the Information Society Technologies programme of the EU 6th Framework; and, in the UK, the Research Grid and the work of the DPC, will be central to bringing political, technical and organisational impetus to bear on the challenges of digital preservation. With accelerating development of digital content, the issue of  maintaining long-term access will be of concern to an increasing ranging of sectors and individuals.

Forging strategic alliances with industry in the context of preservation was a vital component of any initiative in this area, said Neil Beagrie. He outlined how the DPC's constitution was shaping its emerging links with industry. Two important principles governed the DPC's work: firstly, the DPC will support the development of standards and generic approaches to digital preservation, which can be implemented by a range of hardware, software and service vendors; in short, he continued, the goals of the DPC are vendor-neutral. The second principle is that the DPC is a coalition of not-for-profit organisations including industry associations; it is committed to promoting and disseminating information so that all can learn from the transferable lessons and outcomes. The DPC is actively interested in broadening its industry links including those to individual companies. There is a major potential role for industry, concluded Mr Beagrie; this event therefore marked an important step in realising that potential.

Philip Lord, formerly of SmithKlineGlaxo, and now a Digital Archiving consultant, spoke of his own experience in industry and of some of the major drivers for industry in the field of digital preservation. He particularly emphasised the importance of the US Federal Drug Agency (FDA)'s regulation 21CFR Part 11 on the maintenance of good electronic records for the pharmaceutical industry. Legal and regulatory issues were, he said, clearly of major importance, as were the various voluntary drivers, such as contractual and IP obligations, operational efficiency considerations and the need to preserve for future reuse. Rarely were records preserved, he said, for historical or sentimental reasons.

Mr Lord talked of the many challenges that faced industry in this area: the heterogeneity of data sources and systems, geographical dispersion (spanning legal and regulatory jurisdictions), the lack of suitable preservation systems and services, management issues, cost, and the lack of expertise in this area. Mr Lord reported that little progress had been made so far within the sector, although a few companies, he reported, were leading the way.

David Ryan from the Public Records Office reported that digital preservation was a core part of the PRO's e-business strategy. Research and development in this area was focussing, he said, on proprietary format migration, emulations and simulation, open format export and migration and product reviews. The importance to users of interaction with preserved digital information, rather than simply using that information as an historical record (as is more likely the case with printed data) was made, and this gave an added importance to the work of the PRO in its e-preservation activities. Mr Ryan concluded by saying that there was an overriding need for the archives community to establish credibility with the key stakeholders in e-preservation, including industry, the media and the general public. The community as a whole needed to prioritise so that the most urgent tasks and challenges were tackled first.

The film and sound archives of the BBC contain some 1.75m film and videotape items and around 800,000 radio recordings from the late 1940s onwards. A ten-year preservation strategy had, Adrian Williams from the BBC reported, recently been approved and would cost around £60m. A key part of this strategy was a programme of digitisation for both access and preservation. He reported on the European Commission Presto project, which involved 10 partners, lasted 24 months and cost around 4.8 million Euros. Findings from this project suggested, for example, that digitisation and mass storage is about 50% more expensive, but is expected to double usage of an asset; and moreover, that the value of an item must be four times the preservation cost to be financially viable. He concluded by suggesting that Europe requires a "dedicated preservation factory" given the scale of the task facing national broadcast archives. There was substantial audience interest in the approaches to cost and business modelling described in the Presto project.

Julian Jackson noted in the headlong rush to put photographic images into digital form, little thought seemed to have been given to the problem of the longevity of digital files.  There is an assumption that they will be lasting, but that is now under question. He addressed general issues surrounding preservation and obsolescence in digital images. He surveyed the techniques of refreshing, migration and emulation and emphasised the crucial role that metadata and metadata standards have to play in these preservation processes.

Paul Wheatley of the CAMiLEON project spoke of some of the practicalities of digital preservation and emphasised the need for long-term strategies.Existing methods have many drawbacks.  Mr Wheatley described advanced techniques of data migration which can be used to support preservation more accurately and cost effectively.

To ensure that preserve works can be rendered on computer systems over time "traditional migration" has been used to convert data into current formats.  As the existing format becomes obsolete, another conversion is performed, etc.  Traditional migration has many inherent problems as errors during transformation propagate through future transformations.  Mr Wheatley described how the Camileon project had developed new approaches to extending software longevity ("C--") which had been applied in experiments and demonstrated improvements over traditional migration.  This new approach is named "Migration on Request".

Migration on requesting shifts the burden of preservation onto a single tool, which is maintained overtime.  Always returning to the original format enables potential errors to be significantly reduced.  Mr Wheatley also described how preservation quality emulators were being produced and strategies of migration on request and/or emulation were being applied.

The need for public-private partnerships in the field of digital preservation is crucial, said David Bowen of Audata Ltd. He went on to outline what industry was currently doing in this field -  - e-mail, document and database migration, as well as promoting standards, while software suppliers are also improving backward compatibility (Word, Wordperfect, RTF), and increasingly adopting and promoting standards themselves too. Mr Bowen called for R & D partnerships, like the Testbed Digitale Bewaring in the Netherlands, which is leading to the sharing of results and advice, and sound record creation and metadata practices. Particularly important, he concluded, was the need for software suppliers to be brought into the growing public-private partnerships that are developing.

The final session of the day was a discussion session. Key themes that emerged included:

  • the importance of archiving software and technical documentation.  It was felt by participants from all sectors that this is a major gap and there was an urgent need to develop appropriate repositories;
  • the need to develop case studies and tools for modelling costs.  It was felt this is a major area that should be covered in future DPC forum;
  • the necessity of developing national funding for the preservation of intangible heritage assets.  It was noted there is no "Superfund" or legislation which allows  digital heritage to be gifted in lieu of tax to or purchased by, the nation;
  • further work by the Digital Preservation Coalition to establish contacts with industry and to build on the dialogue commenced at the forum.

It was felt that it was important, as David Bowen had said, to include software and hardware suppliers in future developments as their actions could be crucial, in particular in providing the tools and products for end-to-end solutions which were needed in this area. Once again the importance of using both migration and emulation strategies was emphasised, as was the question of the criteria for choosing what should be preserved; we are not in a position to judge easily what will be in demand in the future. Therefore sampling could be crucial importance as one strand in our overall strategy.

Some delegates from industry felt that there were gaps of responsibility between the organisations, and that it was therefore important for the DPC to coordinate and facilitate activities in this area.

The forum ended on a note of optimism that the first steps in the dialogue with industry had been taken, and with a number of concrete action points which Lynn Brindley, Chair of the DPC, promised would be followed up in the coming months.

End of Meeting Report

Programme and Presentations

10.30 - 11.00  Registration and coffee
   
11.00 - 11.10 Welcome and Introduction
11.10 - 11.35 Keynote Address - "Trends and Future Opportunities" (PDF 92KB) Neil Beagrie JISC
11.35 - 12.00 Preserving digital records in Industry (PDF 258KB) Philip Lord ex GlaxoSmithKline
12.00 - 12.30 Preserving digital records in Government (PDF 71KB) David Ryan Public Records Office
   
12.30 - 1.30 Lunch
   
1.30 - 1.55 Preserving TV and Broadcast Archives (PDF 449KB) Adrian Williams BBC
1.55 - 2.15 Preserving Digital and Historic Images Julian Jackson Internet Consultant and Writer Picture Research Association;
Digital File Longevity
2.15 - 2.35 The Camileon and Cedars Research projects (PDF 500KB) Paul Wheatley Leeds University
   
2.35 - 3.05 Coffee
   
3.05 - 3.30 Practical Experiences of Preservation: R&D partnerships in the private and public sector (PDF 397KB) David Bowen Audata Ltd
3.30 - 4.10 Discussion
4.10 - 4.30 Concluding Address - Lynne Brindley Chief Executive British Library
Read More

Web-archiving: managing and archiving online documents and records

Web sites are an increasingly important part of each institution's digital assets and of this country's information and cultural heritage. This event, organised by the Digital Preservation Coalition (DPC), brought together key organisations in the field of web archiving in order to assess the needs of organisations to archive their own and others' web sites, to highlight good practice, and to influence the wider debate about digital preservation.

Meeting Report

This meeting report provides a short summary of the DPC Members Forum on web archiving held on 25th March 2002. Individual PowerPoint presentations from each of the speakers are available below.

Web sites are an increasingly important part of each institution's digital assets and of this country's information and cultural heritage. As such, the question of their management and archiving is an issue which UK organisations need to be increasingly aware of. This event, organised by the newly-created Digital Preservation Coalition (DPC), brought together key organisations in the field of web archiving in order to assess the needs of organisations to archive their own and others' web sites, to highlight good practice, and to influence the wider debate about digital preservation.

Neil Beagrie, Programme Director for digital preservation at JISC and Secretary of the DPC, began the day's proceedings by welcoming delegates to the event, the first event on web archiving to be organised by the DPC. He stressed the importance of the issue.

The first speaker, Catherine Redfern from the Public Record Office (PRO) provided a short general introduction to web-archiving. Web sites are records, and as such, need to be managed and archived. However selection was necessary too, said Ms Redfern. But what are the criteria to be employed in such a process of selection? And how important is the capturing of the 'experience' of using the web site given that the look and feel of a site are an intrinsic part of the record. It was important, concluded Ms Redfern, to accept that perfect solutions do not exist, and that flexibility means that it may be the case that different solutions existed for different web sites.

Brian Kelly of UKOLN followed and covered the size of the UK web domain and the nature of UK websites. He emphasised the sheer scale of the challenge by looking at definitions and measurements of the UK web space. A number of different approaches by organisations came up with different measurements, but a figure of 3 million public web servers which contained .uk within their URLs was given by Netcraft. In 2001 OCLC's Web Characterization Project suggested the UK accounted for 3% of the sites on the WWW. Searches using AltaVista further suggested that UK websites might contain around 25 million pages. Preserving web sites which we are unable to count will prove particularly difficult, he said, but perhaps the most important question was: at what rate is the UK web space growing?

Brian Kelly then went on to describe issues encountered during work on the UK webwatch and a pilot study to explore archiving issues for project websites funded under the JISC's eLib programme. He also described the Internet Archive (www.archive.org/) which is building historical collections of web content. He concluded that measuring the size of the UK web is difficult but the experiences of web harvesting robot developers and web indexers will provide valuable information for archiving the UK web.

Comparisons with other international situations are important in this context, and Julien Masanes from the Bibliotheque nationale de France (BnF), gave the French perspective on these questions. In France the Government is currently in the process of modifying the law regarding legal deposit of online content. Masanes explored the issue of archiving the "deep web" generated by databases and mechanisms for improving the current generation of web harvesters. The BnF is currently researching the best way to manage procedures of selection, transfer and preservation, which could be applied on a large scale within the framework of the proposed new law. Two large-scale projects are proposed as part of this ongoing research. The first one has begun and involves sites related to the presidential and parliamentary elections that will take place in Spring 2002 in France. More than 300 sites have already been selected and the BnF collects about 30 Gb per week. The second project will be a global harvesting of the '.fr' domain in June.

If the sheer scale of the amount to be archived presents a major challenge, it is one that the BBC, with a million pages on its web site, and each regularly being updated, faces as a matter of course. Cathy Smith New Media Archivist of the BBC spoke about modernising the BBC archive to include its digital content and the huge logistical and legal problems that this can involve. The BBC's Charter responsibilities mean that it must archive its content, while its message boards, live chat forums, etc. mean that Data Protection becomes a serious issue in this context too. Multi-media content, often created through non-standard production processes, add further problems while proposals to extend the period within which the public can make formal complaints from one year to three years, has important consequences for the amount that will need to be archived. Ms Smith talked of the need to change perceptions from archiving to media management and for more pre-production emphasis on generating metadata and considering future re-use. She also emphasised the fact that the archive needs to recreate the look and feel of the original record since this was an important aspect of what it is that the BBC does.

A number of short reports from DPC members followed in the afternoon focussing on current initiatives and pilot projects. Stephen Bury of the British Library spoke of the BL Domain UK pilot project, the capture of 100 websites, and some of the criteria used by the BL in its current archiving activities, given the lack of legal deposit requirements. These criteria include topicality, and reflecting a representative cross-section of subject areas. Metrics of sites captured were provided, for example only 10% were "Bobby" compliant. Future developments would include scaling up the project and international and national collaborations.

Stephen Bailey, Electronic Records Manager for the Joint information Systems Committee (JISC) spoke of the JISC's efforts to implement its own recommendations in electronic records management and its current project of redesigning its own web site. The archive module of the new website will allow for identification and retention of key pages and documents and will also allow a greater degree of functionality for end users. Centralised control of the web records' lifecycle will allow for greater uniformity but will place demands on content providers. Future developments will include working in partnership on long-term preservation with the national archives and libraries and looking at preservation of the distributed JISC-funded project websites.

Steve Bordwell of the National Archives of Scotland asked whether we should even be attempting to preserve web sites in terms of look and feel, and whether we should rather be focussing on their content. He discussed their first work in the field, archiving snapshots of the Scottish Parliament website and the "Suckler Cow Premium scheme", a website based on an Oracle database with active server pages. They cannot preserve the whole application for the Suckler Cow site but will capture and preserve the dataset and use screencams to preserve the look and feel.

David Ryan of the PRO looked at the project to preserve the No. 10 web site from election day June 2001, and asked what an acceptable level of capture and functionality might be in terms of archiving and preservation procedures. Kevin Ashley of the University of London Computer Centre (ULCC), suggested that we need to think what the purpose of websites is precisely and what their significant properties are in order to formulate criteria for selection, capture, and preservation.

Robert Kiley spoke about the joint Wellcome Trust/JISC web archiving feasibility study and the specific part of this that is looking at archiving the medical Web. He emphasised what we are in danger of losing if no action is taken. Once again, the sheer volume of the medical Web presents significant problems for selection: quality would be one criterion, but how should we judge quality? In addition, many database are published only electronically, while discussion lists and e-mail correspondence are also potentially of immense importance to future generations of researchers. Are the next generation of Watson and Crick already communicating electronically via a public email forum and will this survive? He outlined key issues to be addressed by the consultancy including copyright, costs and the maintenance of any medical web archive.

Concluding discussion on the future way forward for the UK emphasised the value of sharing current approaches and technical developments on web archiving both internationally and within the UK. There are still many technical challenges including the preservation of database driven sites and the need for better tools for harvesting and archiving webpages. It was recognised that the scale of the task in the UK was significant and would require careful selection of sites as well as collaboration between organisations, to address it effectively. The DPC would be setting up further individual meetings between members to advance discussions initiated at the forum and to develop plans for scaling up current pilot activities.

End of Meeting Report

 

 

Presentations

Session 1

Web-archiving: an introduction to the issues (PDF 17KB) Catherine Redfern PRO (based on MA research)

Developing a French web archive (PDF 115KB) Julien Masanes Bibliotheque Nationale de France

The UK domain and UK websites (PDF 292KB) Brian Kelly UKOLN

Archiving the BBC website (PDF 15KB) Cathy Smith BBC

Session 2

Members' contributions

Stephen Bury (British Library) (PDF 17KB)

Steve Bailey (Joint Information Systems Committee) (PDF 16KB)

David Ryan (Public Record Office) (PDF 201KB)

Kevin Ashley (University of London Computer Centre) (PDF 10KB)

Robert Kiley (Wellcome Trust Library) (PDF 127KB)

Steve Bordwell (National Archives of Scotland) (PDF 45KB)

Read More

Digital Preservation Launch at House of Commons Launch

Added on 27 February 2002

Press Release Number Two - 27th February 2002

Coalition launches at House of Commons to secure the future of digital material

27th February 2002 Embargoed until 8pm 27.02.02.  Coalition launches at House of Commons to secure the future of digital material

The Digital Preservation Coalition (DPC) announced an action plan to ensure that the digital information we are producing is not lost to current and future generations.The key messages were:

Read More

Memorandum of Understanding with National Library of Australia

Added on 1 December 2001

National Library of Australia and DPC sign Memorandum of Understanding

The DPC and NLA have signed a memorandum of understanding to work collaboratively on digital preservation activities and dissemination. PADI is a highly recommended international gateway for digital preservation developed by the National Library of Australia (NLA) with the guidance of an international advisory group.

The DPC and NLA jointly compile a quarterly 'What's New in Digital Preservation' electronic digest of selected new items added to PADI and to the JISCmail Digital-Preservation list.

In addition to pointing to padiforum-l, PADI's own discussion list for the exchange of news and ideas about digital preservation issues, PADI will also provide a direct link to the Digital-Preservation list archive from its News and Discussion area.

Read More

DPC Signs memorandum with the NPO

Added on 22 October 2001

DPC signs Memorandum with the National Preservation Office

The aim of the National Preservation Office is to work in the broadest possible partnership to ensure a planned approach to preservation management for long term access to collections throughout the United Kingdom and Ireland. To achieve this it works with preservation practitioners, collection managers, and in partnership with a broad range of organisations engaged in collection care.

The Digital Preservation Coalition and the National Preservation Office recognise their complementary roles and common interests in ensuring long term access to digital resources and collections. A Memorandum of Understanding has been drawn up in order to provide a cohesive and independent focus for their specific and joint activities.

Read More

Digital Curation: digital archives, libraries, and e-science

This was an invitational seminar sponsored by the Digital Preservation Coalition and the British National Space Centre. The seminar aimed to raise the profile of the Open Archival Information System Reference Model (OAIS) standard in the UK and share practical experience of digital curation in the digital library sector, archives, and e-sciences.

Presentations

Digital Curation: digital archives, libraries and e-science - report by Philip Pothen JISC

A report on an invitational seminar held on the 19th October at 75-79 York Rd, and sponsored by the Digital Preservation Coalition and the British National Space Centre

Neil Beagrie (JISC Digital Preservation Focus and Secretary, Digital Preservation Coalition) welcomed the guests to the seminar and outlined the main objectives of the event; it would attempt, he suggested, to

  • raise the profile of relevant UK and international standards and "hands-on" initiatives in the UK;
  • show their application in the sciences, libraries and archives, and in the educational sector and beyond;
  • illustrate their role in securing and promoting access to the digital outputs of research and other activities for current and future generations.

Three developments have been key to the timing and organisation of this event:

  • the imminent approval of the Open Archival Information Systems (OAIS) Reference Model as an ISO standard;
  • the launch of the Digital Preservation Coalition (DPC), a major cross-sectoral coalition of over 15 major organisations in the field;
  • and the development of the e-science programme to develop the research grid.

Session 1. The OAIS Reference Model and digital archive certification

Lou Reich of NASA spoke about how NASA and the Consultative Committee for Space Data Systems (CCSDS) had been central to the development of the OAIS Reference Model, but how they had ensured widespread consultation and cooperation with the archive community, both in the US and internationally. The resulting model had been developed, therefore as an 'open' and public model and was already being widely adopted as a starting point in digital preservation efforts. Lou Reich explained that a new version of the OAIS Reference Model was delivered to the ISO and CCSDS Secretariats in July 2001 for a two month public review period, and a final standard should be produced in late 2001.

Lou Reich provided a summary of some of the main characteristics of the OAIS Reference Model: it can be applied to all digital archives, and their Producers and Consumers; it identifies a minimum set of responsibilities for an archive to claim it is an OAIS; it establishes common terms and concepts for comparing implementations, but does not specify an implementation; and it provides detailed models of both archival functions and archival information. Dr Reich concluded by outlining some of the use and implementation efforts of the OAIS Model by the science, library and archive communities, including the Networked European Deposit Library (NEDLIB), the National Library of Australia, CEDARS, NSSDC (National Space Science Data Center), the National Archives and Records Administration (NARA), and others.

Bruce Ambacher of the US National Archives and Records Administration spoke about the development of OAIS, particularly in the areas of Ingest, Identification and Certification. Through the October 1999 Archival Workshop on Identification, Ingest and Certification (AWIICS), Mr Ambacher was particularly involved in the area of Certification; the AWIICS Certification Working Group developed a preliminary checklist for certification that develops best practices and procedures for each aspect of the OAIS model, including legal issues, mission plans, compliance with relevant regulations, relationships with data providers, ingest procedures, data fidelity and life-cycle maintenance. The workshop also acknowledged that the full range of best practices was not yet in place.

Mr Ambacher summarised the current initiatives that are pointing the way towards developing a suite of approaches that will support certification, including the InterPARES reports, the RLG and OCLC report Attributes of a Trusted Digital Repository for Digital Materials, the Global Electronic Records Association, the JISC's standards and guidelines for the DNER, and many others. 

Robin Dale of RLG spoke on RLG and OCLC report Attributes of a Trusted Digital Repository for Digital Materials. She re-emphasised the importance of certification as a key component of a trusted digital repository; self-assessment, she said, will not always be adequate. There is a need, therefore, for certification practices to be formalized and made explicit. The AWIICS draft report had suggested the need for an official certifying body, for identifying the attributes to be measured and to define the conditions of revocation of certification. But many questions still remained to be answered, including, who will be on such a body, who will set up this body and which stakeholders will be represented on it.

David Ryan of the UK Public Record Office spoke about some of the challenges facing the PRO in terms of preservation of public records, technical obsolescence and physical deterioration. Once again he emphasised the importance of collaborating over these issues with a range of interested parties - scientists, users, archivists and technical staff. Mr Ryan outlined the PRO mandate to store and make available comprehensive 'born digital' public records and how its activities in this area were a core part of the PRO e-business strategy.

Part of the PRO's efforts in "hands-on" e-preservation would be development of a database PRONOM to support a technology watch function. This would aim to identify and warn of impending changes to hardware and software and risks of obsolescence.

The importance to users of interaction with preserved digital information, rather than simply using that information as an historical record (as is more likely the case with printed data) was made, and this gave an added importance to the work of the PRO in its e-preservation activities. The PRO is collaborating therefore not only with Government bodies, but is also a founder member of the DPC. Appraisal is crucial to preservation activities, said Mr Ryan, and this means talking to users ('talking to the cook and not the chef') in a meaningful and collaborative way.

An interesting discussion followed in which the relative costs of printed and digital storage were discussed. There was vigorous debate over whether preservation of digital materials would actually be more expensive than original materials in other media.  The duty of care and costs associated with traditional special collections and important archives was cited. With digital storage the costs of computer storage are diminishing constantly so costs would be related primarily to staff effort required for long-term preservation. The degree of automation which could be implemented for future migrations and preservation efforts, would therefore be critical to relative and absolute costs over the long-term. It was argued that issues such as appraisal and migration represented costs that were ongoing. It would be easy to underestimate the costs of long-term digital preservation where it was dependent on human intervention and perhaps could not be scaled across collections. Other issues discussed at this stage were the importance of metadata to long-term preservation, and the question of selection, particularly pertinent to the PRO and its efforts to secure the preservation of the UK public record.

Session 2. Data Curation and the Grid

Professor Tony Hey, Director of the UK e-science programme, began by stating that e-science is about global collaboration in key areas of science and the next generation of infrastructure that will enable it. He quoted John Taylor, Director General of the Research Councils who said that 'e-science will change the dynamic of the way science is undertaken.' NASA's Information Power Grid has promoted a revolution in how NASA addresses large-scale science engineering problems by providing a persistent infrastructure for 'highly capable' computing and data management services. The Grid, by providing pervasive, dependable and inexpensive access to advanced computational capabilities will provide the infrastructure for e-science, said Professor Hey.

The UK e-science initiative represents £120m worth of funding over the next three years to provide next generation IT infrastructure to support e-science and business, £75m of which is for Grid applications in all areas of science and engineering, £10m for the supercomputer and £35m for the Grid middleware. It uses SuperJANET and all e-science centres will donate resources to form a UK National Grid. Professor Hey outlined some of the projects being funded under this programme, including the Comb-e-Chem project which will integrate structure and property data sources within a knowledge environment to find new chemical compounds with desirable properties; My Grid, and the PPARC e-science projects, such as GridPP and AstroGrid.

Peter Dukes from the Medical Research Council outlined the overall scope of the MRC's programme as well as its current work to develop an archiving policy. The MRC has a number of data sets, including genome databases, genetic databases and populations databases. It has made a considerable investment in its populations studies databases which has meant that a management policy governing both archiving and access has been crucial. Outlining some of the access issues, such as rights and ownership issues, consent and ethical issues, Peter Dukes detailed some of the central concerns in the next stage of the MRC's development of an archiving policy, such as the use of case studies to examine current practices, skills and costs, the potential for a specialised data centre, as well as the specific issues of metadata, ownership and the management of access. It was clear that the research Grid would provide tremendous opportunities for advancing science and that work on research data policies and practice will also help unlock the potential of the Grid for collaborative scientific research.

David Boyd of the CLRC e-science Centre looked at how the Grid can help some of the problems involved in scientific data curation. The problems are of a rapidly increasing capability to generate data in many different formats in the physical and life sciences, the increasingly expensive facilities needed to generate this data, the irreplaceability of much of the data and the increasing need for access to be on a global scale.

The problems of data curation in this environment are immense too: the question of who owns the data is often unclear, which can mean that responsibilities for curation are unclear, resulting in it often not getting done; the lack of a culture of data curation, and the lack of recognition in some quarters that data is a significant long term asset which requires investment. The Grid can help with some of these problems by offering a mechanism for control and access (through authentication and authorisation, etc.), by making the location and the existence of data more visible, by providing easier access to data and by facilitating distributed collaborative working by sharing data. Dr Boyd concluded by outlining some of the CLRC's current activities, including the development of a portal for accessing scientific data, operating a large-scale data archive, as well as participating in global standards activities.

Paul Jeffreys, Director of the Oxford e-Science Centre, spoke about its recent launch, and the management structures that will enable managerial, technical and user concerns to be integrated within the activities of the Centre. He spoke too about the Oxford-wide collaboration that the centre is involved in, such as the work with the Oxford Digital Library, the Oxford Text Archive and Humbul. Although, Dr Jeffreys said, global science is driving the initiative, the interest is much wider, and these areas of collaboration suggest that the centre will become a core part of the University's life.

A discussion followed in which it was asked whether, given the nature of the collaboration that Paul Jeffreys had outlined, whether there was a place for greater cooperation between the Grid and the Arts and Humanities community, in particular the Arts and Humanities Research Board. Researchers in the Arts and Humanities do not generate the vast amounts of data that those in the sciences did, but they have need of different data types (video, for example), a need to overcome certain traditional cultural barriers to the use of digital information. There was therefore a need for their involvement in wider developments.

Another key issue that came up in this discussion session was the need to look at data policies, archival models, and how to incentivize the submission of primary research in digital form and appropriate metadata. Ideas put forward included: financial incentives (perhaps linking part of the research grant funding to archiving as has already done by some research councils); through increasing and enhancing recognition of the value of digital resources in general among the research and scholarly community; to persuading funding councils, the RAE and publishers to take these matters more seriously and to build such considerations into their funding and reward processes. An interesting example was given of linkage between primary data and publication articles and how this had provided the incentive for researchers to complete and submit the primary research archive to a high standard. It was also recognised that project funding while valuable in targeting research on current needs and tight deliverables tended to ignore long-term data needs and the infrastructure to support it. This was something the research sector would need to address to ensure an appropriate balance.

Session. 3.  Curation of Digital Collections

Maggie Jones and Derek Sergeant from the CEDARS project funded by JISC, explained how the project had so far delivered the CEDARS Demonstrator Archive as well as the CEDARS Preservation Metadata outline specification, both based on the OAIS Reference Model, which has been a significant influence on the project and one whose development the project had, in fact, contributed to.  Plans for the forthcoming extension year including a major redesign of the CEDARS web site, the production of a series of guidance documents and the hosting of an invitation-only workshop in early 2002, which would involve all of the organisations that CEDARS has been collaborating with. Some of the lessons of the project were outlined, including the centrality of metadata to the preservation of resources, and the increasing consensus that is emerging about standards.

Deborah Woodyard, Digital Preservation Coordinator at the British Library, outlined the BL's main activities in the area of digital curation and preservation. At the moment the BL's digital collecting was on the basis of a voluntary deposit, along with purchases made by the BL, as well as created digital resources undertaken by the BL itself. Among its main priorities, Deborah Woodyard said, was to ensure improved coverage of the UK's National Published Archive, to increase the collection of digital materials and to continue to collaborate with other major players in the field. There was also a major consideration, in keeping with the BL's policy in other areas, and that is to make the library's digital collections more accessible to users.

The BL's Digital Library Store aims to support the storage and long-term preservation of digital materials. It is based on the OAIS standard which, Deborah Woodyard said, had helped the understanding of the information objects that were being preserved, clarified functionality and had helped provide a common language for its curation and preservation activities. Once again, collaboration was seen to be central to these activities, with the BL being involved with a number of key organisations and projects in this field including the DPC, the National Library of the Netherlands, the OCLC-RLG Preservation Metadata and Attributes of a Trusted Digital Repository Working Groups, and the CEDARS project.

Kevin Ashley of the University of London Computing Centre (ULCC) and the National Digital Archive of Datasets (NDAD) spoke about ULCC's role under contract to the PRO and others for digital preservation and their practical experience of running a digital preservation service. He also covered the history of mass storage of information, and of the different forms of archival resources. He noted most discussion centres on digital forms of preservation metadata, but it is important too to recognise that some forms of metadata are in non-digital format - manuals, specifications, a person's individual expertise, etc that was also an important consideration in preservation of materials. Another important issue was the role of third parties such as the data creators and departments who have a different agenda and priorities and how this impacts on preservation.

Kevin Ashley also spoke about the OAIS model; it advantages were clear, he said in that it eases procurement of hardware and software, and interworking with compliant systems, as well as migration tasks, but there are question marks about interworking with traditional repositories, as well as its working with mixed-mode models, questions which will need to be looked at closely in the future.

A discussion followed on the potential value and limitations of the OAIS model. Its value in the early stages of system design and development was recognised but at the same time it would not provided the detailed implementations and practice. Documenting and sharing practical experience in this area will be vital.

The difficulties and importance for archives in working with disparate data creators and departments in either industry or the public sector were also discussed. It was recognised that this often would require a cultural change process and outreach to work with data creators. The increasing role of spatial data and Geographical Information Systems (GIS) in organisations was cited as one factor which is giving increasing prominance to corporate datasets, archiving and standards.

Session. 4.  The Way Forward

In the final session from the day Neil Beagrie, David Giaretta, and Tony Hey reflected on the seminar and ways forward.

David Giaretta (Rutherford Appleton Laboratory/BNSC and chair of CCSDS panel developing the OAIS model) noted the next international CCSDS meeting which would discuss OAIS and archive certification was being held in Toulouse in the following week. He would report on the UK seminar and its discussion. He felt the seminar had been exceptionally valuable and it would be important to continue the momentum and progress it had achieved. It was also important to continue co-ordination across sectors and the Coalition could be immensely valuable in achieving this. Tony Hey suggested collaborating with the e-science institute in Edinburgh to arrange further follow-on meetings.

Neil Beagrie highlighted a number of additional areas. He welcomed the presence of his colleague Louise Edwards from JISC at the seminar who was working on the primary research data in the JISC Collections Policy. Clearly the use of data by the sector and closer involvement between JISC and the research councils to support users of the Grid and primary research data would be important. The creation of a JISC research committee chaired by Tony Hey could clearly have an important role in this area. He also wished to thank everyone for their contribution and felt sure that members of the Digital Preservation Coalition would be willing to initiate and participate in future seminars. He hoped to see close links with e-science and a growing membership of the DPC amongst the research councils and data centres.

In developing digital archive certification it was suggested we may need a two track process: some rapid prototyping and implementations e.g. the RLG/OCLC attributes work or the JISC information environment, etc, and an evolving standards process -- hopefully getting both practice and theory in a feedback loop.

The topic of linkage between "digital libraries", data curation and preservation research and the Grid touched on during the seminar was reviewed. Tony Hey noted he was open to discussion of possible projects in data curation or indeed other areas raised during the seminar but it was important to note the need currently for industry involvement and funding in such proposals.

Finally David Giaretta noted a meeting report was being prepared and would be made available on the Web shortly with the speakers' presentations. The seminar was concluded by all participants thanking the speakers and the DPC and BNSC for an extremely lively and stimulating day on a key topic of cross-sectoral interest.

Philip Pothen JISC. © JISC 2001.

End of Meeting Report

Participant Evaluation of Sessions (RTF 5KB)

Presentations

Welcome and Introduction (RTF 9KB) - Neil Beagrie (JISC)

Session 1: OAIS and Digital Archive Certification
The OAIS Reference Model Lou Reich (NASA) (PDF 140KB)
Panel Presentations
Bruce Ambacher (NARA) (PDF 86KB)
Robin Dale (Research Libraries Group) (PDF 23KB)
David Ryan (UK Public Records Office) (PDF 125KB)

Session 2: Data Curation and the Grid
E-science and the Research Grid (PDF 684KB) - Tony Hey (Office of Science and Technology/EPSRC)
Panel Presentations
Peter Dukes (Medical Research Council) (PDF 56KB)
David Boyd (Central Laboratory of the Research Councils) (PDF 28KB)
Paul Jeffreys (Oxford University) (PDF 819KB)

Session 3: Curation of Digital Collections
Derek Sergeant & Maggie Jones (CEDARS Project) (PDF 259KB)
Deb Woodyard (British Library) (PDF 163KB)
Kevin Ashley (University of London Computing Centre) (PDF 19KB)

Delegate List (RTF 11KB)

Read More

DPC Advent Calendar

Added on 30 November -0001

Ho Ho Ho! 'Tis the start of the holiday season and this year, the DPC team has created a very special Advent Calendar ....

Read More

Introducing the Levels of Born-Digital Access

Shira Peltzman, Jessica Venlet & Brian Dietz

Shira Peltzman, Jessica Venlet & Brian Dietz

Last updated on 4 November 2020

By Brian Dietz (Digital Program Librarian for Special Collection, NC State University Libraries), Shira Peltzman (Digital Archivist, UCLA Library) and Jessica Venlet (Assistant University Archivist for Digital Records and Records Management, UNC at Chapel Hill University Libraries)


The decisions facing those who work with born-digital archival materials are myriad. While it has become increasingly easier to find technical processing workflows and lists of handy tools, documentation and guidance on exactly how to provide access to our born-digital collections has lagged behind in our collective conversations. 

Over the last several years, a few like-minded efforts in the U.S. to tackle this common challenge coalesced into the DLF Born-Digital Access Working Group (BDAWG) in 2017. Among other things, the group set out to explore the questions: What if the Levels of Digital Preservation included access?  Does access have to be an all or nothing choice? What are some of the key considerations - technical or policy - that make access to born-digital materials possible?

Read More

Scroll to top