Open Source and Dynamic Databases


In keeping with using the Forums as a means to keep participants up to date with the latest developments in digital preservation, the 6th DPC Forum focussed on Open Source Software and Dynamic Databases, both of which have been the subject of debate and speculation. A stimulating and thought provoking day began with four presentations on OSS.

Alan Robiette provided a comprehensive, historical overview of the development of OSS and its pros and cons. A key drawback (lack of support) was echoed in other presentations. The early stages of the MIT-Cambridge DSpace collaboration were described in Julie Walker and Anne Murray's presentation. William Nixon discussed the issues in building a network of institutional repositories as part of the DAEDALUS project at the University of Glasgow.

Jo Pettit described the work of the National Archives of their trials and pilots programme in addressing some of the practical issues they are facing. Open source software testing is a part of this programme. Demonstrations of OCLC'S Digital Archive and DSpace were provided during the lunch break.

The afternoon session had three presentations on archiving dynamic databases. Peter Bunemann's presentation was on archiving scientific data and referred to the tension between preserving scientific data frequently, which is space consuming, and infrequently, resulting in delays. Experimental work being undertaken on data structures offers some promise for affordable, persistent scientific archives.

Bryan Lawrence described the role of the British Atmospheric Data Centre, NERC's designated centre for atmospheric data and referred to the influence of the OASI model in recognising that data storage is part of a wider picture of consumers and producers, with data repositories acting as facilitators between the two. Finally Cathy Smith provided a lively update on the evolution of the BBS website and issues involved in archiving it.

9.30 - 10.00 

Registration and Coffee

10.00 - 10.10

Welcome and Introduction


Session 1 - Open Source and Digital Preservation


10.10 - 10.40 

Open Source & Commercial Software (PDF 194KB)
Alan Robiette, Programme Director, JISC

10.40 - 11.10

The DSpace at Cambridge Project (PDF 547KB)
Anne Murray, Cambridge University and Julie Walker, MIT

11.10 - 11.40

Experiences With E-prints and Dspace (PDF 943KB)
William Nixon, Glasgow University

11.40 - 12.10

The Open Source Evaluation Project at the National Archives (PDF 1.9MB)
Jo Pettitt, National Archives


12.10 - 12.30 



12.30 - 2.00

Lunch and Demonstrations
(DSpace and OCLC Digital Archive)


Session 2 - Approaches to Archiving Dynamic Databases


2.00 - 2.30

Archiving Dynamic Databases (PDF 436KB)
Professor Peter Buneman, Edinburgh University

2.30 - 3.00

Experiences with Archiving Databases in BADC (PDF 2.42MB)
Bryan Lawrence, British Atmospheric Data Centre


3.00 - 3.30



3.30 - 4.00

Ten Years on the Web: Archiving BBCi Online (PDF 632KB)
Cathy Smith, BBC


4.00 - 4.30

Concluding Discussion

Read More

Archives: adapting to the digital age

Report on the DPC Forum, Archives: adapting to the digital age

Held at the National Archives, Kew, Wednesday 24th September 2003

Around 40 participants attended the 7th DPC Forum, which was held at the National Archives, Kew. The Forum was timed to coincide with Archives Awareness month so it was appropriately held at TNA and focussed on archives in the digital age. It also coincided with the anticipated Autumn internet launch of TNA's PRONOM database (http://www.nationalarchives.gov.uk/preservation/

Update 26 September 2007
This link no longer active; information on PRONOM can be found at
http://www.nationalarchives.gov.uk/aboutapps/pronom/ )
and the UK Central Government Web Archive

Update 03 October 2007
New location:
).  Demonstrations of both of these were provided to participants in the afternoon sessions.

David Thomas, Director, Government and Archive Services at TNA, chaired the Forum and in his welcome and introduction, noted that TNA was in a process of change but now has "real stuff" to show, as opposed to abstract discussions. The following sessions, which preceded demonstrations of PRONOM, the Digital Archives, and tours of TNA, were informative, thoughtful, and stimulated lively discussion.

Session 1: Electronic Records Management

Richard Blake, Head of the TNA's Records Management Advisory Service, placed ERM in a strategic framework. He stressed that this was an issue affecting any business, not just archives, if material is to be held for more than five years, preservation issues will inevitably arise. It is necessary to ensure that records are kept useable, whether they are being kept for twenty years or taken into an archive for permanent retention.

He referred to BS ISO 15489, which is the first international standard for records management. There were however some problems in applying this standard in practice as it doesn't define in enough detail what each of the four key characteristics (authenticity; reliability; integrity; usability) is. The presentation looked at issues associated with each of these four characteristics and the major theme running through each of these was the need to ensure that records management systems function within a strong intellectual framework that articulates such details as, for example, what additions and annotations are permissible. It needs to go beyond "just buying the technology" in order to ensure that the authenticity, reliability, and integrity of electronic records are not to be challenged in future. The TNA provides a guidance role for and most of these are available from the TNA website. Finally, Richard noted that in such a new and rapidly evolving area, we have to accept that mistakes will be made but we should at least be able to understand why we made them.

Stuart Orr, Assistant Director in the Information and Workplace strategies Directorate of DTI provided a useful case study of how ERM was implemented in DTI. The problem was that there had been a practice of increasing devolvement in government departments during the previous government. This had led to difficulties in sharing information and storing it in non standard ways. The Secretary of State for DTI, Patricia Hewitt, recognised the need for improved means of sharing and storing information so that a better service could be provided for those seeking information from DTI. The Matrix project was developed to address this problem and was rolled out across 22 sites within the U.K, with c. 5,000 users. The presentation described what Matrix will and will not provide, for example it will be expected to support collaborative working but it will not introduce the paperless office. It will also not work unless there is investment of time and effort, people need to input quality information and this is difficult to control in a devolved environment. Around 60 people worked on the Matrix project, including a full-time communications manager. A step-by-step approach was taken, beginning in May 2000, with plan and prototype, leading on to testing, trailing, and finally leading to rollout in May 2002. Bringing staff on board and providing training were seen as key elements and Stuart said that they could have invested even more in training. In terms of long-term preservation, DTI still has a lot of questions. They are looking to TNA for advice and are conscious of the need for caution before investing in preservation infrastructure.

Session 2: Collecting and preserving digital materials

Kevin Schurer, Director of the UK Data Archive and the recently established Economic and Social Data Service provided a fascinating historical overview of the first 35 years of the UK Data Archive. Kevin noted that the UKDA is not a legal repository in the sense that TNA is but their service goes well beyond data delivery so there are synergies between the two. There are many changes that have occurred in the 35 years since the UKDA was established. The material has diversified so that not only survey data, but, for example, sound recordings and pictures are now included in data collections. The number of users has increased greatly and has doubled over the past few years. Formats have changed, in particular since the mid 1990's. Until then, magnetic tape was the dominant input and dissemination medium. CD-ROM's and web delivery have now become much more prevalent. There were still older forms, such as punched cards in the collection and a new punch card reader was purchased recently (though it had been difficult to locate!). While preservation was not seen as an issue when the UKDA was being set up in the 1960's, it has become an issue because of the emphasis on providing research material to the academic sector. This inevitably leads to preservation issues needing to be addressed in order to keep the material useable. The Data Exchange Initiative was seen as a potential bridge between the need to push material out in user friendly formats, while still retaining data in XML, which makes preservation simpler to manage. The XML schema will commence in early 2004 and will be a two year project. While the UKDA was working with a limited sub-set of digital information, it must still deal with most of the challenges which legal archives need to address, because of their remit to provide access to research materials. Kevin provided copies of the UKDA's preservation policy to participants [Note: this will be available from the member's pages on the DPC website in the near future].

David Ryan, Head of Archive Services at TNA gave the final presentation of the morning. David described the TNA's work on web archiving. There were two broad types of approach to web archiving, selective and harvesting, and David outlined the pros and cons of each before describing the approach being taken by TNA. This was to evaluate a number of technical approaches, develop a selection policy for websites, work with government departments to develop guidance, and develop long-term preservation strategies. The Modernising Government white paper provided the impetus for using the web as a communication mechanism with its target of all Government services being available online by 2005. The spin-off benefit of this is that preservation and presentation are brought together, people can see both the benefits and the limitations of what current technology can provide. Issues include the size of the domain, estimated at c.2,500, though this is difficult to track as not all are called .gov.uk; increasingly dynamic (and therefore more complicated) content; copyright; and legal deposit.

To open PDFs you will need Adobe Reader


9.30  Registration and Coffee
10.00  Introduction and welcome, David Thomas
  Session 1 - Electronic Records Management - Chair - David Thomas
10.15 Electronic Records Management - the role of TNA (PDF 191KB)
Richard Blake
10.45 Introducing ERM at DTI (PDF 1.6MB)
Stuart Orr
11.15 Short break
  Session 2 - Collecting and preserving digital materials - Chair - David Thomas
11.30 The UK Data Archive and the Experience of Digital Preservation (PDF 311KB)
Kevin Schurer
12.00 Collecting Government websites at TNA (PDF 1.3MB)
David Ryan
12.30 Lunch
13.30 Stream 1:  Demonstrations of Digital Archive and PRONOM
by Adrian Brown and Jo Pettitt
  Stream 2:  Tour of TNA led by Kelvin Smith
14.30 Short Break
14.40 Stream 1:  Tour of TNA led by Kelvin Smith
  Stream 2: Demonstrations of the Digital Archive and PRONOM
  By Adrian Brown and Jo Pettitt
15.40 Discussion and final wrap-up
16.00 Close
Read More

Digital Preservation: the global context

Report on the DPC Forum held at the British Library Conference Centre, Wednesday 23 June 2004.

The 8th DPC Forum attracted the biggest audience to date for a DPC Forum. Around 100 delegates were kept interested and informed by a very rich programme with presentations from several experts from the U.S and Europe. One key theme running throughout the day was the need for active collaboration at every level and across sectoral and geographic boundaries. Speaker after speaker illustrated how this collaboration was essential. Other consistent messages were the importance of trust (between partners and stakeholders in the emerging technical infrastructure), the need to find effective mechanisms to distribute responsibilities, developing standards and tools and above all, the need to develop and share practical experience.


Speakers from the 8th DPC Forum, Digital Preservation: the global context
L to R
Taylor Surface, OCLC; Robin Dale, RLG; Nancy McGovern, Cornell; Peter Burnhill, DCC; Seamus Ross, HATII;
Eileen Fenton, Ithaka; David Seaman, DLF; Vicky Reich, LOCKSS; Tony Hey, eSCP; Laura Campbell, Library of Congress

Delegates received a sense of the broad range of activities going on, the progress that has been made, and the increasingly compelling need to accelerate progress. Feedback from the Forum, both formal and informal, has been overwhelmingly positive and is indicative of the consistently high quality of the presentations and a stimulating and thought provoking programme.

On the evening before the Forum, there was the presentation of the annual Conservation Awards, which included the inaugural DPC Award for Digital Preservation. This was won by the National Archives, for their Digital Archive. The CAMiLEON had received a special commendation. This event was regarded as another stepping-stone on the way to raising the profile of digital preservation. The Forum was equally important, bringing together people from all over the world, recognising the need for international collaboration, noting that no one can do this on their own.

Lynne Brindley, Director of the British Library and Chair of the DPC Board, chaired the day and noted in her welcome the importance the DPC placed on international links and the need to ensure that digital preservation issues are increasingly on the political and policy agendas. The DPC is committed to making practical progress and to sharing best practice through its membership.

Ms Brindley introduced the first speaker, Laura Campbell, Associate Director for Strategic Initiatives at the Library of Congress, who provided an up-to-date picture of what the Library of Congress was doing through their NDIIPP (PDF 772KB) (National Digital Information Infrastructure and Preservation Program) program. Ms Campbell described NDIIPP, which developed as a result of a report commissioned by the Library of Congress to assess whether it was prepared for the 21st century. Much experience in digital technology had come from building their Digital Library and they had learned the power of digital surrogates as well as their vulnerability to loss.

The $US175m plan consisted of $5m approved by Congress to produce a plan, $20m upon approval of the plan, and a further $75m which would be contingent on obtaining matching funds. Scenario planning helped show how a distributed effort might operate.

Key lessons and messages of NDIIPP to date include the belief that there will never be a single right way of doing things, so the architecture needs to be sufficiently modular and flexible to take account of this, the need for a distributed and decentralised approach and the need for new tools and technologies. NDIIPP needs to build partnerships and networks and then create a technical infrastructure to support the partners. Partnerships already forged included an alliance with the DPC, helping to establish the International Internet Preservation Consortium (IIPC), business model partnerships such as subscription and archiving services for e-journals, and technical partnerships, taking full advantage of the skilful technical talent which exists.

The next stage of NDIIPP would include testing architectures to support archive ingest and handling. In summing up, Ms Campbell indicated that during the next five years, the intention was to form a range of formal partnerships, encourage standards for digital preservation, establish a governing body, and make recommendations to Congress for funding.

Seamus Ross, Director of HATII (Humanities advanced Technology and Information Institute), provided an introduction to ERPANET (PDF 1087KB) (Electronic Resource Preservation and Network), the European Commission funded project which has brought together partners from Italy, the Netherlands, U.K and Switzerland. ERPANET has created a number of resources, organised seminars on several key topics, carried out an analysis of relevant literature and developed other tools, such as business cases for digital preservation and off-the-shelf policy statements. It was stressed that a lot of expertise already exists but there is a pressing need to bring it together and to work together.

Lessons learned were that the digital preservation community needs practical case studies and reports of "real world" experience. Simple tools for costing digital preservation exist but much more work needs to be done here. Guidance on digital repository design is also needed. ERPA E-prints (a repository of digital preservation papers and reports) is growing very slowly and needed to be marketed better. ERPANET have negotiated with the Swiss National Archives to preserve material held in this repository in perpetuity. In summing up, Dr Ross emphasised the great need for knowledge sharing so ERPANET events and DPC Forums were extremely important in helping to raise the level of awareness and understanding.

Presentations from David Seaman (Director of the Digital Library Federation); Robin Dale (Program Officer for the Research Libraries Group); and Taylor Surface (OCLC), described the work their organisations are doing in developing practical, collaborative tools, all of which will play a role in increasing trust in the developing infrastructure for digital preservation.

David Seaman's presentation, 'Towards a Global File Format Registry' (PDF 67KB) described the developing global file format registry, which is responding to an immediate need. The importance and value of linking to other relevant work, such as the National Archives' PRONOM system and the DCC (Digital Curation Centre) in the UK, was also stressed.

The title of Robin Dale's presentation, 'The Devil's in the Details- working towards global consensus for digital repository certification' (PDF 77KB), aptly summarised the challenge of articulating and reaching broad consensus on what elements and what process can be put in place to certify digital repositories against a commonly understood standard.

Taylor Surface described the work of OCLC's Digital Collections and Preservation Services in 'the OCLC Registry of Digital Masters' (PDF 464KB), which arose from a DLF Steering Committee recommendation. Taylor described how the registry linked to the OCLC's WorldCat service to provide enhanced discovery , encourage use of standards and limit duplication of effort of digitisation initiatives.

During the lunch break, Lynne Brindley and Laura Campbell signed an agreement between the DPC and the Library of Congress. A poster session on the Digital Curation Centre gave delegates the opportunities to ask specific questions before the afternoon presentations.

The afternoon session began with Tony Hey, Director of the e-Science Core programme. Tony's presentation was 'e-Science - preserving the data deluge' (PDF 543KB) . The e-science grid (or cyberinfrastructure as it's known in the U.S) has the vision enunciated by Licklider, of being able to bring together all material throughout the world and build a truly global, collaborative environment which enabled researchers to work together regardless of geographic location. Describing the impetus for the development of the DCC, Dr Hey said that over the next 5 years, e-science will produce more scientific data than has been collected in the whole of human history. The goal is to bring together the digital library community with the scientific community so that each can learn from the other.

Peter Burnhill, Interim Director of the DCC, described 'The Digital Curation Centre' (PDF 146KB)which has received funding of £1.3m p.a. from JISC and the e-Science Core programme. The DCC was not a digital repository, he said, but would provide services and research for the community involved in digital preservation. It is still very early days, in that the DCC has only been operational for a few months but progress has been made. A website has been launched, an e-journal is planned and focus groups would help to articulate who the user community for the DCC is and what their needs are. It was anticipated that a permanent Director would be in post by the official launch, scheduled for early November.

Nancy McGovern spoke of 'The Cornell Digital Preservation Online Tutorial and Workshop (PDF 419KB). This is yet another illustration of the pressing need to develop practical support for those already involved in, or about to embark on, digital preservation programmes. It was also another example of the strength of collaboration, as the curriculum had been developed collaboratively and Cornell looked forward to working closely with the DPC, who have been inspired by Cornell's work to develop a similar programme geared towards the U.K. Nancy described the five organisational stages of digital preservation which are: Acknowledge; Act; Consolidate; Institutionalise; Externalise. Nancy noted that none of these stages can be skipped and it was essential to realise that there is no on/off switch for digital preservation, it is something which needs to build over time. Cornell has now run four workshops which have received very positive feedback from participants. All have been oversubscribed, which illustrates the need for intensive training which provides a toolkit to enable participants to take practical short-term strategies appropriate to their own institutional settings. A fifth workshop is planned for November 2004.

The final session of the day provided an opportunity to hear two very different approaches to preserving e-journals. Vicky Reich described the 'LOCKSS Program approach' (PDF 804KB), applicable to any content available through http protocol, and which enables libraries to collect and preserve content in the same way as they do for print. Vicky stressed that LOCKSS preserves the content, not the services publishers provide (e.g. search buttons). LOCKSS has established contact with several publishers and it is essential to have the cooperation of publishers to allow LOCKSS crawlers to gather their content. Trust was also an issue here - publishers needed to trust that libraries would gather content they have purchased under licence. Key advantages of LOCKSS are its inbuilt redundancy and ease and cheapness to install. Vicky stressed that some institutions needed to have large, central repositories as well but this need not preclude the use of LOCKSS.

The final speaker of the day was Eileen Fenton, on 'Preserving e-journals, the JSTOR model' (PDF 58KB). The Electronic Archiving Initiative has involved working with publishers and is focused on preserving the source files. Archiving e-journals requires a significant investment in the development of organisational and technological infrastructure, it was not either/or. Eileen also described Ithaka, a not-for-profit company, supported by Mellon, Hewlett and Niarchos funding. This has the goal of filling gaps not being supplied by the free market. Both Eileen and Vicky agreed that at this nascent stage of development, the community needs multiple approaches.

In closing the Forum, Lynne Brindley thanked all of the speakers for the significant contribution that had made to the success of the Forum. The next DPC Forum will be a joint DPC/CURL event and will be held on Tuesday 19 October 2004. Further details will be available in the coming months.

Read More

Library of Congress and DPC sign agreement

Added on 23 June 2004

DPC signs Memorandum of Unverstanding with the The Library of Congress


The National Digital Information Infrastructure and Preservation Program of the Library of Congress (NDIIPP) was established after Congress gave approval to the Library of Congress to develop the program in December 2000. In January 2004, Congress approved the Library of Congress's plan for NDIIPP, which will enable the Library of Congress to launch the first phase of building a national infrastructure for the collection and long-term preservation of digital content. Funds released will allow testing various technical models for capturing and preserving content.

Read More

UK Needs Assessment

A UK Needs Assessment was identified as a key priority in the DPC Business Plan for 2003-2006 in order to identify the volume and level of risk and assigning priorities for action. The first stage of this exercise was a DPC Members survey. This was carried out between August 2003 and March 2004, with a Workshop in November 2003 to discuss preliminary results and determine further action required. The survey form used is available below and the final report of the DPC Members survey and annexes, by Duncan Simpson, who was commissioned to undertake the survey on behalf of the DPC are also available. The report of the Workshop is also available below, and this Powerpoint slide presented by Duncan Simpson at the Workshop, indicates the proposed timeframe for the UK Needs Assessment, assuming funding for key initiatives.

Other deliverables from the survey are the map of DPC members, which provides details of each DPC member and their interest in digital preservation and (where appropriate) what material they have undertaken responsibility for. The table of DPC Member projects was also derived from the survey, and will be periodically updated. A related follow up task was Scenarios of Risks of Data Loss, real-life examples where data was either lost or at risk, provided by some DPC members and collated by Duncan Simpson. This is available below.

Read More

Digital Preservation in Institutional Repositories

The 9th DPC Forum was a collaboration between CURL and the British Library. The theme of institutional repositories was proposed by CURL as being very timely as the move from theory to practice is likely to accelerate, requiring more emphasis on sustainability and lessons learned from the practical experience of early adopters. Clifford Lynch's quote from a recent RLG DigiNews : 'An institutional repository needs to be a service with continuity behind it........Institutions need to recognize that they are making commitments for the long term.'

Clifford Lynch, 2004 http://www.rlg.org/en/page.php?Page_ID=19481#article0 was used in promoting the Forum and several presenters used other pertinent Lynch quotes. Themes emerging from the day were that there were many challenges, but it was important to continue to gain practical experience and build on experience and expertise. Some speakers also referred to the current need to provide mediation for depositors of content but that this was not scaleable. Ways and means of enhancing efficiency included shared tools and services, such as the PRONOM file format registry, and automating parts of the ingest processes.

In opening the Forum, Richard Ovenden, Keeper of Special Collections at the Bodleian Library, set the institutional repository scene, as one in which there is a gradual progression from theory to practice but uptake has been slow (Introduction PDF 108KB). The purpose of this Forum would be to hear from the early adopters, and listen and learn from them. The role and commitment of CURL to institutional repositories and digital preservation was seen at task force level, in individual CURL institutions, and through consortial activity. The role of the DPC in setting the digital preservation agenda was now well known and its value in training, information exchange and providing advice and guidance was a valuable asset.

Delegates were referred to the JISC press releases contained in their packs, which provided details of the successful proposals from the recent 4/04 Call on Digital Preservation and Institutional Asset management and also the forthcoming Repositories programme call, which will be the subject of two further calls in 2005 and indicated a major step forward and a major investment by JISC.

Session 1 was chaired by Paul Ayris, Director of Library Services, University College London, who introduced the first presentation by William Nixon, Deputy Head of IT Services at the University of Glasgow who presented a paper 'From ePrints to eSPIDA: Digital Preservation at the University of Glasgow' (PDF 822KB). A number of questions had been raised by the Glasgow experience, which had started as a pilot service in 2001. Digital preservation was not the primary focus as there was no content to preserve, but was becoming more of an issue and providing the greatest challenge. We need rigorous, robust preservation options if we are to move to the non-print world. William also suggested that this may well prove to be a selling point for academics in encouraging them to deposit their papers with the repository. In reviewing progress to date, Nixon said that there was a need to transition from project funding to embedding repositories into the bottom line of institutions so that they can make a stewardship commitment without dependence on project funding and move towards becoming a trusted digital repository.

John MacColl, Sub-Librarian, Digital Library, University of Edinburgh, and and Jim Downing, Preservation Development Manager, DSpace@Cambridge provided two perspectives of DSpace, as a manager of a repository service, and as a developer of the preservation aspects of DSPace. John MacColl drew attention to the services arising from project funding but which could potentially fall into disrepute unless they are properly managed over time (DSpace MacColl Presentation PDF 655KB). Digital preservation could be regarded as a high cost for individual institutions to undertake and it might be necessary to make use of other facilities. Advice and guidance were needed by the library community and the Edinburgh would be looking to the DCC as a source of that technical and practical guidance.

Jim Downing described the DSpace at Cambridge repository in which there are no mandates on type of material or file formats but they do actively provide advice on good practice (DSpace Downing Presentation PDF 166KB). Better preservation metadata was needed to support preservation planning. Tools such as PRONOM, which are already available, are proving valuable in helping to provide monitor technological obsolescence. Cambridge have been advised to retain human readable action plans and to add automation, wherever feasible/appropriate, but to retain human validation of automated steps. Currently DSpace at Cambridge records all item and metadata changes but this would not be scaleable. It would be necessary to refine policy and implementation.

The final session of the morning was a joint presentation on Storage Resource Broker (SRB) at the AHDS (SRB Presentation PDF 1.2MB). Hamish James provided an overview of what SRB is and its role at AHDS. The SRB software assists in managing digital objects scattered around multiple locations, a clear benefit for a distributed service such as AHDS, which was moving from a loose federation of repositories to a much more centralised preservation service, while still maintaining its distributed nature. The collection was expected to grow to 10 TB within the next two years, so any service must be scaleable. Andrew Speakman then outlined some of the practical issues involved in installing SRB. Andrew drew attention to a frequently recurring them in any discussion of digital preservation, that of collaboration and the need to take advantage of related effort which has already occurred. He also went on to outline the pros and cons of SRB, pros included the ability to handle large networked data volumes and high user acceptance. On the negative side, technical support is not well advanced so there is a requirement for significant in-house expertise as it is quite complex to install. In concluding Andrew said that SRB has the potential to simplify day-to-day operations and also to simplify distributed management of data and indicated that the AHDS was looking for partners using SRB.

The afternoon session was chaired by Richard Boulderstone, Director eStrategy, the British Library and began with a presentation 'Preserving EPrints:Scaling the Preservation Mountain' (PDF 144KB) on the SHERPA project presented by Sheila Anderson and co-authored with Stephen Pinfield. Sheila outlined the SHERPA project objectives and partners Nottingham (lead), Edinburhg, Glasgow, Leeds, Oxford, Sheffield, York, the British Library, and AHDS. SHERPA is primarily concerned with e-prints, i.e. a digital duplicate of an academic research paper that is made available online as a means of improving access to the paper.

Differing views have been expressed on whether it is necessary to preserve these documents but there is an opportunity here to move beyond saving and rescuing digital objects to building the infrastructure required to manage them from the start. A good start has been made in identifying properties of e-prints, looking at selection and retention criteria, preferred formats, rights issues etc. but none of these are 'doing' preservation. Using the OAIS model as a guide, a preservation storage layer and preservation planning (e.g. policies and procedures, risk assessment) needs to be added, with preservation and administration metadata and preservation protocols and processes in place.

A new two-year project, known as SHERPA DP, which is being led by AHDS in partnership with Nottingham and 3-4 SHERPA partners and funded under the recent JISC 4/04 Call has recently been announced. The aim of SHERPA DP will be to develop a persistent preservation environment for SHERPA partners based on the OAIS model and to explore the use of METS for packaging and transferring metadata and content. A Digital Preservation User Guide would be another practical deliverable from this project. The preservation community would be looking to the DCC for support, particularly in functions which are most appropriately centralised, such as technology watch.

The final presentation was from David Ryan, Head of Archives Services and Digital Preservation at the National Archives, 'Delivering digital records: towards a seamless flow'. David described the development of the Digital Archive and key points needed for its success (TNA Presentation Part 1 PDF 96KB), which were a strong business case linked to core organisational aims, a good team, and the need to sell the fact that this is not an insuperable problem. It has taken three years for the Digital Archive to become a comprehensive service delivery and all business targets have been met but it is critical to recognise that stewardship is a long-term evolving business. In recruiting staff it was essential to have the right technical skills, combined with the ability to sell the work to others within the organisation (TNA Presentation Part 2 PDF 90KB). The reality is that we must collect e-records. The Digital Archive should be scaleable to 100TB, which is way beyond current storage requirements though it is rapidly growing (TNA Presentation Part 3 PDF 1MB). TNA works with government departments but the current procedures, which tend to be case-by-case and handcrafted, was not scaleable (Editor's note: a similar point was made by William Dixon in Glasgow's experience of building their repository). Preservation planning is a key feature of the Digital Archive, which must be able to accommodate changes in preservation management over time. The main thing is to ensure that the bitstream remains unharmed incase a different preservation strategy is adopted (the current strategy is migration). Other TNA digital preservation effort includes the PRONOM service (TNA Presentation Part 4 PDF 613KB), which is now on Version 4 and is designed to be the primary file format registry. PRONOM can be used to help decisions about migration planning because it can indicate when a file format is likely to become unsupported. The UK Central Government Web Archive has captured c. 60 web sites to date and is currently held separately from the Digital Archive but it was intended to bring the two together. An issue is the size of the government website domain. Finally the work of NDAD was described, and their role as contractor for TNA in preserving data sets. Next steps would include a comparison of the NDAD data model and the digital Archive data model. In closing, David said that trusted digital repository certification was a key issue and there was a need for a process to allow a federated system of preservation and access.

A final panel session allowed delegates to put questions to all the speakers.

Read More

Report on the DPC Meeting on the large-scale archival storage of digital objects

The DPC Meeting on Mass Storage Systems was held in York on 22nd April. The meeting was open to DPC members only and was intended to be an informal discussion of mass storage systems, structured around the latest DPC Technology Watch report, Large Scale Archival Storage  and authored by four members of the DOM team at the British Library. Richard Masters, Sean Martin, Jim Linden, and Roderic Parker led discussion of the decision-making and planning which led to development of their storage system. The PP slides (in PDF 433KB) for the meeting are available.

The presentation on the storage system included the importance of having a clear mission statement for the DOM Programme, and the pragmatic decision to adopt a generic, cost-effective, and incremental approach. Major drivers for the programme were discussed, including legal and voluntary deposit and Richard Masters referred to the e-journal pilot being undertaken with volunteer publishers, to test how legally deposited e-journals will be delivered to the BL. Other categories of material includes the BL's digitised collections, sound archive, web archiving, and Ordnance Survey material. This comprises both a large volume of digital material and also a wide variety of formats.

While the decision was to purchase off-the-shelf products wherever possible, it had not been possible to purchase a storage system which met all of the BL's requirements. Principles which needed to be considered included the need for material to be invariant over time (which proved to be a fundamental difference with many commercial approaches); the need to assign an internal, unique identifier; the need to ensure that there would be no extended loss of service; and the need to ensure both integrity and authenticity. The latter needs to be more than simply checking that a file hasn't changed and the team had conducted a key generation ceremony to ensure this condition was met. This provides a trust model which ensures that a bit-stream remains unchanged after decades, despite changes of hardware during that timeframe.

Resilience of the system will be provided by having multiple sites (initially there will be one at Boston Spa, one at St Pancras), which can currently hold 12TB of storage, and a third "dark archive" to be held in another location. The multiple site design provides disaster tolerance by enabling the service to continue despite the loss of a storage site. The role of the dark archive is to provide the ability to recreate the DOM store in the extreme case that all sites are destroyed - this would be done by re-ingesting all objects from the dark archive into a new site.

The concept of total cost of ownership was outlined, Jim Linden led the meeting through elements of total cost, including initial purchase, the cost of operations (where staff costs are significant), data centre costs and application support and enhancement. It was decided that performance of commodity storage was adequate for preservation storage. It had been necessary to plan and decide on features that did not add value for the BL's needs (even though several commercial vendors felt they would provide benefits, it was necessary to articulate the BL's specific requirements, where many of these added extras were not required). Issues still needing to be considered were emerging technologies, such as the MAID concept of power saving. There are also a number of placeholders for future work, for example the assumption that the same 80/20 rule for accessed material which holds true in the print world, needs to be tested in the digital world.

It was a very informative and stimulating session and I'm grateful to the authors for taking the time to talk through their approach. One suggestion on the feedback forms for additional themes for similar meetings was preservation metadata and it may be of interest that the next Technology Watch report has recently been commissioned from Brian Lavoie of OCLC and Richard Gartner and Michael Popham of the University of Oxford and is on Preservation Metadata. This report should be ready for peer review in July 2005.

Read More

Report on IS & T Archiving 2005 Conference, Washington, 26 - 29 April 2005

Sarah Middleton

Sarah Middleton

Last updated on 30 September 2016

By Hugh Campbell, PRONI

1. I attended the Imaging Science & Technology (IS&T) Archiving 2005 conference at the Washington Hilton. This is my report on the conference.

2. Washington is quite a long way away – home to hotel was about 20 hours with hotel to home about 18 hours. This needs to be borne in mind when planning travel to such a conference and return to work - the body needs time to recover.

3. The conference itself started on Tuesday, 26 April with a number of tutorials. I attended the Long-Term Archiving of Digital Images tutorial – see attached summary. The conference proper ran from Wednesday 27 April – Friday 29 April, kicking off at 0830 each morning (and finishing at 1700 on Wednesday and Thursday and 1500 on Friday). Wednesday featured a 40-minute keynote address and 15 20-minute sessions; Thursday featured a 40-minute keynote address, 10 20-minute sessions and approximately 20 90second poster previews followed by the opportunity to visit the poster presentations. Friday featured a 40-minute keynote address and 10 20-minute sessions. I felt that there were too many sessions, cramming too much into a short space of time.

Read More

Report for the DCC/DPC Workshop on Cost Models for preserving digital assets

The DCC/DPC joint Workshop on Cost Models for preserving digital assets was held at the British Library on 26th July, and was the first joint workshop between the two organisations. Around seventy delegates from the UK, Europe, and the US were treated to a rich and stimulating source of information and discussion on costs and business models with a number of key themes emerging.

Maggie Jones gave the welcome and introduction, on behalf of Lynne Brindley, and emphasised the need, not just to discover how much it costs to preserve X digital objects over time, but the implications of inaction and the strategic drivers which would motivate institutions to invest in digital preservation and curation. Laurie Hunter provided the keynote address (PDF 16KB) and set the scene by placing digital preservation within a wider context of the business strategy of an organisation. The keynote stressed that there is a need to understand not just the costs but also the value of digital preservation and referred to the model scorecard as one tool which can be adapted for use in the digital preservation environment and which the eSPIDA project is investigating further.

James Currall referred to major obstacles to progress as including a very poor understanding of digital preservation issues among senior managers and creators and discussed some of the tools being developed by eSPIDA (PDF 182KB) to help counteract those obstacles. Once again, the importance of the strategic direction of the organisation, was noted as being of critical importance. The eSPIDA approach to the model scorecard placed the information asset at the centre, with the other perspectives (customer, internal business process, innovation and development) tending to feed into the financial perspective. Currall noted that, while this was being applied within the University of Glasgow, the same principles can be applied anywhere.

Paul Ayris and James Watson gave a presentation describing the LIFE project (PDF 164KB), which, like eSPIDA, has been funded under the JISC 4/04 programme. The LIFE project is a collaboration between UCL and the British Library. Paul Ayris described the context for the project, and drivers, which for UCL are the management of e-journals and the strategic issue of moving from print to e-journals. The BL needed additional information to help them manage multiple digital collections, acquired through voluntary and legal deposit, or created by them, and to maintain them in perpetuity. James Watson described the work to date in developing a generic lifecycle model which can be applied to all digital objects. The project also hoped to identify cost reductions and potential efficiencies. The major findings of this one-year project would be announced at a conference at the BL, in association with LIBER, on 12 December 2005.

The next sessions focussed on practical case studies. Anne Kenney described the work at Cornell (PDF 198KB) on identifying the costs associated with taking on Paul Ginsparg's arXiv. A quote from a Victor Mature movie, "If we had some horses, we'd have a cavalry - if we had some men" seemed to appropriately sum up an attitude to digital preservation programmes, "we'd have a digital preservation programme, if we had some staff - if we had some content!". Kenney emphasised the importance of getting concrete cost figures since no senior management will be prepared to write a blank cheque. This reflects the recommendation Hunter made during his keynote address for digital preservation proponents to speak to senior management in concrete, economic terms. The presentation covered cost centers, which were principally staff costs, and also identified costs needed to support the work but which were often hidden. The arXiv.org archive is highly automated and is relatively cheap to maintain, with an estimated submission cost of between US$1-5. Expenses are minimised in this case by having a gatekeeper function at the beginning and having most cost of ingest borne by the depositor. Kenney also noted that the costs of the server had significantly reduced each year but cautioned that it was critical to ensure an ongoing annual budget, as it is not possible to skip a year in digital preservation.

The Cornell case study contrasted with the TNA case study (PDF 1.7MB), presented by Adrian Brown. In this case, a publicly funded body with a mandate for preserving selected digital records so they must deal with a large number of formats. This illustrates the implications of organisational role and mission on potential costs. National libraries and archives will need to make different commitments to organisations who are more able to control the material they ingest. While TNA can influence creators, they cannot mandate that they will only accept certain formats. The TNA experience has shown some elements of costs for which there is a good understanding and others which there is little concrete knowledge of at this stage. Brown used the OAIS model to illustrate costs. Ingest costs represent the most substantial portion of costs and have been roughly calculated as £18.76 per file. As developments in automation progress and standards are agreed with creators, these costs may well fall over time. The time and human effort involved in creating metadata records for deposited materials was cited as a potentially high-cost element. Current research into automated metadata extraction could prove extremely beneficial in helping to minimise these costs. Data storage is relatively straightforward to prepare costs for but it is very difficult to predict transfer volumes over the next two years, and therefore difficult to plan longer term, so Preservation Planning is a major cost at the moment as it involves much R&D work. TNA also foresees opportunities to reduce costs through collaboration (not everyone needs to reinvent the wheel) and automation.

Erik Oltmans presented a model developed by the KB (PDF 350KB), in collaboration with the Delft Institute of Technology, which compares costs over time of two key digital preservation strategies, emulation and migration. This is based on the assumption that migration must apply to every single object in a collection, while emulation does not. The emulation approach seems to work best with collections with very few formats - for example a large digital repository of pdf files. However, it can become much more costly when there are a vast range of formats to be emulated. Oltmans conceded that the model, may not be entirely realistic but provides a useful starting point. The KB experience indicates that volume is less of an issue regarding costs as the complexity of submissions.

The afternoon session began with David Giaretta discussing science data characteristics (PDF 323KB) and how these dictate the most appropriate and cost-effective strategy. For example, emulation is almost certainly not enough for science data, which is increasingly processed "on the fly" so the archive keeps the raw data and processes on the fly. Issues such as bandwidth are critical (how do you get data into the system and then how do you get it out?). Other issues are migrating a file (relatively straightforward) and migrating a collection (much more complex). The costs of keeping information useable were those which would be the most difficult.

Matthew Addis and Ant Miller did a joint presentation on PrestoSpace (PrestoSpace Presentation One (PDF 1.8MB) and PrestoSpace Presentation Two( PDF 2.4MB)), an EU-funded project on audio-visual archives. The project began in February 2004 and will last for 40 months, and has 35 partner institutions. A key issue for a/v archives is that digital formats are rapidly becoming obsolete. Individual items on a shelf will cause huge logistical problems as they become obsolete. However once mass storage systems are developed, then it becomes imperative to have metadata in order to find and keep track of individual objects. The aim is to establish a framework for medium-large archives at this stage. Miller said that there is a need to "scare budget holders into action" but solid numbers are needed to back this up. Addis referred to the urgent need for planning as "whatever you put your stuff on will be obsolete at some stage." A workflow model was demonstrated, which enables decisions to be made on priorities for action. The next stage will be to test how well the model works against existing archives' plans. Some copies of the preliminary report were made available at the workshop for those interested in further information. The DCC and DPC will make the final version of this report available on their web sites when it is released later this year.

Andy Rothwell and Richard House provided the final presentation on costing EDRM programmes (PDF 605KB). Rothwell echoed earlier discussion in indicating that the pre-ingest stage is crucial in driving down costs. It was also necessary to look at the implications of the Governments Modernising Government white paper, which has been a key catalyst in moving from paper to electronic records. When coupled with looking at the whole information space, it needed to be understood that only c. 2% of records ultimately end up at TNA, so organisations need to manage the other 98%. The value lies not so much in putting material in but in being able to access it, so search and retrieval capabilities are key. The costs of implementation are not trivial, and it can take anywhere for 18 months to 2 years to implement the change in management and to provide the necessary training to staff. These costs are often not considered and can be significant. Other issues to be considered are the volatility of the marketplace. A practical example used was when EDRM product A is no longer supported and needs to be migrated to EDRM Product B. Without tried and tested export facilities, this is not a trivial undertaking. Rothwell also noted that data migration costs are not currently being factored into EDRM programmes. House went on to make the point that the key issue is not replacing paper systems with electronic but rather the integration of paper and electronic records systems. In terms of costs, staff costs are substantial and classification system design is frequently underestimated.

The workshop concluded with a panel session of all speakers and was chaired by Chris Rusbridge (DCC Director). Questions raised during this session highlighted a range of issues that were explored during the workshop.

For instance, it will be essential to determine what level of fiscal responsibility content creators and end-users share for the long-term preservation of digital assets. End-users potentially stand to benefit most from the preservation of digital assets and, as such, should be made aware that they may have a role to play in bearing the costs of preservation. Related to this were questions regarding the costs of accessing and retrieving digital assets over time.

The issue of metadata and representation information was raised several times during the panel session. Many participants stressed that without quality contextual information being preserved with the digital asset, there is little to no value in preserving the object. For example, even if a statistical digital data set is preserved and accessible 100 years after its creation, unless key items are defined, such as table headings, the data will be unusable. Users could undertake archaeological processes to try and ascertain the meaning of table headings, but ultimately they would at best only be able to guess at their true meanings.

The limit to which digital repositories may dictate acceptable formats for deposit was also a topic discussed during this session. While it is widely acknowledged that most repositories will not have the capabilities to preserve every format, there was also concern about placing too many constraints on content creators and depositors. As noted during the TNA case study, some organisation will not have the luxury of selecting the formats they will accept due to the very nature of their organisations, though they may be able to influence creators. In other cases, user communities may influence the formats that are deposited within repositories. This was the case with arXiv who did not originally impose restrictions but found that most depositors used the LaTex standard. This illustrates that identifying preferred formats for deposit does not always come from the managerial level, but could indeed be user-driven. Ultimately, a compromise is needed between reducing constraints on creators and depositors but also with facilitating effective preservation activities over time. Where there are equally viable alternatives, it may be acceptable to suggest one choice of format over another.

Very few repositories will have the capacity to care for every format or will have staff with all the skills needed to carry out preservation activities. Many of the participants felt that sharing resources and skills across a wide range of repositories would be the most logical approach to ensuring long-term preservation. PrestoSpace has investigated the creation of a European market place in which repositories and service providers can benefit from a shared approach. Several participants thought that the DCC and DPC might be able to assist in facilitating such an approach in the UK.

Participants felt that determining the value of preservation itself rather than simply identifying the costs will be of paramount importance in securing funding for digital preservation activity. This reflects suggestions made by several of the speakers. For instance, Richard House argued that it will be crucial for organisations to identify potential benefits that are not only appreciated by senior management but also by their stakeholders as well. It was acknowledged by several participants that a given stakeholder community may change over time and, as such, identifying benefits could be quite a difficult task.

It is highly unlikely that repositories will be able to accept and care for everything that is offered to them. Accordingly, sound appraisal and selection processes must be established within organisations to determine exactly what they will and will not preserve. Again, an organisational mission statement can be very useful in selecting and appraising digital assets for preservation. Selection and appraisal policies may change over time as the organisation changes. As such, periodic review of these documents will be necessary. Indeed, such changes may result in holdings within the repository no longer fitting in with the overall organisational mission. Therefore, some type of de-accessioning or disposal policy must be taken into consideration.

Many of the questions highlighted that, as yet, we have very few concrete answers. As such, much more work must be done in determining useable cost models, in identifying practical benefits, and establishing the value of digital preservation. The DCC and the DPC are currently looking into making available the spreadsheets for the cost models presented at this event via our web sites. We will also endeavour to monitor the progress of current projects and to report major findings as they are released.

Read More


Unless otherwise stated, content is shared under CC-BY-NC Licence

Scroll to top