Also in this section

DPC

Joint DPC/DCC Forum - Policies for Digital Curation and Preservation

The Digital Preservation Coalition (DPC) and Digital Curation Centre (DCC) delivered a two-day workshop to explore the range of policies required to manage, preserve, and reuse the information held within digital repositories over time. This event was co-sponsored by the Oxford Internet Institute (OII) and held at Wolfson College at the University of Oxford on 3rd and 4th of July, 2006.

Developing and implementing a range of policies is vital for enabling the effective management, discovery, and re-usability of information held within digital repositories. This workshop provided concrete examples of the range and nature of the policies required and shared real-life experiences in implementing these policies through a series of case studies and panel discussions.

Monday July 3rd, 2006

Setting the Scene

Digital information life-cycles and policies (PDF 31KB)
David Ryan, Director of Records, The Royal Household
Data Sharing and Curation Policies across the UK Research Councils (PDF 56KB)
Mark Thorley, NERC

Session One: This session explored issues including: roles and responsibilities in developing policies, relationships with other institutional policies, workflow issues, key themes of specific policies, problems encountered during development.

Developing the NERC Data Policy (PDF 382KB)
Mark Thorley, NERC
Developing a digital preservation strategy (PDF 173KB)
Heather Needham, HRO
TNA's custody of digital records policy (PDF 116KB)
Malcolm Todd, TNA
Digital repositories and Grids (PDF 1.2 MB)
Bob Jones, CERN

Tuesday July 4th, 2006

Session Two: Implementating Curation and Preservation Policies

Ingest at the Archaeology Data Service: a case study (PDF 3.3MB)
Jennifer Mitcham, Archaeological Data Services (ADS)
Implementing selection and appraisal policies at the UK Data Archive (PDF 41KB)
Kevin Schurer, UKDA
The Case of the Disappearing Documents: the Capture and Curation of Current University Records (PDF 245KB)
Susan Graham, Records Manager, the University of Edinburgh

Session Three: Evaluating and Reviewing Curation and Preservation Policies

The public consultation process for Operational Selection Policies (PDF 108KB)
Howard Davies, TNA
Preserving, Sharing and Re-using Biomedical Research Data: Policy and Practice (PDF 245KB)
Allan Sudlow, Medical research council (MRC)

DPC Forum on Web Archiving

The DPC held a one-day web archiving forum at the British Library. The first DPC web archiving forum was held in 2002 to promote the need to archive websites given their increasing importance as information and cultural resources.

Four years on, this event again brought together key web archiving initiatives and provided a chance to review progress in the field. The day provided an in-depth picture of the UK Web Archiving project as well as European initiatives. Technical solutions and legal issues were examined and the presentations encouraged much debate and discussion around different strategies and methodologies. The event made clear that the field has moved on tremendously from four years ago. The debate has broadened and so have the tools and methodologies.

The first presentation was from Philip Beresford, Web Archiving Project manager at the British Library [BL]. He spoke about the BL's involvement with UKWAC, the tools the project had built, the challenges they have had with the PANDAS software and the overall constraints of web archiving, especially as it is such a technology dependent discipline. Philip also outlined the web curator tool developed with the National Library of New Zealand and the next version of PANDAS. UKWAC - the first two years [PDF 33 KB]

Adrian Brown, Head of digital preservation at the National Archives followed on from Philip's talk as he outlined the future of UKWAC and its recent evaluation report. Adrian outlined the collection methods at the National Archives as well as database preservation and transactional archiving. He touched on one rather overlooked aspect, that of the long term preservation of the actual content. Collecting and preserving web content [PDF 401KB]

John Tuck spoke in the second session about the legal BL's deposit bill. He touched on issues regarding collection, capture, preservation and access to non-print collections. Of interest is how the legal deposit bill translates into the e-environment and web archiving; should web archiving extend to UK-related sites, not just UK-domain sites and are national boundaries less relevant now? He outlined the BL's two different strategies - taking a twice yearly snapshot of the entire UK web and the second being a more selective approach of sites that are deemed to be of national and cultural interest. He also stressed the lengthy permissions process that gathering each web site entails. Collecting, selecting and legal deposit [PDF 42KB]

Andrew Charlesworth highlighted the complexity of the UK legal framework regarding web archiving. An emerging theme throughout the day was the debate about whether archives should ask for permission before or after they have collected websites. Andrew stressed the importance of understanding the regulatory framework. The field has moved on in that we know more today about the risks and benefits regarding web archiving than we did a few years ago. Any web archiving project probably needs to carry out risk analysis and to have insurance, in particular with regard the defamation law, ensure that they don't hold anything in their archive that could be used as legal evidence. Archiving of internet resources: the legal issues revisited [PDF 33KB]

Julien Masanes spoke about the European Web Archive [PDF 530KB] . He presented an interesting approach to web archiving - the information architecture of the web is such that its archiving should follow the natural structure of the web. Julien reminded the audience that web content is already digital and readily processable and that the web is cross-linkable and cross-compatible, a good foundation for an archive. He also stressed that web archiving requires functional collaboration. What is needed is a mutualisation of resources which combines competence and skills. Internet preservation: current situation and perspectives for the future [PDF 530KB]

Paul Bevan outlined the UKWAC project to archive the 2005 UK election sites. He described how three national libraries collaborated on this web archive. He touched on selection, collection remit for each library and frequence of snapshots. Did the general election classify as an event or as a known instance? Paul stressed the difficulties involved in obtaining permissions to archive electoral websites and the difficulties in identifying candidates websites. On a technical level the slowness of the gathering engine was also highlighted. Archiving the 2005 UK General Election [PDF 129KB]

Catherine Lupovici of the International Internet Preservation Consortium [IIPC] outlined the activities of the IIPC and outlined all the life-cycle tools that the team are working on such as ingest and access tools. She stressed the importance of collaboration in web archiving and it is clear that both UKWAC and IIPC do this successfully. IIPC activity: standards and tools for domain scale archiving [PDF 149KB]

The panel session was most productive. The panel leaders stressed that we are still in the early days of web archiving. We can never be fully sure that the techniques employed are correct, but we have to make a start. However, more research needs to be carried out into the preservation techniques of the actual content. Access issues are also critical; searching a digital web archive won't employ the same search and retrieval tools as a traditional archive would and crucial access tools need to be developed for successful use of web archives.

On a technical note, we need to be aware of issues of browser compatibility in the future; there was a debate about whether it was an acceptable solution to obtain the source code of browsers in order to assist rendering pages in the future. It was highlighted that we have to be aware of unknown plugins which could hinder the readability of web pages. The importance of the ingest stage was stressed and the transformation of the digital object that should occur at this point to ensure readability. There may be legal issues to consider here however in transforming from one format to another.

Web archiving is not an isolated activity - so many web formats are now available as well as different content delivery mechanisms e.g. blogs and chat rooms. These formats make archiving even more challenging. There was a recognition that the community needs smarter tools to make web archiving scalable. There is definitely a need to semi automate quality assurance and selection. The question was raised whether or not we still need manual and selective archiving which is both time-consuming and costly compared to automatic sweeping of the web? The general consensus was that both methodologies should still be employed. The overall conclusion and recurring theme of the day is that collaboration is essential and no single organisation can carry out web archiving on its own. Projects such as UKWAC, IIPC and the European Web Archive demonstrate that much can be achieved in terms of solutions and methodologies.

DPC Briefing on OAIS

The DPC held a briefing day on the OAIS model on 4^th April 2006 at the York Innovation Centre. The purpose of the day was to examine the model and provide an informal but in-depth look at the practical application of the Open Archival Information System [OAIS] model in various UK institutions. OAIS is a high-level reference model which describes the functions of a digital archive and has been used as a model for a number of digital archiving repositories. It is now a recognized and highly-prominent international standard.

There were four presentations in total, all of which presented interesting case-studies and examples of OAIS implementation in a variety of institutions, giving a valuable overview of how it has been interpreted and applied.

Najla Semple, Digital Preservation Coalition, began the day with an overview of OAIS and her experience of implementing the model at Edinburgh University Library in 2002 (Overview of OAIS PDF 1MB). She gave a summary of the pilot project and how she used the model to digitally archive the online University Calendar. Each of the six OAIS processes were examined and used as part of the archival workflow. She also gave an overview of the detailed OAIS metadata scheme that was implemented.

Jen Mitcham, Curatorial Officer at the Archaeology Data Service [ADS] presented next (Working with OAIS PDF 2.6MB). Her approach to OAIS was different from that of Edinburgh University Library as ADS already have a digital archive up and running. At ADS they have applied the model to their existing digital archive, which is both an interesting and practical way to approach the model. Her talk identified which areas of her organization the model could be applied to, and she clarified this by including photographs of the actual staff involved in each of the OAIS processes. The issue of registration and access to online archives was debated.

Andrew Wilson, Preservation Services and Projects Manager at AHDS spoke in the afternoon (Sherpa-DP and OAIS PDF 300KB) about the use of OAIS in the Sherpa DP project http://ahds.ac.uk/about/projects/sherpa-dp/. They are using a disaggregated model for implementing the model throughout the university-based institutional repositories and he indicated that they will share an AHDS preservation repository. He then initiated the question, 'What does OAIS compliance mean?', an interesting question with regard to institutions setting up their own archives. He touched on the OAIS audit process developed by RLG and what this will mean for future implementation of the model. A certification process might lead to the assumption that the model will have to be implemented in a certain prescriptive way and perhaps this goes against the 'open' spirit of OAIS. Some of the processes are 'deliberately vague' therefore they shouldn't be set in stone as to how one applies them. This issue initiated much lively debate amongst the delegates.

The final presentation of the day was a joint effort by Hilary Beedham of the UK Data Archive and Matt Palmer of the National Archives (Mapping to the OAIS PDF 500KB). They gave an interesting insight into two archives that are both assessing their existing organizational structure against the OAIS model. Interestingly, they both arrived at similar conclusions and found certain shortcomings with OAIS. A couple of areas that they struggled with were management of the Dissemination Information Package, as well as the metadata model which they thought could perhaps be made more detailed to include access controls and IPR concerns.

Matt also pointed out that it is fairly easy to be compliant with OAIS as most of the functions and processes are core to any digital archive. Both the TNA and UKDA Designated Communities are wide-ranging and they indicated that it might be the case that the model assumes a homogenous user community. However, this point was disputed by the audience as indeed the issue of the Designated Community is a very important feature of OAIS and establishing who you are preserving the information for is crucial. The Representation Information metadata field assumes that you will include an appropriate detailed technical description according to who will read the data in the future.

Hilary Beedham concentrated on their recently published report, 'Assessment of UKDA and TNA Compliance with OAIS and METS Standards' http://www.jisc.ac.uk/uploaded_documents/oaismets.pdf. The JISC-funded report was written partly to assist regional county-councils apply the model and simplify it.

The discussion at the end of the day proved very fruitful, and overall conclusions were as follows:

That it was really useful to have some real-life examples and case-studies.
OAIS vocabulary and terminology is now recognised as really useful across a range of institutions. The value of having OAIS-compliant repositories will also enable a 'seamless transfer' of data between archives.
While the model may be vague in its prescription, it certainly indicates what to think about when setting about creating a digital archive.
One delegate suggested that the starting point should be to look at your own organization first, analyse the processes involved and apply OAIS processes accordingly.
A practical guide as to how to set up an OAIS repository would be very useful, especially one which indicated different communities and organizational-specific interpretations. This guide could ideally take the form of 'OAIS-lite'.

Archived DPC and Digital Preservation Media Coverage

Added on 1 January 2006

DPC Meeting on Preservation Metadata

The Digital Preservation Coalition has commissioned a series of Technology Watch reports on themes known to be of key interest to DPC members. The authors of the Technology Watch Report on Preservation Metadata (PDF 209KB) - Brian Lavoie, OCLC, and Richard Gartner, University of Oxford - agreed to lead an informal meeting of DPC members, many of whom have an active interest in this area..

Attendance was open to DPC members only and there was no charge. Numbers were limited to a maximum of thirty, to allow scope for interaction.

An overview of the meeting is provided by Michael Day's presentation (PDF 121KB)

Presentations in the meeting:

Brian Lavoie: Preservation Metadata: Setting the Scene (PDF 148KB)
Richard Gartner: An Introduction to METS (PDF 172KB)
Taylor Surface: Implementing Preservation Metadata: The OCLC Digital Archive Experience (PDF 660KB)
Brian Lavoie: PREMIS and Preservation Metadata Standards (PDF 225KB)
Taylor Surface and Paul Ayris: Registry of Digital Masters (PDF 998KB)

Report for the DCC/DPC Workshop on Cost Models for preserving digital assets

The DCC/DPC joint Workshop on Cost Models for preserving digital assets was held at the British Library on 26^th July, and was the first joint workshop between the two organisations. Around seventy delegates from the UK, Europe, and the US were treated to a rich and stimulating source of information and discussion on costs and business models with a number of key themes emerging.

Maggie Jones gave the welcome and introduction, on behalf of Lynne Brindley, and emphasised the need, not just to discover how much it costs to preserve X digital objects over time, but the implications of inaction and the strategic drivers which would motivate institutions to invest in digital preservation and curation. Laurie Hunter provided the keynote address (PDF 16KB) and set the scene by placing digital preservation within a wider context of the business strategy of an organisation. The keynote stressed that there is a need to understand not just the costs but also the value of digital preservation and referred to the model scorecard as one tool which can be adapted for use in the digital preservation environment and which the eSPIDA project is investigating further.

James Currall referred to major obstacles to progress as including a very poor understanding of digital preservation issues among senior managers and creators and discussed some of the tools being developed by eSPIDA (PDF 182KB) to help counteract those obstacles. Once again, the importance of the strategic direction of the organisation, was noted as being of critical importance. The eSPIDA approach to the model scorecard placed the information asset at the centre, with the other perspectives (customer, internal business process, innovation and development) tending to feed into the financial perspective. Currall noted that, while this was being applied within the University of Glasgow, the same principles can be applied anywhere.

Paul Ayris and James Watson gave a presentation describing the LIFE project (PDF 164KB), which, like eSPIDA, has been funded under the JISC 4/04 programme. The LIFE project is a collaboration between UCL and the British Library. Paul Ayris described the context for the project, and drivers, which for UCL are the management of e-journals and the strategic issue of moving from print to e-journals. The BL needed additional information to help them manage multiple digital collections, acquired through voluntary and legal deposit, or created by them, and to maintain them in perpetuity. James Watson described the work to date in developing a generic lifecycle model which can be applied to all digital objects. The project also hoped to identify cost reductions and potential efficiencies. The major findings of this one-year project would be announced at a conference at the BL, in association with LIBER, on 12 December 2005.

The next sessions focussed on practical case studies. Anne Kenney described the work at Cornell (PDF 198KB) on identifying the costs associated with taking on Paul Ginsparg's arXiv. A quote from a Victor Mature movie, "If we had some horses, we'd have a cavalry - if we had some men" seemed to appropriately sum up an attitude to digital preservation programmes, "we'd have a digital preservation programme, if we had some staff - if we had some content!". Kenney emphasised the importance of getting concrete cost figures since no senior management will be prepared to write a blank cheque. This reflects the recommendation Hunter made during his keynote address for digital preservation proponents to speak to senior management in concrete, economic terms. The presentation covered cost centers, which were principally staff costs, and also identified costs needed to support the work but which were often hidden. The arXiv.org archive is highly automated and is relatively cheap to maintain, with an estimated submission cost of between US$1-5. Expenses are minimised in this case by having a gatekeeper function at the beginning and having most cost of ingest borne by the depositor. Kenney also noted that the costs of the server had significantly reduced each year but cautioned that it was critical to ensure an ongoing annual budget, as it is not possible to skip a year in digital preservation.

The Cornell case study contrasted with the TNA case study (PDF 1.7MB), presented by Adrian Brown. In this case, a publicly funded body with a mandate for preserving selected digital records so they must deal with a large number of formats. This illustrates the implications of organisational role and mission on potential costs. National libraries and archives will need to make different commitments to organisations who are more able to control the material they ingest. While TNA can influence creators, they cannot mandate that they will only accept certain formats. The TNA experience has shown some elements of costs for which there is a good understanding and others which there is little concrete knowledge of at this stage. Brown used the OAIS model to illustrate costs. Ingest costs represent the most substantial portion of costs and have been roughly calculated as £18.76 per file. As developments in automation progress and standards are agreed with creators, these costs may well fall over time. The time and human effort involved in creating metadata records for deposited materials was cited as a potentially high-cost element. Current research into automated metadata extraction could prove extremely beneficial in helping to minimise these costs. Data storage is relatively straightforward to prepare costs for but it is very difficult to predict transfer volumes over the next two years, and therefore difficult to plan longer term, so Preservation Planning is a major cost at the moment as it involves much R&D work. TNA also foresees opportunities to reduce costs through collaboration (not everyone needs to reinvent the wheel) and automation.

Erik Oltmans presented a model developed by the KB (PDF 350KB), in collaboration with the Delft Institute of Technology, which compares costs over time of two key digital preservation strategies, emulation and migration. This is based on the assumption that migration must apply to every single object in a collection, while emulation does not. The emulation approach seems to work best with collections with very few formats - for example a large digital repository of pdf files. However, it can become much more costly when there are a vast range of formats to be emulated. Oltmans conceded that the model, may not be entirely realistic but provides a useful starting point. The KB experience indicates that volume is less of an issue regarding costs as the complexity of submissions.

The afternoon session began with David Giaretta discussing science data characteristics (PDF 323KB) and how these dictate the most appropriate and cost-effective strategy. For example, emulation is almost certainly not enough for science data, which is increasingly processed "on the fly" so the archive keeps the raw data and processes on the fly. Issues such as bandwidth are critical (how do you get data into the system and then how do you get it out?). Other issues are migrating a file (relatively straightforward) and migrating a collection (much more complex). The costs of keeping information useable were those which would be the most difficult.

Matthew Addis and Ant Miller did a joint presentation on PrestoSpace (PrestoSpace Presentation One (PDF 1.8MB) and PrestoSpace Presentation Two( PDF 2.4MB)), an EU-funded project on audio-visual archives. The project began in February 2004 and will last for 40 months, and has 35 partner institutions. A key issue for a/v archives is that digital formats are rapidly becoming obsolete. Individual items on a shelf will cause huge logistical problems as they become obsolete. However once mass storage systems are developed, then it becomes imperative to have metadata in order to find and keep track of individual objects. The aim is to establish a framework for medium-large archives at this stage. Miller said that there is a need to "scare budget holders into action" but solid numbers are needed to back this up. Addis referred to the urgent need for planning as "whatever you put your stuff on will be obsolete at some stage." A workflow model was demonstrated, which enables decisions to be made on priorities for action. The next stage will be to test how well the model works against existing archives' plans. Some copies of the preliminary report were made available at the workshop for those interested in further information. The DCC and DPC will make the final version of this report available on their web sites when it is released later this year.

Andy Rothwell and Richard House provided the final presentation on costing EDRM programmes (PDF 605KB). Rothwell echoed earlier discussion in indicating that the pre-ingest stage is crucial in driving down costs. It was also necessary to look at the implications of the Governments Modernising Government white paper, which has been a key catalyst in moving from paper to electronic records. When coupled with looking at the whole information space, it needed to be understood that only c. 2% of records ultimately end up at TNA, so organisations need to manage the other 98%. The value lies not so much in putting material in but in being able to access it, so search and retrieval capabilities are key. The costs of implementation are not trivial, and it can take anywhere for 18 months to 2 years to implement the change in management and to provide the necessary training to staff. These costs are often not considered and can be significant. Other issues to be considered are the volatility of the marketplace. A practical example used was when EDRM product A is no longer supported and needs to be migrated to EDRM Product B. Without tried and tested export facilities, this is not a trivial undertaking. Rothwell also noted that data migration costs are not currently being factored into EDRM programmes. House went on to make the point that the key issue is not replacing paper systems with electronic but rather the integration of paper and electronic records systems. In terms of costs, staff costs are substantial and classification system design is frequently underestimated.

The workshop concluded with a panel session of all speakers and was chaired by Chris Rusbridge (DCC Director). Questions raised during this session highlighted a range of issues that were explored during the workshop.

For instance, it will be essential to determine what level of fiscal responsibility content creators and end-users share for the long-term preservation of digital assets. End-users potentially stand to benefit most from the preservation of digital assets and, as such, should be made aware that they may have a role to play in bearing the costs of preservation. Related to this were questions regarding the costs of accessing and retrieving digital assets over time.

The issue of metadata and representation information was raised several times during the panel session. Many participants stressed that without quality contextual information being preserved with the digital asset, there is little to no value in preserving the object. For example, even if a statistical digital data set is preserved and accessible 100 years after its creation, unless key items are defined, such as table headings, the data will be unusable. Users could undertake archaeological processes to try and ascertain the meaning of table headings, but ultimately they would at best only be able to guess at their true meanings.

The limit to which digital repositories may dictate acceptable formats for deposit was also a topic discussed during this session. While it is widely acknowledged that most repositories will not have the capabilities to preserve every format, there was also concern about placing too many constraints on content creators and depositors. As noted during the TNA case study, some organisation will not have the luxury of selecting the formats they will accept due to the very nature of their organisations, though they may be able to influence creators. In other cases, user communities may influence the formats that are deposited within repositories. This was the case with arXiv who did not originally impose restrictions but found that most depositors used the LaTex standard. This illustrates that identifying preferred formats for deposit does not always come from the managerial level, but could indeed be user-driven. Ultimately, a compromise is needed between reducing constraints on creators and depositors but also with facilitating effective preservation activities over time. Where there are equally viable alternatives, it may be acceptable to suggest one choice of format over another.

Very few repositories will have the capacity to care for every format or will have staff with all the skills needed to carry out preservation activities. Many of the participants felt that sharing resources and skills across a wide range of repositories would be the most logical approach to ensuring long-term preservation. PrestoSpace has investigated the creation of a European market place in which repositories and service providers can benefit from a shared approach. Several participants thought that the DCC and DPC might be able to assist in facilitating such an approach in the UK.

Participants felt that determining the value of preservation itself rather than simply identifying the costs will be of paramount importance in securing funding for digital preservation activity. This reflects suggestions made by several of the speakers. For instance, Richard House argued that it will be crucial for organisations to identify potential benefits that are not only appreciated by senior management but also by their stakeholders as well. It was acknowledged by several participants that a given stakeholder community may change over time and, as such, identifying benefits could be quite a difficult task.

It is highly unlikely that repositories will be able to accept and care for everything that is offered to them. Accordingly, sound appraisal and selection processes must be established within organisations to determine exactly what they will and will not preserve. Again, an organisational mission statement can be very useful in selecting and appraising digital assets for preservation. Selection and appraisal policies may change over time as the organisation changes. As such, periodic review of these documents will be necessary. Indeed, such changes may result in holdings within the repository no longer fitting in with the overall organisational mission. Therefore, some type of de-accessioning or disposal policy must be taken into consideration.

Many of the questions highlighted that, as yet, we have very few concrete answers. As such, much more work must be done in determining useable cost models, in identifying practical benefits, and establishing the value of digital preservation. The DCC and the DPC are currently looking into making available the spreadsheets for the cost models presented at this event via our web sites. We will also endeavour to monitor the progress of current projects and to report major findings as they are released.

Sarah Middleton

Last updated on 30 September 2016

By Hugh Campbell, PRONI

1. I attended the Imaging Science & Technology (IS&T) Archiving 2005 conference at the Washington Hilton. This is my report on the conference.

2. Washington is quite a long way away – home to hotel was about 20 hours with hotel to home about 18 hours. This needs to be borne in mind when planning travel to such a conference and return to work - the body needs time to recover.

3. The conference itself started on Tuesday, 26 April with a number of tutorials. I attended the Long-Term Archiving of Digital Images tutorial – see attached summary. The conference proper ran from Wednesday 27 April – Friday 29 April, kicking off at 0830 each morning (and finishing at 1700 on Wednesday and Thursday and 1500 on Friday). Wednesday featured a 40-minute keynote address and 15 20-minute sessions; Thursday featured a 40-minute keynote address, 10 20-minute sessions and approximately 20 90second poster previews followed by the opportunity to visit the poster presentations. Friday featured a 40-minute keynote address and 10 20-minute sessions. I felt that there were too many sessions, cramming too much into a short space of time.

Report on the DPC Meeting on the large-scale archival storage of digital objects

The DPC Meeting on Mass Storage Systems was held in York on 22nd April. The meeting was open to DPC members only and was intended to be an informal discussion of mass storage systems, structured around the latest DPC Technology Watch report, Large Scale Archival Storage and authored by four members of the DOM team at the British Library. Richard Masters, Sean Martin, Jim Linden, and Roderic Parker led discussion of the decision-making and planning which led to development of their storage system. The PP slides (in PDF 433KB) for the meeting are available.

The presentation on the storage system included the importance of having a clear mission statement for the DOM Programme, and the pragmatic decision to adopt a generic, cost-effective, and incremental approach. Major drivers for the programme were discussed, including legal and voluntary deposit and Richard Masters referred to the e-journal pilot being undertaken with volunteer publishers, to test how legally deposited e-journals will be delivered to the BL. Other categories of material includes the BL's digitised collections, sound archive, web archiving, and Ordnance Survey material. This comprises both a large volume of digital material and also a wide variety of formats.

While the decision was to purchase off-the-shelf products wherever possible, it had not been possible to purchase a storage system which met all of the BL's requirements. Principles which needed to be considered included the need for material to be invariant over time (which proved to be a fundamental difference with many commercial approaches); the need to assign an internal, unique identifier; the need to ensure that there would be no extended loss of service; and the need to ensure both integrity and authenticity. The latter needs to be more than simply checking that a file hasn't changed and the team had conducted a key generation ceremony to ensure this condition was met. This provides a trust model which ensures that a bit-stream remains unchanged after decades, despite changes of hardware during that timeframe.

Resilience of the system will be provided by having multiple sites (initially there will be one at Boston Spa, one at St Pancras), which can currently hold 12TB of storage, and a third "dark archive" to be held in another location. The multiple site design provides disaster tolerance by enabling the service to continue despite the loss of a storage site. The role of the dark archive is to provide the ability to recreate the DOM store in the extreme case that all sites are destroyed - this would be done by re-ingesting all objects from the dark archive into a new site.

The concept of total cost of ownership was outlined, Jim Linden led the meeting through elements of total cost, including initial purchase, the cost of operations (where staff costs are significant), data centre costs and application support and enhancement. It was decided that performance of commodity storage was adequate for preservation storage. It had been necessary to plan and decide on features that did not add value for the BL's needs (even though several commercial vendors felt they would provide benefits, it was necessary to articulate the BL's specific requirements, where many of these added extras were not required). Issues still needing to be considered were emerging technologies, such as the MAID concept of power saving. There are also a number of placeholders for future work, for example the assumption that the same 80/20 rule for accessed material which holds true in the print world, needs to be tested in the digital world.

It was a very informative and stimulating session and I'm grateful to the authors for taking the time to talk through their approach. One suggestion on the feedback forms for additional themes for similar meetings was preservation metadata and it may be of interest that the next Technology Watch report has recently been commissioned from Brian Lavoie of OCLC and Richard Gartner and Michael Popham of the University of Oxford and is on Preservation Metadata. This report should be ready for peer review in July 2005.

Digital Preservation in Institutional Repositories

The 9th DPC Forum was a collaboration between CURL and the British Library. The theme of institutional repositories was proposed by CURL as being very timely as the move from theory to practice is likely to accelerate, requiring more emphasis on sustainability and lessons learned from the practical experience of early adopters. Clifford Lynch's quote from a recent RLG DigiNews : 'An institutional repository needs to be a service with continuity behind it........Institutions need to recognize that they are making commitments for the long term.'

Clifford Lynch, 2004 http://www.rlg.org/en/page.php?Page_ID=19481#article0 was used in promoting the Forum and several presenters used other pertinent Lynch quotes. Themes emerging from the day were that there were many challenges, but it was important to continue to gain practical experience and build on experience and expertise. Some speakers also referred to the current need to provide mediation for depositors of content but that this was not scaleable. Ways and means of enhancing efficiency included shared tools and services, such as the PRONOM file format registry, and automating parts of the ingest processes.

In opening the Forum, Richard Ovenden, Keeper of Special Collections at the Bodleian Library, set the institutional repository scene, as one in which there is a gradual progression from theory to practice but uptake has been slow (Introduction PDF 108KB). The purpose of this Forum would be to hear from the early adopters, and listen and learn from them. The role and commitment of CURL to institutional repositories and digital preservation was seen at task force level, in individual CURL institutions, and through consortial activity. The role of the DPC in setting the digital preservation agenda was now well known and its value in training, information exchange and providing advice and guidance was a valuable asset.

Delegates were referred to the JISC press releases contained in their packs, which provided details of the successful proposals from the recent 4/04 Call on Digital Preservation and Institutional Asset management and also the forthcoming Repositories programme call, which will be the subject of two further calls in 2005 and indicated a major step forward and a major investment by JISC.

Session 1 was chaired by Paul Ayris, Director of Library Services, University College London, who introduced the first presentation by William Nixon, Deputy Head of IT Services at the University of Glasgow who presented a paper 'From ePrints to eSPIDA: Digital Preservation at the University of Glasgow' (PDF 822KB). A number of questions had been raised by the Glasgow experience, which had started as a pilot service in 2001. Digital preservation was not the primary focus as there was no content to preserve, but was becoming more of an issue and providing the greatest challenge. We need rigorous, robust preservation options if we are to move to the non-print world. William also suggested that this may well prove to be a selling point for academics in encouraging them to deposit their papers with the repository. In reviewing progress to date, Nixon said that there was a need to transition from project funding to embedding repositories into the bottom line of institutions so that they can make a stewardship commitment without dependence on project funding and move towards becoming a trusted digital repository.

John MacColl, Sub-Librarian, Digital Library, University of Edinburgh, and and Jim Downing, Preservation Development Manager, DSpace@Cambridge provided two perspectives of DSpace, as a manager of a repository service, and as a developer of the preservation aspects of DSPace. John MacColl drew attention to the services arising from project funding but which could potentially fall into disrepute unless they are properly managed over time (DSpace MacColl Presentation PDF 655KB). Digital preservation could be regarded as a high cost for individual institutions to undertake and it might be necessary to make use of other facilities. Advice and guidance were needed by the library community and the Edinburgh would be looking to the DCC as a source of that technical and practical guidance.

Jim Downing described the DSpace at Cambridge repository in which there are no mandates on type of material or file formats but they do actively provide advice on good practice (DSpace Downing Presentation PDF 166KB). Better preservation metadata was needed to support preservation planning. Tools such as PRONOM, which are already available, are proving valuable in helping to provide monitor technological obsolescence. Cambridge have been advised to retain human readable action plans and to add automation, wherever feasible/appropriate, but to retain human validation of automated steps. Currently DSpace at Cambridge records all item and metadata changes but this would not be scaleable. It would be necessary to refine policy and implementation.

The final session of the morning was a joint presentation on Storage Resource Broker (SRB) at the AHDS (SRB Presentation PDF 1.2MB). Hamish James provided an overview of what SRB is and its role at AHDS. The SRB software assists in managing digital objects scattered around multiple locations, a clear benefit for a distributed service such as AHDS, which was moving from a loose federation of repositories to a much more centralised preservation service, while still maintaining its distributed nature. The collection was expected to grow to 10 TB within the next two years, so any service must be scaleable. Andrew Speakman then outlined some of the practical issues involved in installing SRB. Andrew drew attention to a frequently recurring them in any discussion of digital preservation, that of collaboration and the need to take advantage of related effort which has already occurred. He also went on to outline the pros and cons of SRB, pros included the ability to handle large networked data volumes and high user acceptance. On the negative side, technical support is not well advanced so there is a requirement for significant in-house expertise as it is quite complex to install. In concluding Andrew said that SRB has the potential to simplify day-to-day operations and also to simplify distributed management of data and indicated that the AHDS was looking for partners using SRB.

The afternoon session was chaired by Richard Boulderstone, Director eStrategy, the British Library and began with a presentation 'Preserving EPrints:Scaling the Preservation Mountain' (PDF 144KB) on the SHERPA project presented by Sheila Anderson and co-authored with Stephen Pinfield. Sheila outlined the SHERPA project objectives and partners Nottingham (lead), Edinburhg, Glasgow, Leeds, Oxford, Sheffield, York, the British Library, and AHDS. SHERPA is primarily concerned with e-prints, i.e. a digital duplicate of an academic research paper that is made available online as a means of improving access to the paper.

Differing views have been expressed on whether it is necessary to preserve these documents but there is an opportunity here to move beyond saving and rescuing digital objects to building the infrastructure required to manage them from the start. A good start has been made in identifying properties of e-prints, looking at selection and retention criteria, preferred formats, rights issues etc. but none of these are 'doing' preservation. Using the OAIS model as a guide, a preservation storage layer and preservation planning (e.g. policies and procedures, risk assessment) needs to be added, with preservation and administration metadata and preservation protocols and processes in place.

A new two-year project, known as SHERPA DP, which is being led by AHDS in partnership with Nottingham and 3-4 SHERPA partners and funded under the recent JISC 4/04 Call has recently been announced. The aim of SHERPA DP will be to develop a persistent preservation environment for SHERPA partners based on the OAIS model and to explore the use of METS for packaging and transferring metadata and content. A Digital Preservation User Guide would be another practical deliverable from this project. The preservation community would be looking to the DCC for support, particularly in functions which are most appropriately centralised, such as technology watch.

The final presentation was from David Ryan, Head of Archives Services and Digital Preservation at the National Archives, 'Delivering digital records: towards a seamless flow'. David described the development of the Digital Archive and key points needed for its success (TNA Presentation Part 1 PDF 96KB), which were a strong business case linked to core organisational aims, a good team, and the need to sell the fact that this is not an insuperable problem. It has taken three years for the Digital Archive to become a comprehensive service delivery and all business targets have been met but it is critical to recognise that stewardship is a long-term evolving business. In recruiting staff it was essential to have the right technical skills, combined with the ability to sell the work to others within the organisation (TNA Presentation Part 2 PDF 90KB). The reality is that we must collect e-records. The Digital Archive should be scaleable to 100TB, which is way beyond current storage requirements though it is rapidly growing (TNA Presentation Part 3 PDF 1MB). TNA works with government departments but the current procedures, which tend to be case-by-case and handcrafted, was not scaleable (Editor's note: a similar point was made by William Dixon in Glasgow's experience of building their repository). Preservation planning is a key feature of the Digital Archive, which must be able to accommodate changes in preservation management over time. The main thing is to ensure that the bitstream remains unharmed incase a different preservation strategy is adopted (the current strategy is migration). Other TNA digital preservation effort includes the PRONOM service (TNA Presentation Part 4 PDF 613KB), which is now on Version 4 and is designed to be the primary file format registry. PRONOM can be used to help decisions about migration planning because it can indicate when a file format is likely to become unsupported. The UK Central Government Web Archive has captured c. 60 web sites to date and is currently held separately from the Digital Archive but it was intended to bring the two together. An issue is the size of the government website domain. Finally the work of NDAD was described, and their role as contractor for TNA in preserving data sets. Next steps would include a comparison of the NDAD data model and the digital Archive data model. In closing, David said that trusted digital repository certification was a key issue and there was a need for a process to allow a federated system of preservation and access.

A final panel session allowed delegates to put questions to all the speakers.

UK Needs Assessment

A UK Needs Assessment was identified as a key priority in the DPC Business Plan for 2003-2006 in order to identify the volume and level of risk and assigning priorities for action. The first stage of this exercise was a DPC Members survey. This was carried out between August 2003 and March 2004, with a Workshop in November 2003 to discuss preliminary results and determine further action required. The survey form used is available below and the final report of the DPC Members survey and annexes, by Duncan Simpson, who was commissioned to undertake the survey on behalf of the DPC are also available. The report of the Workshop is also available below, and this Powerpoint slide presented by Duncan Simpson at the Workshop, indicates the proposed timeframe for the UK Needs Assessment, assuming funding for key initiatives.

Other deliverables from the survey are the map of DPC members, which provides details of each DPC member and their interest in digital preservation and (where appropriate) what material they have undertaken responsibility for. The table of DPC Member projects was also derived from the survey, and will be periodically updated. A related follow up task was Scenarios of Risks of Data Loss, real-life examples where data was either lost or at risk, provided by some DPC members and collated by Duncan Simpson. This is available below.

Start
Prev
255
256
257
258
259
260
261
262
263
264
Next
End

DPC

Joint DPC/DCC Forum - Policies for Digital Curation and Preservation

Monday July 3rd, 2006

Tuesday July 4th, 2006

DPC Forum on Web Archiving

DPC Briefing on OAIS

Archived DPC and Digital Preservation Media Coverage

DPC Meeting on Preservation Metadata

Report for the DCC/DPC Workshop on Cost Models for preserving digital assets

Report on IS & T Archiving 2005 Conference, Washington, 26 - 29 April 2005

Sarah Middleton

Report on the DPC Meeting on the large-scale archival storage of digital objects

Digital Preservation in Institutional Repositories

UK Needs Assessment

Subcategories

News

About

Digital Preservation

Events

Knowledge Base

Blog