Blog

Unless otherwise stated, content is shared under CC-BY-NC Licence

Introducing the DPC RAM

Jenny Mitcham

Jenny Mitcham

Last updated on 20 September 2019

“If you can’t measure it, you can’t control it.”

Martin Robb, National Programme Manager, NDA

 

I’ve heard this phrase several times since starting work on a digital preservation project with the Nuclear Decommissioning Authority here in the UK. Colleagues at the NDA were very keen that as part of our two year project with them, we found an appropropriate way of measuring where they are now in their digital preservation journey and establishing a clear direction of travel.

Maturity modelling was the obvious answer.

As mentioned in a previous blog post we didn’t want to re-invent the wheel, so we did some research, looking at digital preservation maturity models that were available, hoping to find one that was suitable to use in the context of the NDA.

Read More

Integrated Preservation Suite (IPS): a scalable preservation planning toolset for diverse digital collections

Peter May

Peter May

Last updated on 16 September 2019

Peter May is the British Library’s Digital Preservation Technical Architect


Preservation planning is a long established function in digital preservation. Its purpose is to ensure that digital content can move forwards through time for future users without suffering unacceptable loss, either to intellectual content or functionality. Many different activities support preservation planning, and at the British Library this has included collection profiling, format sustainability assessments, defining digital preservation policy, content sampling, and preservation risk modelling. These activities have led to an excellent understanding of what is needed to preserve our digital content and the risks that are likely to manifest.

Missing from this picture, however, was the ability for us to put this knowledge into practice in an automated manner so that technical risks can be effectively and efficiently mitigated, at scale, and across all the collections. Our approach, formalised in our Integrated Preservation Suite (IPS) project, is our developing solution to this challenge.

Read More

How to correctly identify the file type of a text file from its contents?

Santhilata Kuppili Venkata

Santhilata Kuppili Venkata

Last updated on 13 September 2019

Dr Santhilata Kuppili Venkata is Digital Preservation Specialist / Researcher at The National Archives, UK


The Plain text file format identification is of interest for in the digital preservation area. At The National Archives (TNA), we have initiated the research to identify text file formats as the main topic. We carry on the research for the question:  'How to correctly identify the file type of a text file from its contents?'

Motivation to start this research and the dataset used for this purpose are discussed in part 1 published earlier. We present the methodology to the text file format identification as a classification problem in this part. As of now, we consider the classification of five formats - two programming source codes (Python and Java), two data file types (.csv and .tsv) and one text file type (.txt).

Methodology - ML to the rescue

Artificial Intelligence (AI) and Machine learning (ML) have become an integrated part of our lives. Machine learning is a set of generic algorithms that can understand and extract patterns from data and predict the outcome. TNA deals with a huge variety of file types for digital archiving. Hence an iterative process model is appropriate to include file types gradually. The methodology should be flexible enough to apply to more file types progressively. As and when a new file type is to be included, its features (specific characteristics) should be compared against the existing features and engineered to add to the list. Hyperparameters for the models should be adjusted accordingly to get a better performance. The flow graph in Figure 1 shows the methodology developed.

Read More

Curation and Preservation: Teaching, Practice and Inquiry

Kristen Schuster

Kristen Schuster

Last updated on 10 September 2019

Kristen Schuster is Lecturer in Digital Curation at Kings College London


I had a rather interesting conversation with a colleague a few months back. It started with the simple question: what exactly do you do? Usually this sort of inquiry is a conversation stopper, but with this particular colleague, it was a genuine inquiry meant to start a conversation. It helped that we were sitting in a room full of undergraduate students visiting from the States who had only a vague inkling (at best) about the Digital Humanities.

They were in for a bit of a disappointment though, because I am not a digital humanist. I’m a librarian who works within the digital humanities. And I have a rather fantastic, if slightly cryptic, job title: Lecturer in Digital Curation.

Read More

Motivation to Undertake File Format Identification Research for Plain Text Files

Santhilata Kuppili Venkata

Santhilata Kuppili Venkata

Last updated on 2 September 2019

Dr Santhilata Kuppili Venkata is Digital Preservation Specialist / Researcher at The National Archives, UK


The file format identification problem has been of interest for quite some time in the areas of digital archiving and digital forensics. Many researchers are working to find a solution to this problem. While most of the work is done to identify files with binary file formats, not much work is found to identify the file type of plain text files. In this digital era, files are often generated in an integrated development environment where each document generated is supported by multiple files. These include programming source code, data description files (such as XML), configuration files etc. For digital preservation, it is important to understand each of these supporting files correctly.

Contents of the supporting files are often human-readable. i.e they can be opened as plain text files using a simple text editor. But if the file extensions are missed or corrupted,  it is hard to know how to put that file to use!!  

Some of the existing research work used Natural Language Processing (NLP) techniques such as pattern matching of n-gram contiguous sequence models.  Even though  these techniques needs the program to be written in a specific style only.  They fail to differentiate files when the target file types have almost similar structures (for example, Java and C). We need to generate file features and classification models in such a way that they describe file types distinctly.  To address this problem, we kick started the research project: 'Text File Format Identification' at The National Archives (TNA). Our initial prototype makes use of machine learning algorithms and can identify five formats: Python, Java, .txt, .csv, .tsv.

Read More

Ten things I've learned about using OAIS from a sprawling twitter convo

Paul Wheatley

Paul Wheatley

Last updated on 12 September 2019

Throughout the last couple of days I've blundered about the twitterspace, having a multidimensional debate about OAIS, prompted in the first instance by this tweet:

I've done some ranting, I've got the wrong end of the stick, I've discussed stuff, I've got completely confused, I've had a few chuckles and now and again I've even managed to say something remotely useful. I've also done quite a bit of reflection on what's been said and I reckon I've learned a lot. I wanted to attempt to write up a summary of what I've learned, and as it's Friday afternoon I've aimed for something reasonably light hearted. So here's ten things I've learnt about using OAIS from a sprawling twitter convo:

Read More

File format identification: A student project at the University of Sheffield Library

Chris Loftus

Chris Loftus

Last updated on 23 August 2019

This blog has been written by Peter Vickers; a postgraduate student in Speech and Language Processing hired by the University Library, as part of the University of Sheffield’s OnCampus programme, to look into file identification and archiving.


Forgotten Scripts

Below is an inscription written in Linear A, a Minoan script which has been found on thousands of historical objects across Greece. Because the language bears no close similarity to a language we understand, and we have no Rosetta Stone to decipher the language, linguists have had to use speculation and comparison to attempt to decode the script. Whilst over the past decades, Linear A has been related to the proto-Greek Linear B, the Hittitie Luwian script, Phonecian, and Indo-Iranian, none of these comparisons have either achieved widespread academic acceptance or allowed for the translation of much of the Linear A corpus. For now, at least, Linear A, and all of the Debts, Curses, Tax Returns it encodes are indecipherable. 

Given our cultural interest in lost languages and the knowledge they might encode, I wonder what researchers in 100 years will make of all the digital content we create. Linear A is 3,500 years old – old enough to be forgiven for having been forgotten. Meanwhile, last week I found myself unable to access the data on a five inch floppy disk, which were still in use twenty years ago. Of course, the loss is not the same – I could use the library’s archival system to read the disc. However, the data on the disc might itself be in an obsolete file format. Comparing it to the Linear A problem : recovering the data might be compared to the legibility of our script, whilst opening might be compared files it to our ability to translate it.

Read More

Open repositories: or how I learned to start worrying and hate jingoism

Hrafn Malmquist

Hrafn Malmquist

Last updated on 29 July 2019

Disclaimer: I must state that the following blog-post is written in a personal capacity, airing opinions that are my own and are not intended to endorse a particular piece of software. They should not be considered official on behalf of my current employer, The University of Edinburgh.

Last month, in June 2019, I attended the fourteenth Open Repositories (OR) conference held in Hamburg, organised by Hamburg University. Hamburg is a beautiful city, and this coincided with the Hamburg University’s centenary.

It is one of the biggest conferences in the world of its kind and had a packed four day schedule. It was the first OR I attended and I delivered a presentation: “Automating OAIS compliant digital preservation using Archivematica and DSpace”. A bit more about that later. I saw many interesting talks, both from an ideological perspective as well as technical (I am a developer although I do have a background in library and information science). I’ll now proceed to tell you a bit about my experience at the conference.

Read More

Enhancing Services to Preserve New Forms of Scholarship

Karen Hanson

Karen Hanson

Last updated on 22 July 2019

Karen Hanson is Senior Research Developer for Portico


The last decade or so has seen the emergence of a new kind of scholarly work - the enhanced digital monograph. While still recognizable as monographs, these resources include a variety of dynamic features that cannot be replicated in print format. These works represent a leap forward for scholarship, but their formats, use of dynamic features, and composite nature present complex preservation challenges. 

To help address these challenges, a new collaborative project funded by the Andrew W. Mellon Foundation partners preservation institutions, libraries, and university presses that are producing enhanced monographs. The goal is to examine what aspects of these works can be preserved at scale, and produce guidelines to improve their preservability that publishers and authors can use while creating these works.

Read More

Ten Years On – Some Myths Debunked About the Artist FKA The DPC Leadership Programme

Sharon McMeekin

Sharon McMeekin

Last updated on 19 July 2019

Our illustrious (!) leader William Kilbride started with the DPC in February 2009, and one of the first new initiatives he introduced the DPC’s Leadership Programme. For ten years now the programme has been one of the core elements of our workforce development activities. It offers grants so that our members can attend training and development opportunities they may not otherwise be able to. The programme has also helped ensure that organizations who offer training can have some assurance of a return on their investment. In its lifetime the DPC Leadership Programme has provided well over 100 grants for members to attend training and development opportunities. This began back in May 2009 with 2 grants for individuals from the National Library of Wales and Cambridge University to attend the Digital Preservation Training Programme.

Read More

Scroll to top