Matthew Addis

Matthew Addis

Last updated on 3 November 2021

Matthew Addis is Chief Technology Officer at Arkivum


The theme of this year’s World Digital Preservation Day is ‘Breaking Down Barriers’.  A great example is recent work on how digital preservation can help ensure scientific data will remain accessible and usable for decades to come.   The European Open Science Cloud (EOSC) is bringing together the communities of scientific research and digital preservation to break down the barriers that currently jeopardise the long-term access to hugely important scientific datasets and results.  

This is happening both at a strategic level, for example through EOSC’s working group on sustainability that includes a task force on long-term digital preservation (LTDP), and on a very practical and technical level, for example through the ARCHIVER project that is developing new LTDP services for EOSC.  Arkivum has been privileged to work in ARCHIVER to find new ways to do cost-effective, scalable, and environmentally friendly digital preservation at scale (more on that below).

FAIR Forever

It is now well recognised that there is substantial value in making research data open and accessible.  Benefits include science that is higher quality and more productive, faster development of new products and services, and increased impact of research when addressing societal challenges.  These benefits can only be fully realised if research data is Findable, Accessible, Interoperable and Reusable – otherwise known as FAIR.  It is not enough simply to put data online and hope for the best!   

FAIR encourages and supports high quality research that follows good research practice which produces results that are repeatable, verifiable, and re-usable.   FAIR data needs to be more than just FAIR today or FAIR tomorrow, it needs to remain FAIR for as long as the data is useful, which to all practical purposes is forever.  This includes the need for certified Trusted Digital Repositories for FAIR data, which is something that the FAIRsFAIR project is addressing head on with practical tools and services for repository certification, for example to CoreTrustSeal.

The DPC recently released an excellent report as a result of Fair Forever study that highlights the need to better incorporate digital preservation policy, technology and capacity into the new infrastructures that are emerging for sharing and reuse of FAIR data such as EOSC.  The DPC report identified gaps and barriers – and most importantly it provides a series of recommendations on how to break these down.   

The DPC report and the EOSC sustainability working group is taking a top-down approach to LDTP in EOSC, and rightly so, because preservation at the scale needed in EOSC is deeply connected to the need for sustainability that includes issues around funding and economics, policies and mandates, and the need for awareness and changes in user behaviours.  That’s going to be no mean feat because EOSC aims to serve 1.7 million European researchers and 70 million professionals in science, technology, the humanities and social sciences. 

ARCHIVER

The ARCHIVER project is taking a more bottom-up approach to LTDP in EOSC.  There are major technical challenges to overcome when doing practical digital preservation and archiving of huge scientific datasets (a petabyte of data is merely an hors d'oeuvre for many in this community).  ARCHIVER recognises the need for new commercial services for digital preservation that can be reliably and certifiably scaled to the “petabyte region and beyond”.   To that end, ARCHIVER is stimulating new R&D to be done by service providers and has currently just finished the Prototype phase of the project. 

Arkivum produced one of the ARCHIVER prototypes and demonstrated the ability to ingest, preserve, store and provide access to very large datasets and at high data rates (over 100TB per day).  We recently presented this work at iPRES 2021   This speed and scale is needed for organisations with large volumes of data such as the ARCHIVER end users (CERN, EMBL-EBI, PIC and DESY), but it also means that preservation services can be provided in a cost-effective way for smaller volumes or data or smaller organisations – which is sometimes called the long-tail of science.

Environmental Sustainability

One of the key aspects of our approach in ARCHIVER was to make LTDP services as efficient as possible when using compute and storage infrastructures.  We did this by embracing serverless computing, k8s, auto-scaling and an architecture of micro-services.  The details are in the iPRES paper for those interested, but one of the important outcomes was the minimisation of IT resource consumption when there isn’t work to do as well as the ability to scale-out massively when there is lots of data to process – otherwise known as auto-scaling.  This, along with cloud deployment into Google data-centres powered by renewable energy, is a step towards minimising the environmental impact of LTDP – at least from a technical standpoint.  Doing ‘more with less’ means less equipment (e.g. servers), less power, less cooling and as a result less carbon – both in the embodied and use phases of the carbon footprint of LTDP.

I’ve blogged before on the challenges and approaches to environmental sustainability of LTDP.  Environmental impact of LTDP services is an important factor in the overall long-term sustainability of research data.  Not least in EOSC where the IT infrastructure needed to store, process and provide access to 100s of PB of data for decades is substantial to say the least!  However, long-term environmental sustainability of FAIR data is only recently getting significant attention in EOSC.  And that’s despite the DPC doing some great work to raise awareness of environmental issues and solutions to LTDP through recent and forthcoming events. 

There are still many barriers to be broken down in order to achieve environmental sustainability at the same time as fast, easy and open access to scientific datasets.  But the signs are good that the collaboration necessary to do this is starting to happen with CERN, STFC and CSC all flying the environmental sustainability flag at the forthcoming DPC event “Environmentally sustainable digital preservation - moving from theory to practice” – which itself is very well timed to coincide with COP26.


Scroll to top