Jamie Shiers

Jamie Shiers

Last updated on 12 July 2019

Jamie Shiers works in the Information Technology Department at CERN and is Manager of the Data Preservation for Long-Term Analysis in High Energy Physics (DPHEP) Collaboration

Every 5 – 7 years, physicists from around the world get together to discuss their views on the priorities for Particle Physics – both in Europe and in collaboration with corresponding plans for other parts of the world[1]. At the most recent of these symposia, held in Granada in May 2019, with the intent of forming a strategy that can be approved by the CERN Council in May 2020, there was notable enthusiasm for a new electron-positron collider (this might be linear, circular, built in Europe or elsewhere). Should such a machine be hosted at CERN – for example, in a 100m circular tunnel corresponding to one of the proposals – it would be unlikely to enter operation before the mid to late 2030s.

If this was approved and constructed, one of its first goals would be to take data (albeit with some ten thousand times the statistics and greater precision) at energies that the former Large Electron Positron (LEP) collider operated at from 1989 to 2000 (initially at the Z0 peak and subsequently producing W+/- pairs – see https://en.wikipedia.org/wiki/W_and_Z_bosons for clarifications).

The data from LEP is a “modest” 400TB (around 100TB from each of the 4 experiments ALEPH, DELPHI, L3 and OPAL) and there are 3 full copies of this data maintained at CERN, not to mention additional copies at various institutes around the world. The data is still available and used for scientific purposes, with papers published as recently as 2018 and 2019.

Having already preserved the data for almost two decades since the end of LEP, another two should be trivial, right?

Unfortunately not: whereas preserving the bits might be relatively straightforward (much of the data – but not all – is in a binary machine-independent format designed for 32-bit computers), discovery and access protocols have changed and will no doubt change again several times in the next one to two decades.

Furthermore, without complex software, largely written in “legacy” languages such as Fortran 77 but fortunately ported to 64-bit Linux on Intel processors, the data is largely useless. Maintaining this software, porting it as necessary and undertaking the necessary validation requires specialized knowledge that is decaying, as people age, retire and die[2].

In addition, documentation, websites, newsgroups, even e-mail threads are all essential to re-use of the data, as are various databases. All of these present their own preservation challenges – ones that are likely shared by many other disciplines.

What is clear, however, is that if we don’t start an activity to actively preserve this data, documentation, software and knowledge now it will soon be too late.

Who should oversee such an activity? The top management of CERN is typically replaced every 5 years, the data itself is migrated to new media every 3 years and services and technology are under continuous evolution.

One possible option would be to attach the activity to whichever of the candidate electron-positron collider projects that is approved (possibly not for several / many years) but where should it be hosted in the interim? Can it be passed from Director-General to Director-General in the meantime? And who will ensure that this actually happens? A possible solution would be to include the active preservation of LEP data as part of the 2020 strategy – and hope that it is then carried forward through the next update at the end of the 2020s (around the time that a construction project could possibly be approved).

The bottom line: preserving data, software, documentation and much more requires considerable effort and foresight. Be careful: software that you write as a student may still be needed at the end of your career or even beyond!

[1] The inter-play between theoretical and experimental physics is discussed in this recent Scientific American article: https://blogs.scientificamerican.com/observations/which-should-come-first-in-physics-theory-or-experiment.

[2] For (much) more on these issues, see the material prepared for a recent workshop on Sustainable Software Sustainability: https://indico.cern.ch/event/801649/.

Scroll to top