Erin Liu is Assistant Archivist at the University of the Arts London. She recently completed the Postgraduate Certificate in Applied Data Science at Birkbeck with support from the DPC Career Development Fund which is funded by DPC Supporters.
With the support of a Member Self-Identified grant from the DPC Career Development Fund, I completed the Postgraduate Certificate in Applied Data Science at Birkbeck, University of London last autumn. I was motivated to pursue the course by experiences working with digitised and born-digital material across various roles held at the University of Arts London (UAL) Archives and Special Collections Centre (ASCC). In both my substantive role as Assistant Archivist and on part-time secondment through 2024 as Digital Preservation and Access Manager, I’d come across tasks and workflows that could potentially be optimised if our service was afforded the opportunity to deepen our in-house computational knowledge. The UAL ASCC has always had a strong practical sense of what we require computational processes to do. However, there was the possibility of benefitting from further training in practically developing, testing and applying scripts to directly improve our workflows, particularly at the pre-ingest stage.
The PgCert Applied Data Science arose from the Computing for Cultural Heritage programme, which was co-developed by the British Library, The National Archives (UK) and Birkbeck. I had first learned about the course from my fellow past Bridging the Digital Gap trainee, Jacob Bickford, who undertook the course in 2021, during his time at the University of Westminster. You can read about his experience, also funded by the DPC Career Development Fund, in this reflective blog post. The PgCert is intended for those who do not have a formal qualification in computer science but have demonstrable experience using IT tools in a professional environment. It provides students with a broad knowledge of computing, as well as applicable skills in programming and data analysis using Python.
During the course, I undertook two taught modules covering the foundations of Python, as well as Python’s applications within machine learning and Natural Language Processing tools. Whilst I’d come in with knowledge of interpreting and running Python scripts, these modules gave me additional background and fundamentals that I hadn't previously appreciated. They also equipped me with knowledge and skills to proceed with a practical work-based project that allowed me the time, space and supervision to apply this computational knowledge within the UAL context.

An example of UAL’s digital collections, which document histories of art, design and creative education. Digitised photograph of IA_074 Industrial Design and Engineering – Photograph, from the archive of the Camberwell Inner London education Authority (ILEA) Collection.
UAL has collection strengths in filmmaking, sound arts, photography and student artworks, as well as histories of printing, publishing and art education. Many of these collections take digitised and born-digital formats. Elisabeth Thurlow has written extensively on this very blog about the digital preservation journey she has led UAL through as Digital Preservation and Access Manager. In responding to the challenges of caring for digital assets, a key contention for allied practitioners alike is whether (1) to maintain several discrete yet interrelated programmes/activities or (2) to acquire a third-party solution that aspires to encapsulate these workflows as an all-in-one solution. In practice, however, even those that adopt the second approach, i.e. acquiring a third-party all-in-one solution, acknowledge that complementary ‘microservices’ are still required to make use of the third-party solution, even at a basic level (Rice and Schweikert, 2019).
This potential microservices gap was precisely the context I was interested in exploring in my work-based project. Between 2017-2021, UAL procured and implemented a digital preservation system (DPS) to provide an active preservation environment for digital assets under its care. Since entering the business-as-usual phase of DPS implementation, workflow bottlenecks have revealed themselves earlier up the chain, at the stage of preparing material for submission to the DPS, i.e. the ‘pre-ingest’ stage. Research undertaken as part of the project reassured us that challenges at the pre-ingest stage are certainly not unique to UAL. Processes such as content review and structuring Submission Information Packages (SIPs) can require manual human intervention at various stages in each process. However, in the UAL context, these challenges are compounded by the diversity of cataloguing systems and standards used across UAL, which include those of libraries, museums and archives; the sheer range of practices resists simple standardisation of approaches. Due (in part) to this complexity, activities associated with the pre-ingest stage have historically been more manual, tedious and at risk of human error.
My project explored how Python in particular might be deployed to address this potential microservices gap. As part of this project, I consulted key stakeholders at the UAL ASCC to sharpen project requirements. Based on these requirements, this project sought to automate three core pre-ingest tasks using Python tools and libraries:
- To compare directories of digital content, parsing duplicate material from unique material.
- To securely copy material across storage areas, ensuring integrity of digital content throughout the process.
- To correctly organise unstructured data and metadata into DPS-readable folder structures.
Whilst task (3) is more idiosyncratic to the UAL environment, tasks (1) and (2) are often used in some shape across most organisations, regardless of their scale or maturity. For tasks (1) and (2), this project drew from precedence that deploys Python for these purposes, including from the Irish Film Institute, the National Library of New Zealand, MIT Libraries and from UAL’s experience as part of the DPC/BitCurator Python Study 2024 Groups, where our group shared an investment in developing a secure copy script (Gattuso, 2015; Gattuso, 2024; Hanson, 2019; O’Leary et al., 2025, Prater et al., 2024). For all tasks, this project drew upon the intimate knowledge UAL ASCC staff have both of their digital collections and of the challenges they face in caring for them, particularly at the pre-ingest stage.
In design and development, I sought to maintain the following key principles:
- Treat checksums as foundational to the process, using hashing and hash comparison as a basis for maintaining and reporting on file integrity throughout script run. These scripts/programmes deploy the MD5 hash algorithm for internal use, but another algorithm could be implemented with some tweaks.
- Support knowledge continuity through generation of CSV logs that communicate script results. Logs can support ongoing collections management and problem solving.
- Facilitate a straightforward user experience, inclusive of users of varying technical backgrounds. Digital preservation is often considered additional to core duties for many colleagues at UAL, and many will not have prior coding experience; it was therefore essential that the programme interfaces were simple, communicated progress/errors and were accompanied by clear user instructions.
- Design with simplicity, legibility, modularity and extensibility in mind. It’s impossible to anticipate all changes that can occur at the organisational, financial and/or technical levels. However, designing with these principles attempted to buttress the project for future change and support future custodians in making code adaptations as they arise.
The project resulted in three python-based scripts/programmes, which are available on GitHub for anyone to reuse and adapt. The project intended not only to facilitate the efficient movement of digital assets onward in the digital preservation lifecycle, but also to increase knowledge of our digital collections and trust among our users and donors. Whilst challenges remain in terms of the project’s scalability, ideal user experience, and maintenance of date creation metadata, the project achieved its overall aims of making day-to-day digital preservation processes easier for staff – all the while maintaining key archival standards as much as possible.

Screenshot of a completed run of safe_copy.py, which copies content from one directory to another and compares checksums between directories to ensure successful copy.
Prior to rejoining UAL in 2021, I worked in environments where collections managers worked side-by-side with applications developers within one team. I experienced firsthand the tangible benefits that computational skills could bring to digital preservation activity, particularly when it came to collaboratively designing workflows for processes like data transformations and data cleaning. As information professionals worldwide face shrinking teams/resources and increased workload and organisational pressures, the option to bring computational knowledge and skills in-house can support teams to progress key technical tasks more autonomously. Having a stronger handle on programming and data analysis using tools like Python can support us in continuing to provide ready and meaningful access to researchers both within our organisations and beyond, in-person and online. However, if this PgCert project showed me anything, it’s that effectively automating workflows (and testing and maintaining them) requires significant staff time, organisational support, collaboration and knowledge-sharing – evidencing that we can do more with more, rather than more with less.
References
Gattuso, J. (2015) ‘Safe_mover/log_compare.py at master · NLNZDigitalPreservation/Safe_mover’. Available at: https://github.com/NLNZDigitalPreservation/Safe_mover/blob/master/log_compare.py (Accessed: 23 May 2025).
Gattuso, J. (2024) ‘NLNZDigitalPreservation/Safe_mover’. National Library of New Zealand. Available at: https://github.com/NLNZDigitalPreservation/Safe_mover (Accessed: 23 May 2025).
Hanson, E. (2019) ‘file-management-python-scripts/compareFilesInTwoDirectories.py at master · ehanson8/file-management-python-scripts’. Available at: https://github.com/ehanson8/file-management-python-scripts/blob/master/compareFilesInTwoDirectories.py (Accessed: 23 May 2025).
O’Leary, K.J. et al. (2025) ‘Irish-Film-Institute/IFIscripts’. Irish-Film-Institute. Available at: https://github.com/Irish-Film-Institute/IFIscripts (Accessed: 23 May 2025).
os.path — Common pathname manipulations (2025) Python documentation. Available at: https://docs.python.org/3/library/os.path.html (Accessed: 20 July 2025).
Prater, S. et al. (2024) PythonStudyGroups/2024 Cohort 2/Group 21/secureCopyFile.py at main · Digital-Preservation-Coalition/PythonStudyGroups. Available at: https://github.com/Digital-Preservation-Coalition/PythonStudyGroups/blob/main/2024%20Cohort%202/Group%2021/secureCopyFile.py (Accessed: 11 June 2025).
Rice, D. and Schweikert, A. (2019) ‘Microservices in Audiovisual Archives: An Exploration of Constructing Microservices for Processing Archival Audiovisual Information’, International Association of Sound and Audiovisual Archives (IASA) Journal, (50), pp. 53–75. Available at: https://doi.org/10.35320/ij.v0i50.70.
Acknowledgements
The Career Development Fund is sponsored by the DPC’s Supporters who recognize the benefit and seek to support a connected and trained digital preservation workforce. We gratefully acknowledge their financial support to this programme and ask applicants to acknowledge that support in any communications that result. At the time of writing, the Career Development Fund is supported by Arkivum, Artefactual Systems Inc., boxxe, Cerabyte, DAMsmart, Evolved Binary, Ex Libris, HoloMem, Iron Mountain, Libnova, Max Communications, Pictoscope, Preferred Media, Preservica and Simon P Wilson. A full list of supporters is online here.












































































































































