The UK Web Archive (UKWA) is led by the British Library, and works collaboratively across the 6 UK Legal Deposit Libraries: the National Library of Scotland, the National Library of Wales, the Bodleian Libraries, University of Oxford, Cambridge University Library, the library of Trinity College Dublin and the British Library. In May 2020, UKWA celebrated 15 years since making its first collections available publicly online.

ukwa homepage Ian Cooke

UKWA was formed as a response to growing awareness of an urgent digital preservation need, to collect and preserve communication using the web. The preservation risks relating to websites were known to be: highly decentralised communication, with many small groups and individuals responsible for content; a rapidly changing technological environment; no regulatory framework for preservation; and a low awareness both of the value of communication on the web and of web archiving generally. These factors created an environment where content on the web could be easily lost or changed, with no organisations holding responsibility for their long-term preservation. The situation was characterised as a ‘digital black hole’.

The response of UKWA was therefore social, cultural, political and technological. Led by the British Library, UKWA formed as a partnership of institutions concerned with the preservation of digital content. This included the National Library of Scotland and National Library of Wales, alongside The National Archives, Wellcome Library and Joint Information Systems Committee (JISC). The British Library was also a founding member of the International Internet Preservation Consortium (IIPC), and used its membership to learn from peer institutions and also share knowledge and experience. Preservation of the web required, and continues to require, co-operation to build capacity and support technical infrastructures that can be shared. The British Library continues its close association with the IIPC, providing a location for the IIPC’s Programme and Communications Officer and co-Chairing the Collection Development Group.

UKWA recognised the importance of access to the archived web as a key part of preservation. This was both in terms of ensuring that user feedback could identify issues with the archived websites and guide preservation decisions, and in terms of making the case publicly for the value of preserving born digital content. Collecting is curator-led, with an emphasis from the start on selection of websites and building thematic and event-based collections. Early collections included the 2005 UK General Election, the July 2005 terrorist bombings in London, and response to the Indian Ocean tsunami on Boxing Day 2004. The creation of collections provided a way to highlight the value of the information contained in websites, and stimulate the use of the archived web in research.

Selective web archiving was also permissions-based, with curators contacting website owners to request permission to make archival copies and play these back through a public interface. The permissions-based approach was both apractical response to the lack of regulatory support, and a means of raising awareness and generating an evidence base to make the case for legal deposit regulations.

Web archiving was used as a means to demonstrate the risks to born digital content more widely and to show the weaknesses of a system based on permissions. The publicly accessible user interface could be used as a communications tool to explain the value and the fragility of born digital content. Web archiving became a part of wider advocacy activities that led to the introduction, in April 2013, of the UK regulations for Legal Deposit of ‘non-print works’. These regulations referred explicitly to web harvesting as a means of collection, and established the right of legal deposit libraries to collect digital-only publications (including websites) in the UK, without the need for gaining permission first.

The possibilities of the regulations required a scaling-up of preservation activity from thousands to millions of websites. UKWA has managed an annual UK domain crawl every year from 2013, and continues to run topic and event-based collections. When content is highly ephemeral, rapid collecting is an essential part of preservation. However, this requires an infrastructure to support and enable scaling up to responsibly manage preservation. The technology used by UKWA is the result of extensive research and development, which is continually monitored and regularly updated. UKWA uses, and has contributed to the development of, open source tools for large-scale web archiving (Heritrix), file format identification, an international standard preservation format (WARC) for storing web content collected, and implementation of a server structure with high levels of redundancy (Hadoop Distributed File System) for storage and access.

UKWA has been pioneering in its use of technology across the collection management lifecycle. It has led on the use of Solr to create full-text indexes from the archived web, and in the development of a user interface to exploit the research potential of full-text indexing. The latter, ‘Shine’, was developed iteratively as part of a research collaboration (the ‘Big UK Domain Data for the Arts and Humanities’) to meet the needs of ‘real’ research projects. Shine has subsequently been adopted by other web archives including, most recently, the Croatian national web archive. UKWA actively shares learning and experience through its engagement with the IIPC and more widely, participating in and hosting conferences, workshops, and hackathons on a range of curatorial and preservation tools, methods and issues.

Effort is focused on collaborative and inclusive methods to ensure a more comprehensive and high quality collectionof the UK web domain. This includes working with a diverse range of partners to build collections, experimenting with use of evolving web archive technologies to enable collecting of more dynamic and complex sites, and the development of a high-fidelity playback tool to improve access to the archived web.

Today, the UK Web Archive preserves in excess of 750 TB of data, representing billions of files and seven annual UK domain crawls. Thematic collections include a series on UK general elections, Caribbean communities in the UK, LGBTQ+ lives, sport, religion and coronavirus. The public User Interface recorded 925,000 sessions in April 2019-March 2020. UKWA continues to host research placements and participate in national and international research projects.

Scroll to top