Datasheets for Datasets: Describing Web Archives Collections

Home
News
Datasheets for Datasets: Describing Web Archives Collections

Added on 27 April 2023

When: Friday 26 May 2023

Time: 1:30-4:00pm

Where: British Library, Staff Entrance, Midland Road, NW1 2DB

Capacity: 16 people

Register here: https://forms.gle/HLMV6chmQhKG6kJ66 (Registration closes 17th May)

This in person workshop explores how web archives collections can be described using the Datasheets for Datasets framework.

Significant work in web archives scholarship focuses on the description and provenance of collections and their data. Looking beyond the worlds of libraries, archives and cultural heritage can provide valuable alternative approaches, which we can experiment with and use. Datasheets for Datasets is a method for describing large datasets from the field of machine learning, which uses a standard set of questions arranged by stages of the data lifecycle.

During this workshop participants will discuss how web archives collections can be described using the Datasheets for Datasets framework. Specifically a datasheets template that is arranged into nine sections. This template asks questions about a dataset, focusing on the specific needs of machine learning researchers. More information on these questions can be found here: https://www.microsoft.com/en-us/research/project/datasheets-for-datasets/

Participants will consider how these questions can be adopted for the purposes of describing web archives datasets. Considering and assessing how each question might be adapted and applied to describe datasets from UK Web Archive curated collections.

After a description of the Datasheets for Datasets framework, there will be a group card-sorting exercise. Each group will evaluate a set of questions using the MoSCoW technique, sorting them into categories of Must, Should, Can’t, and Won’t have. Groups will report back on this task via a facilitated discussion about the priorities and resources available for generating descriptive metadata and documentation for public web archives datasets.

These workshops will be held in-person only due to the format of the activity; they will not be recorded and can only host 16 people at most.

Research participation: As part of this workshop we want to gather data based on the findings of this workshop as part of the project “Studying Description Needs for Web Archives Datasets”, in which we are studying how to prioritize description and documentation needs for web archive datasets. We’re hoping to learn about how the existing framework of Datasheets for Datasets does or doesn’t align with web archives priorities, and what individuals’ experiences have been with documentation and description frameworks in general for web archives datasets. We are recruiting participants to take part in research including: (1) a pre-workshop online survey which will gather data about participants’ experience and roles related to web archives; (2) recording of findings from the workshop card-sorting activity.

Participants are eligible to take part in this study if they are:

- Have experience or interest in the creation and use of web archives datasets – our focus is web archives collections from the British Library;

- 18 years of age or older;

- comfortable communicating in English; and

- interested in discussing experiences during the workshop and with the research team.

Participation in this study is completely voluntary and confidential, and names will be anonymized to protect participants’ identities. You will be asked about the role or tasks you engage in related to web archives data, your training, time period of your employment, and experience with specific tools or collections, but no other personal information will be collected as part of this project.

About the Facilitators:

Emily Maemura is an Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data.

Helena Byrne is the Curator of Web Archives at the British Library. She was the Lead Curator on the IIPC Content Development Group 2022, 2018 and 2016 Olympic and Paralympic collections.