Grant Hurley is Digital Preservation Librarian, Scholars Portal and Meghan Goodchild is Research Data Management Systems Librarian, Scholars Portal/Queen’s University Library. They are based in Ontario, Canada

As a service provider, Scholars Portal is building a suite of services and infrastructure to support the research data management and preservation in Canada. But a key gap is the ability of our member institutions to make use of these services when there is a lack of policies, procedures, strategies and resources at the local level. This post outlines our work to support research data preservation workflows through an integration project between Dataverse and Archivematica. And it offers some observations on the challenges facing the uptake of these tools that means the preservation of research data continues to be at risk.

The past few years have witnessed a remarkable growth of interest in ensuring that digital research data persist over time. Informed by broad trends, such as open data, open access, and the need for more robust study reproducibility, and paired with new and forthcoming requirements set by funders and publishers, the subject of research data management has become a field unto itself. Research data has also become a distinct genre of records that institutions are actively collecting. Among the many papers, guidelines, tools and standards that address the topic, the FAIR principles, released in 2016, have been adopted as an organizing catalyst for activity around research data management. Ensuring that data are findable, accessible, interoperable, and reusable (FAIR, for short) is a noble goal. But while the FAIR principles provide a set of requirements for stakeholders in the research data ecosystem to strive towards, it is notable that they do not address preservation: designing and implementing the preservation practices that would keep data FAIR into the future is left up to the institutions who take on custodial responsibilities for research data. The question of how these organizations should go about doing the things that support preservation remains an open one, and dependent on the complex mix of technological infrastructures and platforms, human and financial resources, and strategy, documentation, policies and procedures at their disposal that together create a preservation program. When one or more of these “legs” is missing, to cite the classic “three-legged stool” metaphor for digital preservation by Anne Kenney and Nancy McGovern, the risks to preserving the records of research remain unmitigated. At Scholars Portal, we have been working to solidify the technology “leg” for research data in collaboration with our members and funders. But we cannot see our services and work taken up and used at its full capacity if the other two legs remain stubbornly wobbly. This post outlines our work to support the preservation of research data through an integration project between Dataverse and Archivematica. And it offers some observations on the challenges facing the uptake of these tools: the need for policies, procedures, strategies and staff resources to guide this work at the institutional level.

Scholars Portal is the information technology service provider for members of the Ontario Council of University Libraries (OCUL), a 21-member consortium of academic libraries in the province of Ontario in Canada. Among our suite of services is a multi-institutional instance of Dataverse, a popular open source repository platform for uploading, curating, and accessing research data. As of the time of writing, there are 47 participating institutions with the Scholars Portal Dataverse service, which is growing beyond our consortium’s members via the Portage Network and agreements with regional library consortia to become a national scale service. A CANARIE-funded project to improve Dataverse’s data curation and storage capabilities is also in progress. We also host 11 instances of Archivematica, an open source preservation processing application, as part of the Permafrost service for OCUL members. The Permafrost service makes use of the Ontario Library Research Cloud for storage, an OpenStack Swift-based cloud network supported by Scholars Portal for OCUL members.

As part of our mandate to support our user community with technological solutions and infrastructure for preservation, Scholars Portal sponsored Artefactual Systems Inc. to develop an integration between Dataverse and Archivematica, which was made available as part of the Archivematica public release v. 1.8 in November 2018. While Dataverse supports some preservation-friendly functionality, such as establishing checksums, encouraging descriptive metadata, and creating non-proprietary, standardized tabular text files from submitted formats like SPSS and Excel, organizations may wish to process certain datasets further and store them independently from Dataverse for long-term preservation. This would enable them to perform additional preservation actions outside of Dataverse, and manage materials more effectively in an archival environment over time, even if the Dataverse or Archivematica platforms themselves no longer exist. At its most basic, the technical integration enables Archivematica users to configure a Dataverse instance as a transfer source location, and select and process datasets in Archivematica. The output is an Archival Information Package that contains the user-submitted original files, any normalized copies (including versions created through Dataverse’s processes) and a set of descriptive and preservation metadata files. Therefore, the integration would enable a workflow wherein a dataset is submitted to Dataverse, and a curator or archivist appraises and selects that dataset as suitable for long-term preservation, processes that dataset with Archivematica, and stores and maintains it in archival storage. Since Dataverse and Archivematica are both open source products, the integration is available to anyone with access to them with appropriate admin privileges. We were pleased to present a paper at the recent iPRES conference on the specifics of the integration. A final version will be published as part of the iPRES proceedings shortly. You can also read more about the integration in the Archivematica wiki. All readers are also invited to test out the integration using the sandbox we support that uses test datasets in our demo Dataverse. Please visit this page for more information.

In the course of promoting the testing and use of the integration, we have been asking for feedback and use cases to inform future developments. We’re interested in hearing to what extent the workflow as designed meets institutional needs and what could be improved. We have already outlined a list of ideas about improvements based on our testing. Though the integration has been out for nearly a year now, we have yet to hear any substantive feedback from the community. While there seems to be general interest, our sense is that institutions are still figuring out how to even begin linking Dataverse and related RDM support services with preservation. Although there are a few institutional libraries in Canada who have been actively building RDM services, many of which developed out of established data services, activities around preservation of research data are largely in the early discussion or planning phases.  Both of these things - rolling out RDM services and developing digital preservation programs - are still largely in development at Canadian institutions, though the situation varies widely among them. In many cases, staff are assigned to RDM and preservation roles in addition to their existing duties, and it can be hard to prioritize the development of policies, procedures and strategy when you are struggling to find time to build capacity and engage in the heavy outreach and training work required to get RDM services rolling.

On the preservation side, it is clear from a recent survey of institutions in Canada by the CARL Digital Preservation Working Group that low levels of staffing and a lack of policies and procedures are the main challenges facing institutions. Less than 20% of the institutions surveyed have digital preservation policies. Less than half are using any one tool for preservation processing, and fewer are using cloud or other replicated storage methods. On the staffing side, over 60% of respondents do not have the equivalent of 1 full time position working on digital preservation, even when adding up the time of all staff with some responsibilities in this area. A recent RDM needs assessment of libraries within OCUL conducted by Meghan Goodchild (report forthcoming) also found that research data preservation is an area with major gaps. Although the majority of libraries (72%) see a role for themselves in preserving research data, they are currently grappling with defining this role, with budget for resources and long-term storage notably lacking. Across the board, libraries expressed a need for a shared approach to preservation, which would require partnerships at the regional and national levels.

There remains a series of unanswered questions to get curation and preservation workflows moving between Dataverse and Archivematica: given a particular dataset, what qualities and contextual information would make it suitable for long-term preservation? How much quality checking, verification and description should curators do when accepting datasets and who should be responsible for this work? What domain-specific expertise is needed to evaluate and curate the dataset and its metadata and documentation? When should a dataset be moved from the semi-active repository storage in a platform like Dataverse to the archival storage linked to via Archivematica and by whom? What preservation decisions like normalization might need to happen in Archivematica? How long should the access copy be maintained in Dataverse? While service providers like ourselves can assist via our knowledge of our technologies and abilities to develop tools further to meet needs, issues of policy, procedure and roles and responsibilities need resolution by individuals who know the data well and are empowered to make these kinds of decisions.

A model developed into the now-active Data Curation Network in the United States provides an exciting approach for potentially dealing with the resource and expertise challenges that come with data curation in a collaborative, distributed way. But the model requires a local curator who corresponds with the researcher who submitted the data, and contributes expertise to the network. They also perform appraisal duties at the local level, and support preservation activities at their institution. It is not clear that the resources to do this work, and therefore develop a workable network model for curation, is anywhere near a possibility in Canada at the moment. A bright spot is the effort coalescing at the national level in Canada via the Portage Network, which includes a variety of expert groups around data curation and preservation producing some excellent guides, training materials, and other documentation. A paper by the Portage Preservation Expert Group also proposes an interesting distributed model for a national preservation service. A recent Canadian Data Curation Forum, hosted by Portage and funded by the Canadian Social Sciences and Humanities Research Council (SSHRC), consisted of a training event with interactive skill-building workshops to develop data curation skills and a community-building forum involving discussions with key stakeholders. With an aim to develop a vision and roadmap for a national approach to data curation in Canada, these types of efforts are laying the groundwork for building national-level services and support, but will need coordination of diverse players and a sustainable funding model in order to move such a vision into a practical reality.

The genre “research data” can often seem very opaque, since it describes the context of its creation rather than its form or format. A dataset from a survey project will be quite different from a dataset involving the study of icicle morphology. It is only through the time and resources to test, ask questions, and manage and curate actual data through the research lifecycle that policies and procedures can be better defined. Researchers have the data to deposit. Service providers and collaborative networks can provide shared technologies, development and expertise about their use and preservation. In Canada, there remains a “missing middle” in the form of institutional capacity with policies and procedures to join the two together and ensure that research data can persist into the future.

Scroll to top