The purpose of the Crossrail Project archive is:

  • To support regulatory/contractual data retention requirements. This was the most obvious need and provided the must-have justification for an archive
  • To provide context around the assets and final documentation being handed over. Although the final version of documentation and data is handed over to the operational railway the lead-up to that decision (drafts and comments) is not and this can be critical to support any subsequent enquires/investigations
  • To provide a data-centric learning legacy. Up to 10 years of data around health and safety, contract administration, planning, risks, vehicle plans, ground movements, documentation and more can be analyzed to identify trends, activity levels and so improve future major projects and this is the real opportunity for users of this archive

As the Crossrail organization disbanded, the body of accumulated expertise required to navigate and interpret this information was at risk of being lost with no enduring organization to retain corporate knowledge.  An additional risk to the information was that the information would be dispersed at the end of the project across a number of interested parties, each taking their own slice of the data and losing the value of an integrated project dataset.

Crossrail Image 1

With no initial budget for creating or running an archive and an initial perception that this was a relatively uninteresting compliance issue (even suggestions that the data could all be printed rather than stored electronically) the team had to find a sponsor and a minimal cost strategy which would allow the data to be useful beyond the life of the Crossrail Project.

The TfL Corporate Archives Team was the natural sponsor to the archive providing an effective set of challenges and rigor to the identification of data to be retained.  With GDPR going live at the same time it was imperative to incorporate this rigor into the archive process and the TfL Team ensured that this was the case.

To ensure that the archive was useful beyond the life of Crossrail the goal at the outset was to make it as easy to use as an online shopping portal (e.g. Amazon).  The ability for customers to be able to find what they are looking for quickly and easily is key and the same logic was applied to the Crossrail archive with the following core requirements:

  • Free text search as the primary mechanism for identifying potential matches
  • Facets and filters to then allow users to restrict possible matches in the same way as customers can restrict their product search to particular departments or price ranges
  • Map based search as an additional mechanism for geospatial data
  • Basket to allow the user to collect items for subsequent download

From a compliance perspective each record in the archive needs to be:

  • Fully encrypted at all stages
  • Assigned a retention end date
  • Classified in terms of security access and level of personal data held

This then led to the primary technical challenge which was to convert millions of electronic files and 25 different proprietary applications (each containing millions of unique transactions) into a unified archive which could be stored and accessed in a consistent manner.

Crossrail Image 2

The team developed the following approach:

  • Identify and confirm the reason for and duration of retention of application data
  • Identify the core business objects within the application and which screen(s) or report(s) were used by users to view these objects
  • Extract data from the underlying database to match the user screen view
  • Tick the screen view to the resulting data extract
  • Output each business object to a JSON object
  • Assign additional metadata (classifications, retention period, security)

 

 

The use of JSON to hold the archived business objects which were previously stored across numerous tables in relational databases was a critical decision.  As per www.json.org  “JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate…. an ideal data-interchange language”.  

Each JSON object represents a single business object as a set of labels with associated values.  By describing the labels in a way that matches the business view of the information and using arrays to provide a structure within the object the team end up with a human readable document which can be searched and displayed.

Crossrail Image 3

Having converted applications into JSON files a unified search could then be applied across these and other file types to provide a mechanism for users to find the data.  This involved identifying fields which:

  • Should be searchable (So those with a high text content and those which contained the user recognizable references)
  • Could be used to facet (group) the results (Those fields with a consistent set of values across the dataset)
  • Could be used to filter the results (Typically date ranges)
  • For other files the file content was the primary searchable value and then a JSON parent object was created to hold and manage security/meta-data

Microsoft Azure was then used to deliver the project with the following core components:

  • Blob Storage for files and original database backups
  • CosmosDB for JSON objects
  • Azure Search to provide out of the box search capabilities
  • KeyVault to enable encryption
  • Application Services to allow the solution to be run with no infrastructure

As well as being available to the business as a live and accessible collection, the Crossrail Archive will be monitored for usage and potential file format obsolescence. Once a dataset has either reached the end of its business retention period, is reaching the end of its file format viability, or its usage figures have dropped below a predetermined number, it will be extracted and transferred into the Corporate Archives full digital preservation repository. During the intermediate period, using the usage figures and enhanced content knowledge, full re-appraisal, classification, and public access decisions can be made so that these can be applied instantly.


Scroll to top