Evanthia Samaras

Evanthia Samaras

Last updated on 4 November 2020

Evanthia Samaras is the VERS Senior Officer at Public Record Office Victoria.


Over the past few years, Public Record Office Victoria (PROV) has been working to develop and test solutions to appropriately manage and preserve Lotus Notes email accumulations.

This work has been a part of our wider Victorian Electronic Records Strategy (VERS), which is about ensuring the creation, capture and preservation of authentic, complete and meaningful digital records by the Victorian public sector.

This blog will share some findings from our Stage 2, email appraisal, disposal and preservation project.

Background

The Victorian State Government in Australia has used IBM’s Lotus Notes application since the 1990s. The central government agencies have now nearly completed a move to Outlook and Microsoft 365, however, there are massive accumulations of Lotus Notes emails that require attention.

How can we identify the emails of value and remove the non-public records? How do we manage personal and sensitive emails in the accumulation? How can we preserve the emails with context and in a format that will remain accessible over time?

Stage 1 project

During 2017/18, PROV completed a Stage 1 project to test an eDiscovery tool on a sample set of 4.6 million Lotus Notes emails from a department. We found the tool performed deduplication processing very well and could quickly identify certain low value emails using keywords and domains. However, it did not have features to be able to undertake effective functional appraisal of emails and explore email accumulations in context of the originating organisation.

Over the past year, a small team at PROV explored and tested other approaches in a Stage 2 project.

Stage 2 project

For this project, we opted to use an accumulation of our own organisation’s email from 2016-2019, which equated to about 1.2 million emails. We initially set up a secure server environment at work for testing software (including the eDiscovery tool used in Stage 1) and storing and processing the emails, but as the challenges of 2020 unfolded, we then explored approaches that were practically achievable while working from home. We also delved further into IBM’s Lotus Notes format.

Below are some of our key learnings and outcomes from the project.

Discoveries about the Lotus Notes email format

Lotus Notes uses its own format called Notes Storage Format with the file extension of ‘.NSF’. These files are supported by an IBM Domino Server as well as the IBM Lotus Notes application. An NSF file is capable of holding databases including emails, contacts, design information, user data and appointments.

Once we started working from home, we discovered that the files have strict security and privacy features that prevent them being opened outside the creating environment. This is suitable for secure email communication, but inherently unsuitable for long-term preservation. We determined that other formats such as EML are much more suitable and the NSF format should be avoided for long-term use because it relies upon having a Lotus Notes client API to access the files (i.e. access is always mediated by the Lotus Notes software). Also, availability and access to email processing tools that support NSF is very limited.

Identifying low value and non-public record emails without access to the email content

Given the high volume of email accumulations, a focus for us is to determine ways non work-related emails and low value emails can be identified and then removed to reduce the overall accumulation size and associated storage and archival management costs.

As we could not securely access the emails while working from home, we explored whether the email header was sufficient to identify these types of emails.

Using a sample of about 9000 emails, a CSV report export from the eDiscovery tool was generated for analysis. Using the CSV, we looked for email subjects that were completely ephemeral (for example, ‘Out of Office’) and terms that could identify low value, non-permanent records (such as ‘Finance’ and ‘Maintenance’). All repeated terms were noted down until a list of over 500 terms was formed.

We found that over 70% of the sample were low value or non-public records.

Email threading

Another area explored in the project was preserving email conversations as threads.

From a technical perspective, we found that threading is surprisingly challenging. There is no single threading mechanism, and not all email clients implement the standard mechanism correctly (including Lotus Notes).

Using the previously sample of email header information, we initially developed scripts that applied a simple heuristic: if the emails had the same subject, they were considered to be part of the same thread. Subjects were considered identical even though they may be prefixed by ‘Re:’ or ‘Fwd:’.

We found that managing threads rather than individual emails significantly reduces the volume of records, which aids access and simplifies and reduces the cost of archival management. This was a key precursor to the subject based classification described in the previous section; examining the subjects of threads was feasible as the number of threads was far less than the number of emails.

We conducted a second experiment by threading a different collection using the formal threading headers in the email. This was also successful, but it highlighted that email systems did not always implement threading according to the standards.

Overall we found that threading email conversations improves understanding of organisational context, providing more enhanced documentary evidence for management and discovery than an individual email removed from its communication context.

Encapsulating emails for preservation

Lastly, we also explored how we can represent email records in our long term preservation format — VERS encapsulated objects (VEOs). Using a sample collection from a single year, a proof of concept tool was written to convert emails from EML format into VEOs.

We explored several approaches to representing emails as records, such as single email to single VEO creation. However, we determined that converting email threads into a record would be the best approach because the threaded version will be able to provide future users with a structured, easy to understand, representation of the sequence of interactions between the participants of the email.

Within the VEO, the tree structure of the thread was retained, and each email was broken up into its components (headers, email body, and attachments). In addition, data from the email files were extracted to form XML metadata packages.

Overall our VEO creation testing proved promising and produced valid VEOs (i.e. Submission Information Packages) that could be ingested into our new digital archive.

Conclusion

I would like to acknowledge my fellow members of the Stage 2 project team at PROV: Andrew Waugh, Julie McCormack, Christine Mitchell and Jenny Rout.

While our original plans for the project deviated due to working from home requirements, we achieved some valuable learnings and outcomes for managing and preserving email accumulations in Victorian Government.

Next year we hope to commence planning for a Stage 3 project.

Please see our project page to learn more and keep up to date with our work in this area.


Scroll to top