Tom Wilson

Tom Wilson

Last updated on 17 January 2022

Tom Wilson is Associate Archivist (Digital Preservation) for United Nations High Commissioner for Refugees in Switzerland.


Our recent transfer of web-crawl suppliers taught us that the best laid plans can be derailed by factors beyond one’s control.

UNHCR has been capturing content for its web-archive since 2015, working with Internet Memory Research (IMR) as our supplier to capture, store and display this content. In 2018, IMR informed us that they would be going bankrupt. The timing of this announcement was decidedly inconvenient, as our procurement process for a new supplier had not yet been completed. This left us with the need to download our data from IMR and store it at UNHCR until we knew who our new supplier would be. We therefore drew up a plan to transfer the data, store it and then transfer it to our new supplier, all the while checking that the data remained complete and uncorrupted by this moving and storing.

We put our plan into motion and as IMR sank forever into the depths of bankruptcy, we thought that having successfully transferred all our crawl data from them, moved it into our storage and check-summed it repeatedly without issue that we were home and dry. All we would then have to do would be to transfer this data to our next supplier and, hey presto, we could pick up where we left off. The road to achieving this turned out to have more twists and turns than expected.

Initially, we were aware that the data from IMR’s captures came to a larger amount than we were expecting. 5TB instead of 3TB. This didn’t ring any alarm bells; the internet is ever changing, websites are getting larger and more complex, so it didn’t worry us too much that we had more data than originally foreseen. There was a minor issue in that our procurement had required our new supplier to take and display 3TB of data from our initial web-archiving programme, so this meant we had to have a series of discussions on how to deal with this extra data volume. After agreeing that Mirrorweb, the successful bidder, couldn’t store and display 2TB extra without running up extra costs and that UNHCR didn’t currently have the budget to meet those costs, we agreed to select 2TB of the data to keep in offline storage at UNHCR. Whilst not ideal, it was agreed that this was the best solution to allow us to get the web-archive back up and running whilst staying in budget. The extra data would be stored safely in our Digital Preservation System until we could get it back into the portal and our capture frequency of the sites would mean that any gaps in the archived content would be minimal. So, with the help of Mirrorweb we began to look for candidates to go into temporary storage.

We started uncovering issues when we began to dig into the transferred data in order to select files. Initially, we found it clear as to what files belonged to which site and which crawl date. As we sorted out the clearly labelled content we were left with a number of files that we couldn’t match to the lists of crawls that we had given IMR and which gave little or no clue as to what they contained. As to whether they were superfluous data or connected to other crawls, we are still investigating. As we dug deeper, we made some more “interesting” discoveries. The most “interesting” that we found was the patches for one twitter crawl, made in the later stages of IMR’s contract. Generally, the captures for even the most prolific Twitter accounts came out to be at most 1GB in size, so when we found a set of patches for one twitter account crawl, made in June 2017 that totalled 600GB we suspected that the crawlers got a little carried away a little too often and hence why we had 5TB rather than 3TB. Examination of other twitter crawls and their size deepened this suspicion. In the end, our policy of crawling core sites and social media accounts more than once a year gave us space to manoeuvre and to select nearly 2TB for temporary offline storage, whilst keeping the files from each crawl date together. Once we had selected and moved the “extra” data to storage, there was one last surprise in store. Upon loading our data into the new portal and trying to access it, we discovered that some of our earlier content, mainly twitter accounts, were inaccessible. This is one final mystery for us to investigate before we can consider our work on the old data to be done!  

As things currently stand, the vast majority of our later captures are accessible and the damage is mostly limited to the first year of captures with IMR. The 2TB of extra data remains in temporary storage until such time that we can return it to the portal and we continue to investigate the issues with our earlier content. In the end, the lessons we learnt were that planning and being adaptable can provide a buffer for many unforeseen mishaps, but, even then, there are sometimes unexpected twists and turns in the road.  

 

For more information, please contact This email address is being protected from spambots. You need JavaScript enabled to view it..


Scroll to top