DPC Members

  • wellcome library logo
  • leedsuniversitylogo
  • tcd logo for website
  • ed univ logo tiny
  • tate logo for website
  • cerch logo for website
  • portico logo
  • oclc logo for website
  • ulcc logo for website
  • parliamentary archives 2012 logo
  • universityofyorklogotiny
  • portsmouth logo tiny
  • national records scotland logo
  • pls logo resized for website
  • cambridge logo for website
  • ara logo 2
  • rcahms for website logo
  • ads logo
  • sac logo
  • new proni logo
  • tna logo
  • llgc nlw logo
  • glasgowuniversitylogo
  • standrewsblockcrest logo
  • dcc logo
  • bodleian library logo
  • lse lib logo tiny
  • open university logo
  • rin logo for website
  • rcahmw for website logo
  • eh logo for website eh
  • jisc logo for website
  • rmg logo
  • rcuk logo for website rcuk
  • uk data archive logo
  • british library logo
  • nli tiny logo
  • national library scotland logo
  • bbc logo

Simple Checksum Utilities

I was at the Digital Preservation Road show in Edinburgh on 28th October 09. I heard several speakers insist that if we create or accept digital files part of the process should be running corruption checks with Checksum or Hash software.

I would like to know what free program to download to create the checksum for each file. How best to keep the information -print out or digital file- and how to make a comparison to spot corrupted bits 3 years in the future.

I wonder if you would be kind enough to recommend something not too complicated that we could use. I would be very grateful because I understand it is important but cant find my way through all the programs that spring up when searching on the internet.

Many thanks for what help you can give me

Brenda Dreghorn

Senior Conservator

Cumbria Archive Service


Hi Brenda

I would strongly recommend jacksum as a good starting point for checksum generation and comparison. http://www.jonelo.de/java/jacksum/

Jacksum covers all the options for checksums; it is also in line with the latest government guidelines on software choices, being open source. Further, it is entirely cross-platform, so when Cumbria considers its next desktop operating system software you can be sure it will continue to work. It would also be easy to integrate into a full digital preservation workflow in the future.

You will probably need to work with someone from the ICT team to get it configured for your use. In the first instance I would look to store the checksums into a spreadsheet as this will be easiest to migrate into a more automated system in the future. Maybe using columns something like "filename, file type, file source, date stored, date last checked" and then using conditional formatting to highlight files that are due to be checked e.g. after 3 years.

I think you will, after a while, need to move towards a more complete preservation system, but the above would be a good start.

Hope that helps.

Regards

Chris Puttick

CIO

Oxford Archaeology: Exploring the Human Journey

 


Good morning Brenda,

 

You've asked a very simple question to which there is a deceptively complex answer! In short I don't have a software solution for you but I think it's worth considering some of the complexities of what you are asking.  Checksums, hashes - message digests in PREMIS speak - are one way of checking to see if digital material, typically a file, has changed, or If it is still what it purports to be. Hashing can only tell us that some thing has changed, it can't tell us what has changed or why some thing changed. So it's a technique that can work really well for digital material that seldom if ever changes, such as for material in a digital library. Where material frequently changes it can be less useful.

The process sounds simple but this hides complexities. Create a hash value for an object, store that and periodically re-generate the hash value and compare it to the one first taken - this is know as the validation process - raises a number of points,

1. When do you take a hash value? This can be important. If you have a digitisation workflow in which material was scanned, the files cropped and de-skewed and maybe resized then you might want to create your hash value only after all of these changes have been made. Otherwise each change - made legitimately to the file - would result in a validation error.  Deciding exactly when to create a hash value is key to your workflow. It needs to be after intended and legitimate changes have been made but balanced with the danger of inadvertent or accidental change during handling or storage.

2. What material do you apply the hash value to? Do you hash the 'live'copy but rely on a backup in case something goes wrong? If so what happens if the backup copy changes? How would you know? Hashing both live and backup copies creates potential synchronisation issues and questions as to which copy is your definitive and authentic copy.

3. Hashing needs to be not only an automated process, do you really want to create all those hashes by hand? but the process of subsequent checking has definitely got to be automated. More than just a bit of software, you need a process or workflow that can run autonomously over the body of material you want to manage.  It isn't some thing that a human can do reliably or quickly enough. Even for small amounts of material hashing can be time consuming and prone to human error.

4. In order for the system to work you need to be able to securely store the hash values that are considered the definitive set. The larger the body of material you want to manage the larger this 'store' needs to be.

The larger it is the more sophisticated it needs to be and the more secure. Whilst hash values can be regenerated the point of having them is to be able to compare a current value with that of one taken at some point in the past.

5. Having hash values doesn't remove the need for proper backup and the ability to restore that material. 

6. In order for hashing to be most effective the process needs to be performed regularly. Checking each file on an annual basis probably isn't sufficient, anything could happen in the months between each check and you'd not know of any problems until maybe a years time.  The frequency of your checking depends upon the numbers of files you want to check.  The more files the bigger the system required. Many repositories and digital stores have the ability to run their own validation routines. However, as the body of material held grows so the task of frequently checking each file becomes more burdensome to the system.

This can result in systems slowing down as they spend increasing amounts of time running validations routines.  Depending upon the size of the body of material you want to manage you might have a stand alone server whose task it is to do nothing but validation routines 24 x 7. This can be expensive and complex.

7. Why are you wanting to use hashes, what is your perceived risk or threat? Is there some particular problem or risk to your material that you've identified? If so is there a better way to manage that risk or threat.  Should you change your backup strategy? Is there a change to be made to your handling of important digital material? Is your storage adequate?

8. What happens when you receive a validation process failure? Who receives that message? Hashing can only tell you that some thing has changed, not what has changed. You'll need to work out what to do when you do get validation errors - a file fails the validation test. Do you replace that file? Is so, where does a reliable and authentic copy come from? How do you work out why a file has failed a validation test, is that even possible? This raises the horrible spectre of having to have validation routines not only over 'live' copies of material but also over backup copies.

9. Who manages your validation process? Do you want IT to run and own the process? They have great technical skills but do they understand the material sufficiently well? Do you run and own the process? You understand the material well enough but do you have sufficient technical skills? I'd suggest that this is a case where close collaboration with your own IT is essential. In effect combining skills in understanding the material with IT skills.

We have a digital object repository that is capable of running validation routines over material that it holds. However, given our digitisation plans and the volumes of material we're planning/anticipating putting into our repository we're considering 'outsourcing' validation to a sub system. This sub system would be separate from the repository, would operate at the file store level, act directly on our archival master copies and be running 24 x 7. Even so we still have issues of design and how this would work exactly.

Hashing won't stop you from driving off a cliff, it can only tell you that the cliff exists, or that you have driven over it.  You need to come up with the business need for a hashing strategy and be assured that this is the right approach to your needs.  Whilst it seems a simple approach, hashing can be complex and raise more questions than first thought. Having come up with your business requirements you might find that only certain software suits or that there are other ways of addressing the problem.

I hope this doesn't put you off!  More than happy to continue the conversation, please just get in touch..

Kind regards, dnt

 Dave Thompson, Digital Curator

Digital Services

Wellcome Library


Dear Brenda,

I've experimented with using the (free) OpenSSL software here at the BBC Archives.  It seems a lot quicker in generating checksums than the Windows software that was demonstrated to us at the recent DPC event.  Speed is important to us because of the very large files (video+audio) that we want to derive a digital signature for.  We use OpenSSL on a Linux workstation, (where it is supplied as part of the operating system) but according to Wikipedia there is a Windows version http://en.wikipedia.org/wiki/Openssl

In addition, we have moved to use of SHA1 digital signatures rather than using MD5.  There seems to be advantages in this - not least because you get a larger (40 character) digital signature checksum.

I have only used a command-line driven version of openSSL.  I don't know if anyone has packaged it up into GUI driven software.  You can use OpenSSL to both generate the digital signatures, and verify them against files that you have already derived a signature for.

Yes - certainly keep electronic versions of the keys you generate.  It will save a lot of awkward re-keying.

Best regards,

James Inseell

BBC


The MoL are recommending a particular checksum free software in their new guidelines - see belowNicky

http://www.museumoflondonarchaeology.org.uk/NR/rdonlyres/0A02B0B8-A175-44F6-8B77-99BFC3BEAD98/0/Standards_27_Digital_data.pdf

Checksums are section 2.7.2.4

Nicky Scott

Oxford Archaeology


Brenda

William forwarded your question about checksum utilities to DPC members to see if we could help. My answers here assume you are using Microsoft Windows; if you aren't, similar utilities exist for other operating systems.

Your question had 3 parts: what software to use; how to keep the checksum information; and how to check it 3 years later.

I'll answer the first one about software last, as there are a number of possibilities. The other questions are easier to deal with!

All the programs that produce checksums tend to produce a single file which contains a list of filenames and a list of their checksums. This file is small, and to protect your files against malicious interference it is best to store a copy separately from the files themselves. Once I might have suggested doing this on a floppy disk; nowadays you might use a flash disk or a CD - or just another computer. If malicious interference isn't a concern, then there's no need to store the checksum file elsewhere.

When the time comes to check the files, put the checksum file back in the same directory as the files to check (if you've made a separate copy) and use the same program that created it, but in its checking mode.

(How this is done is slightly different for each one.)

What software to use ? There's a wide choice. Microsoft produce their own, which you can download here:

http://www.microsoft.com/downloads/details.aspx?FamilyID=B3C93558-31B7-47E2-A663-7365C1686C08&displaylang=en

There's md5summer, a standalone program with a nicer interface than the microsoft one: http://www.md5summer.org/

There's also a range of programs that operate as Windows Shell extensions - which means that they integrate with Windows Explorer, allowing you to create and check checksums directly from Explorer, or the 'My Documents' view: http://code.kliu.org/hashcheck/

All of these examples use a checksum method called MD5. A few years ago, this was shown to be crackable - that is, a determined attacker could alter some files in a way which which would not change the checksum - and as a result those with more serious security concerns tend to use a checksum called SHA1, or one of its relatives. There aren't as many easy-to-use utilities for Windows for SHA1 - you need to be comfortable using the command line. The Microsoft utility I mentioned above can use SHA1, but is is a command line utility. It is documented here: http://support.microsoft.com/kb/841290

Hope this is helpful.

Kevin Ashley

Head of Digital Archives, ULCC


There's a list at: http://www.thefreecountry.com/utilities/free-md5-sum-tools.shtml which could easily be Googled by anyone.

I do not use checksum software very often, and I certainly don't create hash files, but I am a bit of an OS hobbyist, so occasionally need to check ISO files for integrity.

All best,

Mike Mertens

RLUK


Hi Brenda ~

I spoke with one of our more technical staff and he recommends that you look at the BagIt specification from the Library of Congress and John Kunze (http://sourceforge.net/projects/loc-xferutils/).  While BagIt was developed as a tool to transfer data, it might also be useful to you because it defines a way of storing the checksums and directory organization.

There are many ways of generating checksums.  Core JAVA has some facilities and another Java based checksum utility is http://www.jonelo.de/java/jacksum/.

If your objective is to detect file corruption an MD5 checksum should be adequate.  At Portico, in addition to detecting file corruption, we use checksums to identify duplicate files and use a much larger SHA512 checksum.

Let me know if you have other questions!

~ Amy Kirchhoff

Archive Service Product Manager

ITHAKA

 


I'd wager people were simply talking about SHA1 hashes (much like Heritrix/Wayback uses). The tools available rather depend on their setup. They mention they're going to "create or accept digital files" - are they using particular software to import this? If so, does it already have a hash-calculator? Just had a quick look and Microsoft publish a "File Checksum Integrity Verifier" which they might be comfortable using: http://support.microsoft.com/kb/841290

Roger Coram

British Library

 


 

We use FastSum (http://www.fastsum.com/) -- the Standard Edition -- which is cheap, easy to use and reliable. It can be a little slow when dealing with huge files -- i.e., about 4 mins to do 10 GB, but in general it's fit for purpose.

Best

Matthew Woollard    

UK Data Archive, University of Essex