Fixity and checksums

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

Fixity

“Fixity, in the preservation sense, means the assurance that a digital file has remained unchanged, i.e. fixed.” (Bailey, 2014). Fixity doesn’t just apply to files, but to any digital object that has a series of bits inside it where that ‘bitstream’ needs to be kept intact with the knowledge that it hasn’t changed. Fixity could be applied to images or video inside an audiovisual object, to individual files within a zip, to metadata inside an XML structure, to records in a database, or to objects in an object store. However, files are currently the most common way of storing digital materials and fixity of files can established and monitored through the use of checksums.

Checksums

A checksum on a file is a ‘digital fingerprint’ whereby even the smallest change to the file will cause the checksum to change completely. Checksums are typically created using cryptographic techniques and can be generated using a range of readily available and open source tools. It is important to note that whilst checksums can be used to detect if the contents of a file have changed, they do not tell you where in the file that the change has occurred.

Checksums have three main uses:

To know that a file has been correctly received from a content owner or source and then transferred successfully to preservation storage
To know that file fixity has been maintained when that file is being stored.
To be given to users of the file in the future so they know that the file has been correctly retrieved from storage and delivered to them.

This allows a ‘chain of custody’ to be established between those who produce or supply the digital materials, those responsible for its ongoing storage, and those who need to use the digital material that has been stored. In the OAIS reference model (ISO, 2012) these are the producers, the OAIS itself is the repository, and the consumers.

Application in digital preservation

A short video explaining the basics of Integrity (Fixity) Checking in Digital Preservation

If an organisation has multiple copies of their files, for example as recommended in the Storage section, then checksums can be used to monitor the fixity of each copy of a file and if one of the copies has changed then one of the other copies can be used to create a known good replacement. The approach is to compute a new checksum for each copy of a file on a regular basis and compare this with the reference value that is known to be correct. If a deviation is found then the file is known to have been corrupted in some way and will need replacing with a new good copy. This process is known as ‘data scrubbing’.

Checksums are ideal for detecting if unwanted changes to digital materials have taken place. However, sometimes the digital materials will be changed deliberately, for example if a file format is migrated. This causes the checksum to change. This requires new checksums to be established after the migration which become the way of checking data integrity of the new file going forward.

Files should be checked against their checksums on a regular basis. How often to perform checks depends on many factors including the type of storage, how well it is maintained, and how often it is being used. As a general guideline, checking data tapes might be done annually and checking hard drive based systems might be done every six months. More frequent checks allow problems to be detected and fixed sooner, but at the expense of more load on the storage system and more processing resources.

Checksums can be stored in a variety of ways, for example within a PREMIS record, in a database, or within a ‘manifest’ that accompanies the files in a storage system.

Tool support is good for checksum generation and use. As they are relatively simple functions, checksums are integrated into many other digital preservation tools. For example, generating checksums as part of the ingest process and adding this fixity information to the Archive Information Packages generated, or allowing manifests of checksums to be generated for multiple files and for the manifest and files to be bundled together for easy transport or storage. In addition md5sum and md5deep provide simple command line tools that operate across platforms to generate checksums on individual files or directories.

There are several different checksum algorithms, e.g. MD5 and SHA-256 that can be used to generate checksums of increasing strength. The ‘stronger’ the algorithm then the harder it is to deliberately change a file in a way that goes undetected. This can be important for applications where there is a need to demonstrate resistance to malicious corruption or alteration of digital materials, for example where evidential weight and legal admissibility is important. However, if checksums are being used to detect accidental loss or damage to files, for example due to a storage failure, then MD5 is sufficient and has the advantage of being well supported in tools and is quick to calculate.

The Handbook follows the National Digital Stewardship Alliance (NDSA) preservation levels (NDSA, 2013) in recommending four levels at which digital preservation can be supported through file fixity and data integrity techniques. Many of the benefits of fixity checking can only be achieved if there are multiple copies of the digital materials, for example allowing repair if integrity of one of the copies has been lost.

Level	Activity	Risks addressed and benefits achieved
1	Check file fixity on ingest if it has been provided with the content. Create fixity info if it wasn’t provided with the content.	Corrupted or incorrect digital materials are not knowingly stored. Authenticity of the digital materials can be asserted. Baseline fixity established so unwanted data changes have potential to be detected.
2	Check fixity on all ingests Use write-blockers when working with original media Virus-check high risk content.	No digital material of unconfirmed integrity can enter preservation storage. Evidential weight supported for authenticity. Assurance can be given to all content providers that their content has been safely received. Original media is protected. No malicious content can enter preservation storage.
3	Check fixity of content held on preservation storage systems at regular intervals. Maintain logs of fixity info and supply audit on demand. Ability to detect corrupt data. Virus-check all content.	Protection from wide range of data corruption and loss events. Problems with storage are detected earlier. Data corruption or loss does not go undetected due to ‘silent errors’ or ‘undetected failures'. Digital materials are not in a state of ‘unknown’ integrity. Ongoing evidential weight can be given that digital materials are intact and correct.
4	Check fixity of all content in response to specific events or activities Ability to replace/repair corrupted data Ensure no one person has write access to all copies.	Failure modes that threaten digital materials are proactively countered. All copies of digital materials are actively maintained. Assurance to users of the integrity and authenticity of digital materials being accessed. Effectiveness of preservation approach can be measured and demonstrated. Compliance with standards, e.g. ISO 16363 Audit and certification of trustworthy digital repositories.

Write-blocking

Note that the National Digital Stewardship Alliance (NDSA) recommends the use of write-blockers at level 2. This is to prevent write access to media that digital materials might be on prior to being copied to the preservation storage system. For example, if digital material is delivered to an organisation on a hard disc drive or USB key then a write blocker would prevent accidental deletion of this digital material when the drive or key is read. Digital material might not be on physical media, e.g. it could be on a legacy storage server or delivered through a network transfer, e.g. an ftp upload. In these cases write blockers wouldn't apply and other measures would be used to make the digital material 'read only' on the source and hence immutable before confirmation that the digital material has been successfully transferred to preservation storage. Write blockers also don't exist for all types of media. If a write-blocker is applicable then the costs/skills required to use them should be balanced against the risk of damage to the original digital material or the need to have rigorous data authenticity. Therefore, some organisations might consider use of write blockers to be unnecessary or a level 3 or level 4 step.

Resources

Bailey, J., 2014, Protect Your Data: File Fixity and Data Integrity, The Signal, Library of Congress.

http://blogs.loc.gov/thesignal/2014/04/protect-your-data-file-fixity-and-data-integrity/

Checking Your Digital Content: What is Fixity and When Should I Be Checking It?

http://digitalpreservation.gov/ndsa/working_groups/documents/NDSA-Fixity-Guidance-Report-final100214.pdf?loclr=blogsig

Many in the preservation community know they should be checking the fixity of their content, but how, when and how often? This document published by NDSA in 2014 aims to help stewards answer these questions in a way that makes sense for their organization based on their needs and resources (7 pages).

A good short overview not limited to oral history, this video provides a brief introduction to the role of the checksum in digital preservation. It features Doug Boyd, Director of the Louie B. Nunn Center for Oral History at the University of Kentucky Libraries. (3 mins 25 secs)

References

Bailey, J., 2014. Protect Your Data: File Fixity and Data Integrity.The Signal. [blog]. Available: http://blogs.loc.gov/thesignal/2014/04/protect-your-data-file-fixity-and-data-integrity/

ISO, 2012. ISO 14721:2012 - Space Data and Information Transfer Systems – Open Archival Information System (OAIS) – Reference Model, 2nd edn. Geneva: International Organization for Standardization. Available:https://www.iso.org/standard/57284.html

NDSA , 2013. The NDSA Levels of Digital Preservation: An Explanation and Uses, version 1 2013. National Digital Stewardship Alliance. Available: http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_Levels_Archiving_2013.pdf

Add comment

Explore the Handbook

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

Fixity

Checksums

Application in digital preservation

Level

Activity

Risks addressed and benefits achieved

Write-blocking

Resources

Bailey, J., 2014, Protect Your Data: File Fixity and Data Integrity, The Signal, Library of Congress.

Checking Your Digital Content: What is Fixity and When Should I Be Checking It?

AVPreserve Fixity Tool

MD5

SHA-1

SHA-256

Md5deep and hashdeep

md5sum

The "Checksum" and the Digital Preservation of Oral History

References