DPA_2026_Award_Title_Banners_webpages_RI.png

AN OAIS-COMPLIANT ARCHIVING PLATFORM WITH DNA CONNECTOR WITHIN OLOS

Pierre-Yves Burgi, Hugues Cazeaux, Dario Genga, Michaël El Kharroubi, Florient Serex, Jérôme Charmet

 OLOS1.jpg  OLOS2.jpg  OLOS3.jpg

In accordance with the principles of OAIS-compliant repositories (ISO 14721), OLOS (olos.swiss) offers an open, Core Trust Seal certified, and modular architecture for the long-term preservation of data in the fields of public administration, cultural heritage and research. Once deployed in the cloud, these independent modules offer a range of services enabling users to prepare their archives for preservation, namely: to submit them via a pre-ingest step followed by ingestion (submission package - SIP), to store them physically (archival package - AIP), to index metadata (represented in standard formats such as the METS container with PREMIS and DataCite fields) and to access them (dissemination package - DIP) based on specific rights and data sensitivity. This suite of services guarantees both the implementation of best practices in the field, such as virus detection, format detection, checksum calculation, integrity verification, replication, etc. – and tight integration with other information systems and all types of storage media, including those based on biotechnology.

As part of the DNAMIC project (dnamic.org), which is funded by the European Pathfinder program, we have developed a DNA connector that interfaces OLOS with a micro-factory capable of autonomously processing DNA, from synthesis to sequencing, thereby enabling the autonomous storage of data into DNA. Thanks to its extremely high data density (hundreds of petabytes per gram), longevity (thousands of years with minimal degradation), and sustainability (very low energy consumption), the use of DNA is set to revolutionize archival preservation. As long as there is life, humanity will have mastery over DNA, therefore, there will be no risk of technological obsolescence, unlike with magnetic and optical systems.

To archive documents in synthetic DNA, once they have been imported into OLOS, the AIPs are processed via the DNA connector. The DNA connector, using an encoder/decoder (aka CODEC), first encodes the binary AIP into a DNA representation, then sends it to the micro-factory, which distributes the genomic tasks among the relevant modular devices. DNA is composed of four bases, namely the nucleotides adenine, thymine, cytosine, and guanine, designated by the letters A, T, C, and G, respectively. To represent the information contained in DNA, a coding step is therefore necessary to convert a binary representation into a quaternary base system. A simple coding consists, for example, of associating the binary codes 00 with A, 01 to C, 10 to G, and 11 to T. Without going into detail, several types of encoding are possible, but they must consider biological constraints, such as avoiding successive repetitions of the same nucleotides (""homopolymers"") or certain motifs, as well as a balanced distribution between GC and AT nucleotides (constraints that the simple coding mentioned above does not satisfy). Once assembled, the strands are processed in the micro-factory, which is capable of synthesizing DNA letter by letter to create the molecules. These molecules are then packaged in dehydrated form in vials. Since the chemical process for manufacturing synthetic nucleotides has a relatively low error rate for segments shorter than 300 nucleotides, the binary files are segmented to comply with this limit. To enable the files to be reconstructed after sequencing, each segment is associated with an index, encoded within the structure of the nucleotides. Redundant error-correction information is also added. Finally, each strand of DNA belonging to a given AIP is assigned two additional DNA strands specific to that AIP at both ends. These strands, called ""primers,"" are used to uniquely identify the AIPs contained in the vials.

To read an AIP contained in a vial, the synthesizer first produces primers specific to that AIP. The micro-factory uses a micro-pipette to collect a DNA sample, which is then transferred to the amplification (or ""PCR"") unit. The strands amplified by the primers are then divided into two sets. One is sent back to the DNA storage unit for future use, and the other is sent to the sequencing stage. In the designed micro-factory, sequencing relies on the latest nanopore technology, which has the advantage of being compact and affordable but has a relatively high read error rate (around 5% to 10% on average). Upon exiting the sequencer, the reads are then decoded by the CODEC. Several steps are devoted to correcting potential errors involving nucleotide insertions, deletions, or substitutions before reorganizing the reads according to their index to reconstruct the binary file. The latter is then sent back to OLOS to be distributed in DIP format.

To evaluate the entire OAIS cycle, from the SIP through the encoding of the AIP into the four DNA molecules, and finally the delivery of the reconstructed DIP, we prepared several test files with different redundancy profiles. In silico samples were prepared using a sequencer noise model. For in vitro sampling, we obtained DNA strands using various synthesis and sequencing technologies. Preliminary results, based on in silico and in vitro analyses, demonstrate the potential of our approach to handle high error rates (up to 20%) and decode 1-megabyte files in less than 5 minutes. In these experiments, decoding is considered successful only when the AIP checksum is verified. While a single vial could store hundreds of petabytes, the prototype currently in production operates on the megabyte scale; however, optimizing encoding/decoding (CODEC) and other processes will allow us to expand this capacity far beyond that. We therefore aspire to create more scalable DNA data storage solutions, thereby making a significant contribution to the future of digital archiving methods.

 


DPC Members, login to reveal the link to the voting form!  

Votes must be cast online by 1200 (BST/UTC+1) on Monday 6th July.


Scroll to top