by Edinburgh iGEM 2016
In 2040, you will not be able to read this sentence. This isn’t because of some sci-fi apocalyptic event, or because you’ll forget how to read; rather, it will be because we are running out of one of the most essential resources in modern day life: data storage space.
We are living in the age of big data. Our lives revolve around Twitter, Facebook and our smart devices. The last decade has seen an
exponential increase in information usage, creating a demand that will soon outweigh supply. By 2040, global demand will reach 3×10^24 (3 million billion billion) bits.
Considering the amount of energy it takes to run a data centre (about 2% of global energy consumption) and the limited supply of raw materials like silicon for manufacturing, it is clear than novel storage methods are of the utmost importance in meeting demand and providing a sustainable, long term solution to the data storage problem. Leading researchers are working hard to address this issue but no practical solution has been offered… until now.
The University of Edinburgh 2016 undergraduate iGEM (International Genetically Engineered Machine) team held these considerations in mind when we set out to create a new DNA-based storage system. We are now looking to repurpose DNA to store synthetic data, specifically in the form of nucleotide-encoded words.
Our project is based on a simple idea: develop a modular system whereby DNA fragments, each representing a word, can be strung together into any phrase the end user desires. The system will be secure, so the data will be safely retrievable, as well as cheap and quick to assemble. In essence, we will create a modern DNA typewriter.
DNA is life’s innate information storage mechanism – established through millions of years of evolution, its properties are already perfect for the purpose. It is very stable, meaning it can survive for thousands of years. In comparison, a commonly used external hard drive lasts at best a few decades. It also allows for very efficient information storage. Storing DNA is less energy-consuming and much more eco-friendly, than storing data digitally.
The first stage that we will complete is to encode words from Ogden’s Basic English into short DNA sequences. Basic English is a collection of 850-1,000 words that can be used to express most concepts in the English language (http://ogden.basic-english.org/), and has successfully been used in Wikipedia to simplify a lot of its pages. We have already written the computer programme that is necessary for encoding words into sequences (think Google translate for DNA).
Each word fragment will be stored in a small digestible BioBrick; digestion will produce two sticky ends flanking the word that will allow fast and efficient assembly of different words. The single addition of words will be determined by alternating types of sticky ends, which prevent more than one from adding on per step. The directionality of the sentence is provided with a unique “anchor” segment that can bind a magnetic bead, and be melted off for easy retrieval. The novelty of our system lies in the ability to rapidly assemble complex sentences without the need to synthesize new oligonucleotides every time. This will be the first time that data has been encoded using words as a basic unit, rather than letters, meaning our system is more efficient than previously designed DNA text storage methods, allowing for very condensed information storage.
To bring our typewriter into the modern day, and ensure our DNA messages are properly interpreted, we will also encode spell-checking mechanisms into every word fragment. Our spell-check system will be made up of three parts:
1) an optimal rectangular code, which can detect and correct any single base mutation in the word coding region of the DNA with 100% accuracy,
2) a checksum that can detect any damage in a constructed sentence,
3) natural language processing mechanisms from the Python Natural Language Toolkit, which will ensure our DNA sentences are grammatically correct.
These mechanisms ensure high fidelity in the retrieval of information from DNA storage.
One great advantage of our system is that it is much more economically accessible to the general public, as our users will construct text messages out of modular, pre-synthesised word fragments. Anyone will be able to construct their own message out of DNA by simply ordering the word blocks they want to use and following a simple protocol.
Security and Safety
The potential for encryption provides users with a safe option for data storage – the DNA has to be sequenced, and even then cannot be decoded unless a decryption key is available. In addition to ensuring that our messages have the option of being private by applying a stream cipher, we are taking every possible step to make sure that our physical DNA sequences are safe. To avoid interfering with biological processes we have taken multiple steps to make sure our typewriter only codes for text. We have put STOP codons in every reading frame of our DNA text fragments to prevent protein synthesis. You can rest easy, our research isn’t far-fetched science fiction; it’s sensible, functional and a cheaper, greener alternative to data storage needs.
Leading researchers from Microsoft and Harvard University have already tried tackling this vital issue by developing DNA storage methods. However, our approach is unique in its sustainability, modularity and cheapness. We want our technology to be accessible not only to large scale companies but to anyone with long-term data storage needs.
We are very excited to have embarked on this venture, but our idea still needs funding to come to life! If you are interested in funding the Edinburgh undergraduate iGEM team, and support the development of a technology that will be essential in the (not so distant!) future, don’t hesitate to contact us at firstname.lastname@example.org.