In a new study, scientists at the Massachusetts Institute of Technology have developed a technique to tag and retrieve DNA data files, which could make DNA data storage possible.
At this point, there are about 10 trillion gigabytes (gigabytes) of data on the planet, and every day, humans churn out another 2.5 million gigabytes of data in emails, photos, social media feeds and other digital files. Much of this data is stored in huge facilities called exabyte data centers (1EB is 1 billion gigabytes), which can be the size of several football fields and cost about $1 billion to build and maintain.
Many scientists believe another solution to the massive data storage problem lies in the biological macromolecule that contains our genetic information: deoxyribonucleic acid (DNA). Since the beginning of life on Earth, DNA has evolved to store huge amounts of information at extremely high densities. Mark Barth, a professor of bioengineering at the Massachusetts Institute of Technology, says a coffee cup filled with DNA could theoretically store all the world's data.
"We need new solutions to store the vast amount of data the world is accumulating, especially archival data," he says. "DNA is even 1,000 times denser than flash memory. Another interesting property is that once the DNA polymer is made, it doesn't consume any more energy. You can write data into DNA and store it forever."
Scientists have shown that images and text can be encoded into DNA, but we still need a simple way to pick out the required files from the many mixtures of DNA fragments. In the new study, Mark Barth and colleagues demonstrated a way to encapsulate each data file in a six-micron silicon dioxide spherical "capsule," using short sequences of DNA as tags to display the file's contents.
Using this method, the researchers accurately extracted individual images stored as DNA sequences from DNA files containing 20 images. This method can be scaled up to 1020 files, taking into account the number of tags available.
Stable storage medium
Digital storage systems encode text, photos and other types of information as A series of zeros and ones, and the same information can be encoded in DNA using the four nucleotides (A, T, G and C, adenine, thymine, guanine and cytosine) that make up the genetic code. For example, G and C can stand for 0, while A and T stand for 1.
DNA has several other characteristics as a storage medium. For one thing, it is very stable, and relatively easy to synthesize and sequence (though currently expensive). Second, it has a very high storage density -- one nucleotide is equivalent to two bits, about one cubic nanometer. As a result, data stored in THE form of DNA could easily fit in the palm of our hand.
This new way of storing data faces a number of obstacles, starting with the cost of synthesizing such a large amount of DNA. Currently, it costs $1 trillion to write one petabyte (1 million GIGABytes) of data. To compete with magnetic tape, which is commonly used to store archival data, Barth estimates that the cost of DNA synthesis needs to fall by about six orders of magnitude. He noted that this goal could be achieved within a decade or two, just as the cost of storing information on flash memory has fallen dramatically over the past few decades.
In addition to the cost, another major bottleneck to using DNA to store data is the difficulty of sorting through all the files we want.
"What would happen if the technology for writing DNA were so advanced that it was cost-effective to write one exabyte or one zettabyte (ZB) into DNA? You'd have a whole bunch of DNA, which is tons of documents or images or movies and stuff, but you'd need to find a particular image or movie in it, "Barth said." It's like looking for a needle in a haystack."
Currently, DNA files are usually retrieved using PCR (polymerase chain reaction). Each DNA data file contains a sequence bound to a specific PCR primer. To read a particular file, the primer needs to be added to the sample to find and amplify the desired sequence. However, a disadvantage of this approach is that there may be crosstalk between the primer and DNA sequences other than the target sequence, resulting in unnecessary file output. In addition, the PCR retrieval process uses enzymes that eventually consume most of the DNA in the library.
"It's a bit like looking for a needle in a haystack, because all the other DNA isn't amplified, so basically it's thrown away." Barth said.
Solve DNA file retrieval problem
The MIT team has developed a new retrieval technique that it hopes will replace the PCR method. They encapsulated each DNA file in a tiny silica capsule, each labeled with a "bar code" of single strands of DNA that corresponded to the file's contents. To demonstrate the cost-effectiveness of this approach, the researchers encoded 20 different images into DNA fragments about 3,000 nucleotides long, which is roughly equivalent to 100 bytes (their study also showed that the capsules could hold up to 1 GB of DNA files).
Each file in the study was labeled with a barcode, such as "cat" or "plane." When researchers want to extract a specific image, they take a DNA sample and add primers corresponding to the target tag. For example, images of tigers correspond to labels like "cat," "orange," and "wild," while images of domestic cats correspond to "cat," "Orange," and "domestic."
These primers are labeled with fluorescent or magnetic particles to facilitate extraction and identification of matching fragments from samples. In this way, the researchers can remove the required files and put the rest of the DNA back in place, storing the data. Their search process allows Boolean logic statements such as "presidents and the 18th century" to produce results for "George Washington," much like Google's image search.
"At the current proof-of-concept stage, our search speed is 1, 000 bytes (1KB) per second," said James Baner of MIT, another lead author of the paper. The speed of our filesystem searches is determined by the amount of data per capsule, which is currently limited by the high cost of writing 100 megabytes (megabytes) of data onto DNA and the number of classifiers that can be used in parallel. If DNA synthesis becomes cheap enough, we can use this method to maximize the amount of data stored per file."
The barcodes the researchers used - single-stranded DNA sequences - were taken from a library of 100,000 sequences developed by Stephen Elledge, a professor of genetics and medicine at Harvard Medical School. If you attach 2 of these tags to each file, you can uniquely mark 1 010 different files; If there are four labels on each file, 1 020 files can be uniquely tagged.
George Church, a professor of genetics at Harvard Medical School who was not involved in the research, described the technology as a "giant leap forward in knowledge management and search technology."
"Rapid advances in writing, copying, reading and using DNA for low-energy archival data storage have made it extremely difficult to accurately retrieve data files from huge databases (1021-byte, zeta scale)," "What's striking about this new study is that it addresses this problem using a completely separate outer layer of DNA, extending the different properties of DNA (hybridization rather than sequencing), and using existing instruments and chemicals," Church said.
Barth envisions this DNA encapsulation technology being used to store "cold" data, that is, data kept in archives but not often accessed. Currently, his lab has founded a startup called Cache DNA, which is developing long-term DNA storage technology for both long-term DNA data storage and short-term clinical and other existing DNA samples storage.
"While it may be some time before we can use DNA as a data storage medium, there is a pressing need for low-cost and large-scale storage solutions for DNA and RNA samples in COVID-19 testing, human genome sequencing, and other areas of genomics." "Said Bath.
评论
发表评论