DNA is crucial for life, and its organization has been a significant scientific challenge. GROVER, a model developed by BIOTEC, decodes DNA like text, promising advancements in genomics and personalized medicine.
DNA holds the essential information required to sustain life. Deciphering how this information is stored and organized has been one of the greatest scientific challenges of the past century. Now, with GROVER, a new large language model trained on human DNA, researchers can attempt to decode the intricate information concealed within our genome. Developed by a team at the Biotechnology Center (BIOTEC) of Dresden University of Technology, GROVER treats human DNA as text, learning its rules and context to extract functional information about DNA sequences. Published in Nature Machine Intelligence, this innovative tool has the potential to revolutionize genomics and accelerate personalized medicine.
Since the discovery of the double helix, scientists have sought to understand the information encoded in DNA. 70 years later, it is clear that the information hidden in the DNA is multilayered. Only 1-2 % of the genome consists of genes, the sequences that code for proteins.
“DNA has many functions beyond coding for proteins. Some sequences regulate genes, others serve structural purposes, and most sequences serve multiple functions at once. Currently, we don’t understand the meaning of most of the DNA. When it comes to understanding the non-coding regions of the DNA, it seems that we have only started to scratch the surface. This is where AI and large language models can help,” says Dr. Anna Poetsch, research group leader at the BIOTEC.
DNA as a Language
Large language models, like GPT, have transformed our understanding of language. Trained exclusively on text, the large language models developed the ability to use the language in many contexts.
“DNA is the code of life. Why not treat it like a language?” says Dr. Poetsch. The Poetsch team trained a large language model on a reference human genome. The resulting tool named GROVER, or “Genome Rules Obtained via Extracted Representations”, can be used to extract biological meaning from the DNA.
“GROVER learned the rules of DNA. In terms of language, we are talking about grammar, syntax, and semantics. For DNA this means learning the rules governing the sequences, the order of the nucleotides and sequences, and the meaning of the sequences. Like GPT models learning human languages, GROVER has basically learned how to ‘speak’ DNA,” explains Dr. Melissa Sanabria, the researcher behind the project.
The team showed that GROVER can not only accurately predict the following DNA sequences but can also be used to extract contextual information that has biological meaning, e.g., identify gene promoters or protein binding sites on DNA. GROVER also learns processes that are generally considered to be “epigenetic”, i.e., regulatory processes that happen on top of the DNA rather than being encoded.
“It is fascinating that by training GROVER with only the DNA sequence, without any annotations of functions, we are actually able to extract information on biological function. To us, it shows that the function, including some of the epigenetic information, is also encoded in the sequence,” says Dr. Sanabria.
The DNA Dictionary
“DNA resembles language. It has four letters that build sequences and the sequences carry a meaning. However, unlike a language, DNA has no defined words,” says Dr. Poetsch. DNA consists of four letters (A, T, G, and C) and genes, but there are no predefined sequences of different lengths that combine to build genes or other meaningful sequences.
To train GROVER, the team had to first create a DNA dictionary. They used a trick from compression algorithms. “This step is crucial and sets our DNA language model apart from the previous attempts,” says Dr. Poetsch.
“We analyzed the whole genome and looked for combinations of letters that occur most often. We started with two letters and went over the DNA, again and again, to build it up to the most common multi-letter combinations. In this way, in about 600 cycles, we have fragmented the DNA into ‘words’ that let GROVER perform the best when it comes to predicting the next sequence,” explains Dr. Sanabria.
The Promise of AI in Genomics
GROVER promises to unlock the different layers of genetic code. DNA holds key information on what makes us human, our disease predispositions, and our responses to treatments.
“We believe that understanding the rules of DNA through a language model is going to help us uncover the depths of biological meaning hidden in the DNA, advancing both genomics and personalized medicine,” says Dr. Poetsch.
Reference: “DNA language model GROVER learns sequence context in the human genome” by Melissa Sanabria, Jonas Hirsch, Pierre M. Joubert and Anna R. Poetsch, 23 July 2024, Nature Machine Intelligence.
DOI: 10.1038/s42256-024-00872-0
News
Lockdowns prematurely aged teenagers’ brains, study suggests
Teenage girls' brains may have prematurely aged by up to four years during the Covid pandemic, an American study suggests. Adolescent boys weren't immune either with their brain's also showing signs of undue wear [...]
Long COVID Still a Mystery: Routine Labs Show No Reliable Biomarkers
Routine lab tests are not reliable for diagnosing Long COVID, according to a new study. The research found that no clinical lab values could serve as biomarkers, highlighting the need to focus on symptoms [...]
Tiny magnetic robots could treat bleeds in the brain
Researchers have created nanoscale robots which could be used to manage bleeds in the brain caused by aneurysms. The development could enable precise, relatively low-risk treatment of brain aneurysms, which cause around 500,000 deaths globally [...]
Turning Mosquito Spit Into a Weapon Against the West Nile Virus and Other Deadly Diseases
Anita Saraf investigates mosquito saliva to understand how viruses like dengue and West Nile are transmitted, using mass spectrometry to identify potential targets for vaccines and treatments. You might guess it’d be tough to [...]
Ethics in Nanomedicine: Key Issues and Principles
Nanomedicine, a branch of nanotechnology, is revolutionizing healthcare by enabling the manipulation of materials at the nanoscale to diagnose, treat, and prevent diseases. Unlike traditional treatments, nanoparticles (NPs) are highly precise in targeting diseased [...]
A call for robust H5N1 influenza preparedness and response
As the global threat of H5N1 influenza looms with outbreaks across species and continents including the U.S., three international vaccine and public health experts say it is time to fully resource and support a [...]
Mucosal COVID-19 boosters outperform mRNA shots in preventing upper airway infections
In a recent study published in Nature Immunology, a team of researchers from the United States used non-human primate models to compare the protection conferred by an intramuscular booster dose of the bivalent messenger ribonucleic acid [...]
How Space Travel Really Changes Astronauts – From the Inside Out
International team reveals previously unknown effects on physiology that could shape the future of long-duration space missions. Researchers have discovered significant changes in the gut microbiome due to spaceflight, which affects host physiology and [...]
Breakthrough in blood stem cell development offers hope for leukemia and bone marrow failure
Melbourne researchers have made a world first breakthrough into creating blood stem cells that closely resemble those in the human body. And the discovery could soon lead to personalized treatments for children with leukemia [...]
Scientists Develop Game-Changing Needle-Free COVID-19 Intranasal Vaccine
A new mucosal COVID-19 vaccine poised to revolutionize the delivery process is especially beneficial for those with a fear of needles. A next-generation COVID-19 mucosal vaccine is set to be a game-changer not only when delivering [...]
Scientists Develop All-in-One Solution To Catch and Destroy “Forever Chemicals”
A new water treatment system developed by UBC researchers efficiently removes and destroys PFAS pollutants using a dual-action catalyst, offering a sustainable and cost-effective solution for water purification challenges. Chemical engineers at the University of [...]
New method accelerates drug discovery from years to months
Researchers from the University of Cincinnati College of Medicine and Cincinnati Children's Hospital have found a new method to increase both speed and success rates in drug discovery. The study, published Aug. 30 in [...]
A new smart mask analyzes your breath to monitor your health
Your breath can give away a lot about you. Each exhalation contains all sorts of compounds, including possible biomarkers for disease or lung conditions, that could give doctors a valuable insight into your health. [...]
Study reveals the role of blood clotting in COVID-19
In a study that reshapes what we know about COVID-19 and its most perplexing symptoms, scientists have discovered that the blood coagulation protein fibrin causes the unusual clotting and inflammation that have become hallmarks [...]
A Novel Cancer Vaccine Combining Nano-11 and ADU-S100
In a recent article published in npj Vaccines, researchers detailed the development of a novel cancer vaccine that combines a plant-derived nanoparticle adjuvant, Nano-11, with a clinically tested STING agonist, ADU-S100. The primary objective was [...]
AI spots cancer and viral infections with nanoscale precision
Researchers have developed an artificial intelligence which can differentiate cancer cells from normal cells, as well as detect the very early stages of viral infection inside cells. The findings, published today in a study [...]