Whether we are predisposed to specific diseases depends largely on the countless variants in our genome. However, especially in the case of genetic variants that are rarely found in the population, the influence on the presentation of certain pathological features has so far been complex to determine. Scientists from the German Cancer Research Center (DKFZ), the European Molecular Biology Laboratory (EMBL) and the Technical University of Munich have introduced an algorithm based on deep learning that can predict the effects of sporadic genetic variants. This method allows for a more precise differentiation of people at high risk of disease and facilitates the identification of genes involved in the development of diseases.
Each person’s genome differs from that of other people by millions of individual building blocks. These differences in the genome are known as variants. Many of these variants are associated with specific biological traits and diseases. Such correlations are usually determined using so-called genome-wide association studies.
However, the impact of sporadic variants, which occur at a frequency of just 0.1% or less in the population, is often statistically underestimated in association studies. “Particularly sporadic variants often have a much larger impact on the presentation of a biological trait or disease,” says Brian Clarke, one of the first authors of the study. “They can therefore assist identify genes that play a role in disease development, and then they can point us in the direction of modern therapeutic approaches,” adds co-author Eva Holtkamp.
To better predict the effects of sporadic variants, teams led by Oliver Stegle and Brian Clarke from the DKFZ and EMBL and Julien Gagneur from the Technical University of Munich have developed a machine learning-based risk assessment tool. “DeepRVAT” (RIf Inmoney ANDassociation TThe method, as the researchers called it, is the first to apply artificial intelligence (AI) to genomic association studies to decipher sporadic genetic variants.
The model was initially trained on sequence data (exome sequences) from 161,000 people from the UK Biobank. The researchers also fed in information about the genetically determined biological traits of each person, as well as the genes involved in the traits. The sequences used for training included some 13 million variants. For each of these, detailed “annotations” are available, providing quantitative information about the possible effects that a given variant might have on cellular processes or protein structure. These annotations were also a central part of the training.
Once trained, DeepRVAT can predict for each individual which genes are impaired in their function by sporadic variants. To do this, the algorithm uses individual variants and their annotations to calculate a numerical value that describes the degree of impairment of the gene and its potential impact on health.
The researchers validated DeepRVAT on genomic data from the UK Biobank. For the 34 features tested, i.e. blood test results relevant to disease, the testing method found 352 associations with the genes involved, significantly outperforming all previous models. The results obtained with DeepRVAT proved to be very resilient and more reproducible in independent data than the results of alternative approaches.
Another essential application of DeepRVAT is assessing genetic predisposition to certain diseases. The researchers combined DeepRVAT with a polygenic risk score based on more common genetic variants. This significantly improved the accuracy of predictions, especially for high-risk variants. In addition, DeepRVAT was found to identify genetic correlations for a number of diseases—including various cardiovascular diseases, types of cancer, metabolic diseases, and neurological disorders—that were not found in existing tests.
DeepRVAT has the potential to significantly advance personalized medicine. Our method works independently of feature type and can be flexibly combined with other testing methods.
Oliver Stegle, physicist and data scientist
His team now wants to test the risk assessment tool in large-scale studies as soon as possible and put it into practice. The researchers are already in contact with the organizers of INFORM, for example. The aim of this study is to exploit genomic data to identify individually tailored treatments for children with cancer who have had a relapse. DeepRVAT could assist uncover the genetic basis of some childhood cancers.
“I think the potential impact of DeepRVAT on sporadic disease applications is invigorating. One of the major challenges in sporadic disease research is the lack of large-scale systematic data. Using the power of AI and the half a million exomes in the UK Biobank, we have objectively identified which genetic variants most significantly impair gene function,” says Julien Gagneur from the Technical University of Munich.
The next step is to integrate DeepRVAT into the infrastructure of the German Human Genome Phenome Archive (GHGA) to facilitate applications in diagnostics and basic research. Another advantage of DeepRVAT is that the method requires significantly less computing power than comparable models. DeepRVAT is available as a user-friendly software package that can be used with pre-trained risk assessment models or trained with researchers’ own data sets for specialized purposes.
Brian Clarke, Eva Holtkamp, Hakime Öztürk, Marcel Mück, Magnus Wahlberg, Kayla Meyer, Felix Munzlinger, Felix Brechtmann, Florian R. Hölzlwimmer, Jonas Lindner, Zhifen Chen, Julien Gagneur, Oliver Stegle: Integrating variant annotations using an ensemble of deep networks accelerates scant variant testing.
Source:
Magazine reference:
Clarke, B., and others (2024). Integrating variant annotations using deep ensemble networks enhances sporadic variant association testing. Genetics of nature. doi.org/10.1038/s41588-024-01919-z.