LOCAL ANCESTRY INFERENCE WITH MACHINE LEARNING MODELS

LOCAL ANCESTRY INFERENCE WITH MACHINE LEARNING MODELS

Researchers at Stanford have developed deep learning methods for high-resolution ancestry estimation along the human genome.

Although most sites in the human genome do not vary between individuals, roughly two percent do. These variations, called single nucleotide polymorphisms (SNPs), can vary in discernible frequencies between populations originating from different continents and subcontinent regions. Local-ancestry inference (LAI) uses the pattern of variation observed at various sites along an individual’s genome to estimate the ancestral origin of an individual’s DNA. LAI methods have broad biomedical applications.

Stage of Research

The inventors have developed computational methods for estimating the ancestral origins of segments of genetic variants in a DNA sequence using multiple machine learning models. One model, LAI-Net, uses neural network architecture and is composed of two subnetworks: a classification network and a smoothing layer. Another model, XGMix, is based on gradient boosted decision trees and permits approaching LAI as a regression problem by estimating the coordinates (latitude and longitude) of the geographic source location of ancestry for each SNP segment. Together, these machine learning LAI methods provide useful ancestry estimates, even with closely related populations.

Applications

  • High-resolution (e.g., milliMorgan) and robust predictions of ancestral origins for segments of SNPs in a genome
  • Local-ancestry inference predictions can support various biomedical applications, such as predicting risk of disease, or informing medical treatments

Advantages

  • Models are sharable, simple to use, fast to train, and can be run on consumer-level computers
  • The model can process haploid or diploid DNA sequences obtained by a variety of genotyping arrays
  • Models can be trained on genome data of known ancestral origins or simulated genomic sequences of simulated admixed descendants over a range of generations, increasing model robustness for populations and individuals with different admixture histories
  • Models are robust to missing data and phasing errors

Stage of Development

Research

Publications

Montserrat DM, Bustamante C, Ioannidis A. LAI-Net: Local-ancestry inference with neural networks. arXiv:2004.10377 [q-bio.GN]. 2020. Doi: 10.1109/ICASSP40776.2020.9053662

Kumar A, Montserrat DM, Bustamante C, Ioannidis A. XGMix: Local-ancestry inference with stacked XGBoost. bioRxiv. 2020. Doi: 10.1101/2020.04.21.053876

Related Web Links

https://bustamantelab.stanford.edu

Keywords

Machine learning, genomics, single nucleotide polymorphisms, SNP

Reference

CZB-155S, Stanford S20-173

Patent Information:
For Information, Contact:
Maureen Sheehy
General Counsel
CZ Biohub
ip@czbiohub.org
Inventors:
Daniel Montserrat
Carlos Bustamante
Alexander Ioannidis
Arvind Kumar
Keywords:
Genomics
Machine Learning
Single Nucleotide Polymorphisms
SNP