SIMULATED GENOMIC SEQUENCING DATASET GENERATION

SIMULATED GENOMIC SEQUENCING DATASET GENERATION

Researchers at Stanford have developed machine learning methods to generate simulated genomic sequencing datasets.

Local-ancestry inference (LAI) allows identification of the ancestry of all chromosomal segments in an individual. Many LAI techniques have been developed but require large training data sets of human genomic sequences of known ancestry. Such data sets are usually protected, proprietary, or otherwise not publicly available. Techniques to generate training data sets that resemble real human sequences from specific ancestries, and which can be shared, will be useful in LAI methods to deconvolve the ancestry of admixed individuals.

Stage of Research

The inventors have developed machine learning methods to generate new human genomic sequences. Their model is composed of three neural network sub-networks: an encoder, a decoder, and a discriminator. The encoder and decoder are configured as a class-conditional variational autoencoder (VAE) to encode the input data into a lower-dimensional space. The decoder and discriminator are configured as a class-conditional generative adversarial network (GAN) that tries to reconstruct the original input. Their generative VAE-GAN machine learning model transforms input DNA sequences of discernible SNP variation and known ancestral origin into simulated output sequences having different SNP variants that are statistically related to the training set. These simulated genomic data sets could be made publicly available for various biomedical applications.

Applications

  • Generation of a large number of random yet realistic and statistically simulated SNP segments of different ancestral origins
  • Simulate admixed descendants over a series of generations
  • Simulated human genomic sequences can be used to train LAI algorithms or serve as control datasets in genome-wide association studies

Advantages

  • Simulated genomic sequences can be publicly shared
  • The model can process haploid or diploid DNA sequences obtained by a variety of genotyping arrays
  • The methods can be trained to fit different patterns of SNP variants of different ancestral origins to ensure that a simulated SNP is statistically related to the input SNP

Stage of Development

Research

Publications

Montserrat DM, Bustamante C, Ioannidis A. Class-conditional VAE-GAN for local-ancestry simulation. arXiv:1911.13220v1 [q-bio.GN]. 2019.

Related Web Links

https://bustamantelab.stanford.edu

Keywords

Machine learning, genomics, single nucleotide polymorphisms, SNP

Reference

CZB-156S, Stanford S20-177

Patent Information:
For Information, Contact:
CZBiohub Admin
Inteum Hosted Admin
CZ Biohub
ip@czbiohub.org
Inventors:
Daniel Montserrat
Carlos Bustamante
Alexander Ioannidis
Keywords:
Genomics
Machine Learning
Single Nucleotide Polymorphisms
SNP