snps

snps¶

tools for reading, writing, generating, merging, and remapping SNPs 🧬

snps strives to be an easy-to-use and accessible open-source library for working with genotype data

Features¶

Input / Output¶

Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testing sources with a SNPs object
Read and write VCF files (e.g., convert 23andMe to VCF)
Merge raw data files from different DNA tests, identifying discrepant SNPs in the process
Read data in a variety of formats (e.g., files, bytes, compressed with gzip or zip)
Handle several variations of file types, historically validated using data from openSNP
Generate synthetic genotype data for testing and examples

Build / Assembly Detection and Remapping¶

Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
Remap SNPs between builds / assemblies

Data Cleaning¶

Perform quality control (QC) / filter low quality SNPs based on chip clusters
Fix several common issues when loading SNPs
Sort SNPs based on chromosome and position
Deduplicate RSIDs
Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
Deduplicate alleles on MT
Assign PAR SNPs to the X or Y chromosome

Analysis¶

Derive sex from SNPs
Detect deduced genotype / chip array and chip version based on chip clusters
Predict ancestry from SNPs (when installed with ezancestry)

Supported Genotype Files¶

snps supports VCF files and genotype files from the following DNA testing sources:

Additionally, snps can read a variety of “generic” CSV and TSV files.

Dependencies¶

snps requires Python 3.9+ and the following Python packages:

Installation¶

snps is available on the Python Package Index. Install snps (and its required Python dependencies) via pip:

$ pip install snps

For ancestry prediction capability, snps can be installed with ezancestry:

$ pip install snps[ezancestry]

Examples¶

To try these examples, first generate some sample data:

>>> from snps.resources import Resources
>>> paths = Resources().create_example_datasets()

Load a Raw Data File¶

Load a raw data file exported from a DNA testing source (e.g., 23andMe, AncestryDNA, Family Tree DNA):

>>> from snps import SNPs
>>> s = SNPs("resources/sample1.23andme.txt.gz")

snps automatically detects the source format and normalizes the data:

>>> s.source
'23andMe'
>>> s.count
991767
>>> s.build
37
>>> s.assembly
'GRCh37'

The SNPs are available as a pandas.DataFrame:

>>> df = s.snps
>>> df.columns.tolist()
['chrom', 'pos', 'genotype']
>>> len(df)
991767

Merge Raw Data Files¶

Combine SNPs from multiple files (e.g., combine data from different testing companies):

>>> results = s.merge([SNPs("resources/sample2.ftdna.csv.gz")])
>>> s.count
1006949

SNPs are compared during the merge. Position and genotype discrepancies are identified and can be inspected via properties of the SNPs object:

>>> len(s.discrepant_merge_positions)
27
>>> len(s.discrepant_merge_genotypes)
156

Remap SNPs¶

Convert SNPs between genome assemblies (Build 36/NCBI36, Build 37/GRCh37, Build 38/GRCh38):

>>> chromosomes_remapped, chromosomes_not_remapped = s.remap(38)
>>> s.assembly
'GRCh38'

Save SNPs¶

Save SNPs to common file formats:

>>> _ = s.to_tsv("output.txt")
>>> _ = s.to_csv("output.csv")

To save as VCF, snps automatically downloads the required reference sequences for the assembly. This ensures the REF alleles in the VCF are accurate:

>>> _ = s.to_vcf("output.vcf")

All output files are saved to the output directory.

Generate Synthetic Data¶

Generate synthetic genotype data for testing, examples, or demonstrations:

>>> from snps.io import SyntheticSNPGenerator
>>> gen = SyntheticSNPGenerator(build=37, seed=123)
>>> gen.save_as_23andme("synthetic_23andme.txt.gz", num_snps=10000)
'synthetic_23andme.txt.gz'

The generator supports multiple output formats (23andMe, AncestryDNA, FTDNA) and automatically injects build-specific marker SNPs to ensure accurate build detection.

Documentation¶

Documentation is available here.

Acknowledgements¶

Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, Open Humans, and Sano Genetics. This project was historically validated using data from openSNP.

snps incorporates code and concepts generated with the assistance of various generative AI tools (including but not limited to ChatGPT, Grok, and Claude). ✨

License¶

snps is licensed under the BSD 3-Clause License.