API Reference ¶

to_tsv(filename='', atomic=True, **kwargs)[source]¶

Output SNPs as tab-separated values.

Note that this results in the same default output as save.

Parameters:

filename (str or buffer) – filename for file to save or buffer to write to
atomic (bool) – atomically write output to a file on local filesystem
**kwargs – additional parameters to pandas.DataFrame.to_csv

Returns:

path to file in output directory if SNPs were saved, else empty str

Return type:

to_vcf(filename='', atomic=True, alt_unavailable='.', chrom_prefix='', qc_only=False, qc_filter=False, **kwargs)[source]¶

Output SNPs as Variant Call Format.

Parameters:

filename (str or buffer) – filename for file to save or buffer to write to
atomic (bool) – atomically write output to a file on local filesystem
alt_unavailable (str) – representation of ALT allele when ALT is not able to be determined
chrom_prefix (str) – prefix for chromosomes in VCF CHROM column
qc_only (bool) – output only SNPs that pass quality control
qc_filter (bool) – populate FILTER column based on quality control results
**kwargs – additional parameters to pandas.DataFrame.to_csv

Returns:

path to file in output directory if SNPs were saved, else empty str

Return type:

Notes

Parameters qc_only and qc_filter, if true, will identify low quality SNPs per identify_low_quality_snps(), if not done already. Moreover, these parameters have no effect if this SNPs object does not map to a cluster per compute_cluster_overlap().

References

The Variant Call Format (VCF) Version 4.3 Specification, 27 Nov 2022, https://samtools.github.io/hts-specs/VCFv4.3.pdf

detect_build()[source]¶

Detect build of SNPs.

Use the coordinates of common SNPs to identify the build / assembly of a genotype file that is being loaded.

Notes

rs3094315 : plus strand in 36, 37, and 38
rs11928389 : plus strand in 36, minus strand in 37 and 38
rs2500347 : plus strand in 36 and 37, minus strand in 38
rs964481 : plus strand in 36, 37, and 38
rs2341354 : plus strand in 36, 37, and 38
rs3850290 : plus strand in 36, 37, and 38
rs1329546 : plus strand in 36, 37, and 38

Returns:: detected build of SNPs, else 0
Return type:: int

References

Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. dbSNP accession: rs3094315, rs11928389, rs2500347, rs964481, rs2341354, rs3850290, and rs1329546 (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/

get_count(chrom='')[source]¶

Count of SNPs.

Parameters:: chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
Return type:: int

determine_sex(heterozygous_x_snps_threshold=0.03, y_snps_not_null_threshold=0.3, chrom='X')[source]¶

Determine sex from SNPs using thresholds.

Parameters:

heterozygous_x_snps_threshold (float) – percentage heterozygous X SNPs; above this threshold, Female is determined
y_snps_not_null_threshold (float) – percentage Y SNPs that are not null; above this threshold, Male is determined
chrom ({"X", "Y"}) – use X or Y chromosome SNPs to determine sex

Returns:

‘Male’ or ‘Female’ if detected, else empty str

Return type:

static get_par_regions(build)[source]¶

Get PAR regions for the X and Y chromosomes.

Parameters:: build (int) – build of SNPs
Returns:: PAR regions for the given build
Return type:: pandas.DataFrame

References

Genome Reference Consortium, https://www.ncbi.nlm.nih.gov/grc/human
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

sort()[source]¶: Sort SNPs based on ordered chromosome list and position.

remap(target_assembly, complement_bases=True)[source]¶

Remap SNP coordinates from one assembly to another.

This method uses the assembly map endpoint of the Ensembl REST API service (via Resources’s EnsemblRestClient) to convert SNP coordinates / positions from one assembly to another. After remapping, the coordinates / positions for the SNPs will be that of the target assembly.

If the SNPs are already mapped relative to the target assembly, remapping will not be performed.

Parameters:

target_assembly ({'NCBI36', 'GRCh37', 'GRCh38', 36, 37, 38}) – assembly to remap to
complement_bases (bool) – complement bases when remapping SNPs to the minus strand

Returns:

chromosomes_remapped (list of str) – chromosomes remapped
chromosomes_not_remapped (list of str) – chromosomes not remapped

Notes

An assembly is also know as a “build.” For example:

Assembly NCBI36 = Build 36 Assembly GRCh37 = Build 37 Assembly GRCh38 = Build 38

See https://www.ncbi.nlm.nih.gov/assembly for more information about assemblies and remapping.

References

Ensembl, Assembly Map Endpoint, http://rest.ensembl.org/documentation/info/assembly_map
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

merge(snps_objects=(), discrepant_positions_threshold=100, discrepant_genotypes_threshold=500, remap=True, chrom='')[source]¶

Merge other SNPs objects into this SNPs object.

Parameters:

snps_objects (list or tuple of SNPs) – other SNPs objects to merge into this SNPs object
discrepant_positions_threshold (int) – threshold for discrepant SNP positions between existing data and data to be loaded; a large value could indicate mismatched genome assemblies
discrepant_genotypes_threshold (int) – threshold for discrepant genotype data between existing data and data to be loaded; a large value could indicated mismatched individuals
remap (bool) – if necessary, remap other SNPs objects to have the same build as this SNPs object before merging
chrom (str, optional) – chromosome to merge (e.g., “1”, “Y”, “MT”)

Returns:

for each SNPs object to merge, a dict with the following items:

merged (bool): whether SNPs object was merged
common_rsids (pandas.Index): SNPs in common
discrepant_position_rsids (pandas.Index): SNPs with discrepant positions
discrepant_genotype_rsids (pandas.Index): SNPs with discrepant genotypes

Return type:

list of dict

References

Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.

predict_ancestry(output_directory=None, write_predictions=False, models_directory=None, aisnps_directory=None, aisnps_set=None)[source]¶

Predict genetic ancestry for SNPs.

Predictions by ezancestry.

Notes

Populations below are described here.

Parameters:

various (optional) –

See the available settings for predict at ezancestry.

Returns:

dict with the following keys:

population_code (str)

max predicted population for the sample

population_percent (float)

predicted probability for the max predicted population

superpopulation_code (str)

max predicted super population (continental) for the sample

superpopulation_percent (float)

predicted probability for the max predicted super population

ezancestry_df (pandas.DataFrame)

pandas.DataFrame with the following columns:

component1, component2, component3: The coordinates of the sample in the dimensionality-reduced component space. Can be used as (x, y, z,) coordinates for plotting in a 3d scatter plot.
predicted_ancestry_population: The max predicted population for the sample.
ACB, ASW, BEB, CDX, CEU, CHB, CHS, CLM, ESN, FIN, GBR, GIH, GWD, IBS, ITU, JPT, KHV, LWK, MSL, MXL, PEL, PJL, PUR, STU, TSI, YRI: Predicted probabilities for each of the populations. These sum to 1.0.
predicted_ancestry_superpopulation: The max predicted super population (continental) for the sample.
AFR, AMR, EAS, EUR, SAS: Predicted probabilities for each of the super populations. These sum to 1.0.

Return type:

compute_cluster_overlap(cluster_overlap_threshold=0.95)[source]¶

Compute overlap with chip clusters.

Chip clusters, which are defined in [1], are associated with deduced genotype / chip arrays and DTC companies.

This method also sets the values returned by the cluster, chip, and chip_version properties, based on max overlap, if the specified threshold is satisfied.

Parameters:

cluster_overlap_threshold (float) – threshold for cluster to overlap this SNPs object, and vice versa, to set values returned by the cluster, chip, and chip_version properties

Returns:

pandas.DataFrame with the following columns:

company_composition: DTC company composition of associated cluster from [1]
chip_base_deduced: deduced genotype / chip array of associated cluster from [1]
snps_in_cluster: count of SNPs in cluster
snps_in_common: count of SNPs in common with cluster (inner merge with cluster)
overlap_with_cluster: percentage overlap of snps_in_common with cluster
overlap_with_self: percentage overlap of snps_in_common with this SNPs object

Return type:

pandas.DataFrame

References

identify_low_quality_snps()[source]¶

Identify low quality SNPs based on chip clusters.

Any low quality SNPs are removed from the snps_qc dataframe and are made available as low_quality.

Notes

Chip clusters, which are defined in [1], are associated with low quality SNPs. As such, low quality SNPs will only be identified when this SNPs object corresponds to a cluster per compute_cluster_overlap().

I/O Operations¶

Modules for reading, writing, and generating SNP data files.

snps.io¶

Classes for reading, writing, and generating SNPs.

class snps.io.Reader(file='', only_detect_source=False, resources=None, rsids=())[source]¶

Bases: object

Class for reading and parsing raw data / genotype files.

__init__(file='', only_detect_source=False, resources=None, rsids=())[source]¶

Initialize a Reader.

Parameters:

file (str or bytes) – path to file to load or bytes to load
only_detect_source (bool) – only detect the source of the data
resources (Resources) – instance of Resources
rsids (tuple, optional) – rsids to extract if loading a VCF file

read()[source]¶

Read and parse a raw data / genotype file.

Returns:

dict with the following items:

snps (pandas.DataFrame): dataframe of parsed SNPs
source (str): detected source of SNPs
phased (bool): flag indicating if SNPs are phased

Return type:

classmethod read_file(file, only_detect_source, resources, rsids)[source]¶

static is_zip(bytes_data)[source]¶: Check whether or not a bytes_data file is a valid Zip file.

static is_gzip(bytes_data)[source]¶: Check whether or not a bytes_data file is a valid gzip file.

read_helper(source, parser)[source]¶

Generic method to help read files.

Parameters:

source (str) – name of data source
parser (func) –
parsing function, which returns a tuple with the following items:

0 (pandas.DataFrame)
dataframe of parsed SNPs (empty if only detecting source)

1 (bool), optional
flag indicating if SNPs are phased

2 (int), optional
detected build of SNPs

Returns:

dict with the following items:

snps (pandas.DataFrame): dataframe of parsed SNPs
source (str): detected source of SNPs
phased (bool): flag indicating if SNPs are phased
build (int): detected build of SNPs

Return type:

References

Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.

read_23andme(file, compression, joined=True)[source]¶

Read and parse 23andMe file.

https://www.23andme.com

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_ftdna(file, compression)[source]¶

Read and parse Family Tree DNA (FTDNA) file.

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_ftdna_famfinder(file, compression)[source]¶

Read and parse Family Tree DNA (FTDNA) “famfinder” file.

https://www.myheritage.com

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_ancestry(file, compression)[source]¶

Read and parse Ancestry.com file.

http://www.ancestry.com

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_myheritage(file, compression)[source]¶

Read and parse MyHeritage file.

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_livingdna(file, compression)[source]¶

Read and parse LivingDNA file.

https://livingdna.com/

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_mapmygenome(file, compression, header, comments)[source]¶

Read and parse Mapmygenome file.

https://mapmygenome.in

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_genes_for_good(file, compression)[source]¶

Read and parse Genes For Good file.

https://genesforgood.sph.umich.edu/readme/readme1.2.txt

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_tellmegen(file, compression)[source]¶

Read and parse tellmeGen files.

https://www.tellmegen.com/

Parameters:: data (str) – data string
Returns:: result of read_helper
Return type:: dict

read_gsa(data_or_filename, compresion, comments)[source]¶

Read and parse Illumina Global Screening Array files

Parameters:: data_or_filename (str or bytes) – either the filename to read from or the bytes data itself
Returns:: result of read_helper
Return type:: dict

read_dnaland(file, compression)[source]¶

Read and parse DNA.land files.

https://dna.land/

Parameters:: data (str) – data string
Returns:: result of read_helper
Return type:: dict

read_circledna(file, compression)[source]¶

Read and parse CircleDNA file.

https://circledna.com/

Notes

This method attempts to read and parse a whole exome file, optionally compressed with gzip or zip. Some assumptions are made throughout this process:

SNPs that are not annotated with an RSID are skipped

Insertions and deletions are skipped

Parameters:: file (str or bytes) – path to file or bytes to load
Returns:: result of read_helper
Return type:: dict

read_sano_dtc(file, compression)[source]¶

Read and parse Sano Genetics DTC file.

https://sanogenetics.com

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_selfdecode(file, compression)[source]¶

Read and parse SelfDecode file.

https://selfdecode.com/

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_23Mofang(file, compression)[source]¶

Read and parse 23Mofang file.

https://www.23mofang.com/

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_plink(file, compression)[source]¶

Read and parse plink file.

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_snps_csv(file, comments, compression)[source]¶

Read and parse CSV file generated by snps.

https://pypi.org/project/snps/

Parameters:

file (str or buffer) – path to file or buffer to read
comments (str) – comments at beginning of file

Returns:

result of read_helper

Return type:

read_generic(file, compression, skip=1)[source]¶

Read and parse generic CSV or TSV file.

Notes

Assumes columns are ‘rsid’, ‘chrom’ / ‘chromosome’, ‘pos’ / ‘position’, and ‘genotype’; values are comma separated; unreported genotypes are indicated by ‘–’; and one header row precedes data. For example:

rsid,chromosome,position,genotype rs1,1,1,AA rs2,1,2,CC rs3,1,3,–

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_vcf(file, compression, provider, rsids=(), comments='')[source]¶

Read and parse VCF file.

Notes

This method attempts to read and parse a VCF file or buffer, optionally compressed with gzip. Some assumptions are made throughout this process:

SNPs that are not annotated with an RSID are skipped

If the VCF contains multiple samples, only the first sample is used to lookup the genotype

Precise insertions and deletions are skipped

If a sample allele is not specified, the genotype is reported as NaN

If a sample allele refers to a REF or ALT allele that is not specified, the genotype is reported as NaN

Parameters:

file (str or bytes) – path to file or bytes to load
rsids (tuple, optional) – rsids to extract if loading a VCF file

Returns:

result of read_helper

Return type:

class snps.io.Writer(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]¶

Bases: object

Class for writing SNPs to files.

__init__(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]¶

Initialize a Writer.

Parameters:

snps (SNPs) – SNPs to save to file or write to buffer
filename (str or buffer) – filename for file to save or buffer to write to
vcf (bool) – flag to save file as VCF
atomic (bool) – atomically write output to a file on local filesystem
vcf_alt_unavailable (str) – representation of VCF ALT allele when ALT is not able to be determined
vcf_chrom_prefix (str) – prefix for chromosomes in VCF CHROM column
vcf_qc_only (bool) – for VCF, output only SNPs that pass quality control
vcf_qc_filter (bool) – for VCF, populate VCF FILTER column based on quality control results
**kwargs – additional parameters to pandas.DataFrame.to_csv

write()[source]¶

Write SNPs to file or buffer.

Returns:

str – path to file in output directory if SNPs were saved, else empty str
discrepant_vcf_position (pd.DataFrame) – SNPs with discrepant positions discovered while saving VCF

classmethod write_file(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]¶

class snps.io.SyntheticSNPGenerator(build=37, seed=None)[source]¶

Bases: object

Generate realistic synthetic genotype data.

This class generates synthetic SNP data that mimics real genotype files from various DNA testing companies. The generated data is suitable for testing, examples, and documentation.

Parameters:

build (int) – Genome build (36, 37, or 38), default is 37
seed (int, optional) – Random seed for reproducibility

Examples

>>> gen = SyntheticSNPGenerator(build=37, seed=123)
>>> gen.save_as_23andme("output.txt", num_snps=10000)
'output.txt'

__init__(build=37, seed=None)[source]¶

Parameters:

build (int)
seed (int | None)

Return type:

None

generate_snps(num_snps=10000, chromosomes=None, missing_rate=0.01, inject_build_markers=True)[source]¶

Generate a DataFrame of synthetic SNPs.

Parameters:

num_snps (int) – Approximate number of SNPs to generate
chromosomes (list of str, optional) – Chromosomes to include (default: all autosomes plus X, Y, MT)
missing_rate (float) – Proportion of SNPs with missing genotypes (default: 0.01)
inject_build_markers (bool) – Inject known marker SNPs for build detection (default: True)

Returns:

DataFrame with columns: rsid (index), chrom, pos, genotype

Return type:

pd.DataFrame

create_example_dataset_pair(output_dir='.')[source]¶

Create a pair of realistic example datasets suitable for merging.

Generates two correlated genotype files that share a large number of common SNPs, with some discrepancies to demonstrate merge functionality.

Parameters:: output_dir (str) – Directory for output files
Returns:: Paths to (file1_23andme, file2_ftdna)
Return type:: tuple of (str, str)

save_as_23andme(output_path, num_snps=991786, **kwargs)[source]¶

Save SNPs in 23andMe format.

Parameters:

output_path (str)
num_snps (int)
kwargs (Any)

Return type:

save_as_ancestry(output_path, num_snps=700000, **kwargs)[source]¶

Save SNPs in AncestryDNA format.

Parameters:

output_path (str)
num_snps (int)
kwargs (Any)

Return type:

save_as_ftdna(output_path, num_snps=715194, **kwargs)[source]¶

Save SNPs in Family Tree DNA (FTDNA) format.

Parameters:

output_path (str)
num_snps (int)
kwargs (Any)

Return type:

save_as_generic(output_path, format='csv', num_snps=10000, **kwargs)[source]¶

Save SNPs in generic CSV or TSV format.

Parameters:

output_path (str)
format (str)
num_snps (int)
kwargs (Any)

Return type:

snps.io.get_empty_snps_dataframe()[source]¶

Get empty dataframe normalized for usage with snps.

Return type:: pd.DataFrame

snps.io.reader¶

File format readers for various genotype data sources.

Class for reading SNPs.

snps.io.reader.get_empty_snps_dataframe()[source]¶

Get empty dataframe normalized for usage with snps.

Return type:: pd.DataFrame

class snps.io.reader.Reader(file='', only_detect_source=False, resources=None, rsids=())[source]¶

Bases: object

Class for reading and parsing raw data / genotype files.

__init__(file='', only_detect_source=False, resources=None, rsids=())[source]¶

Initialize a Reader.

Parameters:

file (str or bytes) – path to file to load or bytes to load
only_detect_source (bool) – only detect the source of the data
resources (Resources) – instance of Resources
rsids (tuple, optional) – rsids to extract if loading a VCF file

read()[source]¶

Read and parse a raw data / genotype file.

Returns:

dict with the following items:

snps (pandas.DataFrame): dataframe of parsed SNPs
source (str): detected source of SNPs
phased (bool): flag indicating if SNPs are phased

Return type:

static is_zip(bytes_data)[source]¶: Check whether or not a bytes_data file is a valid Zip file.

static is_gzip(bytes_data)[source]¶: Check whether or not a bytes_data file is a valid gzip file.

read_helper(source, parser)[source]¶

Generic method to help read files.

Parameters:

source (str) – name of data source
parser (func) –
parsing function, which returns a tuple with the following items:

0 (pandas.DataFrame)
dataframe of parsed SNPs (empty if only detecting source)

1 (bool), optional
flag indicating if SNPs are phased

2 (int), optional
detected build of SNPs

Returns:

dict with the following items:

snps (pandas.DataFrame): dataframe of parsed SNPs
source (str): detected source of SNPs
phased (bool): flag indicating if SNPs are phased
build (int): detected build of SNPs

Return type:

References

Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.

read_23andme(file, compression, joined=True)[source]¶

Read and parse 23andMe file.

https://www.23andme.com

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_ftdna(file, compression)[source]¶

Read and parse Family Tree DNA (FTDNA) file.

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_ftdna_famfinder(file, compression)[source]¶

Read and parse Family Tree DNA (FTDNA) “famfinder” file.

https://www.myheritage.com

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_ancestry(file, compression)[source]¶

Read and parse Ancestry.com file.

http://www.ancestry.com

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_myheritage(file, compression)[source]¶

Read and parse MyHeritage file.

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_livingdna(file, compression)[source]¶

Read and parse LivingDNA file.

https://livingdna.com/

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_mapmygenome(file, compression, header, comments)[source]¶

Read and parse Mapmygenome file.

https://mapmygenome.in

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_genes_for_good(file, compression)[source]¶

Read and parse Genes For Good file.

https://genesforgood.sph.umich.edu/readme/readme1.2.txt

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_tellmegen(file, compression)[source]¶

Read and parse tellmeGen files.

https://www.tellmegen.com/

Parameters:: data (str) – data string
Returns:: result of read_helper
Return type:: dict

read_gsa(data_or_filename, compresion, comments)[source]¶

Read and parse Illumina Global Screening Array files

Parameters:: data_or_filename (str or bytes) – either the filename to read from or the bytes data itself
Returns:: result of read_helper
Return type:: dict

read_dnaland(file, compression)[source]¶

Read and parse DNA.land files.

https://dna.land/

Parameters:: data (str) – data string
Returns:: result of read_helper
Return type:: dict

read_circledna(file, compression)[source]¶

Read and parse CircleDNA file.

https://circledna.com/

Notes

This method attempts to read and parse a whole exome file, optionally compressed with gzip or zip. Some assumptions are made throughout this process:

SNPs that are not annotated with an RSID are skipped

Insertions and deletions are skipped

Parameters:: file (str or bytes) – path to file or bytes to load
Returns:: result of read_helper
Return type:: dict

read_sano_dtc(file, compression)[source]¶

Read and parse Sano Genetics DTC file.

https://sanogenetics.com

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_selfdecode(file, compression)[source]¶

Read and parse SelfDecode file.

https://selfdecode.com/

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_23Mofang(file, compression)[source]¶

Read and parse 23Mofang file.

https://www.23mofang.com/

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_plink(file, compression)[source]¶

Read and parse plink file.

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_snps_csv(file, comments, compression)[source]¶

Read and parse CSV file generated by snps.

https://pypi.org/project/snps/

Parameters:

file (str or buffer) – path to file or buffer to read
comments (str) – comments at beginning of file

Returns:

result of read_helper

Return type:

read_generic(file, compression, skip=1)[source]¶

Read and parse generic CSV or TSV file.

Notes

rsid,chromosome,position,genotype rs1,1,1,AA rs2,1,2,CC rs3,1,3,–

Parameters:: file (str) – path to file
Returns:: result of read_helper
Return type:: dict

read_vcf(file, compression, provider, rsids=(), comments='')[source]¶

Read and parse VCF file.

Notes

This method attempts to read and parse a VCF file or buffer, optionally compressed with gzip. Some assumptions are made throughout this process:

SNPs that are not annotated with an RSID are skipped

If the VCF contains multiple samples, only the first sample is used to lookup the genotype

Precise insertions and deletions are skipped

If a sample allele is not specified, the genotype is reported as NaN

If a sample allele refers to a REF or ALT allele that is not specified, the genotype is reported as NaN

Parameters:

file (str or bytes) – path to file or bytes to load
rsids (tuple, optional) – rsids to extract if loading a VCF file

Returns:

result of read_helper

Return type:

snps.io.writer¶

File format writers for exporting genotype data.

Class for writing SNPs.

class snps.io.writer.Writer(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]¶

Bases: object

Class for writing SNPs to files.

__init__(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]¶

Initialize a Writer.

Parameters:

snps (SNPs) – SNPs to save to file or write to buffer
filename (str or buffer) – filename for file to save or buffer to write to
vcf (bool) – flag to save file as VCF
atomic (bool) – atomically write output to a file on local filesystem
vcf_alt_unavailable (str) – representation of VCF ALT allele when ALT is not able to be determined
vcf_chrom_prefix (str) – prefix for chromosomes in VCF CHROM column
vcf_qc_only (bool) – for VCF, output only SNPs that pass quality control
vcf_qc_filter (bool) – for VCF, populate VCF FILTER column based on quality control results
**kwargs – additional parameters to pandas.DataFrame.to_csv

write()[source]¶

Write SNPs to file or buffer.

Returns:

str – path to file in output directory if SNPs were saved, else empty str
discrepant_vcf_position (pd.DataFrame) – SNPs with discrepant positions discovered while saving VCF

snps.io.generator¶

Synthetic SNP data generation utilities.

Generate synthetic genotype data for testing and examples.

class snps.io.generator.SyntheticSNPGenerator(build=37, seed=None)[source]¶

Bases: object

Generate realistic synthetic genotype data.

This class generates synthetic SNP data that mimics real genotype files from various DNA testing companies. The generated data is suitable for testing, examples, and documentation.

Parameters:

build (int) – Genome build (36, 37, or 38), default is 37
seed (int, optional) – Random seed for reproducibility

Examples

>>> gen = SyntheticSNPGenerator(build=37, seed=123)
>>> gen.save_as_23andme("output.txt", num_snps=10000)
'output.txt'

__init__(build=37, seed=None)[source]¶

Parameters:

build (int)
seed (int | None)

Return type:

None

generate_snps(num_snps=10000, chromosomes=None, missing_rate=0.01, inject_build_markers=True)[source]¶

Generate a DataFrame of synthetic SNPs.

Parameters:

num_snps (int) – Approximate number of SNPs to generate
chromosomes (list of str, optional) – Chromosomes to include (default: all autosomes plus X, Y, MT)
missing_rate (float) – Proportion of SNPs with missing genotypes (default: 0.01)
inject_build_markers (bool) – Inject known marker SNPs for build detection (default: True)

Returns:

DataFrame with columns: rsid (index), chrom, pos, genotype

Return type:

pd.DataFrame

create_example_dataset_pair(output_dir='.')[source]¶

Create a pair of realistic example datasets suitable for merging.

Generates two correlated genotype files that share a large number of common SNPs, with some discrepancies to demonstrate merge functionality.

Parameters:: output_dir (str) – Directory for output files
Returns:: Paths to (file1_23andme, file2_ftdna)
Return type:: tuple of (str, str)

save_as_23andme(output_path, num_snps=991786, **kwargs)[source]¶

Save SNPs in 23andMe format.

Parameters:

output_path (str)
num_snps (int)
kwargs (Any)

Return type:

save_as_ancestry(output_path, num_snps=700000, **kwargs)[source]¶

Save SNPs in AncestryDNA format.

Parameters:

output_path (str)
num_snps (int)
kwargs (Any)

Return type:

save_as_ftdna(output_path, num_snps=715194, **kwargs)[source]¶

Save SNPs in Family Tree DNA (FTDNA) format.

Parameters:

output_path (str)
num_snps (int)
kwargs (Any)

Return type:

save_as_generic(output_path, format='csv', num_snps=10000, **kwargs)[source]¶

Save SNPs in generic CSV or TSV format.

Parameters:

output_path (str)
format (str)
num_snps (int)
kwargs (Any)

Return type:

Data Resources¶

snps.ensembl¶

Interface to Ensembl REST API for genomic data.

Ensembl REST client.

Notes

Modified from https://github.com/Ensembl/ensembl-rest/wiki/Example-Python-Client.

References

Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

class snps.ensembl.EnsemblRestClient(server='https://rest.ensembl.org', reqs_per_sec=15)[source]¶

Bases: object

__init__(server='https://rest.ensembl.org', reqs_per_sec=15)[source]¶

perform_rest_action(endpoint, hdrs=None, params=None)[source]¶

snps.resources¶

Resource management for reference data and assembly mappings.

Class for downloading and loading required external resources.

References

International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. http://dx.doi.org/10.1038/35057062
hg19 (GRCh37): Hiram Clawson, Brooke Rhead, Pauline Fujita, Ann Zweig, Katrina Learned, Donna Karolchik and Robert Kuhn, https://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

class snps.resources.Resources(*args, **kwargs)[source]¶

Bases: object

Object used to manage resources required by snps.

__init__(resources_dir='resources')[source]¶

Initialize a Resources object.

Parameters:: resources_dir (str) – name / path of resources directory

get_reference_sequences(assembly='GRCh37', chroms=('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT'))[source]¶

Get Homo sapiens reference sequences for chroms of assembly.

Notes

This function can download over 800MB of data for each assembly.

Parameters:

assembly ({'NCBI36', 'GRCh37', 'GRCh38'}) – reference sequence assembly
chroms (list of str) – reference sequence chromosomes

Returns:

dict of ReferenceSequence, else {}

Return type:

get_assembly_mapping_data(source_assembly, target_assembly)[source]¶

Get assembly mapping data.

Parameters:

source_assembly ({'NCBI36', 'GRCh37', 'GRCh38'}) – assembly to remap from
target_assembly ({'NCBI36', 'GRCh37', 'GRCh38'}) – assembly to remap to

Returns:

dict of json assembly mapping data if loading was successful, else {}

Return type:

https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html

create_example_datasets(output_dir=None)[source]¶

Create synthetic example datasets for demonstrations.

Generates two correlated genotype files in different formats and builds, suitable for demonstrating merging and remapping functionality. The files share ~700K common SNPs with intentional discrepancies to demonstrate merge conflict detection.

Parameters:: output_dir (str, optional) – Directory for output files (default: resources directory)
Returns:: paths – Paths to created example datasets
Return type:: list of str

Examples

>>> from snps.resources import Resources
>>> r = Resources()
>>> paths = r.create_example_datasets()
Creating resources/sample1.23andme.txt.gz
Creating resources/sample2.ftdna.csv.gz

get_all_resources()[source]¶

Get / download all resources used throughout snps.

Notes

This function does not download reference sequences due to their large sizes.

Returns:: dict of resources
Return type:: dict

get_all_reference_sequences(**kwargs)[source]¶

Get Homo sapiens reference sequences for Builds 36, 37, and 38 from Ensembl.

Notes

This function can download over 2.5GB of data.

Returns:: dict of ReferenceSequence, else {}
Return type:: dict

get_gsa_resources()[source]¶

Get resources for reading Global Screening Array files.

Return type:: dict

get_chip_clusters()[source]¶

Get resource for identifying deduced genotype / chip array based on chip clusters.

Return type:: pandas.DataFrame

References

Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.
Lu, Tzovaras, & Gough. (2021). OpenSNP data-freeze of 5,393 (19.10.2020) [Data set]. In Computational and Structural Biotechnology Journal. Zenodo. https://doi.org/10.1016/j.csbj.2021.06.040

get_low_quality_snps()[source]¶

Get listing of low quality SNPs for quality control based on chip clusters.

Return type:: pandas.DataFrame

References

Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.
Lu, Tzovaras, & Gough. (2021). OpenSNP data-freeze of 5,393 (19.10.2020) [Data set]. In Computational and Structural Biotechnology Journal. Zenodo. https://doi.org/10.1016/j.csbj.2021.06.040

get_dbsnp_151_37_reverse()[source]¶

Get and load RSIDs that are on the reference reverse (-) strand in dbSNP 151 and lower.

Return type:: pandas.DataFrame

References

Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1; 29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/

get_gsa_rsid()[source]¶

Get and load GSA RSID map.

https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html

Return type:: pandas.DataFrame

get_gsa_chrpos()[source]¶

Get and load GSA chromosome position map.

https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html

Return type:: pandas.DataFrame

class snps.resources.ReferenceSequence(ID='', url='', path='', assembly='', species='', taxonomy='')[source]¶

Bases: object

Object used to represent and interact with a reference sequence.

__init__(ID='', url='', path='', assembly='', species='', taxonomy='')[source]¶

Initialize a ReferenceSequence object.

Parameters:

ID (str) – reference sequence chromosome
url (str) – url to Ensembl reference sequence
path (str) – path to local reference sequence
assembly (str) – reference sequence assembly (e.g., “GRCh37”)
species (str) – reference sequence species
taxonomy (str) – reference sequence taxonomy

References

The Variant Call Format (VCF) Version 4.3 Specification, 27 Nov 2022, https://samtools.github.io/hts-specs/VCFv4.3.pdf

property ID¶

Get reference sequence chromosome.

Return type:: str

property chrom¶

Get reference sequence chromosome.

Return type:: str

property url¶

Get URL to Ensembl reference sequence.

Return type:: str

property path¶

Get path to local reference sequence.

Return type:: str

property assembly¶

Get reference sequence assembly.

Return type:: str

property build¶

Get reference sequence build.

Returns:: e.g., “B37”
Return type:: str

property species¶

Get reference sequence species.

Return type:: str

property taxonomy¶

Get reference sequence taxonomy.

Return type:: str

property sequence¶

Get reference sequence.

Return type:: np.array(dtype=np.uint8)

property md5¶

Get reference sequence MD5 hash.

Return type:: str

property start¶

Get reference sequence start position (1-based).

Return type:: int

property end¶

Get reference sequence end position (1-based).

Return type:: int

property length¶

Get reference sequence length.

Return type:: int

clear()[source]¶: Clear reference sequence.

Utilities¶

snps.utils¶

Helper functions and utilities.

Utility classes and functions.

class snps.utils.Parallelizer(parallelize=False, processes=2)[source]¶

Bases: object

__init__(parallelize=False, processes=2)[source]¶

Initialize a Parallelizer.

Parameters:

parallelize (bool) – utilize multiprocessing to speedup calculations
processes (int) – processes to launch if multiprocessing

__call__(f, tasks)[source]¶

Optionally parallelize execution of a function.

Parameters:

f (func) – function to execute
tasks (list of dict) – tasks to pass to f

Returns:

results of each call to f

Return type:

list

class snps.utils.Singleton[source]¶: Bases: type

snps.utils.create_dir(path)[source]¶

Create directory specified by path if it doesn’t already exist.

Parameters:: path (str) – path to directory
Returns:: True if path exists
Return type:: bool

snps.utils.get_utc_now()[source]¶

Get current UTC time.

Return type:: datetime.datetime

snps.utils.save_df_as_csv(df, path, filename, comment='', prepend_info=True, atomic=True, **kwargs)[source]¶

Save dataframe to a CSV file.

Parameters:

df (pandas.DataFrame) – dataframe to save
path (str) – path to directory where to save CSV file
filename (str or buffer) – filename for file to save or buffer to write to
comment (str) – header comment(s); one or more lines starting with ‘#’
prepend_info (bool) – prepend file generation information as comments
atomic (bool) – atomically write output to a file on local filesystem
**kwargs – additional parameters to pandas.DataFrame.to_csv

Returns:

path to saved file or buffer (empty str if error)

Return type:

str or buffer

snps.utils.clean_str(s)[source]¶

Clean a string so that it can be used as a Python variable name.

Parameters:: s (str) – string to clean
Returns:: string that can be used as a Python variable name
Return type:: str

snps.utils.zip_file(src, dest, arcname)[source]¶

Zip a file.

Parameters:

src (str) – path to file to zip
dest (str) – path to output zip file
arcname (str) – name of file in zip archive

Returns:

path to zipped file

Return type:

snps.utils.gzip_file(src, dest)[source]¶

Gzip a file.

Parameters:

src (str) – path to file to gzip
dest (str) – path to output gzip file

Returns:

path to gzipped file

Return type: