API Reference¶
This section documents the complete API for the snps library.
Core Classes¶
SNPs¶
The main class for reading, writing, and analyzing genotype data.
SNPs reads, writes, merges, and remaps genotype / raw data files.
- class snps.snps.SNPs(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]¶
Bases:
object- __init__(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]¶
Object used to read, write, and remap genotype / raw data files.
- Parameters:
only_detect_source (
bool) – only detect the source of the dataassign_par_snps (
bool) – assign PAR SNPs to the X and Y chromosomesoutput_dir (
str) – path to output directoryresources_dir (
str) – name / path of resources directorydeduplicate (
bool) – deduplicate RSIDs and make SNPs available as SNPs.duplicatededuplicate_MT_chrom (
bool) – deduplicate alleles on MT; see SNPs.heterozygous_MTdeduplicate_XY_chrom (
boolorstr) – deduplicate alleles in the non-PAR regions of X and Y for males; see SNPs.discrepant_XY if a str then this is the sex determination method to use X Y or XYparallelize (
bool) – utilize multiprocessing to speedup calculationsprocesses (
int) – processes to launch if multiprocessingrsids (
tuple, optional) – rsids to extract if loading a VCF file
- property source¶
Summary of the SNP data source(s).
- Returns:
Data source(s) for this
SNPsobject, separated by “, “.- Return type:
- property snps¶
Normalized SNPs.
Notes
Throughout
snps, the “normalizedsnpsdataframe” is defined as follows:- Returns:
normalized
snpsdataframe- Return type:
- property snps_qc¶
Normalized SNPs, after quality control.
Any low quality SNPs, identified per
identify_low_quality_snps(), are not included in the result.- Returns:
normalized
snpsdataframe- Return type:
- property duplicate¶
Duplicate SNPs.
A duplicate SNP has the same RSID as another SNP. The first occurrence of the RSID is not considered a duplicate SNP.
- Returns:
normalized
snpsdataframe- Return type:
- property discrepant_XY¶
Discrepant XY SNPs.
A discrepant XY SNP is a heterozygous SNP in the non-PAR region of the X or Y chromosome found during deduplication for a detected male genotype.
- Returns:
normalized
snpsdataframe- Return type:
- property heterozygous_MT¶
Heterozygous SNPs on the MT chromosome found during deduplication.
- Returns:
normalized
snpsdataframe- Return type:
- property discrepant_vcf_position¶
SNPs with discrepant positions discovered while saving VCF.
- Returns:
normalized
snpsdataframe- Return type:
- property low_quality¶
SNPs identified as low quality, if any, per
identify_low_quality_snps().- Returns:
normalized
snpsdataframe- Return type:
- property discrepant_merge_positions¶
SNPs with discrepant positions discovered while merging SNPs.
Notes
Definitions of columns in this dataframe are as follows:
Column
Description
rsid
SNP ID
chrom
Chromosome of existing SNP
pos
Position of existing SNP
genotype
Genotype of existing SNP
chrom_added
Chromosome of added SNP
pos_added
Position of added SNP (discrepant with pos)
genotype_added
Genotype of added SNP
- Return type:
- property discrepant_merge_genotypes¶
SNPs with discrepant genotypes discovered while merging SNPs.
Notes
Definitions of columns in this dataframe are as follows:
Column
Description
rsid
SNP ID
chrom
Chromosome of existing SNP
pos
Position of existing SNP
genotype
Genotype of existing SNP
chrom_added
Chromosome of added SNP
pos_added
Position of added SNP
genotype_added
Genotype of added SNP (discrepant with genotype)
- Return type:
- property discrepant_merge_positions_genotypes¶
SNPs with discrepant positions and / or genotypes discovered while merging SNPs.
Notes
Definitions of columns in this dataframe are as follows:
Column
Description
rsid
SNP ID
chrom
Chromosome of existing SNP
pos
Position of existing SNP
genotype
Genotype of existing SNP
chrom_added
Chromosome of added SNP
pos_added
Position of added SNP (possibly discrepant with pos)
genotype_added
Genotype of added SNP (possibly discrepant with genotype)
- Return type:
- property chromosomes¶
Chromosomes of SNPs.
- Returns:
list of str chromosomes (e.g., [‘1’, ‘2’, ‘3’, ‘MT’], empty list if no chromosomes
- Return type:
- property chromosomes_summary¶
Summary of the chromosomes of SNPs.
- Returns:
human-readable listing of chromosomes (e.g., ‘1-3, MT’), empty str if no chromosomes
- Return type:
- property sex¶
Sex derived from SNPs.
- Returns:
‘Male’ or ‘Female’ if detected, else empty str
- Return type:
- property cluster¶
Detected chip cluster, if any, per
compute_cluster_overlap.Notes
Refer to
compute_cluster_overlapfor more details about chip clusters.- Returns:
detected chip cluster, e.g., ‘c1’, else empty str
- Return type:
- property chip¶
Detected deduced genotype / chip array, if any, per
compute_cluster_overlap.- Returns:
detected chip array, else empty str
- Return type:
- property chip_version¶
Detected genotype / chip array version, if any, per
compute_cluster_overlap.Notes
Chip array version is only applicable to 23andMe (v3, v4, v5) and AncestryDNA (v1, v2) files.
- Returns:
detected chip array version, e.g., ‘v4’, else empty str
- Return type:
- heterozygous(chrom='')[source]¶
Get heterozygous SNPs.
- Parameters:
chrom (
str, optional) – chromosome (e.g., “1”, “X”, “MT”)- Returns:
normalized
snpsdataframe- Return type:
- homozygous(chrom='')[source]¶
Get homozygous SNPs.
- Parameters:
chrom (
str, optional) – chromosome (e.g., “1”, “X”, “MT”)- Returns:
normalized
snpsdataframe- Return type:
- notnull(chrom='')[source]¶
Get not null genotype SNPs.
- Parameters:
chrom (
str, optional) – chromosome (e.g., “1”, “X”, “MT”)- Returns:
normalized
snpsdataframe- Return type:
- property summary¶
Summary of SNPs.
- Returns:
summary info if
SNPsis valid, else {}- Return type:
- property valid¶
Determine if
SNPsis valid.SNPsis valid when the input file has been successfully parsed.- Returns:
True if
SNPsis valid- Return type:
- to_csv(filename='', atomic=True, **kwargs)[source]¶
Output SNPs as comma-separated values.
- Parameters:
- Returns:
path to file in output directory if SNPs were saved, else empty str
- Return type:
- to_tsv(filename='', atomic=True, **kwargs)[source]¶
Output SNPs as tab-separated values.
Note that this results in the same default output as save.
- Parameters:
- Returns:
path to file in output directory if SNPs were saved, else empty str
- Return type:
- to_vcf(filename='', atomic=True, alt_unavailable='.', chrom_prefix='', qc_only=False, qc_filter=False, **kwargs)[source]¶
Output SNPs as Variant Call Format.
- Parameters:
filename (
strorbuffer) – filename for file to save or buffer to write toatomic (
bool) – atomically write output to a file on local filesystemalt_unavailable (
str) – representation of ALT allele when ALT is not able to be determinedchrom_prefix (
str) – prefix for chromosomes in VCF CHROM columnqc_only (
bool) – output only SNPs that pass quality controlqc_filter (
bool) – populate FILTER column based on quality control results**kwargs – additional parameters to pandas.DataFrame.to_csv
- Returns:
path to file in output directory if SNPs were saved, else empty str
- Return type:
Notes
Parameters qc_only and qc_filter, if true, will identify low quality SNPs per
identify_low_quality_snps(), if not done already. Moreover, these parameters have no effect if this SNPs object does not map to a cluster percompute_cluster_overlap().References
The Variant Call Format (VCF) Version 4.3 Specification, 27 Nov 2022, https://samtools.github.io/hts-specs/VCFv4.3.pdf
- detect_build()[source]¶
Detect build of SNPs.
Use the coordinates of common SNPs to identify the build / assembly of a genotype file that is being loaded.
Notes
rs3094315 : plus strand in 36, 37, and 38
rs11928389 : plus strand in 36, minus strand in 37 and 38
rs2500347 : plus strand in 36 and 37, minus strand in 38
rs964481 : plus strand in 36, 37, and 38
rs2341354 : plus strand in 36, 37, and 38
rs3850290 : plus strand in 36, 37, and 38
rs1329546 : plus strand in 36, 37, and 38
- Returns:
detected build of SNPs, else 0
- Return type:
References
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. dbSNP accession: rs3094315, rs11928389, rs2500347, rs964481, rs2341354, rs3850290, and rs1329546 (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/
- determine_sex(heterozygous_x_snps_threshold=0.03, y_snps_not_null_threshold=0.3, chrom='X')[source]¶
Determine sex from SNPs using thresholds.
- Parameters:
- Returns:
‘Male’ or ‘Female’ if detected, else empty str
- Return type:
- static get_par_regions(build)[source]¶
Get PAR regions for the X and Y chromosomes.
- Parameters:
build (
int) – build of SNPs- Returns:
PAR regions for the given build
- Return type:
References
Genome Reference Consortium, https://www.ncbi.nlm.nih.gov/grc/human
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
- remap(target_assembly, complement_bases=True)[source]¶
Remap SNP coordinates from one assembly to another.
This method uses the assembly map endpoint of the Ensembl REST API service (via
Resources’sEnsemblRestClient) to convert SNP coordinates / positions from one assembly to another. After remapping, the coordinates / positions for the SNPs will be that of the target assembly.If the SNPs are already mapped relative to the target assembly, remapping will not be performed.
- Parameters:
target_assembly (
{'NCBI36', 'GRCh37', 'GRCh38', 36, 37, 38}) – assembly to remap tocomplement_bases (
bool) – complement bases when remapping SNPs to the minus strand
- Returns:
Notes
An assembly is also know as a “build.” For example:
Assembly NCBI36 = Build 36 Assembly GRCh37 = Build 37 Assembly GRCh38 = Build 38
See https://www.ncbi.nlm.nih.gov/assembly for more information about assemblies and remapping.
References
Ensembl, Assembly Map Endpoint, http://rest.ensembl.org/documentation/info/assembly_map
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
- merge(snps_objects=(), discrepant_positions_threshold=100, discrepant_genotypes_threshold=500, remap=True, chrom='')[source]¶
Merge other
SNPsobjects into thisSNPsobject.- Parameters:
snps_objects (
listortupleofSNPs) – otherSNPsobjects to merge into thisSNPsobjectdiscrepant_positions_threshold (
int) – threshold for discrepant SNP positions between existing data and data to be loaded; a large value could indicate mismatched genome assembliesdiscrepant_genotypes_threshold (
int) – threshold for discrepant genotype data between existing data and data to be loaded; a large value could indicated mismatched individualsremap (
bool) – if necessary, remap otherSNPsobjects to have the same build as thisSNPsobject before mergingchrom (
str, optional) – chromosome to merge (e.g., “1”, “Y”, “MT”)
- Returns:
for each
SNPsobject to merge, a dict with the following items:- merged (bool)
whether
SNPsobject was merged- common_rsids (pandas.Index)
SNPs in common
- discrepant_position_rsids (pandas.Index)
SNPs with discrepant positions
- discrepant_genotype_rsids (pandas.Index)
SNPs with discrepant genotypes
- Return type:
References
Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.
- predict_ancestry(output_directory=None, write_predictions=False, models_directory=None, aisnps_directory=None, aisnps_set=None)[source]¶
Predict genetic ancestry for SNPs.
Predictions by ezancestry.
Notes
Populations below are described here.
- Parameters:
various (optional) –
See the available settings for predict at ezancestry.
- Returns:
dict with the following keys:
- population_code (str)
max predicted population for the sample
- population_percent (float)
predicted probability for the max predicted population
- superpopulation_code (str)
max predicted super population (continental) for the sample
- superpopulation_percent (float)
predicted probability for the max predicted super population
- ezancestry_df (pandas.DataFrame)
pandas.DataFrame with the following columns:
- component1, component2, component3
The coordinates of the sample in the dimensionality-reduced component space. Can be used as (x, y, z,) coordinates for plotting in a 3d scatter plot.
- predicted_ancestry_population
The max predicted population for the sample.
- ACB, ASW, BEB, CDX, CEU, CHB, CHS, CLM, ESN, FIN, GBR, GIH, GWD, IBS, ITU, JPT, KHV, LWK, MSL, MXL, PEL, PJL, PUR, STU, TSI, YRI
Predicted probabilities for each of the populations. These sum to 1.0.
- predicted_ancestry_superpopulation
The max predicted super population (continental) for the sample.
- AFR, AMR, EAS, EUR, SAS
Predicted probabilities for each of the super populations. These sum to 1.0.
- Return type:
- compute_cluster_overlap(cluster_overlap_threshold=0.95)[source]¶
Compute overlap with chip clusters.
Chip clusters, which are defined in [1], are associated with deduced genotype / chip arrays and DTC companies.
This method also sets the values returned by the cluster, chip, and chip_version properties, based on max overlap, if the specified threshold is satisfied.
- Parameters:
cluster_overlap_threshold (
float) – threshold for cluster to overlap this SNPs object, and vice versa, to set values returned by the cluster, chip, and chip_version properties- Returns:
pandas.DataFrame with the following columns:
- company_composition
DTC company composition of associated cluster from [1]
- chip_base_deduced
deduced genotype / chip array of associated cluster from [1]
- snps_in_cluster
count of SNPs in cluster
- snps_in_common
count of SNPs in common with cluster (inner merge with cluster)
- overlap_with_cluster
percentage overlap of snps_in_common with cluster
- overlap_with_self
percentage overlap of snps_in_common with this SNPs object
- Return type:
References
[1] (1,2,3,4) Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.
- identify_low_quality_snps()[source]¶
Identify low quality SNPs based on chip clusters.
Any low quality SNPs are removed from the
snps_qcdataframe and are made available aslow_quality.Notes
Chip clusters, which are defined in [1], are associated with low quality SNPs. As such, low quality SNPs will only be identified when this SNPs object corresponds to a cluster per
compute_cluster_overlap().
I/O Operations¶
Modules for reading, writing, and generating SNP data files.
snps.io¶
Classes for reading, writing, and generating SNPs.
- class snps.io.Reader(file='', only_detect_source=False, resources=None, rsids=())[source]¶
Bases:
objectClass for reading and parsing raw data / genotype files.
- read()[source]¶
Read and parse a raw data / genotype file.
- Returns:
dict with the following items:
- snps (pandas.DataFrame)
dataframe of parsed SNPs
- source (str)
detected source of SNPs
- phased (bool)
flag indicating if SNPs are phased
- Return type:
- read_helper(source, parser)[source]¶
Generic method to help read files.
- Parameters:
source (
str) – name of data sourceparser (
func) –parsing function, which returns a tuple with the following items:
- 0 (pandas.DataFrame)
dataframe of parsed SNPs (empty if only detecting source)
- 1 (bool), optional
flag indicating if SNPs are phased
- 2 (int), optional
detected build of SNPs
- Returns:
dict with the following items:
- snps (pandas.DataFrame)
dataframe of parsed SNPs
- source (str)
detected source of SNPs
- phased (bool)
flag indicating if SNPs are phased
- build (int)
detected build of SNPs
- Return type:
References
Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.
- read_ftdna_famfinder(file, compression)[source]¶
Read and parse Family Tree DNA (FTDNA) “famfinder” file.
- read_gsa(data_or_filename, compresion, comments)[source]¶
Read and parse Illumina Global Screening Array files
- read_circledna(file, compression)[source]¶
Read and parse CircleDNA file.
Notes
This method attempts to read and parse a whole exome file, optionally compressed with gzip or zip. Some assumptions are made throughout this process:
SNPs that are not annotated with an RSID are skipped
Insertions and deletions are skipped
- read_generic(file, compression, skip=1)[source]¶
Read and parse generic CSV or TSV file.
Notes
Assumes columns are ‘rsid’, ‘chrom’ / ‘chromosome’, ‘pos’ / ‘position’, and ‘genotype’; values are comma separated; unreported genotypes are indicated by ‘–’; and one header row precedes data. For example:
rsid,chromosome,position,genotype rs1,1,1,AA rs2,1,2,CC rs3,1,3,–
- read_vcf(file, compression, provider, rsids=(), comments='')[source]¶
Read and parse VCF file.
Notes
This method attempts to read and parse a VCF file or buffer, optionally compressed with gzip. Some assumptions are made throughout this process:
SNPs that are not annotated with an RSID are skipped
If the VCF contains multiple samples, only the first sample is used to lookup the genotype
Precise insertions and deletions are skipped
If a sample allele is not specified, the genotype is reported as NaN
If a sample allele refers to a REF or ALT allele that is not specified, the genotype is reported as NaN
- class snps.io.Writer(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]¶
Bases:
objectClass for writing SNPs to files.
- __init__(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]¶
Initialize a Writer.
- Parameters:
snps (
SNPs) – SNPs to save to file or write to bufferfilename (
strorbuffer) – filename for file to save or buffer to write tovcf (
bool) – flag to save file as VCFatomic (
bool) – atomically write output to a file on local filesystemvcf_alt_unavailable (
str) – representation of VCF ALT allele when ALT is not able to be determinedvcf_chrom_prefix (
str) – prefix for chromosomes in VCF CHROM columnvcf_qc_only (
bool) – for VCF, output only SNPs that pass quality controlvcf_qc_filter (
bool) – for VCF, populate VCF FILTER column based on quality control results**kwargs – additional parameters to pandas.DataFrame.to_csv
- class snps.io.SyntheticSNPGenerator(build=37, seed=None)[source]¶
Bases:
objectGenerate realistic synthetic genotype data.
This class generates synthetic SNP data that mimics real genotype files from various DNA testing companies. The generated data is suitable for testing, examples, and documentation.
- Parameters:
Examples
>>> gen = SyntheticSNPGenerator(build=37, seed=123) >>> gen.save_as_23andme("output.txt", num_snps=10000) 'output.txt'
- generate_snps(num_snps=10000, chromosomes=None, missing_rate=0.01, inject_build_markers=True)[source]¶
Generate a DataFrame of synthetic SNPs.
- Parameters:
num_snps (
int) – Approximate number of SNPs to generatechromosomes (
listofstr, optional) – Chromosomes to include (default: all autosomes plus X, Y, MT)missing_rate (
float) – Proportion of SNPs with missing genotypes (default: 0.01)inject_build_markers (
bool) – Inject known marker SNPs for build detection (default: True)
- Returns:
DataFrame with columns: rsid (index), chrom, pos, genotype
- Return type:
pd.DataFrame
- create_example_dataset_pair(output_dir='.')[source]¶
Create a pair of realistic example datasets suitable for merging.
Generates two correlated genotype files that share a large number of common SNPs, with some discrepancies to demonstrate merge functionality.
snps.io.reader¶
File format readers for various genotype data sources.
Class for reading SNPs.
- snps.io.reader.get_empty_snps_dataframe()[source]¶
Get empty dataframe normalized for usage with
snps.- Return type:
pd.DataFrame
- class snps.io.reader.Reader(file='', only_detect_source=False, resources=None, rsids=())[source]¶
Bases:
objectClass for reading and parsing raw data / genotype files.
- read()[source]¶
Read and parse a raw data / genotype file.
- Returns:
dict with the following items:
- snps (pandas.DataFrame)
dataframe of parsed SNPs
- source (str)
detected source of SNPs
- phased (bool)
flag indicating if SNPs are phased
- Return type:
- read_helper(source, parser)[source]¶
Generic method to help read files.
- Parameters:
source (
str) – name of data sourceparser (
func) –parsing function, which returns a tuple with the following items:
- 0 (pandas.DataFrame)
dataframe of parsed SNPs (empty if only detecting source)
- 1 (bool), optional
flag indicating if SNPs are phased
- 2 (int), optional
detected build of SNPs
- Returns:
dict with the following items:
- snps (pandas.DataFrame)
dataframe of parsed SNPs
- source (str)
detected source of SNPs
- phased (bool)
flag indicating if SNPs are phased
- build (int)
detected build of SNPs
- Return type:
References
Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.
- read_ftdna_famfinder(file, compression)[source]¶
Read and parse Family Tree DNA (FTDNA) “famfinder” file.
- read_gsa(data_or_filename, compresion, comments)[source]¶
Read and parse Illumina Global Screening Array files
- read_circledna(file, compression)[source]¶
Read and parse CircleDNA file.
Notes
This method attempts to read and parse a whole exome file, optionally compressed with gzip or zip. Some assumptions are made throughout this process:
SNPs that are not annotated with an RSID are skipped
Insertions and deletions are skipped
- read_generic(file, compression, skip=1)[source]¶
Read and parse generic CSV or TSV file.
Notes
Assumes columns are ‘rsid’, ‘chrom’ / ‘chromosome’, ‘pos’ / ‘position’, and ‘genotype’; values are comma separated; unreported genotypes are indicated by ‘–’; and one header row precedes data. For example:
rsid,chromosome,position,genotype rs1,1,1,AA rs2,1,2,CC rs3,1,3,–
- read_vcf(file, compression, provider, rsids=(), comments='')[source]¶
Read and parse VCF file.
Notes
This method attempts to read and parse a VCF file or buffer, optionally compressed with gzip. Some assumptions are made throughout this process:
SNPs that are not annotated with an RSID are skipped
If the VCF contains multiple samples, only the first sample is used to lookup the genotype
Precise insertions and deletions are skipped
If a sample allele is not specified, the genotype is reported as NaN
If a sample allele refers to a REF or ALT allele that is not specified, the genotype is reported as NaN
snps.io.writer¶
File format writers for exporting genotype data.
Class for writing SNPs.
- class snps.io.writer.Writer(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]¶
Bases:
objectClass for writing SNPs to files.
- __init__(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]¶
Initialize a Writer.
- Parameters:
snps (
SNPs) – SNPs to save to file or write to bufferfilename (
strorbuffer) – filename for file to save or buffer to write tovcf (
bool) – flag to save file as VCFatomic (
bool) – atomically write output to a file on local filesystemvcf_alt_unavailable (
str) – representation of VCF ALT allele when ALT is not able to be determinedvcf_chrom_prefix (
str) – prefix for chromosomes in VCF CHROM columnvcf_qc_only (
bool) – for VCF, output only SNPs that pass quality controlvcf_qc_filter (
bool) – for VCF, populate VCF FILTER column based on quality control results**kwargs – additional parameters to pandas.DataFrame.to_csv
snps.io.generator¶
Synthetic SNP data generation utilities.
Generate synthetic genotype data for testing and examples.
- class snps.io.generator.SyntheticSNPGenerator(build=37, seed=None)[source]¶
Bases:
objectGenerate realistic synthetic genotype data.
This class generates synthetic SNP data that mimics real genotype files from various DNA testing companies. The generated data is suitable for testing, examples, and documentation.
- Parameters:
Examples
>>> gen = SyntheticSNPGenerator(build=37, seed=123) >>> gen.save_as_23andme("output.txt", num_snps=10000) 'output.txt'
- generate_snps(num_snps=10000, chromosomes=None, missing_rate=0.01, inject_build_markers=True)[source]¶
Generate a DataFrame of synthetic SNPs.
- Parameters:
num_snps (
int) – Approximate number of SNPs to generatechromosomes (
listofstr, optional) – Chromosomes to include (default: all autosomes plus X, Y, MT)missing_rate (
float) – Proportion of SNPs with missing genotypes (default: 0.01)inject_build_markers (
bool) – Inject known marker SNPs for build detection (default: True)
- Returns:
DataFrame with columns: rsid (index), chrom, pos, genotype
- Return type:
pd.DataFrame
- create_example_dataset_pair(output_dir='.')[source]¶
Create a pair of realistic example datasets suitable for merging.
Generates two correlated genotype files that share a large number of common SNPs, with some discrepancies to demonstrate merge functionality.
Data Resources¶
snps.ensembl¶
Interface to Ensembl REST API for genomic data.
Ensembl REST client.
Notes
Modified from https://github.com/Ensembl/ensembl-rest/wiki/Example-Python-Client.
References
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
snps.resources¶
Resource management for reference data and assembly mappings.
Class for downloading and loading required external resources.
References
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. http://dx.doi.org/10.1038/35057062
hg19 (GRCh37): Hiram Clawson, Brooke Rhead, Pauline Fujita, Ann Zweig, Katrina Learned, Donna Karolchik and Robert Kuhn, https://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
- class snps.resources.Resources(*args, **kwargs)[source]¶
Bases:
objectObject used to manage resources required by snps.
- __init__(resources_dir='resources')[source]¶
Initialize a
Resourcesobject.- Parameters:
resources_dir (
str) – name / path of resources directory
- get_reference_sequences(assembly='GRCh37', chroms=('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT'))[source]¶
Get Homo sapiens reference sequences for chroms of assembly.
Notes
This function can download over 800MB of data for each assembly.
- get_assembly_mapping_data(source_assembly, target_assembly)[source]¶
Get assembly mapping data.
- Parameters:
source_assembly (
{'NCBI36', 'GRCh37', 'GRCh38'}) – assembly to remap fromtarget_assembly (
{'NCBI36', 'GRCh37', 'GRCh38'}) – assembly to remap to
- Returns:
dict of json assembly mapping data if loading was successful, else {}
- Return type:
- create_example_datasets(output_dir=None)[source]¶
Create synthetic example datasets for demonstrations.
Generates two correlated genotype files in different formats and builds, suitable for demonstrating merging and remapping functionality. The files share ~700K common SNPs with intentional discrepancies to demonstrate merge conflict detection.
- Parameters:
output_dir (
str, optional) – Directory for output files (default: resources directory)- Returns:
paths – Paths to created example datasets
- Return type:
Examples
>>> from snps.resources import Resources >>> r = Resources() >>> paths = r.create_example_datasets() Creating resources/sample1.23andme.txt.gz Creating resources/sample2.ftdna.csv.gz
- get_all_resources()[source]¶
Get / download all resources used throughout snps.
Notes
This function does not download reference sequences due to their large sizes.
- Returns:
dict of resources
- Return type:
- get_all_reference_sequences(**kwargs)[source]¶
Get Homo sapiens reference sequences for Builds 36, 37, and 38 from Ensembl.
Notes
This function can download over 2.5GB of data.
- Returns:
dict of ReferenceSequence, else {}
- Return type:
- get_gsa_resources()[source]¶
Get resources for reading Global Screening Array files.
https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html
- Return type:
- get_chip_clusters()[source]¶
Get resource for identifying deduced genotype / chip array based on chip clusters.
- Return type:
References
Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.
Lu, Tzovaras, & Gough. (2021). OpenSNP data-freeze of 5,393 (19.10.2020) [Data set]. In Computational and Structural Biotechnology Journal. Zenodo. https://doi.org/10.1016/j.csbj.2021.06.040
- get_low_quality_snps()[source]¶
Get listing of low quality SNPs for quality control based on chip clusters.
- Return type:
References
Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.
Lu, Tzovaras, & Gough. (2021). OpenSNP data-freeze of 5,393 (19.10.2020) [Data set]. In Computational and Structural Biotechnology Journal. Zenodo. https://doi.org/10.1016/j.csbj.2021.06.040
- get_dbsnp_151_37_reverse()[source]¶
Get and load RSIDs that are on the reference reverse (-) strand in dbSNP 151 and lower.
- Return type:
References
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1; 29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/
- get_gsa_rsid()[source]¶
Get and load GSA RSID map.
https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html
- Return type:
- get_gsa_chrpos()[source]¶
Get and load GSA chromosome position map.
https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html
- Return type:
- class snps.resources.ReferenceSequence(ID='', url='', path='', assembly='', species='', taxonomy='')[source]¶
Bases:
objectObject used to represent and interact with a reference sequence.
- __init__(ID='', url='', path='', assembly='', species='', taxonomy='')[source]¶
Initialize a
ReferenceSequenceobject.- Parameters:
References
The Variant Call Format (VCF) Version 4.3 Specification, 27 Nov 2022, https://samtools.github.io/hts-specs/VCFv4.3.pdf
- property sequence¶
Get reference sequence.
- Return type:
np.array(dtype=np.uint8)
Utilities¶
snps.utils¶
Helper functions and utilities.
Utility classes and functions.
- class snps.utils.Parallelizer(parallelize=False, processes=2)[source]¶
Bases:
object
- snps.utils.create_dir(path)[source]¶
Create directory specified by path if it doesn’t already exist.
- snps.utils.save_df_as_csv(df, path, filename, comment='', prepend_info=True, atomic=True, **kwargs)[source]¶
Save dataframe to a CSV file.
- Parameters:
df (
pandas.DataFrame) – dataframe to savepath (
str) – path to directory where to save CSV filefilename (
strorbuffer) – filename for file to save or buffer to write tocomment (
str) – header comment(s); one or more lines starting with ‘#’prepend_info (
bool) – prepend file generation information as commentsatomic (
bool) – atomically write output to a file on local filesystem**kwargs – additional parameters to pandas.DataFrame.to_csv
- Returns:
path to saved file or buffer (empty str if error)
- Return type:
strorbuffer
snps.testing¶
Shared test utilities for snps.
- snps.testing.complement_genotype(genotype)[source]¶
Get the complement of a genotype (both alleles).
- snps.testing.complement_one_allele(genotype)[source]¶
Get the complement of only the first allele of a genotype.
The second allele is preserved unchanged. This is useful for simulating partial strand complementation in test data.
- snps.testing.create_simulated_snp_df(chrom='1', pos_start=1, pos_max=248140902, pos_step=100, pos_dtype=<class 'numpy.uint32'>, genotype='AA', insert_nulls=True, null_snp_step=101, complement_genotype_one_allele=False, complement_genotype_two_alleles=False, complement_snp_step=50)[source]¶
Create a simulated SNP DataFrame for testing.
This is the core logic for creating simulated SNP data. Each project can wrap this to assign to their specific object types.
- Parameters:
chrom (
str) – Chromosome value for all SNPs (default: “1”)pos_start (
int) – Starting position (default: 1)pos_max (
int) – Maximum position (default: 248140902)pos_step (
int) – Step between positions (default: 100)pos_dtype (
type) – Numpy dtype for positions (default: np.uint32)genotype (
str) – Default genotype for all SNPs (default: “AA”)insert_nulls (
bool) – Whether to insert null genotypes (default: True)null_snp_step (
int) – Insert null every N SNPs (default: 101)complement_genotype_one_allele (
bool) – Complement first allele at intervals (default: False)complement_genotype_two_alleles (
bool) – Complement both alleles at intervals (default: False)complement_snp_step (
int) – Apply complement every N SNPs (default: 50)
- Returns:
DataFrame with rsid index and chrom, pos, genotype columns
- Return type:
- snps.testing.assert_series_equal_with_string_dtype(left, right, test_case=None, **kwargs)[source]¶
Assert Series are equal, accepting both object and StringDtype for string data.
In Python 3.14+, pandas infers StringDtype for string data instead of object. This function compares Series without strict dtype matching for string data.
- snps.testing.assert_frame_equal_with_string_index(left, right, test_case=None, **kwargs)[source]¶
Assert DataFrames are equal, accepting both object and StringDtype for string columns.
In Python 3.14+, pandas infers StringDtype for string columns/indices instead of object. This function validates that string columns have string types, then compares the DataFrames without strict dtype matching for object/string columns.
- Parameters:
- Return type:
None
- class snps.testing.SNPsTestMixin[source]¶
Bases:
objectMixin class providing common test assertions and utilities for SNP DataFrames.
This mixin can be combined with unittest.TestCase to add convenient assertion methods for comparing SNP DataFrames with flexible string dtype handling, plus common test utilities like creating test DataFrames.
Example
>>> class MyTestCase(SNPsTestMixin, TestCase): ... def test_something(self): ... df = self.generic_snps() ... self.assert_frame_equal_with_string_index(df, expected_df)
- property downloads_enabled: bool¶
Check if external downloads are enabled for tests.
Only download from external resources when an environment variable named “DOWNLOADS_ENABLED” is set to “true”.
- Return type:
- static get_complement(base)[source]¶
Get the complement of a DNA base.
See
get_complement()for details.
- complement_genotype(genotype)[source]¶
Get the complement of a genotype (both alleles).
See
complement_genotype()for details.
- complement_one_allele(genotype)[source]¶
Get the complement of only the first allele of a genotype.
See
complement_one_allele()for details.
- static create_snp_df(rsid, chrom, pos, genotype)[source]¶
Create a normalized SNP DataFrame.
See
create_snp_df()for details.
- generic_snps()[source]¶
Create a generic SNP DataFrame for testing.
- Returns:
DataFrame with 8 SNPs (rs1-rs8) on chromosome 1
- Return type:
- assert_series_equal_with_string_dtype(left, right, **kwargs)[source]¶
Assert Series are equal, accepting both object and StringDtype for string data.
See
assert_series_equal_with_string_dtype()for details.
- assert_frame_equal_with_string_index(left, right, **kwargs)[source]¶
Assert DataFrames are equal, accepting both object and StringDtype for string columns.
See
assert_frame_equal_with_string_index()for details.