Code Documentation

SNPs

SNPs reads, writes, merges, and remaps genotype / raw data files.

class snps.snps.SNPs(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]

Bases: object

__init__(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]

Object used to read, write, and remap genotype / raw data files.

Parameters
  • file (str or bytes) – path to file to load or bytes to load

  • only_detect_source (bool) – only detect the source of the data

  • assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes

  • output_dir (str) – path to output directory

  • resources_dir (str) – name / path of resources directory

  • deduplicate (bool) – deduplicate RSIDs and make SNPs available as SNPs.duplicate

  • deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT

  • deduplicate_XY_chrom (bool or str) – deduplicate alleles in the non-PAR regions of X and Y for males; see SNPs.discrepant_XY if a str then this is the sex determination method to use X Y or XY

  • parallelize (bool) – utilize multiprocessing to speedup calculations

  • processes (int) – processes to launch if multiprocessing

  • rsids (tuple, optional) – rsids to extract if loading a VCF file

property assembly

Assembly of SNPs.

Returns

Return type

str

property build

Build of SNPs.

Returns

Return type

int

property build_detected

Status indicating if build of SNPs was detected.

Returns

Return type

bool

property chip

Detected deduced genotype / chip array, if any, per compute_cluster_overlap.

Returns

detected chip array, else empty str

Return type

str

property chip_version

Detected genotype / chip array version, if any, per compute_cluster_overlap.

Notes

Chip array version is only applicable to 23andMe (v3, v4, v5) and AncestryDNA (v1, v2) files.

Returns

detected chip array version, e.g., ‘v4’, else empty str

Return type

str

property chromosomes

Chromosomes of SNPs.

Returns

list of str chromosomes (e.g., [‘1’, ‘2’, ‘3’, ‘MT’], empty list if no chromosomes

Return type

list

property chromosomes_summary

Summary of the chromosomes of SNPs.

Returns

human-readable listing of chromosomes (e.g., ‘1-3, MT’), empty str if no chromosomes

Return type

str

property cluster

Detected chip cluster, if any, per compute_cluster_overlap.

Notes

Refer to compute_cluster_overlap for more details about chip clusters.

Returns

detected chip cluster, e.g., ‘c1’, else empty str

Return type

str

compute_cluster_overlap(cluster_overlap_threshold=0.95)[source]

Compute overlap with chip clusters.

Chip clusters, which are defined in 1, are associated with deduced genotype / chip arrays and DTC companies.

This method also sets the values returned by the cluster, chip, and chip_version properties, based on max overlap, if the specified threshold is satisfied.

Parameters

cluster_overlap_threshold (float) – threshold for cluster to overlap this SNPs object, and vice versa, to set values returned by the cluster, chip, and chip_version properties

Returns

pandas.DataFrame with the following columns:

company_composition

DTC company composition of associated cluster from 1

chip_base_deduced

deduced genotype / chip array of associated cluster from 1

snps_in_cluster

count of SNPs in cluster

snps_in_common

count of SNPs in common with cluster (inner merge with cluster)

overlap_with_cluster

percentage overlap of snps_in_common with cluster

overlap_with_self

percentage overlap of snps_in_common with this SNPs object

Return type

pandas.DataFrame

References

1(1,2,3,4)

Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.

property count

Count of SNPs.

Returns

Return type

int

detect_build()[source]

Detect build of SNPs.

Use the coordinates of common SNPs to identify the build / assembly of a genotype file that is being loaded.

Notes

  • rs3094315 : plus strand in 36, 37, and 38

  • rs11928389 : plus strand in 36, minus strand in 37 and 38

  • rs2500347 : plus strand in 36 and 37, minus strand in 38

  • rs964481 : plus strand in 36, 37, and 38

  • rs2341354 : plus strand in 36, 37, and 38

  • rs3850290 : plus strand in 36, 37, and 38

  • rs1329546 : plus strand in 36, 37, and 38

Returns

detected build of SNPs, else 0

Return type

int

References

  1. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  2. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

  3. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308-11.

  4. Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. dbSNP accession: rs3094315, rs11928389, rs2500347, rs964481, rs2341354, rs3850290, and rs1329546 (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/

determine_sex(heterozygous_x_snps_threshold=0.03, y_snps_not_null_threshold=0.3, chrom='X')[source]

Determine sex from SNPs using thresholds.

Parameters
  • heterozygous_x_snps_threshold (float) – percentage heterozygous X SNPs; above this threshold, Female is determined

  • y_snps_not_null_threshold (float) – percentage Y SNPs that are not null; above this threshold, Male is determined

  • chrom ({“X”, “Y”}) – use X or Y chromosome SNPs to determine sex

Returns

‘Male’ or ‘Female’ if detected, else empty str

Return type

str

property discrepant_XY

Discrepant XY SNPs.

A discrepant XY SNP is a heterozygous SNP in the non-PAR region of the X or Y chromosome found during deduplication for a detected male genotype.

Returns

normalized snps dataframe

Return type

pandas.DataFrame

property discrepant_merge_genotypes

SNPs with discrepant genotypes discovered while merging SNPs.

Notes

Definitions of columns in this dataframe are as follows:

Column

Description

rsid

SNP ID

chrom

Chromosome of existing SNP

pos

Position of existing SNP

genotype

Genotype of existing SNP

chrom_added

Chromosome of added SNP

pos_added

Position of added SNP

genotype_added

Genotype of added SNP (discrepant with genotype)

Returns

Return type

pandas.DataFrame

property discrepant_merge_positions

SNPs with discrepant positions discovered while merging SNPs.

Notes

Definitions of columns in this dataframe are as follows:

Column

Description

rsid

SNP ID

chrom

Chromosome of existing SNP

pos

Position of existing SNP

genotype

Genotype of existing SNP

chrom_added

Chromosome of added SNP

pos_added

Position of added SNP (discrepant with pos)

genotype_added

Genotype of added SNP

Returns

Return type

pandas.DataFrame

property discrepant_merge_positions_genotypes

SNPs with discrepant positions and / or genotypes discovered while merging SNPs.

Notes

Definitions of columns in this dataframe are as follows:

Column

Description

rsid

SNP ID

chrom

Chromosome of existing SNP

pos

Position of existing SNP

genotype

Genotype of existing SNP

chrom_added

Chromosome of added SNP

pos_added

Position of added SNP (possibly discrepant with pos)

genotype_added

Genotype of added SNP (possibly discrepant with genotype)

Returns

Return type

pandas.DataFrame

property discrepant_vcf_position

SNPs with discrepant positions discovered while saving VCF.

Returns

normalized snps dataframe

Return type

pandas.DataFrame

property duplicate

Duplicate SNPs.

A duplicate SNP has the same RSID as another SNP. The first occurrence of the RSID is not considered a duplicate SNP.

Returns

normalized snps dataframe

Return type

pandas.DataFrame

get_count(chrom='')[source]

Count of SNPs.

Parameters

chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)

Returns

Return type

int

static get_par_regions(build)[source]

Get PAR regions for the X and Y chromosomes.

Parameters

build (int) – build of SNPs

Returns

PAR regions for the given build

Return type

pandas.DataFrame

References

  1. Genome Reference Consortium, https://www.ncbi.nlm.nih.gov/grc/human

  2. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  3. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

heterozygous(chrom='')[source]

Get heterozygous SNPs.

Parameters

chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)

Returns

normalized snps dataframe

Return type

pandas.DataFrame

property heterozygous_MT

Heterozygous SNPs on the MT chromosome found during deduplication.

Returns

normalized snps dataframe

Return type

pandas.DataFrame

homozygous(chrom='')[source]

Get homozygous SNPs.

Parameters

chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)

Returns

normalized snps dataframe

Return type

pandas.DataFrame

identify_low_quality_snps()[source]

Identify low quality SNPs based on chip clusters.

Any low quality SNPs are removed from the snps_qc dataframe and are made available as low_quality.

Notes

Chip clusters, which are defined in 1, are associated with low quality SNPs. As such, low quality SNPs will only be identified when this SNPs object corresponds to a cluster per compute_cluster_overlap().

property low_quality

SNPs identified as low quality, if any, per identify_low_quality_snps().

Returns

normalized snps dataframe

Return type

pandas.DataFrame

merge(snps_objects=(), discrepant_positions_threshold=100, discrepant_genotypes_threshold=500, remap=True, chrom='')[source]

Merge other SNPs objects into this SNPs object.

Parameters
  • snps_objects (list or tuple of SNPs) – other SNPs objects to merge into this SNPs object

  • discrepant_positions_threshold (int) – threshold for discrepant SNP positions between existing data and data to be loaded; a large value could indicate mismatched genome assemblies

  • discrepant_genotypes_threshold (int) – threshold for discrepant genotype data between existing data and data to be loaded; a large value could indicated mismatched individuals

  • remap (bool) – if necessary, remap other SNPs objects to have the same build as this SNPs object before merging

  • chrom (str, optional) – chromosome to merge (e.g., “1”, “Y”, “MT”)

Returns

for each SNPs object to merge, a dict with the following items:

merged (bool)

whether SNPs object was merged

common_rsids (pandas.Index)

SNPs in common

discrepant_position_rsids (pandas.Index)

SNPs with discrepant positions

discrepant_genotype_rsids (pandas.Index)

SNPs with discrepant genotypes

Return type

list of dict

References

  1. Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.

notnull(chrom='')[source]

Get not null genotype SNPs.

Parameters

chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)

Returns

normalized snps dataframe

Return type

pandas.DataFrame

property phased

Indicates if genotype is phased.

Returns

Return type

bool

predict_ancestry(output_directory=None, write_predictions=False, models_directory=None, aisnps_directory=None, n_components=None, k=None, thousand_genomes_directory=None, samples_directory=None, algorithm=None, aisnps_set=None)[source]

Predict genetic ancestry for SNPs.

Predictions by ezancestry.

Notes

Populations below are described here.

Parameters

various (optional) – See the available settings for predict at ezancestry.

Returns

dict with the following keys:

population_code (str)

max predicted population for the sample

population_description (str)

descriptive name of the population

population_percent (float)

predicted probability for the max predicted population

superpopulation_code (str)

max predicted super population (continental) for the sample

superpopulation_description (str)

descriptive name of the super population

superpopulation_percent (float)

predicted probability for the max predicted super population

ezancestry_df (pandas.DataFrame)

pandas.DataFrame with the following columns:

component1, component2, component3

The coordinates of the sample in the dimensionality-reduced component space. Can be used as (x, y, z,) coordinates for plotting in a 3d scatter plot.

predicted_population_population

The max predicted population for the sample.

ACB, ASW, BEB, CDX, CEU, CHB, CHS, CLM, ESN, FIN, GBR, GIH, GWD, IBS, ITU, JPT, KHV, LWK, MSL, MXL, PEL, PJL, PUR, STU, TSI, YRI

Predicted probabilities for each of the populations. These sum to 1.0.

predicted_population_superpopulation

The max predicted super population (continental) for the sample.

AFR, AMR, EAS, EUR, SAS

Predicted probabilities for each of the super populations. These sum to 1.0.

population_description, superpopulation_name

Descriptive names of the population and super population.

Return type

dict

remap(target_assembly, complement_bases=True)[source]

Remap SNP coordinates from one assembly to another.

This method uses the assembly map endpoint of the Ensembl REST API service (via Resources’s EnsemblRestClient) to convert SNP coordinates / positions from one assembly to another. After remapping, the coordinates / positions for the SNPs will be that of the target assembly.

If the SNPs are already mapped relative to the target assembly, remapping will not be performed.

Parameters
  • target_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’, 36, 37, 38}) – assembly to remap to

  • complement_bases (bool) – complement bases when remapping SNPs to the minus strand

Returns

  • chromosomes_remapped (list of str) – chromosomes remapped

  • chromosomes_not_remapped (list of str) – chromosomes not remapped

Notes

An assembly is also know as a “build.” For example:

Assembly NCBI36 = Build 36 Assembly GRCh37 = Build 37 Assembly GRCh38 = Build 38

See https://www.ncbi.nlm.nih.gov/assembly for more information about assemblies and remapping.

References

  1. Ensembl, Assembly Map Endpoint, http://rest.ensembl.org/documentation/info/assembly_map

  2. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  3. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

property sex

Sex derived from SNPs.

Returns

‘Male’ or ‘Female’ if detected, else empty str

Return type

str

property snps

Normalized SNPs.

Notes

Throughout snps, the “normalized snps dataframe” is defined as follows:

Column

Description

pandas dtype

rsid *

SNP ID

object (string)

chrom

Chromosome of SNP

object (string)

pos

Position of SNP (relative to build)

uint32

genotype

Genotype of SNP

object (string)

*

Dataframe index

Genotype can be null, length 1, or length 2. Specifically, genotype is null if not called or unavailable. Otherwise, for autosomal chromosomes, genotype is two alleles. For the X and Y chromosomes, male genotypes are one allele in the non-PAR regions (assuming deduplicate_XY_chrom). For the MT chromosome, genotypes are one allele (assuming deduplicate_MT_chrom).

Returns

normalized snps dataframe

Return type

pandas.DataFrame

property snps_qc

Normalized SNPs, after quality control.

Any low quality SNPs, identified per identify_low_quality_snps(), are not included in the result.

Returns

normalized snps dataframe

Return type

pandas.DataFrame

sort()[source]

Sort SNPs based on ordered chromosome list and position.

property source

Summary of the SNP data source(s).

Returns

Data source(s) for this SNPs object, separated by “, “.

Return type

str

property summary

Summary of SNPs.

Returns

summary info if SNPs is valid, else {}

Return type

dict

to_csv(filename='', atomic=True, **kwargs)[source]

Output SNPs as comma-separated values.

Parameters
  • filename (str or buffer) – filename for file to save or buffer to write to

  • atomic (bool) – atomically write output to a file on local filesystem

  • **kwargs – additional parameters to pandas.DataFrame.to_csv

Returns

path to file in output directory if SNPs were saved, else empty str

Return type

str

to_tsv(filename='', atomic=True, **kwargs)[source]

Output SNPs as tab-separated values.

Note that this results in the same default output as save.

Parameters
  • filename (str or buffer) – filename for file to save or buffer to write to

  • atomic (bool) – atomically write output to a file on local filesystem

  • **kwargs – additional parameters to pandas.DataFrame.to_csv

Returns

path to file in output directory if SNPs were saved, else empty str

Return type

str

to_vcf(filename='', atomic=True, alt_unavailable='.', chrom_prefix='', qc_only=False, qc_filter=False, **kwargs)[source]

Output SNPs as Variant Call Format.

Parameters
  • filename (str or buffer) – filename for file to save or buffer to write to

  • atomic (bool) – atomically write output to a file on local filesystem

  • alt_unavailable (str) – representation of ALT allele when ALT is not able to be determined

  • chrom_prefix (str) – prefix for chromosomes in VCF CHROM column

  • qc_only (bool) – output only SNPs that pass quality control

  • qc_filter (bool) – populate FILTER column based on quality control results

  • **kwargs – additional parameters to pandas.DataFrame.to_csv

Returns

path to file in output directory if SNPs were saved, else empty str

Return type

str

Notes

Parameters qc_only and qc_filter, if true, will identify low quality SNPs per identify_low_quality_snps(), if not done already. Moreover, these parameters have no effect if this SNPs object does not map to a cluster per compute_cluster_overlap().

References

  1. The Variant Call Format (VCF) Version 4.2 Specification, 8 Mar 2019, https://samtools.github.io/hts-specs/VCFv4.2.pdf

property unannotated_vcf

Indicates if VCF file is unannotated.

Returns

Return type

bool

property valid

Determine if SNPs is valid.

SNPs is valid when the input file has been successfully parsed.

Returns

True if SNPs is valid

Return type

bool

snps.ensembl

Ensembl REST client.

Notes

Modified from https://github.com/Ensembl/ensembl-rest/wiki/Example-Python-Client.

References

  1. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  2. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

class snps.ensembl.EnsemblRestClient(server='https://rest.ensembl.org', reqs_per_sec=15)[source]

Bases: object

__init__(server='https://rest.ensembl.org', reqs_per_sec=15)[source]
perform_rest_action(endpoint, hdrs=None, params=None)[source]

snps.io

Classes for reading and writing SNPs.

snps.io.reader

Class for reading SNPs.

class snps.io.reader.Reader(file='', only_detect_source=False, resources=None, rsids=())[source]

Bases: object

Class for reading and parsing raw data / genotype files.

__init__(file='', only_detect_source=False, resources=None, rsids=())[source]

Initialize a Reader.

Parameters
  • file (str or bytes) – path to file to load or bytes to load

  • only_detect_source (bool) – only detect the source of the data

  • resources (Resources) – instance of Resources

  • rsids (tuple, optional) – rsids to extract if loading a VCF file

static is_gzip(bytes_data)[source]

Check whether or not a bytes_data file is a valid gzip file.

static is_zip(bytes_data)[source]

Check whether or not a bytes_data file is a valid Zip file.

read()[source]

Read and parse a raw data / genotype file.

Returns

dict with the following items:

snps (pandas.DataFrame)

dataframe of parsed SNPs

source (str)

detected source of SNPs

phased (bool)

flag indicating if SNPs are phased

Return type

dict

read_23andme(file, compression, joined=True)[source]

Read and parse 23andMe file.

https://www.23andme.com

Parameters

file (str) – path to file

Returns

result of read_helper

Return type

dict

read_ancestry(file, compression)[source]

Read and parse Ancestry.com file.

http://www.ancestry.com

Parameters

file (str) – path to file

Returns

result of read_helper

Return type

dict

read_circledna(file, compression)[source]

Read and parse CircleDNA file.

https://circledna.com/

Notes

This method attempts to read and parse a whole exome file, optionally compressed with gzip or zip. Some assumptions are made throughout this process:

  • SNPs that are not annotated with an RSID are skipped

  • Insertions and deletions are skipped

Parameters

file (str or bytes) – path to file or bytes to load

Returns

result of read_helper

Return type

dict

read_dnaland(file, compression)[source]

Read and parse DNA.land files.

https://dna.land/

Parameters

data (str) – data string

Returns

result of read_helper

Return type

dict

read_ftdna(file, compression)[source]

Read and parse Family Tree DNA (FTDNA) file.

https://www.familytreedna.com

Parameters

file (str) – path to file

Returns

result of read_helper

Return type

dict

read_ftdna_famfinder(file, compression)[source]

Read and parse Family Tree DNA (FTDNA) “famfinder” file.

https://www.familytreedna.com

Parameters

file (str) – path to file

Returns

result of read_helper

Return type

dict

read_generic(file, compression, skip=1)[source]

Read and parse generic CSV or TSV file.

Notes

Assumes columns are ‘rsid’, ‘chrom’ / ‘chromosome’, ‘pos’ / ‘position’, and ‘genotype’; values are comma separated; unreported genotypes are indicated by ‘–’; and one header row precedes data. For example:

rsid,chromosome,position,genotype rs1,1,1,AA rs2,1,2,CC rs3,1,3,–

Parameters

file (str) – path to file

Returns

result of read_helper

Return type

dict

read_genes_for_good(file, compression)[source]

Read and parse Genes For Good file.

https://genesforgood.sph.umich.edu/readme/readme1.2.txt

Parameters

file (str) – path to file

Returns

result of read_helper

Return type

dict

read_gsa(data_or_filename, compresion, comments)[source]

Read and parse Illumina Global Screening Array files

Parameters

data_or_filename (str or bytes) – either the filename to read from or the bytes data itself

Returns

result of read_helper

Return type

dict

read_helper(source, parser)[source]

Generic method to help read files.

Parameters
  • source (str) – name of data source

  • parser (func) – parsing function, which returns a tuple with the following items:

    0 (pandas.DataFrame)

    dataframe of parsed SNPs (empty if only detecting source)

    1 (bool), optional

    flag indicating if SNPs are phased

    2 (int), optional

    detected build of SNPs

Returns

dict with the following items:

snps (pandas.DataFrame)

dataframe of parsed SNPs

source (str)

detected source of SNPs

phased (bool)

flag indicating if SNPs are phased

build (int)

detected build of SNPs

Return type

dict

References

  1. Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.

read_livingdna(file, compression)[source]

Read and parse LivingDNA file.

https://livingdna.com/

Parameters

file (str) – path to file

Returns

result of read_helper

Return type

dict

read_mapmygenome(file, compression, header)[source]

Read and parse Mapmygenome file.

https://mapmygenome.in

Parameters

file (str) – path to file

Returns

result of read_helper

Return type

dict

read_myheritage(file, compression)[source]

Read and parse MyHeritage file.

https://www.myheritage.com

Parameters

file (str) – path to file

Returns

result of read_helper

Return type

dict

read_snps_csv(file, comments, compression)[source]

Read and parse CSV file generated by snps.

https://pypi.org/project/snps/

Parameters
  • file (str or buffer) – path to file or buffer to read

  • comments (str) – comments at beginning of file

Returns

result of read_helper

Return type

dict

read_tellmegen(file, compression)[source]

Read and parse tellmeGen files.

https://www.tellmegen.com/

Parameters

data (str) – data string

Returns

result of read_helper

Return type

dict

read_vcf(file, compression, provider, rsids=())[source]

Read and parse VCF file.

Notes

This method attempts to read and parse a VCF file or buffer, optionally compressed with gzip. Some assumptions are made throughout this process:

  • SNPs that are not annotated with an RSID are skipped

  • If the VCF contains multiple samples, only the first sample is used to lookup the genotype

  • Insertions and deletions are skipped

  • If a sample allele is not specified, the genotype is reported as NaN

  • If a sample allele refers to a REF or ALT allele that is not specified, the genotype is reported as NaN

Parameters
  • file (str or bytes) – path to file or bytes to load

  • rsids (tuple, optional) – rsids to extract if loading a VCF file

Returns

result of read_helper

Return type

dict

snps.io.reader.get_empty_snps_dataframe()[source]

Get empty dataframe normalized for usage with snps.

Returns

Return type

pd.DataFrame

snps.io.writer

Class for writing SNPs.

class snps.io.writer.Writer(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]

Bases: object

Class for writing SNPs to files.

__init__(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]

Initialize a Writer.

Parameters
  • snps (SNPs) – SNPs to save to file or write to buffer

  • filename (str or buffer) – filename for file to save or buffer to write to

  • vcf (bool) – flag to save file as VCF

  • atomic (bool) – atomically write output to a file on local filesystem

  • vcf_alt_unavailable (str) – representation of VCF ALT allele when ALT is not able to be determined

  • vcf_chrom_prefix (str) – prefix for chromosomes in VCF CHROM column

  • vcf_qc_only (bool) – for VCF, output only SNPs that pass quality control

  • vcf_qc_filter (bool) – for VCF, populate VCF FILTER column based on quality control results

  • **kwargs – additional parameters to pandas.DataFrame.to_csv

write()[source]

Write SNPs to file or buffer.

Returns

  • str – path to file in output directory if SNPs were saved, else empty str

  • discrepant_vcf_position (pd.DataFrame) – SNPs with discrepant positions discovered while saving VCF

snps.resources

Class for downloading and loading required external resources.

References

  1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. http://dx.doi.org/10.1038/35057062

  2. hg19 (GRCh37): Hiram Clawson, Brooke Rhead, Pauline Fujita, Ann Zweig, Katrina Learned, Donna Karolchik and Robert Kuhn, https://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19

  3. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  4. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

class snps.resources.ReferenceSequence(ID='', url='', path='', assembly='', species='', taxonomy='')[source]

Bases: object

Object used to represent and interact with a reference sequence.

property ID

Get reference sequence chromosome.

Returns

Return type

str

__init__(ID='', url='', path='', assembly='', species='', taxonomy='')[source]

Initialize a ReferenceSequence object.

Parameters
  • ID (str) – reference sequence chromosome

  • url (str) – url to Ensembl reference sequence

  • path (str) – path to local reference sequence

  • assembly (str) – reference sequence assembly (e.g., “GRCh37”)

  • species (str) – reference sequence species

  • taxonomy (str) – reference sequence taxonomy

References

  1. The Variant Call Format (VCF) Version 4.2 Specification, 8 Mar 2019, https://samtools.github.io/hts-specs/VCFv4.2.pdf

property assembly

Get reference sequence assembly.

Returns

Return type

str

property build

Get reference sequence build.

Returns

e.g., “B37”

Return type

str

property chrom

Get reference sequence chromosome.

Returns

Return type

str

clear()[source]

Clear reference sequence.

property end

Get reference sequence end position (1-based).

Returns

Return type

int

property length

Get reference sequence length.

Returns

Return type

int

property md5

Get reference sequence MD5 hash.

Returns

Return type

str

property path

Get path to local reference sequence.

Returns

Return type

str

property sequence

Get reference sequence.

Returns

Return type

np.array(dtype=np.uint8)

property species

Get reference sequence species.

Returns

Return type

str

property start

Get reference sequence start position (1-based).

Returns

Return type

int

property taxonomy

Get reference sequence taxonomy.

Returns

Return type

str

property url

Get URL to Ensembl reference sequence.

Returns

Return type

str

class snps.resources.Resources(*args, **kwargs)[source]

Bases: object

Object used to manage resources required by snps.

__init__(resources_dir='resources')[source]

Initialize a Resources object.

Parameters

resources_dir (str) – name / path of resources directory

download_example_datasets()[source]

Download example datasets from openSNP.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Returns

paths – paths to example datasets

Return type

list of str or empty str

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204

get_all_reference_sequences(**kwargs)[source]

Get Homo sapiens reference sequences for Builds 36, 37, and 38 from Ensembl.

Notes

This function can download over 2.5GB of data.

Returns

dict of ReferenceSequence, else {}

Return type

dict

get_all_resources()[source]

Get / download all resources used throughout snps.

Notes

This function does not download reference sequences and the openSNP datadump, due to their large sizes.

Returns

dict of resources

Return type

dict

get_assembly_mapping_data(source_assembly, target_assembly)[source]

Get assembly mapping data.

Parameters
  • source_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – assembly to remap from

  • target_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – assembly to remap to

Returns

dict of json assembly mapping data if loading was successful, else {}

Return type

dict

get_chip_clusters()[source]

Get resource for identifying deduced genotype / chip array based on chip clusters.

Returns

Return type

pandas.DataFrame

References

  1. Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.

get_dbsnp_151_37_reverse()[source]

Get and load RSIDs that are on the reference reverse (-) strand in dbSNP 151 and lower.

Returns

Return type

pandas.DataFrame

References

  1. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1; 29(1):308-11.

  2. Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/

get_gsa_chrpos()[source]

Get and load GSA chromosome position map.

https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html

Returns

Return type

pandas.DataFrame

get_gsa_resources()[source]

Get resources for reading Global Screening Array files.

https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html

Returns

Return type

dict

get_gsa_rsid()[source]

Get and load GSA RSID map.

https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html

Returns

Return type

pandas.DataFrame

get_low_quality_snps()[source]

Get listing of low quality SNPs for quality control based on chip clusters.

Returns

Return type

pandas.DataFrame

References

  1. Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.

get_opensnp_datadump_filenames()[source]

Get filenames internal to the openSNP datadump zip.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Notes

This function can download over 27GB of data. If the download is not successful, try using a different tool like wget or curl to download the file and move it to the resources directory (see _get_path_opensnp_datadump).

Returns

filenames – filenames internal to the openSNP datadump

Return type

list of str

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204

get_reference_sequences(assembly='GRCh37', chroms=('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT'))[source]

Get Homo sapiens reference sequences for chroms of assembly.

Notes

This function can download over 800MB of data for each assembly.

Parameters
  • assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – reference sequence assembly

  • chroms (list of str) – reference sequence chromosomes

Returns

dict of ReferenceSequence, else {}

Return type

dict

load_opensnp_datadump_file(filename)[source]

Load the specified file from the openSNP datadump.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Parameters

filename (str) – filename internal to the openSNP datadump

Returns

content of specified file internal to the openSNP datadump

Return type

bytes

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204

snps.utils

Utility classes and functions.

class snps.utils.Parallelizer(parallelize=False, processes=2)[source]

Bases: object

__init__(parallelize=False, processes=2)[source]

Initialize a Parallelizer.

Parameters
  • parallelize (bool) – utilize multiprocessing to speedup calculations

  • processes (int) – processes to launch if multiprocessing

class snps.utils.Singleton[source]

Bases: type

snps.utils.clean_str(s)[source]

Clean a string so that it can be used as a Python variable name.

Parameters

s (str) – string to clean

Returns

string that can be used as a Python variable name

Return type

str

snps.utils.create_dir(path)[source]

Create directory specified by path if it doesn’t already exist.

Parameters

path (str) – path to directory

Returns

True if path exists

Return type

bool

snps.utils.gzip_file(src, dest)[source]

Gzip a file.

Parameters
  • src (str) – path to file to gzip

  • dest (str) – path to output gzip file

Returns

path to gzipped file

Return type

str

snps.utils.save_df_as_csv(df, path, filename, comment='', prepend_info=True, atomic=True, **kwargs)[source]

Save dataframe to a CSV file.

Parameters
  • df (pandas.DataFrame) – dataframe to save

  • path (str) – path to directory where to save CSV file

  • filename (str or buffer) – filename for file to save or buffer to write to

  • comment (str) – header comment(s); one or more lines starting with ‘#’

  • prepend_info (bool) – prepend file generation information as comments

  • atomic (bool) – atomically write output to a file on local filesystem

  • **kwargs – additional parameters to pandas.DataFrame.to_csv

Returns

path to saved file or buffer (empty str if error)

Return type

str or buffer

snps.utils.zip_file(src, dest, arcname)[source]

Zip a file.

Parameters
  • src (str) – path to file to zip

  • dest (str) – path to output zip file

  • arcname (str) – name of file in zip archive

Returns

path to zipped file

Return type

str