Code Documentation

SNPs

SNPs reads, writes, merges, and remaps genotype / raw data files.

class snps.snps.SNPs(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]

Bases: object

__init__(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]

Object used to read, write, and remap genotype / raw data files.

Parameters:
  • file (str or bytes) – path to file to load or bytes to load

  • only_detect_source (bool) – only detect the source of the data

  • assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes

  • output_dir (str) – path to output directory

  • resources_dir (str) – name / path of resources directory

  • deduplicate (bool) – deduplicate RSIDs and make SNPs available as SNPs.duplicate

  • deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT

  • deduplicate_XY_chrom (bool or str) – deduplicate alleles in the non-PAR regions of X and Y for males; see SNPs.discrepant_XY if a str then this is the sex determination method to use X Y or XY

  • parallelize (bool) – utilize multiprocessing to speedup calculations

  • processes (int) – processes to launch if multiprocessing

  • rsids (tuple, optional) – rsids to extract if loading a VCF file

property assembly

Assembly of SNPs.

Return type:

str

property build

Build of SNPs.

Return type:

int

property build_detected

Status indicating if build of SNPs was detected.

Return type:

bool

property chip

Detected deduced genotype / chip array, if any, per compute_cluster_overlap.

Returns:

detected chip array, else empty str

Return type:

str

property chip_version

Detected genotype / chip array version, if any, per compute_cluster_overlap.

Notes

Chip array version is only applicable to 23andMe (v3, v4, v5) and AncestryDNA (v1, v2) files.

Returns:

detected chip array version, e.g., ‘v4’, else empty str

Return type:

str

property chromosomes

Chromosomes of SNPs.

Returns:

list of str chromosomes (e.g., [‘1’, ‘2’, ‘3’, ‘MT’], empty list if no chromosomes

Return type:

list

property chromosomes_summary

Summary of the chromosomes of SNPs.

Returns:

human-readable listing of chromosomes (e.g., ‘1-3, MT’), empty str if no chromosomes

Return type:

str

property cluster

Detected chip cluster, if any, per compute_cluster_overlap.

Notes

Refer to compute_cluster_overlap for more details about chip clusters.

Returns:

detected chip cluster, e.g., ‘c1’, else empty str

Return type:

str

compute_cluster_overlap(cluster_overlap_threshold=0.95)[source]

Compute overlap with chip clusters.

Chip clusters, which are defined in [1], are associated with deduced genotype / chip arrays and DTC companies.

This method also sets the values returned by the cluster, chip, and chip_version properties, based on max overlap, if the specified threshold is satisfied.

Parameters:

cluster_overlap_threshold (float) – threshold for cluster to overlap this SNPs object, and vice versa, to set values returned by the cluster, chip, and chip_version properties

Returns:

pandas.DataFrame with the following columns:

company_composition

DTC company composition of associated cluster from [1]

chip_base_deduced

deduced genotype / chip array of associated cluster from [1]

snps_in_cluster

count of SNPs in cluster

snps_in_common

count of SNPs in common with cluster (inner merge with cluster)

overlap_with_cluster

percentage overlap of snps_in_common with cluster

overlap_with_self

percentage overlap of snps_in_common with this SNPs object

Return type:

pandas.DataFrame

References

property count

Count of SNPs.

Return type:

int

detect_build()[source]

Detect build of SNPs.

Use the coordinates of common SNPs to identify the build / assembly of a genotype file that is being loaded.

Notes

  • rs3094315 : plus strand in 36, 37, and 38

  • rs11928389 : plus strand in 36, minus strand in 37 and 38

  • rs2500347 : plus strand in 36 and 37, minus strand in 38

  • rs964481 : plus strand in 36, 37, and 38

  • rs2341354 : plus strand in 36, 37, and 38

  • rs3850290 : plus strand in 36, 37, and 38

  • rs1329546 : plus strand in 36, 37, and 38

Returns:

detected build of SNPs, else 0

Return type:

int

References

  1. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  2. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

  3. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308-11.

  4. Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. dbSNP accession: rs3094315, rs11928389, rs2500347, rs964481, rs2341354, rs3850290, and rs1329546 (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/

determine_sex(heterozygous_x_snps_threshold=0.03, y_snps_not_null_threshold=0.3, chrom='X')[source]

Determine sex from SNPs using thresholds.

Parameters:
  • heterozygous_x_snps_threshold (float) – percentage heterozygous X SNPs; above this threshold, Female is determined

  • y_snps_not_null_threshold (float) – percentage Y SNPs that are not null; above this threshold, Male is determined

  • chrom ({“X”, “Y”}) – use X or Y chromosome SNPs to determine sex

Returns:

‘Male’ or ‘Female’ if detected, else empty str

Return type:

str

property discrepant_XY

Discrepant XY SNPs.

A discrepant XY SNP is a heterozygous SNP in the non-PAR region of the X or Y chromosome found during deduplication for a detected male genotype.

Returns:

normalized snps dataframe

Return type:

pandas.DataFrame

property discrepant_merge_genotypes

SNPs with discrepant genotypes discovered while merging SNPs.

Notes

Definitions of columns in this dataframe are as follows:

Column

Description

rsid

SNP ID

chrom

Chromosome of existing SNP

pos

Position of existing SNP

genotype

Genotype of existing SNP

chrom_added

Chromosome of added SNP

pos_added

Position of added SNP

genotype_added

Genotype of added SNP (discrepant with genotype)

Return type:

pandas.DataFrame

property discrepant_merge_positions

SNPs with discrepant positions discovered while merging SNPs.

Notes

Definitions of columns in this dataframe are as follows:

Column

Description

rsid

SNP ID

chrom

Chromosome of existing SNP

pos

Position of existing SNP

genotype

Genotype of existing SNP

chrom_added

Chromosome of added SNP

pos_added

Position of added SNP (discrepant with pos)

genotype_added

Genotype of added SNP

Return type:

pandas.DataFrame

property discrepant_merge_positions_genotypes

SNPs with discrepant positions and / or genotypes discovered while merging SNPs.

Notes

Definitions of columns in this dataframe are as follows:

Column

Description

rsid

SNP ID

chrom

Chromosome of existing SNP

pos

Position of existing SNP

genotype

Genotype of existing SNP

chrom_added

Chromosome of added SNP

pos_added

Position of added SNP (possibly discrepant with pos)

genotype_added

Genotype of added SNP (possibly discrepant with genotype)

Return type:

pandas.DataFrame

property discrepant_vcf_position

SNPs with discrepant positions discovered while saving VCF.

Returns:

normalized snps dataframe

Return type:

pandas.DataFrame

property duplicate

Duplicate SNPs.

A duplicate SNP has the same RSID as another SNP. The first occurrence of the RSID is not considered a duplicate SNP.

Returns:

normalized snps dataframe

Return type:

pandas.DataFrame

get_count(chrom='')[source]

Count of SNPs.

Parameters:

chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)

Return type:

int

static get_par_regions(build)[source]

Get PAR regions for the X and Y chromosomes.

Parameters:

build (int) – build of SNPs

Returns:

PAR regions for the given build

Return type:

pandas.DataFrame

References

  1. Genome Reference Consortium, https://www.ncbi.nlm.nih.gov/grc/human

  2. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  3. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

heterozygous(chrom='')[source]

Get heterozygous SNPs.

Parameters:

chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)

Returns:

normalized snps dataframe

Return type:

pandas.DataFrame

property heterozygous_MT

Heterozygous SNPs on the MT chromosome found during deduplication.

Returns:

normalized snps dataframe

Return type:

pandas.DataFrame

homozygous(chrom='')[source]

Get homozygous SNPs.

Parameters:

chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)

Returns:

normalized snps dataframe

Return type:

pandas.DataFrame

identify_low_quality_snps()[source]

Identify low quality SNPs based on chip clusters.

Any low quality SNPs are removed from the snps_qc dataframe and are made available as low_quality.

Notes

Chip clusters, which are defined in [1], are associated with low quality SNPs. As such, low quality SNPs will only be identified when this SNPs object corresponds to a cluster per compute_cluster_overlap().

property low_quality

SNPs identified as low quality, if any, per identify_low_quality_snps().

Returns:

normalized snps dataframe

Return type:

pandas.DataFrame

merge(snps_objects=(), discrepant_positions_threshold=100, discrepant_genotypes_threshold=500, remap=True, chrom='')[source]

Merge other SNPs objects into this SNPs object.

Parameters:
  • snps_objects (list or tuple of SNPs) – other SNPs objects to merge into this SNPs object

  • discrepant_positions_threshold (int) – threshold for discrepant SNP positions between existing data and data to be loaded; a large value could indicate mismatched genome assemblies

  • discrepant_genotypes_threshold (int) – threshold for discrepant genotype data between existing data and data to be loaded; a large value could indicated mismatched individuals

  • remap (bool) – if necessary, remap other SNPs objects to have the same build as this SNPs object before merging

  • chrom (str, optional) – chromosome to merge (e.g., “1”, “Y”, “MT”)

Returns:

for each SNPs object to merge, a dict with the following items:

merged (bool)

whether SNPs object was merged

common_rsids (pandas.Index)

SNPs in common

discrepant_position_rsids (pandas.Index)

SNPs with discrepant positions

discrepant_genotype_rsids (pandas.Index)

SNPs with discrepant genotypes

Return type:

list of dict

References

  1. Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.

notnull(chrom='')[source]

Get not null genotype SNPs.

Parameters:

chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)

Returns:

normalized snps dataframe

Return type:

pandas.DataFrame

property phased

Indicates if genotype is phased.

Return type:

bool

predict_ancestry(output_directory=None, write_predictions=False, models_directory=None, aisnps_directory=None, aisnps_set=None)[source]

Predict genetic ancestry for SNPs.

Predictions by ezancestry.

Notes

Populations below are described here.

Parameters:

various (optional) – See the available settings for predict at ezancestry.

Returns:

dict with the following keys:

population_code (str)

max predicted population for the sample

population_percent (float)

predicted probability for the max predicted population

superpopulation_code (str)

max predicted super population (continental) for the sample

superpopulation_percent (float)

predicted probability for the max predicted super population

ezancestry_df (pandas.DataFrame)

pandas.DataFrame with the following columns:

component1, component2, component3

The coordinates of the sample in the dimensionality-reduced component space. Can be used as (x, y, z,) coordinates for plotting in a 3d scatter plot.

predicted_ancestry_population

The max predicted population for the sample.

ACB, ASW, BEB, CDX, CEU, CHB, CHS, CLM, ESN, FIN, GBR, GIH, GWD, IBS, ITU, JPT, KHV, LWK, MSL, MXL, PEL, PJL, PUR, STU, TSI, YRI

Predicted probabilities for each of the populations. These sum to 1.0.

predicted_ancestry_superpopulation

The max predicted super population (continental) for the sample.

AFR, AMR, EAS, EUR, SAS

Predicted probabilities for each of the super populations. These sum to 1.0.

Return type:

dict

remap(target_assembly, complement_bases=True)[source]

Remap SNP coordinates from one assembly to another.

This method uses the assembly map endpoint of the Ensembl REST API service (via Resources’s EnsemblRestClient) to convert SNP coordinates / positions from one assembly to another. After remapping, the coordinates / positions for the SNPs will be that of the target assembly.

If the SNPs are already mapped relative to the target assembly, remapping will not be performed.

Parameters:
  • target_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’, 36, 37, 38}) – assembly to remap to

  • complement_bases (bool) – complement bases when remapping SNPs to the minus strand

Returns:

  • chromosomes_remapped (list of str) – chromosomes remapped

  • chromosomes_not_remapped (list of str) – chromosomes not remapped

Notes

An assembly is also know as a “build.” For example:

Assembly NCBI36 = Build 36 Assembly GRCh37 = Build 37 Assembly GRCh38 = Build 38

See https://www.ncbi.nlm.nih.gov/assembly for more information about assemblies and remapping.

References

  1. Ensembl, Assembly Map Endpoint, http://rest.ensembl.org/documentation/info/assembly_map

  2. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  3. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

property sex

Sex derived from SNPs.

Returns:

‘Male’ or ‘Female’ if detected, else empty str

Return type:

str

property snps

Normalized SNPs.

Notes

Throughout snps, the “normalized snps dataframe” is defined as follows:

Column

Description

pandas dtype

rsid [*]

SNP ID

object (string)

chrom

Chromosome of SNP

object (string)

pos

Position of SNP (relative to build)

uint32

genotype []

Genotype of SNP

object (string)

Returns:

normalized snps dataframe

Return type:

pandas.DataFrame

property snps_qc

Normalized SNPs, after quality control.

Any low quality SNPs, identified per identify_low_quality_snps(), are not included in the result.

Returns:

normalized snps dataframe

Return type:

pandas.DataFrame

sort()[source]

Sort SNPs based on ordered chromosome list and position.

property source

Summary of the SNP data source(s).

Returns:

Data source(s) for this SNPs object, separated by “, “.

Return type:

str

property summary

Summary of SNPs.

Returns:

summary info if SNPs is valid, else {}

Return type:

dict

to_csv(filename='', atomic=True, **kwargs)[source]

Output SNPs as comma-separated values.

Parameters:
  • filename (str or buffer) – filename for file to save or buffer to write to

  • atomic (bool) – atomically write output to a file on local filesystem

  • **kwargs – additional parameters to pandas.DataFrame.to_csv

Returns:

path to file in output directory if SNPs were saved, else empty str

Return type:

str

to_tsv(filename='', atomic=True, **kwargs)[source]

Output SNPs as tab-separated values.

Note that this results in the same default output as save.

Parameters:
  • filename (str or buffer) – filename for file to save or buffer to write to

  • atomic (bool) – atomically write output to a file on local filesystem

  • **kwargs – additional parameters to pandas.DataFrame.to_csv

Returns:

path to file in output directory if SNPs were saved, else empty str

Return type:

str

to_vcf(filename='', atomic=True, alt_unavailable='.', chrom_prefix='', qc_only=False, qc_filter=False, **kwargs)[source]

Output SNPs as Variant Call Format.

Parameters:
  • filename (str or buffer) – filename for file to save or buffer to write to

  • atomic (bool) – atomically write output to a file on local filesystem

  • alt_unavailable (str) – representation of ALT allele when ALT is not able to be determined

  • chrom_prefix (str) – prefix for chromosomes in VCF CHROM column

  • qc_only (bool) – output only SNPs that pass quality control

  • qc_filter (bool) – populate FILTER column based on quality control results

  • **kwargs – additional parameters to pandas.DataFrame.to_csv

Returns:

path to file in output directory if SNPs were saved, else empty str

Return type:

str

Notes

Parameters qc_only and qc_filter, if true, will identify low quality SNPs per identify_low_quality_snps(), if not done already. Moreover, these parameters have no effect if this SNPs object does not map to a cluster per compute_cluster_overlap().

References

  1. The Variant Call Format (VCF) Version 4.2 Specification, 8 Mar 2019, https://samtools.github.io/hts-specs/VCFv4.2.pdf

property unannotated_vcf

Indicates if VCF file is unannotated.

Return type:

bool

property valid

Determine if SNPs is valid.

SNPs is valid when the input file has been successfully parsed.

Returns:

True if SNPs is valid

Return type:

bool

snps.ensembl

Ensembl REST client.

Notes

Modified from https://github.com/Ensembl/ensembl-rest/wiki/Example-Python-Client.

References

  1. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  2. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

class snps.ensembl.EnsemblRestClient(server='https://rest.ensembl.org', reqs_per_sec=15)[source]

Bases: object

__init__(server='https://rest.ensembl.org', reqs_per_sec=15)[source]
perform_rest_action(endpoint, hdrs=None, params=None)[source]

snps.io

Classes for reading and writing SNPs.

snps.io.reader

Class for reading SNPs.

class snps.io.reader.Reader(file='', only_detect_source=False, resources=None, rsids=())[source]

Bases: object

Class for reading and parsing raw data / genotype files.

__init__(file='', only_detect_source=False, resources=None, rsids=())[source]

Initialize a Reader.

Parameters:
  • file (str or bytes) – path to file to load or bytes to load

  • only_detect_source (bool) – only detect the source of the data

  • resources (Resources) – instance of Resources

  • rsids (tuple, optional) – rsids to extract if loading a VCF file

static is_gzip(bytes_data)[source]

Check whether or not a bytes_data file is a valid gzip file.

static is_zip(bytes_data)[source]

Check whether or not a bytes_data file is a valid Zip file.

read()[source]

Read and parse a raw data / genotype file.

Returns:

dict with the following items:

snps (pandas.DataFrame)

dataframe of parsed SNPs

source (str)

detected source of SNPs

phased (bool)

flag indicating if SNPs are phased

Return type:

dict

read_23andme(file, compression, joined=True)[source]

Read and parse 23andMe file.

https://www.23andme.com

Parameters:

file (str) – path to file

Returns:

result of read_helper

Return type:

dict

read_ancestry(file, compression)[source]

Read and parse Ancestry.com file.

http://www.ancestry.com

Parameters:

file (str) – path to file

Returns:

result of read_helper

Return type:

dict

read_circledna(file, compression)[source]

Read and parse CircleDNA file.

https://circledna.com/

Notes

This method attempts to read and parse a whole exome file, optionally compressed with gzip or zip. Some assumptions are made throughout this process:

  • SNPs that are not annotated with an RSID are skipped

  • Insertions and deletions are skipped

Parameters:

file (str or bytes) – path to file or bytes to load

Returns:

result of read_helper

Return type:

dict

read_dnaland(file, compression)[source]

Read and parse DNA.land files.

https://dna.land/

Parameters:

data (str) – data string

Returns:

result of read_helper

Return type:

dict

read_ftdna(file, compression)[source]

Read and parse Family Tree DNA (FTDNA) file.

https://www.familytreedna.com

Parameters:

file (str) – path to file

Returns:

result of read_helper

Return type:

dict

read_ftdna_famfinder(file, compression)[source]

Read and parse Family Tree DNA (FTDNA) “famfinder” file.

https://www.familytreedna.com

Parameters:

file (str) – path to file

Returns:

result of read_helper

Return type:

dict

read_generic(file, compression, skip=1)[source]

Read and parse generic CSV or TSV file.

Notes

Assumes columns are ‘rsid’, ‘chrom’ / ‘chromosome’, ‘pos’ / ‘position’, and ‘genotype’; values are comma separated; unreported genotypes are indicated by ‘–’; and one header row precedes data. For example:

rsid,chromosome,position,genotype rs1,1,1,AA rs2,1,2,CC rs3,1,3,–

Parameters:

file (str) – path to file

Returns:

result of read_helper

Return type:

dict

read_genes_for_good(file, compression)[source]

Read and parse Genes For Good file.

https://genesforgood.sph.umich.edu/readme/readme1.2.txt

Parameters:

file (str) – path to file

Returns:

result of read_helper

Return type:

dict

read_gsa(data_or_filename, compresion, comments)[source]

Read and parse Illumina Global Screening Array files

Parameters:

data_or_filename (str or bytes) – either the filename to read from or the bytes data itself

Returns:

result of read_helper

Return type:

dict

read_helper(source, parser)[source]

Generic method to help read files.

Parameters:
  • source (str) – name of data source

  • parser (func) – parsing function, which returns a tuple with the following items:

    0 (pandas.DataFrame)

    dataframe of parsed SNPs (empty if only detecting source)

    1 (bool), optional

    flag indicating if SNPs are phased

    2 (int), optional

    detected build of SNPs

Returns:

dict with the following items:

snps (pandas.DataFrame)

dataframe of parsed SNPs

source (str)

detected source of SNPs

phased (bool)

flag indicating if SNPs are phased

build (int)

detected build of SNPs

Return type:

dict

References

  1. Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.

read_livingdna(file, compression)[source]

Read and parse LivingDNA file.

https://livingdna.com/

Parameters:

file (str) – path to file

Returns:

result of read_helper

Return type:

dict

read_mapmygenome(file, compression, header)[source]

Read and parse Mapmygenome file.

https://mapmygenome.in

Parameters:

file (str) – path to file

Returns:

result of read_helper

Return type:

dict

read_myheritage(file, compression)[source]

Read and parse MyHeritage file.

https://www.myheritage.com

Parameters:

file (str) – path to file

Returns:

result of read_helper

Return type:

dict

read_snps_csv(file, comments, compression)[source]

Read and parse CSV file generated by snps.

https://pypi.org/project/snps/

Parameters:
  • file (str or buffer) – path to file or buffer to read

  • comments (str) – comments at beginning of file

Returns:

result of read_helper

Return type:

dict

read_tellmegen(file, compression)[source]

Read and parse tellmeGen files.

https://www.tellmegen.com/

Parameters:

data (str) – data string

Returns:

result of read_helper

Return type:

dict

read_vcf(file, compression, provider, rsids=())[source]

Read and parse VCF file.

Notes

This method attempts to read and parse a VCF file or buffer, optionally compressed with gzip. Some assumptions are made throughout this process:

  • SNPs that are not annotated with an RSID are skipped

  • If the VCF contains multiple samples, only the first sample is used to lookup the genotype

  • Insertions and deletions are skipped

  • If a sample allele is not specified, the genotype is reported as NaN

  • If a sample allele refers to a REF or ALT allele that is not specified, the genotype is reported as NaN

Parameters:
  • file (str or bytes) – path to file or bytes to load

  • rsids (tuple, optional) – rsids to extract if loading a VCF file

Returns:

result of read_helper

Return type:

dict

snps.io.reader.get_empty_snps_dataframe()[source]

Get empty dataframe normalized for usage with snps.

Return type:

pd.DataFrame

snps.io.writer

Class for writing SNPs.

class snps.io.writer.Writer(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]

Bases: object

Class for writing SNPs to files.

__init__(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]

Initialize a Writer.

Parameters:
  • snps (SNPs) – SNPs to save to file or write to buffer

  • filename (str or buffer) – filename for file to save or buffer to write to

  • vcf (bool) – flag to save file as VCF

  • atomic (bool) – atomically write output to a file on local filesystem

  • vcf_alt_unavailable (str) – representation of VCF ALT allele when ALT is not able to be determined

  • vcf_chrom_prefix (str) – prefix for chromosomes in VCF CHROM column

  • vcf_qc_only (bool) – for VCF, output only SNPs that pass quality control

  • vcf_qc_filter (bool) – for VCF, populate VCF FILTER column based on quality control results

  • **kwargs – additional parameters to pandas.DataFrame.to_csv

write()[source]

Write SNPs to file or buffer.

Returns:

  • str – path to file in output directory if SNPs were saved, else empty str

  • discrepant_vcf_position (pd.DataFrame) – SNPs with discrepant positions discovered while saving VCF

snps.resources

Class for downloading and loading required external resources.

References

  1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. http://dx.doi.org/10.1038/35057062

  2. hg19 (GRCh37): Hiram Clawson, Brooke Rhead, Pauline Fujita, Ann Zweig, Katrina Learned, Donna Karolchik and Robert Kuhn, https://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19

  3. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613

  4. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098

class snps.resources.ReferenceSequence(ID='', url='', path='', assembly='', species='', taxonomy='')[source]

Bases: object

Object used to represent and interact with a reference sequence.

property ID

Get reference sequence chromosome.

Return type:

str

__init__(ID='', url='', path='', assembly='', species='', taxonomy='')[source]

Initialize a ReferenceSequence object.

Parameters:
  • ID (str) – reference sequence chromosome

  • url (str) – url to Ensembl reference sequence

  • path (str) – path to local reference sequence

  • assembly (str) – reference sequence assembly (e.g., “GRCh37”)

  • species (str) – reference sequence species

  • taxonomy (str) – reference sequence taxonomy

References

  1. The Variant Call Format (VCF) Version 4.2 Specification, 8 Mar 2019, https://samtools.github.io/hts-specs/VCFv4.2.pdf

property assembly

Get reference sequence assembly.

Return type:

str

property build

Get reference sequence build.

Returns:

e.g., “B37”

Return type:

str

property chrom

Get reference sequence chromosome.

Return type:

str

clear()[source]

Clear reference sequence.

property end

Get reference sequence end position (1-based).

Return type:

int

property length

Get reference sequence length.

Return type:

int

property md5

Get reference sequence MD5 hash.

Return type:

str

property path

Get path to local reference sequence.

Return type:

str

property sequence

Get reference sequence.

Return type:

np.array(dtype=np.uint8)

property species

Get reference sequence species.

Return type:

str

property start

Get reference sequence start position (1-based).

Return type:

int

property taxonomy

Get reference sequence taxonomy.

Return type:

str

property url

Get URL to Ensembl reference sequence.

Return type:

str

class snps.resources.Resources(*args, **kwargs)[source]

Bases: object

Object used to manage resources required by snps.

__init__(resources_dir='resources')[source]

Initialize a Resources object.

Parameters:

resources_dir (str) – name / path of resources directory

download_example_datasets()[source]

Download example datasets from openSNP.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Returns:

paths – paths to example datasets

Return type:

list of str or empty str

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204

get_all_reference_sequences(**kwargs)[source]

Get Homo sapiens reference sequences for Builds 36, 37, and 38 from Ensembl.

Notes

This function can download over 2.5GB of data.

Returns:

dict of ReferenceSequence, else {}

Return type:

dict

get_all_resources()[source]

Get / download all resources used throughout snps.

Notes

This function does not download reference sequences and the openSNP datadump, due to their large sizes.

Returns:

dict of resources

Return type:

dict

get_assembly_mapping_data(source_assembly, target_assembly)[source]

Get assembly mapping data.

Parameters:
  • source_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – assembly to remap from

  • target_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – assembly to remap to

Returns:

dict of json assembly mapping data if loading was successful, else {}

Return type:

dict

get_chip_clusters()[source]

Get resource for identifying deduced genotype / chip array based on chip clusters.

Return type:

pandas.DataFrame

References

  1. Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.

  2. Lu, Tzovaras, & Gough. (2021). OpenSNP data-freeze of 5,393 (19.10.2020) [Data set]. In Computational and Structural Biotechnology Journal. Zenodo. https://doi.org/10.1016/j.csbj.2021.06.040

get_dbsnp_151_37_reverse()[source]

Get and load RSIDs that are on the reference reverse (-) strand in dbSNP 151 and lower.

Return type:

pandas.DataFrame

References

  1. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1; 29(1):308-11.

  2. Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/

get_gsa_chrpos()[source]

Get and load GSA chromosome position map.

https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html

Return type:

pandas.DataFrame

get_gsa_resources()[source]

Get resources for reading Global Screening Array files.

https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html

Return type:

dict

get_gsa_rsid()[source]

Get and load GSA RSID map.

https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html

Return type:

pandas.DataFrame

get_low_quality_snps()[source]

Get listing of low quality SNPs for quality control based on chip clusters.

Return type:

pandas.DataFrame

References

  1. Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.

  2. Lu, Tzovaras, & Gough. (2021). OpenSNP data-freeze of 5,393 (19.10.2020) [Data set]. In Computational and Structural Biotechnology Journal. Zenodo. https://doi.org/10.1016/j.csbj.2021.06.040

get_opensnp_datadump_filenames()[source]

Get filenames internal to the openSNP datadump zip.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Notes

This function can download over 27GB of data. If the download is not successful, try using a different tool like wget or curl to download the file and move it to the resources directory (see _get_path_opensnp_datadump).

Returns:

filenames – filenames internal to the openSNP datadump

Return type:

list of str

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204

get_reference_sequences(assembly='GRCh37', chroms=('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT'))[source]

Get Homo sapiens reference sequences for chroms of assembly.

Notes

This function can download over 800MB of data for each assembly.

Parameters:
  • assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – reference sequence assembly

  • chroms (list of str) – reference sequence chromosomes

Returns:

dict of ReferenceSequence, else {}

Return type:

dict

load_opensnp_datadump_file(filename)[source]

Load the specified file from the openSNP datadump.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Parameters:

filename (str) – filename internal to the openSNP datadump

Returns:

content of specified file internal to the openSNP datadump

Return type:

bytes

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204

snps.utils

Utility classes and functions.

class snps.utils.Parallelizer(parallelize=False, processes=2)[source]

Bases: object

__init__(parallelize=False, processes=2)[source]

Initialize a Parallelizer.

Parameters:
  • parallelize (bool) – utilize multiprocessing to speedup calculations

  • processes (int) – processes to launch if multiprocessing

class snps.utils.Singleton[source]

Bases: type

snps.utils.clean_str(s)[source]

Clean a string so that it can be used as a Python variable name.

Parameters:

s (str) – string to clean

Returns:

string that can be used as a Python variable name

Return type:

str

snps.utils.create_dir(path)[source]

Create directory specified by path if it doesn’t already exist.

Parameters:

path (str) – path to directory

Returns:

True if path exists

Return type:

bool

snps.utils.get_utc_now()[source]

Get current UTC time.

Return type:

datetime.datetime

snps.utils.gzip_file(src, dest)[source]

Gzip a file.

Parameters:
  • src (str) – path to file to gzip

  • dest (str) – path to output gzip file

Returns:

path to gzipped file

Return type:

str

snps.utils.save_df_as_csv(df, path, filename, comment='', prepend_info=True, atomic=True, **kwargs)[source]

Save dataframe to a CSV file.

Parameters:
  • df (pandas.DataFrame) – dataframe to save

  • path (str) – path to directory where to save CSV file

  • filename (str or buffer) – filename for file to save or buffer to write to

  • comment (str) – header comment(s); one or more lines starting with ‘#’

  • prepend_info (bool) – prepend file generation information as comments

  • atomic (bool) – atomically write output to a file on local filesystem

  • **kwargs – additional parameters to pandas.DataFrame.to_csv

Returns:

path to saved file or buffer (empty str if error)

Return type:

str or buffer

snps.utils.zip_file(src, dest, arcname)[source]

Zip a file.

Parameters:
  • src (str) – path to file to zip

  • dest (str) – path to output zip file

  • arcname (str) – name of file in zip archive

Returns:

path to zipped file

Return type:

str