Code Documentation

SNPs

SNPs reads, writes, merges, and remaps genotype / raw data files.

class snps.snps.SNPs(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]

Bases: object

__init__(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]

Object used to read, write, and remap genotype / raw data files.

Parameters:
  • file (str or bytes) – path to file to load or bytes to load
  • only_detect_source (bool) – only detect the source of the data
  • assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes
  • output_dir (str) – path to output directory
  • resources_dir (str) – name / path of resources directory
  • deduplicate (bool) – deduplicate RSIDs and make SNPs available as SNPs.duplicate
  • deduplicate_XY_chrom (bool) – deduplicate alleles in the non-PAR regions of X and Y for males; see SNPs.discrepant_XY
  • deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT
  • parallelize (bool) – utilize multiprocessing to speedup calculations
  • processes (int) – processes to launch if multiprocessing
  • rsids (tuple, optional) – rsids to extract if loading a VCF file
assembly

Assembly of SNPs.

Returns:
Return type:str
build

Build of SNPs.

Returns:
Return type:int
build_detected

Status indicating if build of SNPs was detected.

Returns:
Return type:bool
chromosomes

Chromosomes of SNPs.

Returns:list of str chromosomes (e.g., [‘1’, ‘2’, ‘3’, ‘MT’], empty list if no chromosomes
Return type:list
chromosomes_summary

Summary of the chromosomes of SNPs.

Returns:human-readable listing of chromosomes (e.g., ‘1-3, MT’), empty str if no chromosomes
Return type:str
count

Count of SNPs.

Returns:
Return type:int
detect_build()[source]

Detect build of SNPs.

Use the coordinates of common SNPs to identify the build / assembly of a genotype file that is being loaded.

Notes

  • rs3094315 : plus strand in 36, 37, and 38
  • rs11928389 : plus strand in 36, minus strand in 37 and 38
  • rs2500347 : plus strand in 36 and 37, minus strand in 38
  • rs964481 : plus strand in 36, 37, and 38
  • rs2341354 : plus strand in 36, 37, and 38
  • rs3850290 : plus strand in 36, 37, and 38
  • rs1329546 : plus strand in 36, 37, and 38
Returns:detected build of SNPs, else 0
Return type:int

References

  1. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
  2. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
  3. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308-11.
  4. Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. dbSNP accession: rs3094315, rs11928389, rs2500347, rs964481, rs2341354, rs3850290, and rs1329546 (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/
determine_sex(heterozygous_x_snps_threshold=0.03, y_snps_not_null_threshold=0.3, chrom='X')[source]

Determine sex from SNPs using thresholds.

Parameters:
  • heterozygous_x_snps_threshold (float) – percentage heterozygous X SNPs; above this threshold, Female is determined
  • y_snps_not_null_threshold (float) – percentage Y SNPs that are not null; above this threshold, Male is determined
  • chrom ({“X”, “Y”}) – use X or Y chromosome SNPs to determine sex
Returns:

‘Male’ or ‘Female’ if detected, else empty str

Return type:

str

discrepant_XY

Discrepant XY SNPs.

A discrepant XY SNP is a heterozygous SNP in the non-PAR region of the X or Y chromosome found during deduplication for a detected male genotype.

Returns:normalized snps dataframe
Return type:pandas.DataFrame
discrepant_merge_genotypes

SNPs with discrepant genotypes discovered while merging SNPs.

Notes

Definitions of columns in this dataframe are as follows:

Column Description
rsid SNP ID
chrom Chromosome of existing SNP
pos Position of existing SNP
genotype Genotype of existing SNP
chrom_added Chromosome of added SNP
pos_added Position of added SNP
genotype_added Genotype of added SNP (discrepant with genotype)
Returns:
Return type:pandas.DataFrame
discrepant_merge_positions

SNPs with discrepant positions discovered while merging SNPs.

Notes

Definitions of columns in this dataframe are as follows:

Column Description
rsid SNP ID
chrom Chromosome of existing SNP
pos Position of existing SNP
genotype Genotype of existing SNP
chrom_added Chromosome of added SNP
pos_added Position of added SNP (discrepant with pos)
genotype_added Genotype of added SNP
Returns:
Return type:pandas.DataFrame
discrepant_merge_positions_genotypes

SNPs with discrepant positions and / or genotypes discovered while merging SNPs.

Notes

Definitions of columns in this dataframe are as follows:

Column Description
rsid SNP ID
chrom Chromosome of existing SNP
pos Position of existing SNP
genotype Genotype of existing SNP
chrom_added Chromosome of added SNP
pos_added Position of added SNP (possibly discrepant with pos)
genotype_added Genotype of added SNP (possibly discrepant with genotype)
Returns:
Return type:pandas.DataFrame
discrepant_vcf_position

SNPs with discrepant positions discovered while saving VCF.

Returns:normalized snps dataframe
Return type:pandas.DataFrame
duplicate

Duplicate SNPs.

A duplicate SNP has the same RSID as another SNP. The first occurrence of the RSID is not considered a duplicate SNP.

Returns:normalized snps dataframe
Return type:pandas.DataFrame
get_count(chrom='')[source]

Count of SNPs.

Parameters:chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
Returns:
Return type:int
static get_par_regions(build)[source]

Get PAR regions for the X and Y chromosomes.

Parameters:build (int) – build of SNPs
Returns:PAR regions for the given build
Return type:pandas.DataFrame

References

  1. Genome Reference Consortium, https://www.ncbi.nlm.nih.gov/grc/human
  2. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
  3. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
heterozygous(chrom='')[source]

Get heterozygous SNPs.

Parameters:chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
Returns:normalized snps dataframe
Return type:pandas.DataFrame
heterozygous_MT

Heterozygous SNPs on the MT chromosome found during deduplication.

Returns:normalized snps dataframe
Return type:pandas.DataFrame
homozygous(chrom='')[source]

Get homozygous SNPs.

Parameters:chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
Returns:normalized snps dataframe
Return type:pandas.DataFrame
merge(snps_objects=(), discrepant_positions_threshold=100, discrepant_genotypes_threshold=500, remap=True, chrom='')[source]

Merge other SNPs objects into this SNPs object.

Parameters:
  • snps_objects (list or tuple of SNPs) – other SNPs objects to merge into this SNPs object
  • discrepant_positions_threshold (int) – threshold for discrepant SNP positions between existing data and data to be loaded; a large value could indicate mismatched genome assemblies
  • discrepant_genotypes_threshold (int) – threshold for discrepant genotype data between existing data and data to be loaded; a large value could indicated mismatched individuals
  • remap (bool) – if necessary, remap other SNPs objects to have the same build as this SNPs object before merging
  • chrom (str, optional) – chromosome to merge (e.g., “1”, “Y”, “MT”)
Returns:

for each SNPs object to merge, a dict with the following items:

merged (bool)

whether SNPs object was merged

common_rsids (pandas.Index)

SNPs in common

discrepant_position_rsids (pandas.Index)

SNPs with discrepant positions

discrepant_genotype_rsids (pandas.Index)

SNPs with discrepant genotypes

Return type:

list of dict

References

  1. Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.
notnull(chrom='')[source]

Get not null genotype SNPs.

Parameters:chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
Returns:normalized snps dataframe
Return type:pandas.DataFrame
phased

Indicates if genotype is phased.

Returns:
Return type:bool
remap(target_assembly, complement_bases=True)[source]

Remap SNP coordinates from one assembly to another.

This method uses the assembly map endpoint of the Ensembl REST API service (via Resources’s EnsemblRestClient) to convert SNP coordinates / positions from one assembly to another. After remapping, the coordinates / positions for the SNPs will be that of the target assembly.

If the SNPs are already mapped relative to the target assembly, remapping will not be performed.

Parameters:
  • target_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’, 36, 37, 38}) – assembly to remap to
  • complement_bases (bool) – complement bases when remapping SNPs to the minus strand
Returns:

  • chromosomes_remapped (list of str) – chromosomes remapped
  • chromosomes_not_remapped (list of str) – chromosomes not remapped

Notes

An assembly is also know as a “build.” For example:

Assembly NCBI36 = Build 36 Assembly GRCh37 = Build 37 Assembly GRCh38 = Build 38

See https://www.ncbi.nlm.nih.gov/assembly for more information about assemblies and remapping.

References

  1. Ensembl, Assembly Map Endpoint, http://rest.ensembl.org/documentation/info/assembly_map
  2. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
  3. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
save(filename='', vcf=False, atomic=True, **kwargs)[source]

Save SNPs to file.

Parameters:
  • filename (str or buffer) – filename for file to save or buffer to write to
  • vcf (bool) – flag to save file as VCF
  • atomic (bool) – atomically write output to a file on local filesystem
  • **kwargs – additional parameters to pandas.DataFrame.to_csv
Returns:

path to file in output directory if SNPs were saved, else empty str

Return type:

str

References

  1. Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.
sex

Sex derived from SNPs.

Returns:‘Male’ or ‘Female’ if detected, else empty str
Return type:str
snps

Normalized SNPs.

Notes

Throughout snps, the “normalized snps dataframe” is defined as follows:

Column Description pandas dtype
rsid [*] SNP ID object (string)
chrom Chromosome of SNP object (string)
pos Position of SNP (relative to build) uint32
genotype [†] Genotype of SNP object (string)
[*]Dataframe index
[†]Genotype can be null, length 1, or length 2. Specifically, genotype is null if not called or unavailable. Otherwise, for autosomal chromosomes, genotype is two alleles. For the X and Y chromosomes, male genotypes are one allele in the non-PAR regions (assuming deduplicate_XY_chrom). For the MT chromosome, genotypes are one allele (assuming deduplicate_MT_chrom).
Returns:normalized snps dataframe
Return type:pandas.DataFrame
sort()[source]

Sort SNPs based on ordered chromosome list and position.

source

Summary of the SNP data source(s).

Returns:Data source(s) for this SNPs object, separated by “, “.
Return type:str
summary

Summary of SNPs.

Returns:summary info if SNPs is valid, else {}
Return type:dict
unannotated_vcf

Indicates if VCF file is unannotated.

Returns:
Return type:bool
valid

Determine if SNPs is valid.

SNPs is valid when the input file has been successfully parsed.

Returns:True if SNPs is valid
Return type:bool

snps.ensembl

Ensembl REST client.

Notes

Modified from https://github.com/Ensembl/ensembl-rest/wiki/Example-Python-Client.

References

  1. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
  2. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
class snps.ensembl.EnsemblRestClient(server='https://rest.ensembl.org', reqs_per_sec=15)[source]

Bases: object

__init__(server='https://rest.ensembl.org', reqs_per_sec=15)[source]

Initialize self. See help(type(self)) for accurate signature.

perform_rest_action(endpoint, hdrs=None, params=None)[source]

snps.io

Classes for reading and writing SNPs.

snps.io.reader

Class for reading SNPs.

class snps.io.reader.Reader(file='', only_detect_source=False, resources=None, rsids=())[source]

Bases: object

Class for reading and parsing raw data / genotype files.

__init__(file='', only_detect_source=False, resources=None, rsids=())[source]

Initialize a Reader.

Parameters:
  • file (str or bytes) – path to file to load or bytes to load
  • only_detect_source (bool) – only detect the source of the data
  • resources (Resources) – instance of Resources
  • rsids (tuple, optional) – rsids to extract if loading a VCF file
static is_gzip(bytes_data)[source]

Check whether or not a bytes_data file is a valid gzip file.

static is_zip(bytes_data)[source]

Check whether or not a bytes_data file is a valid Zip file.

read()[source]

Read and parse a raw data / genotype file.

Returns:dict with the following items:
snps (pandas.DataFrame)
dataframe of parsed SNPs
source (str)
detected source of SNPs
phased (bool)
flag indicating if SNPs are phased
Return type:dict
read_23andme(file, compression)[source]

Read and parse 23andMe file.

https://www.23andme.com

Parameters:file (str) – path to file
Returns:result of read_helper
Return type:dict
read_ancestry(file, compression)[source]

Read and parse Ancestry.com file.

http://www.ancestry.com

Parameters:file (str) – path to file
Returns:result of read_helper
Return type:dict
read_codigo46(file)[source]

Read and parse Codigo46 files.

https://codigo46.com.mx

Parameters:data (str) – data string
Returns:result of read_helper
Return type:dict
read_dnaland(file, compression)[source]

Read and parse DNA.land files.

https://dna.land/

Parameters:data (str) – data string
Returns:result of read_helper
Return type:dict
classmethod read_file(file, only_detect_source, resources, rsids)[source]

Read file.

Parameters:
  • file (str or bytes) – path to file to load or bytes to load
  • only_detect_source (bool) – only detect the source of the data
  • resources (Resources) – instance of Resources
  • rsids (tuple) – rsids to extract if loading a VCF file
Returns:

dict with the following items:

snps (pandas.DataFrame)

dataframe of parsed SNPs

source (str)

detected source of SNPs

phased (bool)

flag indicating if SNPs are phased

Return type:

dict

read_ftdna(file, compression)[source]

Read and parse Family Tree DNA (FTDNA) file.

https://www.familytreedna.com

Parameters:file (str) – path to file
Returns:result of read_helper
Return type:dict
read_ftdna_famfinder(file, compression)[source]

Read and parse Family Tree DNA (FTDNA) “famfinder” file.

https://www.familytreedna.com

Parameters:file (str) – path to file
Returns:result of read_helper
Return type:dict
read_generic(file, compression, skip=1)[source]

Read and parse generic CSV or TSV file.

Notes

Assumes columns are ‘rsid’, ‘chrom’ / ‘chromosome’, ‘pos’ / ‘position’, and ‘genotype’; values are comma separated; unreported genotypes are indicated by ‘–’; and one header row precedes data. For example:

rsid,chromosome,position,genotype rs1,1,1,AA rs2,1,2,CC rs3,1,3,–
Parameters:file (str) – path to file
Returns:result of read_helper
Return type:dict
read_genes_for_good(file, compression)[source]

Read and parse Genes For Good file.

https://genesforgood.sph.umich.edu/readme/readme1.2.txt

Parameters:file (str) – path to file
Returns:result of read_helper
Return type:dict
read_helper(source, parser)[source]

Generic method to help read files.

Parameters:
  • source (str) – name of data source

  • parser (func) – parsing function, which returns a tuple with the following items:

    0 (pandas.DataFrame)

    dataframe of parsed SNPs (empty if only detecting source)

    1 (bool), optional

    flag indicating if SNPs are phased

    2 (int), optional

    detected build of SNPs

Returns:

dict with the following items:

snps (pandas.DataFrame)

dataframe of parsed SNPs

source (str)

detected source of SNPs

phased (bool)

flag indicating if SNPs are phased

build (int)

detected build of SNPs

Return type:

dict

References

  1. Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.
read_livingdna(file, compression)[source]

Read and parse LivingDNA file.

https://livingdna.com/

Parameters:file (str) – path to file
Returns:result of read_helper
Return type:dict
read_mapmygenome(file, compression, header)[source]

Read and parse Mapmygenome file.

https://mapmygenome.in

Parameters:file (str) – path to file
Returns:result of read_helper
Return type:dict
read_myheritage(file, compression)[source]

Read and parse MyHeritage file.

https://www.myheritage.com

Parameters:file (str) – path to file
Returns:result of read_helper
Return type:dict
read_sano(file)[source]

Read and parse Sano Genetics files.

https://sanogenetics.com

Parameters:data (str) – data string
Returns:result of read_helper
Return type:dict
read_snps_csv(file, comments, compression)[source]

Read and parse CSV file generated by snps.

https://pypi.org/project/snps/

Parameters:
  • file (str or buffer) – path to file or buffer to read
  • comments (str) – comments at beginning of file
Returns:

result of read_helper

Return type:

dict

read_tellmegen(file, compression)[source]

Read and parse tellmeGen files.

https://www.tellmegen.com/

Parameters:data (str) – data string
Returns:result of read_helper
Return type:dict
read_vcf(file, compression, provider, rsids=())[source]

Read and parse VCF file.

Notes

This method attempts to read and parse a VCF file or buffer, optionally compressed with gzip. Some assumptions are made throughout this process:

  • SNPs that are not annotated with an RSID are skipped
  • If the VCF contains multiple samples, only the first sample is used to lookup the genotype
  • Insertions and deletions are skipped
  • If a sample allele is not specified, the genotype is reported as NaN
  • If a sample allele refers to a REF or ALT allele that is not specified, the genotype is reported as NaN
Parameters:
  • file (str or bytes) – path to file or bytes to load
  • rsids (tuple, optional) – rsids to extract if loading a VCF file
Returns:

result of read_helper

Return type:

dict

snps.io.reader.get_empty_snps_dataframe()[source]

Get empty dataframe normalized for usage with snps.

Returns:
Return type:pd.DataFrame

snps.io.writer

Class for writing SNPs.

class snps.io.writer.Writer(snps=None, filename='', vcf=False, atomic=True, **kwargs)[source]

Bases: object

Class for writing SNPs to files.

__init__(snps=None, filename='', vcf=False, atomic=True, **kwargs)[source]

Initialize a Writer.

Parameters:
  • snps (SNPs) – SNPs to save to file or write to buffer
  • filename (str or buffer) – filename for file to save or buffer to write to
  • vcf (bool) – flag to save file as VCF
  • atomic (bool) – atomically write output to a file on local filesystem
  • **kwargs – additional parameters to pandas.DataFrame.to_csv
write()[source]
classmethod write_file(snps=None, filename='', vcf=False, atomic=True, **kwargs)[source]

Save SNPs to file.

Parameters:
  • snps (SNPs) – SNPs to save to file or write to buffer
  • filename (str or buffer) – filename for file to save or buffer to write to
  • vcf (bool) – flag to save file as VCF
  • atomic (bool) – atomically write output to a file on local filesystem
  • **kwargs – additional parameters to pandas.DataFrame.to_csv
Returns:

  • str – path to file in output directory if SNPs were saved, else empty str
  • discrepant_vcf_position (pd.DataFrame) – SNPs with discrepant positions discovered while saving VCF

snps.resources

Class for downloading and loading required external resources.

References

  1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. http://dx.doi.org/10.1038/35057062
  2. hg19 (GRCh37): Hiram Clawson, Brooke Rhead, Pauline Fujita, Ann Zweig, Katrina Learned, Donna Karolchik and Robert Kuhn, https://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19
  3. Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
  4. Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
class snps.resources.ReferenceSequence(ID='', url='', path='', assembly='', species='', taxonomy='')[source]

Bases: object

Object used to represent and interact with a reference sequence.

ID

Get reference sequence chromosome.

Returns:
Return type:str
__init__(ID='', url='', path='', assembly='', species='', taxonomy='')[source]

Initialize a ReferenceSequence object.

Parameters:
  • ID (str) – reference sequence chromosome
  • url (str) – url to Ensembl reference sequence
  • path (str) – path to local reference sequence
  • assembly (str) – reference sequence assembly (e.g., “GRCh37”)
  • species (str) – reference sequence species
  • taxonomy (str) – reference sequence taxonomy

References

  1. The Variant Call Format (VCF) Version 4.2 Specification, 8 Mar 2019, https://samtools.github.io/hts-specs/VCFv4.2.pdf
assembly

Get reference sequence assembly.

Returns:
Return type:str
build

Get reference sequence build.

Returns:e.g., “B37”
Return type:str
chrom

Get reference sequence chromosome.

Returns:
Return type:str
clear()[source]

Clear reference sequence.

end

Get reference sequence end position (1-based).

Returns:
Return type:int
length

Get reference sequence length.

Returns:
Return type:int
md5

Get reference sequence MD5 hash.

Returns:
Return type:str
path

Get path to local reference sequence.

Returns:
Return type:str
sequence

Get reference sequence.

Returns:
Return type:np.array(dtype=np.uint8)
species

Get reference sequence species.

Returns:
Return type:str
start

Get reference sequence start position (1-based).

Returns:
Return type:int
taxonomy

Get reference sequence taxonomy.

Returns:
Return type:str
url

Get URL to Ensembl reference sequence.

Returns:
Return type:str
class snps.resources.Resources(resources_dir='resources')[source]

Bases: object

Object used to manage resources required by snps.

__init__(resources_dir='resources')[source]

Initialize a Resources object.

Parameters:resources_dir (str) – name / path of resources directory
download_example_datasets()[source]

Download example datasets from openSNP.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Returns:paths – paths to example datasets
Return type:list of str or empty str

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204
get_all_reference_sequences(**kwargs)[source]

Get Homo sapiens reference sequences for Builds 36, 37, and 38 from Ensembl.

Notes

This function can download over 2.5GB of data.

Returns:dict of ReferenceSequence, else {}
Return type:dict
get_all_resources()[source]

Get / download all resources used throughout snps.

Notes

This function does not download reference sequences and the openSNP datadump, due to their large sizes.

Returns:dict of resources
Return type:dict
get_assembly_mapping_data(source_assembly, target_assembly)[source]

Get assembly mapping data.

Parameters:
  • source_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – assembly to remap from
  • target_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – assembly to remap to
Returns:

dict of json assembly mapping data if loading was successful, else {}

Return type:

dict

get_gsa_resources()[source]

Get resources for reading Global Screening Array files.

https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html

Returns:
Return type:dict
get_opensnp_datadump_filenames()[source]

Get filenames internal to the openSNP datadump zip.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Notes

This function can download over 27GB of data. If the download is not successful, try using a different tool like wget or curl to download the file and move it to the resources directory (see _get_path_opensnp_datadump).

Returns:filenames – filenames internal to the openSNP datadump
Return type:list of str

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204
get_reference_sequences(assembly='GRCh37', chroms=('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT'))[source]

Get Homo sapiens reference sequences for chroms of assembly.

Notes

This function can download over 800MB of data for each assembly.

Parameters:
  • assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – reference sequence assembly
  • chroms (list of str) – reference sequence chromosomes
Returns:

dict of ReferenceSequence, else {}

Return type:

dict

load_opensnp_datadump_file(filename)[source]

Load the specified file from the openSNP datadump.

Per openSNP, “the data is donated into the public domain using CC0 1.0.”

Parameters:filename (str) – filename internal to the openSNP datadump
Returns:content of specified file internal to the openSNP datadump
Return type:bytes

References

  1. Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204

snps.utils

Utility classes and functions.

class snps.utils.Parallelizer(parallelize=False, processes=2)[source]

Bases: object

__init__(parallelize=False, processes=2)[source]

Initialize a Parallelizer.

Parameters:
  • parallelize (bool) – utilize multiprocessing to speedup calculations
  • processes (int) – processes to launch if multiprocessing
class snps.utils.Singleton[source]

Bases: type

snps.utils.clean_str(s)[source]

Clean a string so that it can be used as a Python variable name.

Parameters:s (str) – string to clean
Returns:string that can be used as a Python variable name
Return type:str
snps.utils.create_dir(path)[source]

Create directory specified by path if it doesn’t already exist.

Parameters:path (str) – path to directory
Returns:True if path exists
Return type:bool
snps.utils.gzip_file(src, dest)[source]

Gzip a file.

Parameters:
  • src (str) – path to file to gzip
  • dest (str) – path to output gzip file
Returns:

path to gzipped file

Return type:

str

snps.utils.save_df_as_csv(df, path, filename, comment='', prepend_info=True, atomic=True, **kwargs)[source]

Save dataframe to a CSV file.

Parameters:
  • df (pandas.DataFrame) – dataframe to save
  • path (str) – path to directory where to save CSV file
  • filename (str or buffer) – filename for file to save or buffer to write to
  • comment (str) – header comment(s); one or more lines starting with ‘#’
  • prepend_info (bool) – prepend file generation information as comments
  • atomic (bool) – atomically write output to a file on local filesystem
  • **kwargs – additional parameters to pandas.DataFrame.to_csv
Returns:

path to saved file or buffer (empty str if error)

Return type:

str or buffer

snps.utils.zip_file(src, dest, arcname)[source]

Zip a file.

Parameters:
  • src (str) – path to file to zip
  • dest (str) – path to output zip file
  • arcname (str) – name of file in zip archive
Returns:

path to zipped file

Return type:

str