Code Documentation
SNPs
SNPs
reads, writes, merges, and remaps genotype / raw data files.
- class snps.snps.SNPs(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]
Bases:
object
- __init__(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]
Object used to read, write, and remap genotype / raw data files.
- Parameters
file (str or bytes) – path to file to load or bytes to load
only_detect_source (bool) – only detect the source of the data
assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes
output_dir (str) – path to output directory
resources_dir (str) – name / path of resources directory
deduplicate (bool) – deduplicate RSIDs and make SNPs available as SNPs.duplicate
deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT
deduplicate_XY_chrom (bool or str) – deduplicate alleles in the non-PAR regions of X and Y for males; see SNPs.discrepant_XY if a str then this is the sex determination method to use X Y or XY
parallelize (bool) – utilize multiprocessing to speedup calculations
processes (int) – processes to launch if multiprocessing
rsids (tuple, optional) – rsids to extract if loading a VCF file
- property assembly
Assembly of SNPs.
- Returns
- Return type
str
- property build
Build of SNPs.
- Returns
- Return type
int
- property build_detected
Status indicating if build of SNPs was detected.
- Returns
- Return type
bool
- property chip
Detected deduced genotype / chip array, if any, per
compute_cluster_overlap
.- Returns
detected chip array, else empty str
- Return type
str
- property chip_version
Detected genotype / chip array version, if any, per
compute_cluster_overlap
.Notes
Chip array version is only applicable to 23andMe (v3, v4, v5) and AncestryDNA (v1, v2) files.
- Returns
detected chip array version, e.g., ‘v4’, else empty str
- Return type
str
- property chromosomes
Chromosomes of SNPs.
- Returns
list of str chromosomes (e.g., [‘1’, ‘2’, ‘3’, ‘MT’], empty list if no chromosomes
- Return type
list
- property chromosomes_summary
Summary of the chromosomes of SNPs.
- Returns
human-readable listing of chromosomes (e.g., ‘1-3, MT’), empty str if no chromosomes
- Return type
str
- property cluster
Detected chip cluster, if any, per
compute_cluster_overlap
.Notes
Refer to
compute_cluster_overlap
for more details about chip clusters.- Returns
detected chip cluster, e.g., ‘c1’, else empty str
- Return type
str
- compute_cluster_overlap(cluster_overlap_threshold=0.95)[source]
Compute overlap with chip clusters.
Chip clusters, which are defined in 1, are associated with deduced genotype / chip arrays and DTC companies.
This method also sets the values returned by the cluster, chip, and chip_version properties, based on max overlap, if the specified threshold is satisfied.
- Parameters
cluster_overlap_threshold (float) – threshold for cluster to overlap this SNPs object, and vice versa, to set values returned by the cluster, chip, and chip_version properties
- Returns
pandas.DataFrame with the following columns:
- company_composition
DTC company composition of associated cluster from 1
- chip_base_deduced
deduced genotype / chip array of associated cluster from 1
- snps_in_cluster
count of SNPs in cluster
- snps_in_common
count of SNPs in common with cluster (inner merge with cluster)
- overlap_with_cluster
percentage overlap of snps_in_common with cluster
- overlap_with_self
percentage overlap of snps_in_common with this SNPs object
- Return type
pandas.DataFrame
References
- 1(1,2,3,4)
Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.
- property count
Count of SNPs.
- Returns
- Return type
int
- detect_build()[source]
Detect build of SNPs.
Use the coordinates of common SNPs to identify the build / assembly of a genotype file that is being loaded.
Notes
rs3094315 : plus strand in 36, 37, and 38
rs11928389 : plus strand in 36, minus strand in 37 and 38
rs2500347 : plus strand in 36 and 37, minus strand in 38
rs964481 : plus strand in 36, 37, and 38
rs2341354 : plus strand in 36, 37, and 38
rs3850290 : plus strand in 36, 37, and 38
rs1329546 : plus strand in 36, 37, and 38
- Returns
detected build of SNPs, else 0
- Return type
int
References
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. dbSNP accession: rs3094315, rs11928389, rs2500347, rs964481, rs2341354, rs3850290, and rs1329546 (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/
- determine_sex(heterozygous_x_snps_threshold=0.03, y_snps_not_null_threshold=0.3, chrom='X')[source]
Determine sex from SNPs using thresholds.
- Parameters
heterozygous_x_snps_threshold (float) – percentage heterozygous X SNPs; above this threshold, Female is determined
y_snps_not_null_threshold (float) – percentage Y SNPs that are not null; above this threshold, Male is determined
chrom ({“X”, “Y”}) – use X or Y chromosome SNPs to determine sex
- Returns
‘Male’ or ‘Female’ if detected, else empty str
- Return type
str
- property discrepant_XY
Discrepant XY SNPs.
A discrepant XY SNP is a heterozygous SNP in the non-PAR region of the X or Y chromosome found during deduplication for a detected male genotype.
- Returns
normalized
snps
dataframe- Return type
pandas.DataFrame
- property discrepant_merge_genotypes
SNPs with discrepant genotypes discovered while merging SNPs.
Notes
Definitions of columns in this dataframe are as follows:
Column
Description
rsid
SNP ID
chrom
Chromosome of existing SNP
pos
Position of existing SNP
genotype
Genotype of existing SNP
chrom_added
Chromosome of added SNP
pos_added
Position of added SNP
genotype_added
Genotype of added SNP (discrepant with genotype)
- Returns
- Return type
pandas.DataFrame
- property discrepant_merge_positions
SNPs with discrepant positions discovered while merging SNPs.
Notes
Definitions of columns in this dataframe are as follows:
Column
Description
rsid
SNP ID
chrom
Chromosome of existing SNP
pos
Position of existing SNP
genotype
Genotype of existing SNP
chrom_added
Chromosome of added SNP
pos_added
Position of added SNP (discrepant with pos)
genotype_added
Genotype of added SNP
- Returns
- Return type
pandas.DataFrame
- property discrepant_merge_positions_genotypes
SNPs with discrepant positions and / or genotypes discovered while merging SNPs.
Notes
Definitions of columns in this dataframe are as follows:
Column
Description
rsid
SNP ID
chrom
Chromosome of existing SNP
pos
Position of existing SNP
genotype
Genotype of existing SNP
chrom_added
Chromosome of added SNP
pos_added
Position of added SNP (possibly discrepant with pos)
genotype_added
Genotype of added SNP (possibly discrepant with genotype)
- Returns
- Return type
pandas.DataFrame
- property discrepant_vcf_position
SNPs with discrepant positions discovered while saving VCF.
- Returns
normalized
snps
dataframe- Return type
pandas.DataFrame
- property duplicate
Duplicate SNPs.
A duplicate SNP has the same RSID as another SNP. The first occurrence of the RSID is not considered a duplicate SNP.
- Returns
normalized
snps
dataframe- Return type
pandas.DataFrame
- get_count(chrom='')[source]
Count of SNPs.
- Parameters
chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
- Returns
- Return type
int
- static get_par_regions(build)[source]
Get PAR regions for the X and Y chromosomes.
- Parameters
build (int) – build of SNPs
- Returns
PAR regions for the given build
- Return type
pandas.DataFrame
References
Genome Reference Consortium, https://www.ncbi.nlm.nih.gov/grc/human
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
- heterozygous(chrom='')[source]
Get heterozygous SNPs.
- Parameters
chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
- Returns
normalized
snps
dataframe- Return type
pandas.DataFrame
- property heterozygous_MT
Heterozygous SNPs on the MT chromosome found during deduplication.
- Returns
normalized
snps
dataframe- Return type
pandas.DataFrame
- homozygous(chrom='')[source]
Get homozygous SNPs.
- Parameters
chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
- Returns
normalized
snps
dataframe- Return type
pandas.DataFrame
- identify_low_quality_snps()[source]
Identify low quality SNPs based on chip clusters.
Any low quality SNPs are removed from the
snps_qc
dataframe and are made available aslow_quality
.Notes
Chip clusters, which are defined in 1, are associated with low quality SNPs. As such, low quality SNPs will only be identified when this SNPs object corresponds to a cluster per
compute_cluster_overlap()
.
- property low_quality
SNPs identified as low quality, if any, per
identify_low_quality_snps()
.- Returns
normalized
snps
dataframe- Return type
pandas.DataFrame
- merge(snps_objects=(), discrepant_positions_threshold=100, discrepant_genotypes_threshold=500, remap=True, chrom='')[source]
Merge other
SNPs
objects into thisSNPs
object.- Parameters
snps_objects (list or tuple of
SNPs
) – otherSNPs
objects to merge into thisSNPs
objectdiscrepant_positions_threshold (int) – threshold for discrepant SNP positions between existing data and data to be loaded; a large value could indicate mismatched genome assemblies
discrepant_genotypes_threshold (int) – threshold for discrepant genotype data between existing data and data to be loaded; a large value could indicated mismatched individuals
remap (bool) – if necessary, remap other
SNPs
objects to have the same build as thisSNPs
object before mergingchrom (str, optional) – chromosome to merge (e.g., “1”, “Y”, “MT”)
- Returns
for each
SNPs
object to merge, a dict with the following items:- merged (bool)
whether
SNPs
object was merged- common_rsids (pandas.Index)
SNPs in common
- discrepant_position_rsids (pandas.Index)
SNPs with discrepant positions
- discrepant_genotype_rsids (pandas.Index)
SNPs with discrepant genotypes
- Return type
list of dict
References
Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.
- notnull(chrom='')[source]
Get not null genotype SNPs.
- Parameters
chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
- Returns
normalized
snps
dataframe- Return type
pandas.DataFrame
- property phased
Indicates if genotype is phased.
- Returns
- Return type
bool
- predict_ancestry(output_directory=None, write_predictions=False, models_directory=None, aisnps_directory=None, n_components=None, k=None, thousand_genomes_directory=None, samples_directory=None, algorithm=None, aisnps_set=None)[source]
Predict genetic ancestry for SNPs.
Predictions by ezancestry.
Notes
Populations below are described here.
- Parameters
various (optional) – See the available settings for predict at ezancestry.
- Returns
dict with the following keys:
- population_code (str)
max predicted population for the sample
- population_description (str)
descriptive name of the population
- population_percent (float)
predicted probability for the max predicted population
- superpopulation_code (str)
max predicted super population (continental) for the sample
- superpopulation_description (str)
descriptive name of the super population
- superpopulation_percent (float)
predicted probability for the max predicted super population
- ezancestry_df (pandas.DataFrame)
pandas.DataFrame with the following columns:
- component1, component2, component3
The coordinates of the sample in the dimensionality-reduced component space. Can be used as (x, y, z,) coordinates for plotting in a 3d scatter plot.
- predicted_population_population
The max predicted population for the sample.
- ACB, ASW, BEB, CDX, CEU, CHB, CHS, CLM, ESN, FIN, GBR, GIH, GWD, IBS, ITU, JPT, KHV, LWK, MSL, MXL, PEL, PJL, PUR, STU, TSI, YRI
Predicted probabilities for each of the populations. These sum to 1.0.
- predicted_population_superpopulation
The max predicted super population (continental) for the sample.
- AFR, AMR, EAS, EUR, SAS
Predicted probabilities for each of the super populations. These sum to 1.0.
- population_description, superpopulation_name
Descriptive names of the population and super population.
- Return type
dict
- remap(target_assembly, complement_bases=True)[source]
Remap SNP coordinates from one assembly to another.
This method uses the assembly map endpoint of the Ensembl REST API service (via
Resources
’sEnsemblRestClient
) to convert SNP coordinates / positions from one assembly to another. After remapping, the coordinates / positions for the SNPs will be that of the target assembly.If the SNPs are already mapped relative to the target assembly, remapping will not be performed.
- Parameters
target_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’, 36, 37, 38}) – assembly to remap to
complement_bases (bool) – complement bases when remapping SNPs to the minus strand
- Returns
chromosomes_remapped (list of str) – chromosomes remapped
chromosomes_not_remapped (list of str) – chromosomes not remapped
Notes
An assembly is also know as a “build.” For example:
Assembly NCBI36 = Build 36 Assembly GRCh37 = Build 37 Assembly GRCh38 = Build 38
See https://www.ncbi.nlm.nih.gov/assembly for more information about assemblies and remapping.
References
Ensembl, Assembly Map Endpoint, http://rest.ensembl.org/documentation/info/assembly_map
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
- property sex
Sex derived from SNPs.
- Returns
‘Male’ or ‘Female’ if detected, else empty str
- Return type
str
- property snps
Normalized SNPs.
Notes
Throughout
snps
, the “normalizedsnps
dataframe” is defined as follows:Column
Description
pandas dtype
rsid *
SNP ID
object (string)
chrom
Chromosome of SNP
object (string)
pos
Position of SNP (relative to build)
uint32
genotype †
Genotype of SNP
object (string)
- *
Dataframe index
- †
Genotype can be null, length 1, or length 2. Specifically, genotype is null if not called or unavailable. Otherwise, for autosomal chromosomes, genotype is two alleles. For the X and Y chromosomes, male genotypes are one allele in the non-PAR regions (assuming deduplicate_XY_chrom). For the MT chromosome, genotypes are one allele (assuming deduplicate_MT_chrom).
- Returns
normalized
snps
dataframe- Return type
pandas.DataFrame
- property snps_qc
Normalized SNPs, after quality control.
Any low quality SNPs, identified per
identify_low_quality_snps()
, are not included in the result.- Returns
normalized
snps
dataframe- Return type
pandas.DataFrame
- property source
Summary of the SNP data source(s).
- Returns
Data source(s) for this
SNPs
object, separated by “, “.- Return type
str
- property summary
Summary of SNPs.
- Returns
summary info if
SNPs
is valid, else {}- Return type
dict
- to_csv(filename='', atomic=True, **kwargs)[source]
Output SNPs as comma-separated values.
- Parameters
filename (str or buffer) – filename for file to save or buffer to write to
atomic (bool) – atomically write output to a file on local filesystem
**kwargs – additional parameters to pandas.DataFrame.to_csv
- Returns
path to file in output directory if SNPs were saved, else empty str
- Return type
str
- to_tsv(filename='', atomic=True, **kwargs)[source]
Output SNPs as tab-separated values.
Note that this results in the same default output as save.
- Parameters
filename (str or buffer) – filename for file to save or buffer to write to
atomic (bool) – atomically write output to a file on local filesystem
**kwargs – additional parameters to pandas.DataFrame.to_csv
- Returns
path to file in output directory if SNPs were saved, else empty str
- Return type
str
- to_vcf(filename='', atomic=True, alt_unavailable='.', chrom_prefix='', qc_only=False, qc_filter=False, **kwargs)[source]
Output SNPs as Variant Call Format.
- Parameters
filename (str or buffer) – filename for file to save or buffer to write to
atomic (bool) – atomically write output to a file on local filesystem
alt_unavailable (str) – representation of ALT allele when ALT is not able to be determined
chrom_prefix (str) – prefix for chromosomes in VCF CHROM column
qc_only (bool) – output only SNPs that pass quality control
qc_filter (bool) – populate FILTER column based on quality control results
**kwargs – additional parameters to pandas.DataFrame.to_csv
- Returns
path to file in output directory if SNPs were saved, else empty str
- Return type
str
Notes
Parameters qc_only and qc_filter, if true, will identify low quality SNPs per
identify_low_quality_snps()
, if not done already. Moreover, these parameters have no effect if this SNPs object does not map to a cluster percompute_cluster_overlap()
.References
The Variant Call Format (VCF) Version 4.2 Specification, 8 Mar 2019, https://samtools.github.io/hts-specs/VCFv4.2.pdf
- property unannotated_vcf
Indicates if VCF file is unannotated.
- Returns
- Return type
bool
- property valid
Determine if
SNPs
is valid.SNPs
is valid when the input file has been successfully parsed.- Returns
True if
SNPs
is valid- Return type
bool
snps.ensembl
Ensembl REST client.
Notes
Modified from https://github.com/Ensembl/ensembl-rest/wiki/Example-Python-Client.
References
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
snps.io
Classes for reading and writing SNPs.
snps.io.reader
Class for reading SNPs.
- class snps.io.reader.Reader(file='', only_detect_source=False, resources=None, rsids=())[source]
Bases:
object
Class for reading and parsing raw data / genotype files.
- __init__(file='', only_detect_source=False, resources=None, rsids=())[source]
Initialize a Reader.
- Parameters
file (str or bytes) – path to file to load or bytes to load
only_detect_source (bool) – only detect the source of the data
resources (Resources) – instance of Resources
rsids (tuple, optional) – rsids to extract if loading a VCF file
- read()[source]
Read and parse a raw data / genotype file.
- Returns
dict with the following items:
- snps (pandas.DataFrame)
dataframe of parsed SNPs
- source (str)
detected source of SNPs
- phased (bool)
flag indicating if SNPs are phased
- Return type
dict
- read_23andme(file, compression, joined=True)[source]
Read and parse 23andMe file.
- Parameters
file (str) – path to file
- Returns
result of read_helper
- Return type
dict
- read_ancestry(file, compression)[source]
Read and parse Ancestry.com file.
- Parameters
file (str) – path to file
- Returns
result of read_helper
- Return type
dict
- read_circledna(file, compression)[source]
Read and parse CircleDNA file.
Notes
This method attempts to read and parse a whole exome file, optionally compressed with gzip or zip. Some assumptions are made throughout this process:
SNPs that are not annotated with an RSID are skipped
Insertions and deletions are skipped
- Parameters
file (str or bytes) – path to file or bytes to load
- Returns
result of read_helper
- Return type
dict
- read_dnaland(file, compression)[source]
Read and parse DNA.land files.
- Parameters
data (str) – data string
- Returns
result of read_helper
- Return type
dict
- read_ftdna(file, compression)[source]
Read and parse Family Tree DNA (FTDNA) file.
- Parameters
file (str) – path to file
- Returns
result of read_helper
- Return type
dict
- read_ftdna_famfinder(file, compression)[source]
Read and parse Family Tree DNA (FTDNA) “famfinder” file.
- Parameters
file (str) – path to file
- Returns
result of read_helper
- Return type
dict
- read_generic(file, compression, skip=1)[source]
Read and parse generic CSV or TSV file.
Notes
Assumes columns are ‘rsid’, ‘chrom’ / ‘chromosome’, ‘pos’ / ‘position’, and ‘genotype’; values are comma separated; unreported genotypes are indicated by ‘–’; and one header row precedes data. For example:
rsid,chromosome,position,genotype rs1,1,1,AA rs2,1,2,CC rs3,1,3,–
- Parameters
file (str) – path to file
- Returns
result of read_helper
- Return type
dict
- read_genes_for_good(file, compression)[source]
Read and parse Genes For Good file.
https://genesforgood.sph.umich.edu/readme/readme1.2.txt
- Parameters
file (str) – path to file
- Returns
result of read_helper
- Return type
dict
- read_gsa(data_or_filename, compresion, comments)[source]
Read and parse Illumina Global Screening Array files
- Parameters
data_or_filename (str or bytes) – either the filename to read from or the bytes data itself
- Returns
result of read_helper
- Return type
dict
- read_helper(source, parser)[source]
Generic method to help read files.
- Parameters
source (str) – name of data source
parser (func) – parsing function, which returns a tuple with the following items:
- 0 (pandas.DataFrame)
dataframe of parsed SNPs (empty if only detecting source)
- 1 (bool), optional
flag indicating if SNPs are phased
- 2 (int), optional
detected build of SNPs
- Returns
dict with the following items:
- snps (pandas.DataFrame)
dataframe of parsed SNPs
- source (str)
detected source of SNPs
- phased (bool)
flag indicating if SNPs are phased
- build (int)
detected build of SNPs
- Return type
dict
References
Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.
- read_livingdna(file, compression)[source]
Read and parse LivingDNA file.
- Parameters
file (str) – path to file
- Returns
result of read_helper
- Return type
dict
- read_mapmygenome(file, compression, header)[source]
Read and parse Mapmygenome file.
- Parameters
file (str) – path to file
- Returns
result of read_helper
- Return type
dict
- read_myheritage(file, compression)[source]
Read and parse MyHeritage file.
- Parameters
file (str) – path to file
- Returns
result of read_helper
- Return type
dict
- read_snps_csv(file, comments, compression)[source]
Read and parse CSV file generated by
snps
.https://pypi.org/project/snps/
- Parameters
file (str or buffer) – path to file or buffer to read
comments (str) – comments at beginning of file
- Returns
result of read_helper
- Return type
dict
- read_tellmegen(file, compression)[source]
Read and parse tellmeGen files.
- Parameters
data (str) – data string
- Returns
result of read_helper
- Return type
dict
- read_vcf(file, compression, provider, rsids=())[source]
Read and parse VCF file.
Notes
This method attempts to read and parse a VCF file or buffer, optionally compressed with gzip. Some assumptions are made throughout this process:
SNPs that are not annotated with an RSID are skipped
If the VCF contains multiple samples, only the first sample is used to lookup the genotype
Insertions and deletions are skipped
If a sample allele is not specified, the genotype is reported as NaN
If a sample allele refers to a REF or ALT allele that is not specified, the genotype is reported as NaN
- Parameters
file (str or bytes) – path to file or bytes to load
rsids (tuple, optional) – rsids to extract if loading a VCF file
- Returns
result of read_helper
- Return type
dict
snps.io.writer
Class for writing SNPs.
- class snps.io.writer.Writer(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]
Bases:
object
Class for writing SNPs to files.
- __init__(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]
Initialize a Writer.
- Parameters
snps (SNPs) – SNPs to save to file or write to buffer
filename (str or buffer) – filename for file to save or buffer to write to
vcf (bool) – flag to save file as VCF
atomic (bool) – atomically write output to a file on local filesystem
vcf_alt_unavailable (str) – representation of VCF ALT allele when ALT is not able to be determined
vcf_chrom_prefix (str) – prefix for chromosomes in VCF CHROM column
vcf_qc_only (bool) – for VCF, output only SNPs that pass quality control
vcf_qc_filter (bool) – for VCF, populate VCF FILTER column based on quality control results
**kwargs – additional parameters to pandas.DataFrame.to_csv
snps.resources
Class for downloading and loading required external resources.
References
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. http://dx.doi.org/10.1038/35057062
hg19 (GRCh37): Hiram Clawson, Brooke Rhead, Pauline Fujita, Ann Zweig, Katrina Learned, Donna Karolchik and Robert Kuhn, https://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
- class snps.resources.ReferenceSequence(ID='', url='', path='', assembly='', species='', taxonomy='')[source]
Bases:
object
Object used to represent and interact with a reference sequence.
- property ID
Get reference sequence chromosome.
- Returns
- Return type
str
- __init__(ID='', url='', path='', assembly='', species='', taxonomy='')[source]
Initialize a
ReferenceSequence
object.- Parameters
ID (str) – reference sequence chromosome
url (str) – url to Ensembl reference sequence
path (str) – path to local reference sequence
assembly (str) – reference sequence assembly (e.g., “GRCh37”)
species (str) – reference sequence species
taxonomy (str) – reference sequence taxonomy
References
The Variant Call Format (VCF) Version 4.2 Specification, 8 Mar 2019, https://samtools.github.io/hts-specs/VCFv4.2.pdf
- property assembly
Get reference sequence assembly.
- Returns
- Return type
str
- property build
Get reference sequence build.
- Returns
e.g., “B37”
- Return type
str
- property chrom
Get reference sequence chromosome.
- Returns
- Return type
str
- property end
Get reference sequence end position (1-based).
- Returns
- Return type
int
- property length
Get reference sequence length.
- Returns
- Return type
int
- property md5
Get reference sequence MD5 hash.
- Returns
- Return type
str
- property path
Get path to local reference sequence.
- Returns
- Return type
str
- property sequence
Get reference sequence.
- Returns
- Return type
np.array(dtype=np.uint8)
- property species
Get reference sequence species.
- Returns
- Return type
str
- property start
Get reference sequence start position (1-based).
- Returns
- Return type
int
- property taxonomy
Get reference sequence taxonomy.
- Returns
- Return type
str
- property url
Get URL to Ensembl reference sequence.
- Returns
- Return type
str
- class snps.resources.Resources(*args, **kwargs)[source]
Bases:
object
Object used to manage resources required by snps.
- __init__(resources_dir='resources')[source]
Initialize a
Resources
object.- Parameters
resources_dir (str) – name / path of resources directory
- download_example_datasets()[source]
Download example datasets from openSNP.
Per openSNP, “the data is donated into the public domain using CC0 1.0.”
- Returns
paths – paths to example datasets
- Return type
list of str or empty str
References
Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204
- get_all_reference_sequences(**kwargs)[source]
Get Homo sapiens reference sequences for Builds 36, 37, and 38 from Ensembl.
Notes
This function can download over 2.5GB of data.
- Returns
dict of ReferenceSequence, else {}
- Return type
dict
- get_all_resources()[source]
Get / download all resources used throughout snps.
Notes
This function does not download reference sequences and the openSNP datadump, due to their large sizes.
- Returns
dict of resources
- Return type
dict
- get_assembly_mapping_data(source_assembly, target_assembly)[source]
Get assembly mapping data.
- Parameters
source_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – assembly to remap from
target_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – assembly to remap to
- Returns
dict of json assembly mapping data if loading was successful, else {}
- Return type
dict
- get_chip_clusters()[source]
Get resource for identifying deduced genotype / chip array based on chip clusters.
- Returns
- Return type
pandas.DataFrame
References
Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.
- get_dbsnp_151_37_reverse()[source]
Get and load RSIDs that are on the reference reverse (-) strand in dbSNP 151 and lower.
- Returns
- Return type
pandas.DataFrame
References
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1; 29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/
- get_gsa_chrpos()[source]
Get and load GSA chromosome position map.
https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html
- Returns
- Return type
pandas.DataFrame
- get_gsa_resources()[source]
Get resources for reading Global Screening Array files.
https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html
- Returns
- Return type
dict
- get_gsa_rsid()[source]
Get and load GSA RSID map.
https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html
- Returns
- Return type
pandas.DataFrame
- get_low_quality_snps()[source]
Get listing of low quality SNPs for quality control based on chip clusters.
- Returns
- Return type
pandas.DataFrame
References
Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.
- get_opensnp_datadump_filenames()[source]
Get filenames internal to the openSNP datadump zip.
Per openSNP, “the data is donated into the public domain using CC0 1.0.”
Notes
This function can download over 27GB of data. If the download is not successful, try using a different tool like wget or curl to download the file and move it to the resources directory (see _get_path_opensnp_datadump).
- Returns
filenames – filenames internal to the openSNP datadump
- Return type
list of str
References
Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204
- get_reference_sequences(assembly='GRCh37', chroms=('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT'))[source]
Get Homo sapiens reference sequences for chroms of assembly.
Notes
This function can download over 800MB of data for each assembly.
- Parameters
assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – reference sequence assembly
chroms (list of str) – reference sequence chromosomes
- Returns
dict of ReferenceSequence, else {}
- Return type
dict
- load_opensnp_datadump_file(filename)[source]
Load the specified file from the openSNP datadump.
Per openSNP, “the data is donated into the public domain using CC0 1.0.”
- Parameters
filename (str) – filename internal to the openSNP datadump
- Returns
content of specified file internal to the openSNP datadump
- Return type
bytes
References
Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204
snps.utils
Utility classes and functions.
- snps.utils.clean_str(s)[source]
Clean a string so that it can be used as a Python variable name.
- Parameters
s (str) – string to clean
- Returns
string that can be used as a Python variable name
- Return type
str
- snps.utils.create_dir(path)[source]
Create directory specified by path if it doesn’t already exist.
- Parameters
path (str) – path to directory
- Returns
True if path exists
- Return type
bool
- snps.utils.gzip_file(src, dest)[source]
Gzip a file.
- Parameters
src (str) – path to file to gzip
dest (str) – path to output gzip file
- Returns
path to gzipped file
- Return type
str
- snps.utils.save_df_as_csv(df, path, filename, comment='', prepend_info=True, atomic=True, **kwargs)[source]
Save dataframe to a CSV file.
- Parameters
df (pandas.DataFrame) – dataframe to save
path (str) – path to directory where to save CSV file
filename (str or buffer) – filename for file to save or buffer to write to
comment (str) – header comment(s); one or more lines starting with ‘#’
prepend_info (bool) – prepend file generation information as comments
atomic (bool) – atomically write output to a file on local filesystem
**kwargs – additional parameters to pandas.DataFrame.to_csv
- Returns
path to saved file or buffer (empty str if error)
- Return type
str or buffer