Code Documentation
SNPs
SNPs
reads, writes, merges, and remaps genotype / raw data files.
- class snps.snps.SNPs(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]
Bases:
object
- __init__(file='', only_detect_source=False, assign_par_snps=False, output_dir='output', resources_dir='resources', deduplicate=True, deduplicate_XY_chrom=True, deduplicate_MT_chrom=True, parallelize=False, processes=2, rsids=())[source]
Object used to read, write, and remap genotype / raw data files.
- Parameters:
file (str or bytes) – path to file to load or bytes to load
only_detect_source (bool) – only detect the source of the data
assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes
output_dir (str) – path to output directory
resources_dir (str) – name / path of resources directory
deduplicate (bool) – deduplicate RSIDs and make SNPs available as SNPs.duplicate
deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT
deduplicate_XY_chrom (bool or str) – deduplicate alleles in the non-PAR regions of X and Y for males; see SNPs.discrepant_XY if a str then this is the sex determination method to use X Y or XY
parallelize (bool) – utilize multiprocessing to speedup calculations
processes (int) – processes to launch if multiprocessing
rsids (tuple, optional) – rsids to extract if loading a VCF file
- property assembly
Assembly of SNPs.
- Return type:
str
- property build
Build of SNPs.
- Return type:
int
- property build_detected
Status indicating if build of SNPs was detected.
- Return type:
bool
- property build_original
Original build of SNPs, before any remapping.
- Return type:
int
- property chip
Detected deduced genotype / chip array, if any, per
compute_cluster_overlap
.- Returns:
detected chip array, else empty str
- Return type:
str
- property chip_version
Detected genotype / chip array version, if any, per
compute_cluster_overlap
.Notes
Chip array version is only applicable to 23andMe (v3, v4, v5) and AncestryDNA (v1, v2) files.
- Returns:
detected chip array version, e.g., ‘v4’, else empty str
- Return type:
str
- property chromosomes
Chromosomes of SNPs.
- Returns:
list of str chromosomes (e.g., [‘1’, ‘2’, ‘3’, ‘MT’], empty list if no chromosomes
- Return type:
list
- property chromosomes_summary
Summary of the chromosomes of SNPs.
- Returns:
human-readable listing of chromosomes (e.g., ‘1-3, MT’), empty str if no chromosomes
- Return type:
str
- property cluster
Detected chip cluster, if any, per
compute_cluster_overlap
.Notes
Refer to
compute_cluster_overlap
for more details about chip clusters.- Returns:
detected chip cluster, e.g., ‘c1’, else empty str
- Return type:
str
- compute_cluster_overlap(cluster_overlap_threshold=0.95)[source]
Compute overlap with chip clusters.
Chip clusters, which are defined in [1], are associated with deduced genotype / chip arrays and DTC companies.
This method also sets the values returned by the cluster, chip, and chip_version properties, based on max overlap, if the specified threshold is satisfied.
- Parameters:
cluster_overlap_threshold (float) – threshold for cluster to overlap this SNPs object, and vice versa, to set values returned by the cluster, chip, and chip_version properties
- Returns:
pandas.DataFrame with the following columns:
- company_composition
DTC company composition of associated cluster from [1]
- chip_base_deduced
deduced genotype / chip array of associated cluster from [1]
- snps_in_cluster
count of SNPs in cluster
- snps_in_common
count of SNPs in common with cluster (inner merge with cluster)
- overlap_with_cluster
percentage overlap of snps_in_common with cluster
- overlap_with_self
percentage overlap of snps_in_common with this SNPs object
- Return type:
pandas.DataFrame
References
- property count
Count of SNPs.
- Return type:
int
- detect_build()[source]
Detect build of SNPs.
Use the coordinates of common SNPs to identify the build / assembly of a genotype file that is being loaded.
Notes
rs3094315 : plus strand in 36, 37, and 38
rs11928389 : plus strand in 36, minus strand in 37 and 38
rs2500347 : plus strand in 36 and 37, minus strand in 38
rs964481 : plus strand in 36, 37, and 38
rs2341354 : plus strand in 36, 37, and 38
rs3850290 : plus strand in 36, 37, and 38
rs1329546 : plus strand in 36, 37, and 38
- Returns:
detected build of SNPs, else 0
- Return type:
int
References
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. dbSNP accession: rs3094315, rs11928389, rs2500347, rs964481, rs2341354, rs3850290, and rs1329546 (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/
- determine_sex(heterozygous_x_snps_threshold=0.03, y_snps_not_null_threshold=0.3, chrom='X')[source]
Determine sex from SNPs using thresholds.
- Parameters:
heterozygous_x_snps_threshold (float) – percentage heterozygous X SNPs; above this threshold, Female is determined
y_snps_not_null_threshold (float) – percentage Y SNPs that are not null; above this threshold, Male is determined
chrom ({“X”, “Y”}) – use X or Y chromosome SNPs to determine sex
- Returns:
‘Male’ or ‘Female’ if detected, else empty str
- Return type:
str
- property discrepant_XY
Discrepant XY SNPs.
A discrepant XY SNP is a heterozygous SNP in the non-PAR region of the X or Y chromosome found during deduplication for a detected male genotype.
- Returns:
normalized
snps
dataframe- Return type:
pandas.DataFrame
- property discrepant_merge_genotypes
SNPs with discrepant genotypes discovered while merging SNPs.
Notes
Definitions of columns in this dataframe are as follows:
Column
Description
rsid
SNP ID
chrom
Chromosome of existing SNP
pos
Position of existing SNP
genotype
Genotype of existing SNP
chrom_added
Chromosome of added SNP
pos_added
Position of added SNP
genotype_added
Genotype of added SNP (discrepant with genotype)
- Return type:
pandas.DataFrame
- property discrepant_merge_positions
SNPs with discrepant positions discovered while merging SNPs.
Notes
Definitions of columns in this dataframe are as follows:
Column
Description
rsid
SNP ID
chrom
Chromosome of existing SNP
pos
Position of existing SNP
genotype
Genotype of existing SNP
chrom_added
Chromosome of added SNP
pos_added
Position of added SNP (discrepant with pos)
genotype_added
Genotype of added SNP
- Return type:
pandas.DataFrame
- property discrepant_merge_positions_genotypes
SNPs with discrepant positions and / or genotypes discovered while merging SNPs.
Notes
Definitions of columns in this dataframe are as follows:
Column
Description
rsid
SNP ID
chrom
Chromosome of existing SNP
pos
Position of existing SNP
genotype
Genotype of existing SNP
chrom_added
Chromosome of added SNP
pos_added
Position of added SNP (possibly discrepant with pos)
genotype_added
Genotype of added SNP (possibly discrepant with genotype)
- Return type:
pandas.DataFrame
- property discrepant_vcf_position
SNPs with discrepant positions discovered while saving VCF.
- Returns:
normalized
snps
dataframe- Return type:
pandas.DataFrame
- property duplicate
Duplicate SNPs.
A duplicate SNP has the same RSID as another SNP. The first occurrence of the RSID is not considered a duplicate SNP.
- Returns:
normalized
snps
dataframe- Return type:
pandas.DataFrame
- get_count(chrom='')[source]
Count of SNPs.
- Parameters:
chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
- Return type:
int
- static get_par_regions(build)[source]
Get PAR regions for the X and Y chromosomes.
- Parameters:
build (int) – build of SNPs
- Returns:
PAR regions for the given build
- Return type:
pandas.DataFrame
References
Genome Reference Consortium, https://www.ncbi.nlm.nih.gov/grc/human
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
- heterozygous(chrom='')[source]
Get heterozygous SNPs.
- Parameters:
chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
- Returns:
normalized
snps
dataframe- Return type:
pandas.DataFrame
- property heterozygous_MT
Heterozygous SNPs on the MT chromosome found during deduplication.
- Returns:
normalized
snps
dataframe- Return type:
pandas.DataFrame
- homozygous(chrom='')[source]
Get homozygous SNPs.
- Parameters:
chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
- Returns:
normalized
snps
dataframe- Return type:
pandas.DataFrame
- identify_low_quality_snps()[source]
Identify low quality SNPs based on chip clusters.
Any low quality SNPs are removed from the
snps_qc
dataframe and are made available aslow_quality
.Notes
Chip clusters, which are defined in [1], are associated with low quality SNPs. As such, low quality SNPs will only be identified when this SNPs object corresponds to a cluster per
compute_cluster_overlap()
.
- property low_quality
SNPs identified as low quality, if any, per
identify_low_quality_snps()
.- Returns:
normalized
snps
dataframe- Return type:
pandas.DataFrame
- merge(snps_objects=(), discrepant_positions_threshold=100, discrepant_genotypes_threshold=500, remap=True, chrom='')[source]
Merge other
SNPs
objects into thisSNPs
object.- Parameters:
snps_objects (list or tuple of
SNPs
) – otherSNPs
objects to merge into thisSNPs
objectdiscrepant_positions_threshold (int) – threshold for discrepant SNP positions between existing data and data to be loaded; a large value could indicate mismatched genome assemblies
discrepant_genotypes_threshold (int) – threshold for discrepant genotype data between existing data and data to be loaded; a large value could indicated mismatched individuals
remap (bool) – if necessary, remap other
SNPs
objects to have the same build as thisSNPs
object before mergingchrom (str, optional) – chromosome to merge (e.g., “1”, “Y”, “MT”)
- Returns:
for each
SNPs
object to merge, a dict with the following items:- merged (bool)
whether
SNPs
object was merged- common_rsids (pandas.Index)
SNPs in common
- discrepant_position_rsids (pandas.Index)
SNPs with discrepant positions
- discrepant_genotype_rsids (pandas.Index)
SNPs with discrepant genotypes
- Return type:
list of dict
References
Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.
- notnull(chrom='')[source]
Get not null genotype SNPs.
- Parameters:
chrom (str, optional) – chromosome (e.g., “1”, “X”, “MT”)
- Returns:
normalized
snps
dataframe- Return type:
pandas.DataFrame
- property phased
Indicates if genotype is phased.
- Return type:
bool
- predict_ancestry(output_directory=None, write_predictions=False, models_directory=None, aisnps_directory=None, aisnps_set=None)[source]
Predict genetic ancestry for SNPs.
Predictions by ezancestry.
Notes
Populations below are described here.
- Parameters:
various (optional) – See the available settings for predict at ezancestry.
- Returns:
dict with the following keys:
- population_code (str)
max predicted population for the sample
- population_percent (float)
predicted probability for the max predicted population
- superpopulation_code (str)
max predicted super population (continental) for the sample
- superpopulation_percent (float)
predicted probability for the max predicted super population
- ezancestry_df (pandas.DataFrame)
pandas.DataFrame with the following columns:
- component1, component2, component3
The coordinates of the sample in the dimensionality-reduced component space. Can be used as (x, y, z,) coordinates for plotting in a 3d scatter plot.
- predicted_ancestry_population
The max predicted population for the sample.
- ACB, ASW, BEB, CDX, CEU, CHB, CHS, CLM, ESN, FIN, GBR, GIH, GWD, IBS, ITU, JPT, KHV, LWK, MSL, MXL, PEL, PJL, PUR, STU, TSI, YRI
Predicted probabilities for each of the populations. These sum to 1.0.
- predicted_ancestry_superpopulation
The max predicted super population (continental) for the sample.
- AFR, AMR, EAS, EUR, SAS
Predicted probabilities for each of the super populations. These sum to 1.0.
- Return type:
dict
- remap(target_assembly, complement_bases=True)[source]
Remap SNP coordinates from one assembly to another.
This method uses the assembly map endpoint of the Ensembl REST API service (via
Resources
’sEnsemblRestClient
) to convert SNP coordinates / positions from one assembly to another. After remapping, the coordinates / positions for the SNPs will be that of the target assembly.If the SNPs are already mapped relative to the target assembly, remapping will not be performed.
- Parameters:
target_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’, 36, 37, 38}) – assembly to remap to
complement_bases (bool) – complement bases when remapping SNPs to the minus strand
- Returns:
chromosomes_remapped (list of str) – chromosomes remapped
chromosomes_not_remapped (list of str) – chromosomes not remapped
Notes
An assembly is also know as a “build.” For example:
Assembly NCBI36 = Build 36 Assembly GRCh37 = Build 37 Assembly GRCh38 = Build 38
See https://www.ncbi.nlm.nih.gov/assembly for more information about assemblies and remapping.
References
Ensembl, Assembly Map Endpoint, http://rest.ensembl.org/documentation/info/assembly_map
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
- property sex
Sex derived from SNPs.
- Returns:
‘Male’ or ‘Female’ if detected, else empty str
- Return type:
str
- property snps
Normalized SNPs.
Notes
Throughout
snps
, the “normalizedsnps
dataframe” is defined as follows:Column
Description
pandas dtype
rsid [*]
SNP ID
object (string)
chrom
Chromosome of SNP
object (string)
pos
Position of SNP (relative to build)
uint32
genotype [†]
Genotype of SNP
object (string)
[*] Dataframe index
[†] Genotype can be null, length 1, or length 2. Specifically, genotype is null if not called or unavailable. Otherwise, for autosomal chromosomes, genotype is two alleles. For the X and Y chromosomes, male genotypes are one allele in the non-PAR regions (assuming deduplicate_XY_chrom). For the MT chromosome, genotypes are one allele (assuming deduplicate_MT_chrom).
- Returns:
normalized
snps
dataframe- Return type:
pandas.DataFrame
- property snps_qc
Normalized SNPs, after quality control.
Any low quality SNPs, identified per
identify_low_quality_snps()
, are not included in the result.- Returns:
normalized
snps
dataframe- Return type:
pandas.DataFrame
- property source
Summary of the SNP data source(s).
- Returns:
Data source(s) for this
SNPs
object, separated by “, “.- Return type:
str
- property summary
Summary of SNPs.
- Returns:
summary info if
SNPs
is valid, else {}- Return type:
dict
- to_csv(filename='', atomic=True, **kwargs)[source]
Output SNPs as comma-separated values.
- Parameters:
filename (str or buffer) – filename for file to save or buffer to write to
atomic (bool) – atomically write output to a file on local filesystem
**kwargs – additional parameters to pandas.DataFrame.to_csv
- Returns:
path to file in output directory if SNPs were saved, else empty str
- Return type:
str
- to_tsv(filename='', atomic=True, **kwargs)[source]
Output SNPs as tab-separated values.
Note that this results in the same default output as save.
- Parameters:
filename (str or buffer) – filename for file to save or buffer to write to
atomic (bool) – atomically write output to a file on local filesystem
**kwargs – additional parameters to pandas.DataFrame.to_csv
- Returns:
path to file in output directory if SNPs were saved, else empty str
- Return type:
str
- to_vcf(filename='', atomic=True, alt_unavailable='.', chrom_prefix='', qc_only=False, qc_filter=False, **kwargs)[source]
Output SNPs as Variant Call Format.
- Parameters:
filename (str or buffer) – filename for file to save or buffer to write to
atomic (bool) – atomically write output to a file on local filesystem
alt_unavailable (str) – representation of ALT allele when ALT is not able to be determined
chrom_prefix (str) – prefix for chromosomes in VCF CHROM column
qc_only (bool) – output only SNPs that pass quality control
qc_filter (bool) – populate FILTER column based on quality control results
**kwargs – additional parameters to pandas.DataFrame.to_csv
- Returns:
path to file in output directory if SNPs were saved, else empty str
- Return type:
str
Notes
Parameters qc_only and qc_filter, if true, will identify low quality SNPs per
identify_low_quality_snps()
, if not done already. Moreover, these parameters have no effect if this SNPs object does not map to a cluster percompute_cluster_overlap()
.References
The Variant Call Format (VCF) Version 4.3 Specification, 27 Nov 2022, https://samtools.github.io/hts-specs/VCFv4.3.pdf
- property unannotated_vcf
Indicates if VCF file is unannotated.
- Return type:
bool
- property valid
Determine if
SNPs
is valid.SNPs
is valid when the input file has been successfully parsed.- Returns:
True if
SNPs
is valid- Return type:
bool
snps.ensembl
Ensembl REST client.
Notes
Modified from https://github.com/Ensembl/ensembl-rest/wiki/Example-Python-Client.
References
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
snps.io
Classes for reading and writing SNPs.
snps.io.reader
Class for reading SNPs.
- class snps.io.reader.Reader(file='', only_detect_source=False, resources=None, rsids=())[source]
Bases:
object
Class for reading and parsing raw data / genotype files.
- __init__(file='', only_detect_source=False, resources=None, rsids=())[source]
Initialize a Reader.
- Parameters:
file (str or bytes) – path to file to load or bytes to load
only_detect_source (bool) – only detect the source of the data
resources (Resources) – instance of Resources
rsids (tuple, optional) – rsids to extract if loading a VCF file
- read()[source]
Read and parse a raw data / genotype file.
- Returns:
dict with the following items:
- snps (pandas.DataFrame)
dataframe of parsed SNPs
- source (str)
detected source of SNPs
- phased (bool)
flag indicating if SNPs are phased
- Return type:
dict
- read_23andme(file, compression, joined=True)[source]
Read and parse 23andMe file.
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_ancestry(file, compression)[source]
Read and parse Ancestry.com file.
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_circledna(file, compression)[source]
Read and parse CircleDNA file.
Notes
This method attempts to read and parse a whole exome file, optionally compressed with gzip or zip. Some assumptions are made throughout this process:
SNPs that are not annotated with an RSID are skipped
Insertions and deletions are skipped
- Parameters:
file (str or bytes) – path to file or bytes to load
- Returns:
result of read_helper
- Return type:
dict
- read_dnaland(file, compression)[source]
Read and parse DNA.land files.
- Parameters:
data (str) – data string
- Returns:
result of read_helper
- Return type:
dict
- read_ftdna(file, compression)[source]
Read and parse Family Tree DNA (FTDNA) file.
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_ftdna_famfinder(file, compression)[source]
Read and parse Family Tree DNA (FTDNA) “famfinder” file.
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_generic(file, compression, skip=1)[source]
Read and parse generic CSV or TSV file.
Notes
Assumes columns are ‘rsid’, ‘chrom’ / ‘chromosome’, ‘pos’ / ‘position’, and ‘genotype’; values are comma separated; unreported genotypes are indicated by ‘–’; and one header row precedes data. For example:
rsid,chromosome,position,genotype rs1,1,1,AA rs2,1,2,CC rs3,1,3,–
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_genes_for_good(file, compression)[source]
Read and parse Genes For Good file.
https://genesforgood.sph.umich.edu/readme/readme1.2.txt
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_gsa(data_or_filename, compresion, comments)[source]
Read and parse Illumina Global Screening Array files
- Parameters:
data_or_filename (str or bytes) – either the filename to read from or the bytes data itself
- Returns:
result of read_helper
- Return type:
dict
- read_helper(source, parser)[source]
Generic method to help read files.
- Parameters:
source (str) – name of data source
parser (func) – parsing function, which returns a tuple with the following items:
- 0 (pandas.DataFrame)
dataframe of parsed SNPs (empty if only detecting source)
- 1 (bool), optional
flag indicating if SNPs are phased
- 2 (int), optional
detected build of SNPs
- Returns:
dict with the following items:
- snps (pandas.DataFrame)
dataframe of parsed SNPs
- source (str)
detected source of SNPs
- phased (bool)
flag indicating if SNPs are phased
- build (int)
detected build of SNPs
- Return type:
dict
References
Fluent Python by Luciano Ramalho (O’Reilly). Copyright 2015 Luciano Ramalho, 978-1-491-94600-8.
- read_livingdna(file, compression)[source]
Read and parse LivingDNA file.
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_mapmygenome(file, compression, header)[source]
Read and parse Mapmygenome file.
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_myheritage(file, compression)[source]
Read and parse MyHeritage file.
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_plink(file, compression)[source]
Read and parse plink file.
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_sano_dtc(file, compression)[source]
Read and parse Sano Genetics DTC file.
- Parameters:
file (str) – path to file
- Returns:
result of read_helper
- Return type:
dict
- read_snps_csv(file, comments, compression)[source]
Read and parse CSV file generated by
snps
.https://pypi.org/project/snps/
- Parameters:
file (str or buffer) – path to file or buffer to read
comments (str) – comments at beginning of file
- Returns:
result of read_helper
- Return type:
dict
- read_tellmegen(file, compression)[source]
Read and parse tellmeGen files.
- Parameters:
data (str) – data string
- Returns:
result of read_helper
- Return type:
dict
- read_vcf(file, compression, provider, rsids=(), comments='')[source]
Read and parse VCF file.
Notes
This method attempts to read and parse a VCF file or buffer, optionally compressed with gzip. Some assumptions are made throughout this process:
SNPs that are not annotated with an RSID are skipped
If the VCF contains multiple samples, only the first sample is used to lookup the genotype
Precise insertions and deletions are skipped
If a sample allele is not specified, the genotype is reported as NaN
If a sample allele refers to a REF or ALT allele that is not specified, the genotype is reported as NaN
- Parameters:
file (str or bytes) – path to file or bytes to load
rsids (tuple, optional) – rsids to extract if loading a VCF file
- Returns:
result of read_helper
- Return type:
dict
snps.io.writer
Class for writing SNPs.
- class snps.io.writer.Writer(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]
Bases:
object
Class for writing SNPs to files.
- __init__(snps=None, filename='', vcf=False, atomic=True, vcf_alt_unavailable='.', vcf_chrom_prefix='', vcf_qc_only=False, vcf_qc_filter=False, **kwargs)[source]
Initialize a Writer.
- Parameters:
snps (SNPs) – SNPs to save to file or write to buffer
filename (str or buffer) – filename for file to save or buffer to write to
vcf (bool) – flag to save file as VCF
atomic (bool) – atomically write output to a file on local filesystem
vcf_alt_unavailable (str) – representation of VCF ALT allele when ALT is not able to be determined
vcf_chrom_prefix (str) – prefix for chromosomes in VCF CHROM column
vcf_qc_only (bool) – for VCF, output only SNPs that pass quality control
vcf_qc_filter (bool) – for VCF, populate VCF FILTER column based on quality control results
**kwargs – additional parameters to pandas.DataFrame.to_csv
snps.resources
Class for downloading and loading required external resources.
References
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. http://dx.doi.org/10.1038/35057062
hg19 (GRCh37): Hiram Clawson, Brooke Rhead, Pauline Fujita, Ann Zweig, Katrina Learned, Donna Karolchik and Robert Kuhn, https://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19
Yates et. al. (doi:10.1093/bioinformatics/btu613), http://europepmc.org/search/?query=DOI:10.1093/bioinformatics/btu613
Zerbino et. al. (doi.org/10.1093/nar/gkx1098), https://doi.org/10.1093/nar/gkx1098
- class snps.resources.ReferenceSequence(ID='', url='', path='', assembly='', species='', taxonomy='')[source]
Bases:
object
Object used to represent and interact with a reference sequence.
- property ID
Get reference sequence chromosome.
- Return type:
str
- __init__(ID='', url='', path='', assembly='', species='', taxonomy='')[source]
Initialize a
ReferenceSequence
object.- Parameters:
ID (str) – reference sequence chromosome
url (str) – url to Ensembl reference sequence
path (str) – path to local reference sequence
assembly (str) – reference sequence assembly (e.g., “GRCh37”)
species (str) – reference sequence species
taxonomy (str) – reference sequence taxonomy
References
The Variant Call Format (VCF) Version 4.3 Specification, 27 Nov 2022, https://samtools.github.io/hts-specs/VCFv4.3.pdf
- property assembly
Get reference sequence assembly.
- Return type:
str
- property build
Get reference sequence build.
- Returns:
e.g., “B37”
- Return type:
str
- property chrom
Get reference sequence chromosome.
- Return type:
str
- property end
Get reference sequence end position (1-based).
- Return type:
int
- property length
Get reference sequence length.
- Return type:
int
- property md5
Get reference sequence MD5 hash.
- Return type:
str
- property path
Get path to local reference sequence.
- Return type:
str
- property sequence
Get reference sequence.
- Return type:
np.array(dtype=np.uint8)
- property species
Get reference sequence species.
- Return type:
str
- property start
Get reference sequence start position (1-based).
- Return type:
int
- property taxonomy
Get reference sequence taxonomy.
- Return type:
str
- property url
Get URL to Ensembl reference sequence.
- Return type:
str
- class snps.resources.Resources(*args, **kwargs)[source]
Bases:
object
Object used to manage resources required by snps.
- __init__(resources_dir='resources')[source]
Initialize a
Resources
object.- Parameters:
resources_dir (str) – name / path of resources directory
- download_example_datasets()[source]
Download example datasets from openSNP.
Per openSNP, “the data is donated into the public domain using CC0 1.0.”
- Returns:
paths – paths to example datasets
- Return type:
list of str or empty str
References
Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204
- get_all_reference_sequences(**kwargs)[source]
Get Homo sapiens reference sequences for Builds 36, 37, and 38 from Ensembl.
Notes
This function can download over 2.5GB of data.
- Returns:
dict of ReferenceSequence, else {}
- Return type:
dict
- get_all_resources()[source]
Get / download all resources used throughout snps.
Notes
This function does not download reference sequences and the openSNP datadump, due to their large sizes.
- Returns:
dict of resources
- Return type:
dict
- get_assembly_mapping_data(source_assembly, target_assembly)[source]
Get assembly mapping data.
- Parameters:
source_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – assembly to remap from
target_assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – assembly to remap to
- Returns:
dict of json assembly mapping data if loading was successful, else {}
- Return type:
dict
- get_chip_clusters()[source]
Get resource for identifying deduced genotype / chip array based on chip clusters.
- Return type:
pandas.DataFrame
References
Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.
Lu, Tzovaras, & Gough. (2021). OpenSNP data-freeze of 5,393 (19.10.2020) [Data set]. In Computational and Structural Biotechnology Journal. Zenodo. https://doi.org/10.1016/j.csbj.2021.06.040
- get_dbsnp_151_37_reverse()[source]
Get and load RSIDs that are on the reference reverse (-) strand in dbSNP 151 and lower.
- Return type:
pandas.DataFrame
References
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1; 29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for Biotechnology Information, National Library of Medicine. (dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/
- get_gsa_chrpos()[source]
Get and load GSA chromosome position map.
https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html
- Return type:
pandas.DataFrame
- get_gsa_resources()[source]
Get resources for reading Global Screening Array files.
https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html
- Return type:
dict
- get_gsa_rsid()[source]
Get and load GSA RSID map.
https://support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html
- Return type:
pandas.DataFrame
- get_low_quality_snps()[source]
Get listing of low quality SNPs for quality control based on chip clusters.
- Return type:
pandas.DataFrame
References
Chang Lu, Bastian Greshake Tzovaras, Julian Gough, A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal, Volume 19, 2021, Pages 3747-3754, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2021.06.040.
Lu, Tzovaras, & Gough. (2021). OpenSNP data-freeze of 5,393 (19.10.2020) [Data set]. In Computational and Structural Biotechnology Journal. Zenodo. https://doi.org/10.1016/j.csbj.2021.06.040
- get_opensnp_datadump_filenames()[source]
Get filenames internal to the openSNP datadump zip.
Per openSNP, “the data is donated into the public domain using CC0 1.0.”
Notes
This function can download over 27GB of data. If the download is not successful, try using a different tool like wget or curl to download the file and move it to the resources directory (see _get_path_opensnp_datadump).
- Returns:
filenames – filenames internal to the openSNP datadump
- Return type:
list of str
References
Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204
- get_reference_sequences(assembly='GRCh37', chroms=('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT'))[source]
Get Homo sapiens reference sequences for chroms of assembly.
Notes
This function can download over 800MB of data for each assembly.
- Parameters:
assembly ({‘NCBI36’, ‘GRCh37’, ‘GRCh38’}) – reference sequence assembly
chroms (list of str) – reference sequence chromosomes
- Returns:
dict of ReferenceSequence, else {}
- Return type:
dict
- load_opensnp_datadump_file(filename)[source]
Load the specified file from the openSNP datadump.
Per openSNP, “the data is donated into the public domain using CC0 1.0.”
- Parameters:
filename (str) – filename internal to the openSNP datadump
- Returns:
content of specified file internal to the openSNP datadump
- Return type:
bytes
References
Greshake B, Bayer PE, Rausch H, Reda J (2014), “openSNP-A Crowdsourced Web Resource for Personal Genomics,” PLOS ONE, 9(3): e89204, https://doi.org/10.1371/journal.pone.0089204
snps.utils
Utility classes and functions.
- snps.utils.clean_str(s)[source]
Clean a string so that it can be used as a Python variable name.
- Parameters:
s (str) – string to clean
- Returns:
string that can be used as a Python variable name
- Return type:
str
- snps.utils.create_dir(path)[source]
Create directory specified by path if it doesn’t already exist.
- Parameters:
path (str) – path to directory
- Returns:
True if path exists
- Return type:
bool
- snps.utils.gzip_file(src, dest)[source]
Gzip a file.
- Parameters:
src (str) – path to file to gzip
dest (str) – path to output gzip file
- Returns:
path to gzipped file
- Return type:
str
- snps.utils.save_df_as_csv(df, path, filename, comment='', prepend_info=True, atomic=True, **kwargs)[source]
Save dataframe to a CSV file.
- Parameters:
df (pandas.DataFrame) – dataframe to save
path (str) – path to directory where to save CSV file
filename (str or buffer) – filename for file to save or buffer to write to
comment (str) – header comment(s); one or more lines starting with ‘#’
prepend_info (bool) – prepend file generation information as comments
atomic (bool) – atomically write output to a file on local filesystem
**kwargs – additional parameters to pandas.DataFrame.to_csv
- Returns:
path to saved file or buffer (empty str if error)
- Return type:
str or buffer