The SNPs class accepts a path to a file or a bytes object. A Reader class attempts to
infer the data source and load the SNPs. The loaded SNPs are
normalized and
available via a pandas.DataFrame:
The dataset consists of raw data files from two different DNA testing sources - letβs combine
these files. Specifically, weβll update the SNPs object with SNPs from a
Family Tree DNA file.
>>> merge_results=s.merge([SNPs("resources/662.ftdna-illumina.341.csv.gz")])Merging SNPs('662.ftdna-illumina.341.csv.gz')SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37Downloading resources/NCBI36_GRCh37.tar.gz27 SNP positions were discrepant; keeping original positions151 SNP genotypes were discrepant; marking those as null>>> s.source'23andMe, FTDNA'>>> s.count1006960>>> s.build37>>> s.build_detectedTrue
If the SNPs being merged have a build that differs from the destination build, the SNPs to merge
will be remapped automatically. After this example merge, the build is still detected, since the
build was detected for all SNPs objects that were merged.
As the data gets added, itβs compared to the existing data, and SNP position and genotype
discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.) These
discrepant SNPs are available for inspection after the merge via properties of the SNPs object.
>>> len(s.discrepant_merge_genotypes)151
Additionally, any non-called / null genotypes will be updated during the merge, if the file
being merged has a called genotype for the SNP.
Moreover, merge takes a chrom parameter - this enables merging of only SNPs associated
with the specified chromosome (e.g., βYβ or βMTβ).
Finally, merge returns a list of dict, where each dict has information corresponding
to the results of each merge (e.g., SNPs in common).
Ok, so far weβve merged the SNPs from two files (ensuring the same build in the process and
identifying discrepancies along the way). Then, we remapped the SNPs to Build 38. Now, letβs save
the merged and remapped dataset consisting of 1M+ SNPs to a tab-separated values (TSV) file:
Moreover, letβs get the reference sequences for this assembly and save the SNPs as a VCF file:
>>> saved_snps=s.to_vcf("out.vcf")Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.2.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.3.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.4.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.5.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.6.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.7.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.8.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.9.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.10.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.14.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.15.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.16.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.17.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.18.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.X.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gzSaving output/out.vcf1 SNP positions were found to be discrepant when saving VCF
When saving a VCF, if any SNPs have positions outside of the reference sequence, they are marked
as discrepant and are available via a property of the SNPs object.
All output files are saved to the
output directory.
The various output files produced by snps are detailed below. Output files are saved in the
output directory, which is defined at the instantiation of a SNPs object.
SNPs can be saved with SNPs.save. By default, one tab-separated
.txt or .vcf file (vcf=True) is output when SNPs are saved. If comma is specified as
the separator (sep=","), the default extension is .csv.
The content of non-VCF files (after comment lines, which start with #) is as follows:
Column
Description
rsid
SNP ID
chromosome
Chromosome of SNP
position
Position of SNP
genotype
Genotype of SNP
When filename is not specified, default filenames are used as described below.
The instructions below provide the steps to install snps on a
Raspberry Pi (tested with
βRaspberry Pi OS (32-bit) Liteβ,
release date 2020-08-20). For more details about Python on the Raspberry Pi, see
here.
Note
Text after a prompt (e.g., $) is the command to type at the command line. The
instructions assume a fresh install of Raspberry Pi OS and that after logging in as
the pi user, the current working directory is /home/pi.
Install pip for Python 3:
pi@raspberrypi:~ $ sudo apt install python3-pip
Press βyβ followed by βenterβ to continue. This enables us to install packages from the
Python Package Index.
Install the venv module:
pi@raspberrypi:~ $ sudo apt install python3-venv
Press βyβ followed by βenterβ to continue. This enables us to create a
virtual environment to isolate the snps
installation from other system Python packages.
Press βyβ followed by βenterβ to continue. This is required for NumPy, a
dependency of snps.
Create a directory for snps and change working directory:
pi@raspberrypi:~ $ mkdir snps
pi@raspberrypi:~ $ cd snps
Create a virtual environment for snps:
pi@raspberrypi:~/snps $ python3 -m venv .venv
The virtual environment is located at /home/pi/snps/.venv.
Activate the virtual environment:
pi@raspberrypi:~/snps $ source .venv/bin/activate
Now when you invoke Python or pip, the virtual environmentβs version will be used (as
indicated by the (.venv) before the prompt). This can be verified as follows:
(.venv) pi@raspberrypi:~/snps $ which python
/home/pi/snps/.venv/bin/python
Install snps:
(.venv) pi@raspberrypi:~/snps $ pip install snps
Start Python:
(.venv) pi@raspberrypi:~/snps $ python
Python 3.7.3 (default, Jul 25 2020, 13:03:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
Use snps; examples shown in the README should now work.
At completion of usage, the virtual environment can be deactivated:
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI
database of genetic variation. Nucleic Acids Res. 2001 Jan 1; 29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center for
Biotechnology Information, National Library of Medicine. dbSNP accession: <listed above> (dbSNP
Build ID: 142). Available from: http://www.ncbi.nlm.nih.gov/SNP/
snps could always use more documentation, whether as part of the official snps docs, in
docstrings, or even on the web in blog posts, articles, and such. See below for info on how to
generate documentation.
When youβre done making changes, run all the tests with:
$ pipenv run pytest --cov-report=html --cov=snps tests
Note
Downloads during tests are disabled by default. To enable downloads, set the
environment variable DOWNLOADS_ENABLED=true.
Note
If you receive errors when running the tests, you may need to specify the temporary
directory with an environment variable, e.g., TMPDIR="/path/to/tmp/dir".
Note
After running the tests, a coverage report can be viewed by opening
htmlcov/index.html in a browser.
Check code formatting:
$ pipenv run black --check --diff .
Commit your changes and push your branch to GitHub:
$ git add .
$ git commit -m "Your detailed description of your changes."
$ git push origin name-of-your-bugfix-or-feature
Object used to read, write, and remap genotype / raw data files.
Parameters:
file (str or bytes) β path to file to load or bytes to load
only_detect_source (bool) β only detect the source of the data
assign_par_snps (bool) β assign PAR SNPs to the X and Y chromosomes
output_dir (str) β path to output directory
resources_dir (str) β name / path of resources directory
deduplicate (bool) β deduplicate RSIDs and make SNPs available as SNPs.duplicate
deduplicate_MT_chrom (bool) β deduplicate alleles on MT; see SNPs.heterozygous_MT
deduplicate_XY_chrom (bool or str) β deduplicate alleles in the non-PAR regions of X and Y for males; see SNPs.discrepant_XY
if a str then this is the sex determination method to use X Y or XY
parallelize (bool) β utilize multiprocessing to speedup calculations
processes (int) β processes to launch if multiprocessing
rsids (tuple, optional) β rsids to extract if loading a VCF file
Chip clusters, which are defined in [1], are associated with deduced genotype /
chip arrays and DTC companies.
This method also sets the values returned by the cluster, chip, and
chip_version properties, based on max overlap, if the specified threshold is
satisfied.
Parameters:
cluster_overlap_threshold (float) β threshold for cluster to overlap this SNPs object, and vice versa, to set
values returned by the cluster, chip, and chip_version properties
Returns:
pandas.DataFrame with the following columns:
company_composition
DTC company composition of associated cluster from [1]
chip_base_deduced
deduced genotype / chip array of associated cluster from [1]
snps_in_cluster
count of SNPs in cluster
snps_in_common
count of SNPs in common with cluster (inner merge with cluster)
overlap_with_cluster
percentage overlap of snps_in_common with cluster
overlap_with_self
percentage overlap of snps_in_common with this SNPs object
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K.
dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001
Jan 1;29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center
for Biotechnology Information, National Library of Medicine. dbSNP accession: rs3094315,
rs11928389, rs2500347, rs964481, rs2341354, rs3850290, and rs1329546
(dbSNP Build ID: 151). Available from: http://www.ncbi.nlm.nih.gov/SNP/
Any low quality SNPs are removed from the
snps_qc dataframe and are made
available as low_quality.
Notes
Chip clusters, which are defined in [1], are associated with low quality SNPs.
As such, low quality SNPs will only be identified when this SNPs object corresponds
to a cluster per
compute_cluster_overlap().
snps_objects (list or tuple of SNPs) β other SNPs objects to merge into this SNPs object
discrepant_positions_threshold (int) β threshold for discrepant SNP positions between existing data and data to be loaded;
a large value could indicate mismatched genome assemblies
discrepant_genotypes_threshold (int) β threshold for discrepant genotype data between existing data and data to be loaded;
a large value could indicated mismatched individuals
remap (bool) β if necessary, remap other SNPs objects to have the same build as this SNPs object
before merging
Remap SNP coordinates from one assembly to another.
This method uses the assembly map endpoint of the Ensembl REST API service (via
Resourcesβs EnsemblRestClient) to convert SNP coordinates / positions from one
assembly to another. After remapping, the coordinates / positions for the
SNPs will be that of the target assembly.
If the SNPs are already mapped relative to the target assembly, remapping will not be
performed.
Parameters:
target_assembly ({βNCBI36β, βGRCh37β, βGRCh38β, 36, 37, 38}) β assembly to remap to
complement_bases (bool) β complement bases when remapping SNPs to the minus strand
Returns:
chromosomes_remapped (list of str) β chromosomes remapped
chromosomes_not_remapped (list of str) β chromosomes not remapped
Notes
An assembly is also know as a βbuild.β For example:
filename (str or buffer) β filename for file to save or buffer to write to
atomic (bool) β atomically write output to a file on local filesystem
alt_unavailable (str) β representation of ALT allele when ALT is not able to be determined
chrom_prefix (str) β prefix for chromosomes in VCF CHROM column
qc_only (bool) β output only SNPs that pass quality control
qc_filter (bool) β populate FILTER column based on quality control results
**kwargs β additional parameters to pandas.DataFrame.to_csv
Returns:
path to file in output directory if SNPs were saved, else empty str
Return type:
str
Notes
Parameters qc_only and qc_filter, if true, will identify low quality SNPs per
identify_low_quality_snps(),
if not done already. Moreover, these parameters have no effect if this SNPs
object does not map to a cluster per
compute_cluster_overlap().
Assumes columns are βrsidβ, βchromβ / βchromosomeβ, βposβ / βpositionβ, and βgenotypeβ;
values are comma separated; unreported genotypes are indicated by βββ; and one header row
precedes data. For example:
Class for downloading and loading required external resources.
References
International Human Genome Sequencing Consortium. Initial sequencing and
analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921.
http://dx.doi.org/10.1038/35057062
Get and load RSIDs that are on the reference reverse (-) strand in dbSNP 151 and lower.
Return type:
pandas.DataFrame
References
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K.
dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;
29(1):308-11.
Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD): National Center
for Biotechnology Information, National Library of Medicine. (dbSNP Build ID: 151).
Available from: http://www.ncbi.nlm.nih.gov/SNP/
Get filenames internal to the openSNP datadump zip.
Per openSNP, βthe data is donated into the public domain using CC0 1.0.β
Notes
This function can download over 27GB of data. If the download is not successful,
try using a different tool like wget or curl to download the file and move it
to the resources directory (see _get_path_opensnp_datadump).
Returns:
filenames β filenames internal to the openSNP datadump
Return type:
list of str
References
Greshake B, Bayer PE, Rausch H, Reda J (2014), βopenSNP-A Crowdsourced Web Resource
for Personal Genomics,β PLOS ONE, 9(3): e89204,
https://doi.org/10.1371/journal.pone.0089204