SnpSift dbNSFP

The dbNSFP is an integrated database of functional predictions from multiple algorithms (SIFT, Polyphen2, LRT and MutationTaster, PhyloP and GERP++, etc.).

Typical usage

One of the main advantages is that you can annotate using multiple prediction tools with just one command. This allows for faster annotations. Here is the link to dbNSFP database website for more details.

Database: In order to annotate using dbNSFP, you need to download the dbNSFP database and the index file. dbNSFP is large (several GB) so it might take a while to download it. The database is compressed (block-gzip) and tabix-indexed, so two files are required (the data .gz file and the .gz.tbi index file).

Warning

dbNSFP only contains data for SNPs (single nucleotide polymorphisms). Indels and other variant types are silently skipped during annotation.

Warning

The input VCF file must be sorted by chromosome and position. The command will fail with an error if unsorted entries are detected.

Downloading

You can download the files from SnpEff's site.

WARNING: Remember that you need both the database and the index files

DbNSFP Version 4.5: * GRCh37 / hg19: * Database. Save file as dbNSFP.txt.gz * Index. Save file as dbNSFP.txt.gz.tbi * GRCh38 / hg38: * Database. Save file as dbNSFP.txt.gz * Index. Save file as dbNSFP.txt.gz.tbi

DbNSFP Version 4.1 * GRCh37 / hg19 (dbNSFP Academic): * Database. Save file as dbNSFP.txt.gz * Index. Save file as dbNSFP.txt.gz.tbi * GRCh38 / hg38 (dbNSFP Academic): * Database. Save file as dbNSFP.txt.gz * Index. Save file as dbNSFP.txt.gz.tbi

Command Options

-f <field_list> : Comma-separated list of dbNSFP field names to annotate. Default fields are shown when running the command without arguments.
-n : Invert field selection. Use all fields EXCEPT the ones specified with -f.
-db <file> : Path to dbNSFP database file (bgzip + tabix).
-a : Annotate fields even if the database has an empty value (uses '.' for missing). By default, empty fields are skipped.
-m : Annotate fields even when there is no database entry for a variant (uses '.' for all fields). By default, variants not found in dbNSFP are left unchanged.
-collapse : Collapse (deduplicate) repeated values when multiple dbNSFP entries match a variant (e.g., multiple transcripts). Values are comma-separated.
-nocollapse : Disable collapsing of repeated values.
-g <genome> : Genome version (used to locate the database in the config file).

Output fields

All annotated fields are added to the VCF INFO column with the prefix dbNSFP_. For example, the dbNSFP field SIFT_pred becomes dbNSFP_SIFT_pred in the output VCF.

Special characters in field names are sanitized for VCF compatibility (e.g., GERP++_RS becomes dbNSFP_GERP___RS because + is not valid in VCF INFO keys).

When a variant matches multiple dbNSFP entries (common for genes with multiple transcripts), the values are comma-separated. Use -collapse to deduplicate repeated values across entries.

Annotation examples

Annotate using default fields:

java -jar SnpSift.jar dbnsfp -v myFile.vcf > myFile.annotated.vcf

Annotate specific fields only:

java -jar SnpSift.jar dbnsfp -f SIFT_pred,Polyphen2_HDIV_pred,CADD_phred myFile.vcf > myFile.annotated.vcf

Annotate all fields EXCEPT specific ones:

java -jar SnpSift.jar dbnsfp -n -f Interpro_domain,Uniprot_acc myFile.vcf > myFile.annotated.vcf

Specify a custom database path:

java -jar SnpSift.jar dbnsfp -db path/to/dbNSFP4.5c.txt.gz myFile.vcf > myFile.annotated.vcf

Annotate even when values are missing in the database:

java -jar SnpSift.jar dbnsfp -a -m myFile.vcf > myFile.annotated.vcf

Building dbNSFP (for developers)

Info

Users do NOT need to do this, since a pre-indexed database can be downloaded from SnpSift's site (see previous sub-section). These instructions are mostly for developers.

You can also create dbNSFP files yourself, downloading the files from DbNsfp site. Two files are required:

A block-gzipped database file
The corresponding tabix index for the database file.

Creating a file that SnpSift can use is simple, just follow this guideline:

# Download dbNSFP database (adjust version as needed)
$ wget http://dbnsfp.houstonbioinformatics.org/dbNSFPzip/dbNSFP4.5c.zip

# Uncompress
$ unzip dbNSFP4.5c.zip

# Create a single file version
$ (head -n 1 dbNSFP4.5c_variant.chr1 ; cat dbNSFP4.5c_variant.chr* | grep -v "^#" ) > dbNSFP4.5c.txt

# Compress using block-gzip algorithm
bgzip dbNSFP4.5c.txt

# Create tabix index
tabix -s 1 -b 2 -e 2 dbNSFP4.5c.txt.gz

Building dbNSFP for hg19/GRCh37 using dbNSFP 4.X:

Latest dbNSFP versions are based on GRCh38/hg38 genomic coordinates. In order to use the latest dbNSFP databases with GRCh37/hg19 genome versions you need to create a new dbNSFP file with the right coordinates. Fortunately, dbNSFP provides GRCh37/hg19 coordinates, so we only need to swap coordinates and sort by genomic position. You can easily do this by using the dbNSFP_sort.pl script (you can find it here) by running something like the following command lines:

# Set to your downloaded dbNSFP version
version="4.5c"

# Replace coordinates by columns 7 and 8 (hg19 coordinates) and sort by those coordinates
cat dbNSFP${version}_variant.chr* \
    | $HOME/snpEff/scripts_build/dbNSFP_sort.pl 7 8 \
    > dbNSFP${version}_hg19.txt

# Compress and index
bgzip dbNSFP${version}_hg19.txt
tabix -s 1 -b 2 -e 2 dbNSFP${version}_hg19.txt.gz