SnpEff

SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (such as amino acid changes).

Download & Install

Download and installing SnpEff is pretty easy, take a look at the download page.

Building from source

Take a look at the "Source code" section.

SnpEff Summary

A typical SnpEff use case would be:

Input: The inputs are predicted variants (SNPs, insertions, deletions and MNPs). The input file is usually obtained as a result of a sequencing experiment, and it is usually in variant call format (VCF).
Output: SnpEff analyzes the input variants. It annotates the variants and calculates the effects they produce on known genes (e.g. amino acid changes). A list of effects and annotations that SnpEff can calculate can be found here.

Variants

By genetic variant we mean difference between a genome and a "reference" genome. As an example, imagine we are sequencing a "sample". Here "sample" can mean anything that you are interested in studying, from a cell culture, to a mouse or a cancer patient.

It is a standard procedure to compare your sample sequences against the corresponding "reference genome". For instance you may compare the cancer patient genome against the "reference genome".

In a typical sequencing experiment, you will find many places in the genome where your sample differs from the reference genome. These are called "genomic variants" or just "variants".

Typically, variants are categorized as follows:

Type	What is means	Example
SNP	Single-Nucleotide Polymorphism	Reference = 'A', Sample = 'C'
Ins	Insertion	Reference = 'A', Sample = 'AGT'
Del	Deletion	Reference = 'AC', Sample = 'C'
MNP	Multiple-nucleotide polymorphism	Reference = 'ATA', Sample = 'GTC'
MIXED	Multiple-nucleotide and an InDel	Reference = 'ATA', Sample = 'GTCAGT'

This is not a comprehensive list, it is just to give you an idea.

Annotations

So, you have a huge file describing all the differences between your sample and the reference genome. But you want to know more about these variants than just their genetic coordinates. E.g.: Are they in a gene? In an exon? Do they change protein coding? Do they cause premature stop codons?

SnpEff can help you answer all these questions. The process of adding this information about the variants is called "Annotation".

SnpEff provides several degrees of annotations, from simple (e.g. which gene is each variant affecting) to extremely complex annotations (e.g. will this non-coding variant affect the expression of a gene?). It should be noted that the more complex the annotations, the more it relies in computational predictions. Such computational predictions can be incorrect, so results from SnpEff (or any prediction algorithm) cannot be trusted blindly, they must be analyzed and independently validated by corresponding wet-lab experiments.

Citing

If you are using SnpEff or SnpSift, please cite our work as shown here. Thank you!

SnpEff Features

The following table shows the main SnpEff features:

Feature	Comment
Local install	SnpEff can be installed in your local computer or servers. Local installations are preferred for processing genomic data. As opposed to remote web-based services, running a program locally has many advantages: There no need to upload huge genomic dataset. Processing doesn't depend on availability or processing capacity of remote servers. Service continuity: no need to worry if a remote service will be maintained in the future. Security and confidentiality issues of uploading data to third party servers are not a problem. Avoid legal problems of processing clinical data on "outside" servers.
Multi platform	SnpEff is written in Java. It runs on Unix / Linux, OS.X and Windows.
Simple installation	Installation is as simple as downloading a ZIP file and double clicking on it.
Genomes	Human genome, as well as all model organisms are supported. Over 20,000 genomes are supported, which includes most mammalian, plant, bacterial and fungal genomes with published genomic data.
Speed	SnpEff is really fast. It can annotate up to 1,000,000 variants per minute.
GATK integration	SnpEff can be easily integrated with GATK pipelines (see Integration).
Input and Output formats	SnpEff accepts input files in the following format: `VCF` format, which is the de-facto standard for sequencing variants. `BED` format: To annotate enrichment experiments (e.g. ChIP-Seq peaks) or other genomic data.
Variants supported	SnpEff can annotate SNPs, MNPs, insertions and deletions. Support for mixed variants and structural variants is available (although sometimes limited).
Effect supported	Many effects are calculated using Sequence Ontology terms: such as synonymous_variant, missense_variant, frameshift_variant, stop_gained, just to name a few.
Variant impact	SnpEff provides a simple assessment of the putative impact of the variant (e.g. HIGH, MODERATE or LOW impact).
Cancer tissue analysis	Somatic vs Germline mutations can be calculated on the fly. This is very useful for the cancer researcher community.
Loss of Function (LOF) assessment	SnpEff can estimate if a variant is deemed to have a loss of function on the protein.
Nonsense mediate decay (NMD) assessment	Some mutations may cause mRNA to be degraded thus not translated into a protein. NMD analysis marks mutations that are estimated to trigger nonsense mediated decay.
HGVS notation	SnpEff provides HGVS notation by default, which is quite popular in clinical and translational research environments.
User annotations	A user can provide custom annotations (by means of BED files).
Public databases	SnpEff can annotate using publicly available data from well known databases, for instance: Regulation tracks from ENCODE and Epigenome Roadmap can be used with SnpEff (loaded as BED or GFF regulation files). TFBS Transcription factor binding site predictions can be annotated. Motif data used in these annotations is generated by `Jaspar` and `ENSEMBL` projects. NextProt database can be used to annotate protein domains as well as important functional sites in a protein (e.g. phosphorylation site).
Common variants (dbSnp)	Annotating "common" variants from dbSnp and 1,000 Genomes can be easily done (see `SnpSift annotate`).
Gwas catalog	Support for GWAS catalog annotations (see `SnpSift gwasCat`)
Conservation scores	PhastCons conservation score annotations support (see `SnpSift phastCons`)
DbNsfp	A comprehensive database providing many annotations and scores, such as: `SIFT`, `Polyphen2` ,`GERP++`, `PhyloP`, `MutationTaster`, `SiPhy`, `Interpro`, `Haploinsufficiency`, etc. (via SnpSift). See `SnpSift dbnsfp` for details.
Non-coding annotations	Regulatory and non-coding annotations are supported for different tissues and cell lines. Annotations supported include PolII,H3K27ac, H3K4me2, H3K4me3, H3K27me3, CTCF, H3K36me3, just to name a few.
Gene Sets annotations	Gene sets (MSigDb, GO, BioCarta, KEGG, Reactome, etc.) can be used to annotate via `SnpSift geneSets` command.

Databases

In order to produce the annotations, SnpEff requires a database. We build these databases using information from trusted resources.

Info

By default SnpEff downloads and installs databases automatically (since version 4.0)

Currently, there are pre-built databases for over 20,000 reference genomes. This means that most cases are covered.

In some very rare occasions, people need to build a database for an organism not currently supported (e.g. the genome is not publicly available). In most cases, this can be done and there is a section of this manual teaching how to build your own SnpEff database.

Which databases are supported? You can find out all the supported databases by running the databases command:

java -jar snpEff.jar databases | less

This command shows the genome version, organism name, status (whether the database is locally installed), bundle name, and download link. Keep in mind that many times ENSEMBL reference genomes are used, so the name would be GRCh37 instead of hg19, or GRCm38 instead of mm10, and so on.

Example: Finding a database: So, let's say you want to find out the name of the latest mouse (Mus.Musculus) database. You can runs something like this:

java -jar snpEff.jar databases | grep -i musculus

129S1_SvImJ_v1.99                                               Mus_musculus_129s1svimj                                                                                     https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_129S1_SvImJ_v1.99.zip
AKR_J_v1.99                                                     Mus_musculus_akrj                                                                                           https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_AKR_J_v1.99.zip
A_J_v1.99                                                       Mus_musculus_aj                                                                                             https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_A_J_v1.99.zip
BALB_cJ_v1.99                                                   Mus_musculus_balbcj                                                                                         https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_BALB_cJ_v1.99.zip
C3H_HeJ_v1.99                                                   Mus_musculus_c3hhej                                                                                         https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_C3H_HeJ_v1.99.zip
C57BL_6NJ_v1.99                                                 Mus_musculus_c57bl6nj                                                                                       https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_C57BL_6NJ_v1.99.zip
CAST_EiJ_v1.99                                                  Mus_musculus_casteij                                                                                        https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_CAST_EiJ_v1.99.zip
CBA_J_v1.99                                                     Mus_musculus_cbaj                                                                                           https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_CBA_J_v1.99.zip
DBA_2J_v1.99                                                    Mus_musculus_dba2j                                                                                          https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_DBA_2J_v1.99.zip
FVB_NJ_v1.99                                                    Mus_musculus_fvbnj                                                                                          https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_FVB_NJ_v1.99.zip
GRCm38.75                                                       Mus_musculus                                                                                                https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_GRCm38.75.zip
GRCm38.99                                                       Mus_musculus                                                                                                https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_GRCm38.99.zip
LP_J_v1.99                                                      Mus_musculus_lpj                                                                                            https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_LP_J_v1.99.zip
NOD_ShiLtJ_v1.99                                                Mus_musculus_nodshiltj                                                                                      https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_NOD_ShiLtJ_v1.99.zip
NZO_HlLtJ_v1.99                                                 Mus_musculus_nzohlltj                                                                                       https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_NZO_HlLtJ_v1.99.zip
PWK_PhJ_v1.99                                                   Mus_musculus_pwkphj                                                                                         https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_PWK_PhJ_v1.99.zip
WSB_EiJ_v1.99                                                   Mus_musculus_wsbeij                                                                                         https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_WSB_EiJ_v1.99.zip
testMm37.61                                                     Mus_musculus                                                                                                https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_testMm37.61.zip

The output shows several options including strain-specific databases and general GRCm38 databases. Since it is generally better to use the latest release, you should probably pick GRCm38.99 in this example. Keep in mind that available versions will change over time, so always check the latest output.

Unsupported reference genomes: If your reference genome of interest is not supported yet (i.e. there is no database available), you can build a database yourself (see Building databases). If you have problems adding your own organism, send the issue to SnpEff repository and I'll do my best to help you out.