Building databases

SnpEff needs a database to perform genomic annotations. There are pre-built databases for thousands of genomes, so chances are that your organism of choice already has a SnpEff database available. In the (unlikely?) event that you need to build one yourself, here we describe how to it.

Info

You can know which genomes are supported by running the following command:

java -jar snpEff.jar databases

Warning

Most people do NOT need to build a database, and can safely use a pre-built one. So unless you are working with a rare genome you most likely don't need to do it either.

Managing SnpEff databases manually

SnpEff databases for the most popular genomes are already pre-built and available for you to download. So, chances are that you don't need to build a database yourself (this will save you a LOT of work).

Warning

By default SnpEff automatically downloads and installs the database for you, so you don't need to do it manually. The following instructions are for people that want to pre-install databases manually (again, most people don't need to do this).

The easiest way to download and install a pre-built SnpEff database manually, is using the download command. E.g. if you want to install the SnpEff database for the human genome, you can run the following command:

java -jar snpEff.jar download -v GRCh37.75

Backwards compatible databases

In many cases the databases from previous SnpEff versions might be compatible with newer versions. So, SnpEff has a fallback mechanism for "backwards-compatible database formats". If database format version 5.1 is backward compatible with "5.0", then we can use the "old" (5.0) databases in SnpEff version 5.1

For example, let's say GRCh38.99 was built for "SnpEff version 5.0". When I released version 5.1 I don't need to rebuild GRCh38.99 (or any of the other thousands of databases) again for 5.1, we can just use the "old" 5.0 versions.

How does it work? SnpEff checks the database in "5.1" path, if it doesn't work it fallback to "5.0" path. If you take a look a SnpEff's download output using the -v command, you'll see that it tries two times:

$ java -jar snpEff.jar download -v GRCm38.99

00:00:00 SnpEff version SnpEff 5.1d (build 2022-04-19 15:49), by Pablo Cingolani
00:00:00 Command: 'download'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'GRCm38.99'
00:00:00 Reading config file: /Users/pcingola/snpEff/snpEff.config
00:00:00 done
00:00:00 Downloading database for 'GRCm38.99'
00:00:00 Downloading from 'https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_GRCm38.99.zip' to local file '/var/folders/s9/y0bgs3l55rj_jkkkxr2drz4157r1dz/T//snpEff_v5_1_GRCm38.99.zip'
00:00:00 Connecting to https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_GRCm38.99.zip
00:00:01 Connecting to https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_GRCm38.99.zip, using proxy: false
00:00:01 ERROR while connecting to https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_GRCm38.99.zip
00:00:01 Downloading from 'https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_GRCm38.99.zip' to local file '/var/folders/s9/y0bgs3l55rj_jkkkxr2drz4157r1dz/T//snpEff_v5_0_GRCm38.99.zip'
00:00:01 Connecting to **https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_GRCm38.99.zip**
00:00:01 Connecting to https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_GRCm38.99.zip, using proxy: false
00:00:02 Local file name: '/var/folders/s9/y0bgs3l55rj_jkkkxr2drz4157r1dz/T//snpEff_v5_0_GRCm38.99.zip'
......

So, it starts searching at https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_GRCm38.99.zip but it doesn't find it there, so it proceeds with the databse5.0pathhttps://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_GRCm38.99.zip`, where the blob is found, thus proceeds to download and install it from there.

Info

If you are running SnpEff from a directory different than the one it was installed, you will have to specify where the config file is. This is done using the -c command line option:

java -Xmx8g -jar snpEff.jar download -c path/to/snpEff/snpEff.config -v GRCh37.75

Building a database

In order to build a database for a new genome, you need to:

Warning

Most people do NOT need to build a database, and can safely use a pre-built one. So unless you are working with a rare genome you most likely don't need to do it either.

Step 1: Configure a new genome in SnpEff's config file snpEff.config.
1. Add genome entry to snpEff's configuration
2. If the genome uses a non-standard codon table: Add codon table parameter
Step 2: Build using gene annotations and reference sequences
1. Option 1: Building a database from GTF files (recommended for large genomes)
2. Option 2: Building a database from GenBank files (recommended for small genomes)
3. Option 3: Building a database from GFF files
4. Option 4: Building a database from RefSeq table from UCSC
Step 3: Checking the database: SnpEff will check the database by comparing predicted protein sequences and CDS sequences with ones provided by the user.
1. Checking CDS sequences
2. Checking Protein sequences

Info

All files can be compressed using gzip E.g. the reference file 'hg19.fa' can be compressed to 'hg19.fa.gz', snpEff will automatically decompress the file.

Warning

Some files claimed to be compressed using GZIP are actually not or even use a block compression variant not supported by Java's gzip library. If you notice that your build process finishes abruptly for no apparent reason, try uncompressing the files.

Step 1: Configure a new genome

In order to tell SnpEff that there is a new genome available, you must update SnpEff's configuration file snpEff.config.

You must add a new genome entry to snpEff.config.

If your genome, or a chromosome, uses non-standard codon tables you must update snpEff.config accordingly. A typical case is when you use mitochondrial DNA. Then you specify that chromosome 'MT' uses codon.Invertebrate_Mitochondrial codon table. Another common case is when you are adding a bacterial genome, then you specify that the codon table is Bacterial_and_Plant_Plastid.

Add a genome to the configuration file

This example shows how to add a new genome to the config files. For this example we'll use the mouse genome (mm37.61):

Edit the config file to create the new genome:
```
vi snpEff.config
```
Add the following lines (you are editing snpEff.config):
```
# Mouse genome, version mm37.61
mm37.61.genome : Mouse
```
Warning

You may need to add codon table information for the genome or some parts of it (e.g. mitochondrial "chromosome"). See next section for details.

Optional: Add genome to Galaxy's menu:

cd /path/to/galaxy
cd tools/snpEffect/
vi snpEffect.xml

Add the following lines to the file:

<param name="genomeVersion" type="select" label="Genome">
    <option value="hg37">Human (hg37)<option>
    <option value="mm37.61">Mouse (mm37.61)<option>
<param>

Configuring codon tables (not always required)

Codon tables are provided in the snpEff.config configuration file under the section codon.Name_of_your_codon_table. The format is a comma separated list of CODON/AMINO_ACID.

E.g.:

codon.Invertebrate_Mitochondrial:  TTT/F, TTC/F, TAC/Y, TAA/*, ATG/M+, ATG/M+, ACT/T, ...

Note that codons marked with '*' are STOP codons and codons marked with a '+' are START codons.

In order for you to use them, you have to specify that a given "chromosome" uses one of the tables (otherwise the default codon table is used).

E.g. Here we say the chromosome 'M' from fly genome (dm3) uses Invertebrate_Mitochondrial codon table:

dm3.M.codonTable : Invertebrate_Mitochondrial

...of course, chromosome 'M' is not a real chromosome, it is just a way to mark the sequence as mitochondrial DNA in the reference genome.

Step 2: Build using gene annotations and reference sequences

To build a database, SnpEff needs:

The reference genome sequence: This is the sequence of all chromosomes in the genome, typically in a FASTA file
Gene annotations files: This is the information on where the genes, transcripts and exons are in the genome. Typically, from files in GTF, GeneBank, GFF, or RefSeq formats
Sequences of CDS or Proteins from the genome. These are used to check that SnpEff's database is consistent and doesn't have errors (see sections Checking CDS sequences and Checking Protein sequences)

Step 2, Option 1: Building a database from GTF files

GTF 2.2 files are supported by SnpEff (e.g. ENSEMBL releases genome annotations in this format).

See a more detailed explanation about SnpEff's requirements for GTF files here

Get the genome and uncompress it:

# Create directory for this new genome
cd /path/to/snpEff/data/
mkdir mm37.61
cd mm37.61

# Get annotation files
wget ftp://ftp.ensembl.org/pub/current/gtf/mus_musculus/Mus_musculus.NCBIM37.61.gtf.gz
mv Mus_musculus.NCBIM37.61.gtf.gz genes.gtf.gz

# Get the genome reference sequence file
cd /path/to/snpEff/data/genomes
wget ftp://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/Mus_musculus.NCBIM37.61.dna.toplevel.fa.gz
mv Mus_musculus.NCBIM37.61.dna.toplevel.fa.gz mm37.61.fa.gz

Note: The FASTA file can be either in
/path/to/snpEff/data/genomes/mm37.61.fa
or in
/path/to/snpEff/data/mm37.61/sequences.fa
Add the new genome to the config file (see Add a new genome to the configuration file for details)

Create database:

cd /path/to/snpEff
java -jar snpEff.jar build -gtf22 -v mm37.61

Step 2, Option 2: Building a database from GenBank files

In this section, we show how to build a genome database using GeneBank files.

Info

Using GeneBank files is the recommended way for databases for small genomes, such as viruses or bacteria

Info

For NCBI genomes, you can use the script buildDbNcbi.sh that downloads and builds the database. See details in this FAQ

Example: This example shows how to create a database for a new genome.

Info

GenBank files usually include genome sequence as well as CDS sequences. If these sequences are NOT in the GeneBank file, you should provide separate FASTA files for genome reference and CDS (or Protein) sequences.

For this example we'll use "Staphylococcus aureus":

Go to NIH page for CP000730
Download the features in geneBank format, by clicking as shown in the following images (red arrows):

genBank0

Make sure you click the "Update" button!

Then you go to the "Send" menu:

genBank1

and then:

genBank2

Save the GenBank data to /path/to/snpEff/data/CP000730/genes.gbk

Note: If there are more than one genbank file for an organism (e.g. multiple chromosomes), then you can download each file and concatenate them.

E.g.: Vibrio Cholerae has two chromosomes with GenBank accessions: NC_002505.1 and NC_002506.1. You can download both files and save them as snpEff/data/vibrio/NC_002505.1.gbk and snpEff/data/vibrio/NC_002506.1.gbk respectively, and then concatenate both files:

    cat NC_002505.1.gbk NC_002506.1.gbk > genes.gbk

Add the following entries in the config file:

    # Vibrio Cholerae
    vibrio.genome : Vibrio Cholerae
            vibrio.chromosomes : NC_002505.1, NC_002506.1
            vibrio.NC_002505.1.codonTable : Bacterial_and_Plant_Plastid
            vibrio.NC_002506.1.codonTable : Bacterial_and_Plant_Plastid

Create database (note the -genbank flag):

cd /path/to/snpEff
java -jar snpEff.jar build -genbank -v CP000730

Step 2, Option 3: Building a database from GFF files

Warning

Using GFF is highly discouraged, we recommend you use GTF files instead (whenever possible).

See a more detailed explanation about SnpEff's requirements for GFF files here

This example shows how to create a database for a new genome using GFF file (e.g. FlyBase, WormBase, BeeBase release GFF files). For this example we'll use the Drosophila melanogaster genome (dm5.31):

Get a GFF file to path/to/snpEff/data/dm5.31/genes.gff:

mkdir path/to/snpEff/data/dm5.31
cd path/to/snpEff/data/dm5.31
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.31_FB2010_08/gff/dmel-all-r5.31.gff.gz
mv dmel-all-r5.31.gff.gz genes.gff.gz

Note: GFF3 files can include the reference sequence in the same file. This is done by dumping the fasta file after a '##FASTA' line. You can also add the sequence fasta file to the 'data/genomes/' directory, like it is done in when using GTF format.
Add the new genome to the config file (see Add a new genome to the configuration file for details)

Create database (note the "-gff3" flag):

cd /path/to/snpEff
java -jar snpEff.jar build -gff3 -v dm5.31

Step 2, Option 4: Building a database from RefSeq table from UCSC

This example shows how to create a database for a new genome using UCSC's genome tables. For this example we'll use the Human genome (hg19).

Warning

Using UCSC genome tables is highly discouraged, we recommend you use ENSEMBL versions instead.

Warning

UCSC tables sometimes change for different species. This means that even if these instructions work for human genome, it might not work for other genomes. Obviously creating a new parser for each genome is impractical, so working with UCSC genomes is highly discouraged. We recommend to use ENSEMBL genomes instead.

Warning

UCSC genomes provide only major release version, but NOT sub-versions. E.g. UCSC's "hg19" has major version 19 but there is no "sub-version", whereas ENSEMBL's GRCh37.70 clearly has major version 37 and minor version 70. Not providing a minor version means that they might change the database and two "hg19" genomes are actually be different. This creates all sorts of consistency problems (e.g. the annotations may not be the same that you see in the UCSC genome browser, even though both of them are 'hg19' version). Using UCSC genome tables is highly discouraged, we recommend you use ENSEMBL versions instead.

In order to build a genome using UCSC tables, you can follow these instructions:

Go to UCSC genome browser
Click on "Table" menu
Select parameters as shown here:
Click on "get output" and save the data to file "/path/to/snpEff/data/hg19/genes.refseq".
Add the fasta reference genome. The FASTA file can be either in:

/path/to/snpEff/data/genomes/hg19.fa

or in:

/path/to/snpEff/data/hg19/sequences.fa
Add the new genome to the config file (see Add a new genome to the configuration file for details)

Create database (note the -refSeq flag):

cd /path/to/snpEff
java -jar snpEff.jar build -refSeq -v hg19

Step 3: Checking the database

SnpEff makes sure that the database is built correctly by comparing the internal representation with CDS and Protein sequences. The CDS and Protein sequences are obtained from the same source as the genome reference and gene annotation files.

IMPORTANT: When building databases, SnpEff will attempt to check the database against a CDS FASTA file and a Protein FASTA file. If neither a CDS FASTA file nor a Protein FASTA file is provided, this check will fail and SnpEff will refuse to save the database, by showing error message like this one:

ERROR: CDS check file './data/MY_GENOME/cds.fa' not found.
ERROR: Protein check file './data/MY_GENOME/protein.fa' not found.
ERROR: Database check failed.

Warning

If neither a "Protein" nor a "CDS" sequences are provided, SnpEff will not be able to check the database. By default, SnpEff will refuse to save any unchecked database.

You can override this behaviour by using the build command with these command line options:

-noCheckCds: Skip CDS sequences check.
-noCheckProtein: Skip Protein sequences check.
-noCheckCds -noCheckProtein: When using BOTH command line options, SnpEff to save the new database without checking it (i.e. neither CDS nor protein sequences are checked).

Checking CDS sequences

When building a database, SnpEff will try to check CDS sequences for all transcripts in the database when

building via GFT/GFF/RefSeq: A CDS sequences FASTA file is available.
building via GenBank file: CDS sequences are available within the GenBank file

Info

You can disable this check unsing command line option -noCheckCds

FASTA cds file format:

The file name should be cds.fa (or cds.fa.gz if compressed)
Each transcript should have one CDS sequence
Each FASTA header has the transcript ID either:
- The header contains only the transcript ID
- The header contains Transcript ID and maybe other IDs separated by either spaces, commas, colon, dots, equal sign, or some combination of these
- Some example sequences with valid header examples:
```
>ENST00000448914
ACTGGGGGATACG

>chromosome:GRCh38:14:22449113:22449125:1 transcript:ENST00000448914.1 cds gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3
ACTGGGGGATACG
```

CSD checking output. When run using the -v (verbose) command line option, for each transcript in the FASTA file, SnpEff will output one character

+: OK, the CDS sequence matches the one predicted by SnpEff
.: Missing transcript. SnpEff could not find the transcript ID from the FASTA file. This might indicate a problem parsing the FASTA file header to find the
*: Error. The CDS sequence from inferred from SnpEff's database and the one provided in the CDS file do not match.

After these line a "Summary statistics" line shows the total number of FASTA entries checked, as well as the number of errors (and a percentage), e.g.:

CDS check:  GRCh38.86   OK: 94384   Warnings: 22766 Not found: 103618   Errors: 0   Error percentage: 0.0%

Warning

As a "rule of the thumb", you should not get more than 2% or 3% of errors.

Info

Debugging: You can run SnpEff using -d (debug) command line option to get detailed messages for each CDS sequence comparison. The message shows the transcript ID, CDS sequence inferred by SnpEff's, and the CDS sequence from the FASTA file, as well as the places where they differ.

Checking Protein sequences

This is very similar to the CDS checking in the previous sub-section. When building a database, SnpEff will also try to check Protein sequences for all transcripts when

building via GFT/GFF/RefSeq: A protein sequences FASTA file is available.
building via GenBank file: protein sequences are available within the GenBank file

Info

You can disable this check unsing command line option -noCheckProtein

FASTA protein file:

The file name should be protein.fa (or protein.fa.gz if compressed)
Each transcript should have one protein sequence

Each FASTA header has the transcript ID either:

The header contains only the transcript ID
The header contains Transcript ID and maybe other IDs separated by either spaces, commas, colon, dots, equal sign, or some combination of these

Some example sequences with valid header examples (sequences have been cut for readability):

>ENST00000382044
MPGEQMDPTGSQLDSDFSQQDTPCLIIEDSQPESQVLEDDSGSHFSMLSRHLPNLQTHKE
NPVLDVVSNPEQTAGEERGDGNSGFNEHLKENKVADPVDSSNLDTCGSISQVIEQLPQPN
RTSSVLGMSVESAPAVEEEKGEELEQKEKEKEEDTSGNTTHSLGAEDTASSQLGFGVLEL
...

>ENSP00000371475 pep chromosome:GRCh38:15:43403061:43493171:-1 gene:ENSG00000067369 transcript:ENST00000382044 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:TP53BP1 description:tumor protein p53 binding protein
MPGEQMDPTGSQLDSDFSQQDTPCLIIEDSQPESQVLEDDSGSHFSMLSRHLPNLQTHKE
NPVLDVVSNPEQTAGEERGDGNSGFNEHLKENKVADPVDSSNLDTCGSISQVIEQLPQPN
RTSSVLGMSVESAPAVEEEKGEELEQKEKEKEEDTSGNTTHSLGAEDTASSQLGFGVLEL
...

Protein checking output:

When run using the -v (verbose) command line option, for each transcript in the FASTA file, SnpEff will output one character:

+: OK, the protein sequence matches the one predicted by SnpEff
.: Missing transcript. SnpEff could not find the transcript ID from the FASTA file. This might indicate a problem parsing the FASTA file header to find the
*: Error. The Protein sequence from inferred from SnpEff's database and the one provided in the protein file do not match.

After these line a "Summary statistics" line shows the total number of FASTA entries checked, as well as the number of errors (and a percentage), e.g.:

Protein check:  GRCh38.86       OK: 94371       Not found: 0    Errors: 13      Error percentage: 0.01377352093575182%

Warning

As a "rule of the thumb", the errors should be below 2% or 3%.

How exactly protein sequences are compared The rules used for protein sequence comparison are:

Comparison is case-insensitive
Trailing STOP codon ('*') is removed
Trailing incomplete codon ('?') is removed
Leading incomplete codons ('?') are removed

If these comparisons fails, further attempts are made:

Replace "unknown" codon characters: Codons using old 'X' characters are replaced by newer '?' characters
If any of the sequences only differ by the first codon, they are considered equal (the start codon is translates as 'Met' even when the codon code translates to another Amino acid)
Replace rare amino acids, which often tranlate as stop codons in the middle of the sequence: E.g. replace '*' by 'U'
Try replacing unknown aminco acids ('?') by the ones at the same position in the protein sequence from the FASTA file

If after all these attempts the protein sequence still do not match, they are considered "not equal".

Info

Debugging: You can run SnpEff using -d (debug) command line option to get detailed messages for each protein sequence comparison. The message shows the transcript ID, protein sequence inferred by SnpEff's, and the protein sequence from the FASTA file, as well as the places where they differ.

Example: Building the Human Genome database

This is a full example on how to build the human genome database (using GTF file from ENSEBML), it includes support for regulatory features, sanity check, rare amino acids, etc..

# Go to SnpEff's install dir
cd ~/snpeff

# Create database dir
mkdir data/GRCh37.70
cd data/GRCh37.70

# Download annotated genes
wget ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/Homo_sapiens.GRCh37.70.gtf.gz
mv Homo_sapiens.GRCh37.70.gtf.gz genes.gtf.gz

# Download proteins
# This is used for:
#   - "Rare Amino Acid" annotations
#   - Sanity check (checking protein predicted from DNA sequences match 'real' proteins)
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.70.pep.all.fa.gz
mv Homo_sapiens.GRCh37.70.pep.all.fa.gz protein.fa.gz

# Download CDSs
# Note: This is used as "sanity check" (checking that CDSs predicted from gene sequences match 'real' CDSs)
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.70.cdna.all.fa.gz
mv Homo_sapiens.GRCh37.70.cdna.all.fa.gz cds.fa.gz

# Download regulatory annotations
wget ftp://ftp.ensembl.org/pub/release-70/regulation/homo_sapiens/AnnotatedFeatures.gff.gz
mv AnnotatedFeatures.gff.gz regulation.gff.gz

# Uncompress
gunzip *.gz

# Download genome
cd ../genomes/
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.70.dna.toplevel.fa.gz
mv Homo_sapiens.GRCh37.70.dna.toplevel.fa.gz GRCh37.70.fa.gz

# Uncompress:
# Why do we need to uncompress?
# Because ENSEMBL compresses files using a block compress gzip which is not compatible with Java's library Gunzip
gunzip GRCh37.70.fa.gz

# Edit snpEff.config file
#
# WARNING! You must do this yourself. Just copying and pasting this into a terminal won't work.
#
# Add lines:
#       GRCh37.70.genome : Homo_sapiens
#       GRCh37.70.reference : ftp://ftp.ensembl.org/pub/release-70/gtf/

# Now we are ready to build the database
cd ~/snpeff
java -Xmx20g -jar snpEff.jar build -v GRCh37.70 2>&1 | tee GRCh37.70.build

Troubleshooting Database builds

Warning

By far the most common problem is that the FASTA file chromosome names are different than the GFF chromosome names. Make sure chromosome names are consistent in all the files you use.

When I build the database using GFF 3 SnpEff reports that Exons don't have sequences

GFF3 files can have sequence information either in the same file or in a separate fasta file.

In order to add sequence information in the GFF file, you can do this:

cat annotations.gff > genes.gff
echo "###"  >> genes.gff
echo "##FASTA"  >> genes.gff
cat sequence.fa  >> genes.gff

When building a database, I get zero protein coding genes

When building a database, snpEff tries to find which transcripts are protein coding. This is done using the 'bioType' information.

The bioType information is not a standard GFF or GTF feature. So I follow ENSEMBL's convention of using the second column ('source') for bioType, as well as the gene_biotype attribute.

If your file was not produced by ENSEMBL, it probably doesn't have this information. This means that snpEff doesn't know which genes are protein coding and which ones are not.

Having no information, snpEff will treat all genes as protein coding (assuming you have -treatAllAsProteinCoding Auto option in the command line, which is the default).

So you will get effects as if all genes were protein coding, then you can filter out the irrelevant genes. Unfortunately, this is the best I can do if there is no 'bioType' information

When building a database, I get too many warnings

There are plenty of GFF and GTF files that, unfortunately, do not follow the specification. SnpEff usually complains about this, but tries hard to correct the problems. So the database may be OK even after you see many warnings.

You can check the database to see if the features (genes, exons, UTRs) have been correctly incorporated, by taking a look at the database:

java -jar snpEff.jar dump myGenome | less