Commands and utilities
SnpEff provides several commands and utilities for genomic data analysis.
The main commands ann/eff (variant annotation) and build (database building) are described in the Commands & command line options and Building databases pages respectively.
Here we describe all the other commands and some scripts provided, that are useful for genomic data analysis.
SnpEff buildNextProt
Build a NextProt database from NextProt XML files. NextProt provides proteomic annotations for human proteins that can be used to annotate variants with functional information (e.g. active sites, binding sites, post-translational modifications).
Usage: snpEff buildNextProt genome_version nextProt_XML_dir
The nextProt_XML_dir argument should point to a directory containing NextProt XML files. The resulting database is stored as nextProt.bin in the genome's data directory and is automatically loaded when annotating variants with the -nextProt flag.
SnpEff closest
Annotates using the closest genomic region (e.g. exon, transcript ID, gene name) and distance in bases.
Usage: snpEff closest [options] genome_version file.vcf
Options:
-bed : Input format is BED. Default: VCF
-tss : Measure distance from TSS (transcription start site)
Example:
$ java -Xmx8g -jar snpEff.jar closest GRCh37.66 test.vcf
##INFO=<ID=CLOSEST,Number=4,Type=String,Description="Closest exon: Distance (bases), exons Id, transcript Id, gene name">
1 12078 . G A 25.69 PASS AC=2;AF=0.048;CLOSEST=0,exon_1_11869_12227,ENST00000456328,DDX11L1
1 16097 . T G 42.42 PASS AC=9;AF=0.0113;CLOSEST=150,exon_1_15796_15947,ENST00000423562,WASH7P
1 40261 . C A 366.26 PASS AC=30;AF=0.484;CLOSEST=4180,exon_1_35721_36081,ENST00000417324,FAM138A
1 63880 . C T 82.13 PASS AC=10;AF=0.0400;CLOSEST=0,exon_1_62948_63887,ENST00000492842,OR4G11P
For instance, in the third line (1:16097 T G), it added the tag CLOSEST=150,exon_1_15796_15947,ENST00000423562,WASH7P
which means that the variant is 150 bases away from exon "exon_1_15796_15947".
The exon belongs to transcript "ENST00000423562" of gene "WASH7P".
Info
If multiple markers are available (at the same distance) the one belonging to the longest mRNA transcript is shown.
The input can also be a BED file, the output file has the same information as CLOSEST info field, added to the fourth column of the output BED file:
$ snpeff closest -bed GRCh37.66 test.bed
1 12077 12078 line_1;0,exon_1_11869_12227,ENST00000456328,DDX11L1
1 16096 16097 line_2;150,exon_1_15796_15947,ENST00000423562,WASH7P
1 40260 40261 line_3;4180,exon_1_35721_36081,ENST00000417324,FAM138A
1 63879 63880 line_4;0,exon_1_62948_63887,ENST00000492842,OR4G11P
SnpEff count
As the name suggests, snpEff count command counts how many reads and bases from a BAM file hit a gene, transcript, exon, intron, etc.
Input files can be in BAM, SAM, VCF, BED or BigBed formats.
A summary HTML file with charts is generated. Here are some examples:


If you need to count how many reads (and bases) from a BAM file hit each genomic region, you can use 'count' utility.
The command line is quite simple. E.g. in order to count how many reads (from N BAM files) hit regions of the human genome, you simply run:
java -Xmx8g -jar snpEff.jar count GRCh37.68 readsFile_1.bam readsFile_2.bam ... readsFile_N.bam > countReads.txt
Options:
-n <name> : Output file base name.
-p : Calculate probability model (binomial).
-i <file> : Add intervals from a BED file. Can be used multiple times.
The output is a TXT (tab-separated) file, that looks like this:
chr start end type IDs Reads:readsFile_1.bam Bases:readsFile_1.bam Reads:readsFile_2.bam Bases:readsFile_2.bam ...
1 1 11873 Intergenic DDX11L1 130 6631 50 2544
1 1 249250621 Chromosome 1 2527754 251120400 2969569 328173439
1 6874 11873 Upstream NR_046018;DDX11L1 130 6631 50 2544
1 9362 14361 Downstream NR_024540;WASH7P 243 13702 182 9279
1 11874 12227 Exon exon_1;NR_046018;DDX11L1 4 116 2 102
1 11874 14408 Gene DDX11L1 114 7121 135 6792
1 11874 14408 Transcript NR_046018;DDX11L1 114 7121 135 6792
1 12228 12229 SpliceSiteDonor exon_1;NR_046018;DDX11L1 3 6 0 0
1 12228 12612 Intron intron_1;NR_046018;DDX11L1 13 649 0 0
1 12611 12612 SpliceSiteAcceptor exon_2;NR_046018;DDX11L1 0 0 0 0
1 12613 12721 Exon exon_2;NR_046018;DDX11L1 3 24 1 51
1 12722 12723 SpliceSiteDonor exon_2;NR_046018;DDX11L1 3 6 0 0
1 12722 13220 Intron intron_2;NR_046018;DDX11L1 22 2110 20 987
1 13219 13220 SpliceSiteAcceptor exon_3;NR_046018;DDX11L1 5 10 1 2
1 13221 14408 Exon exon_3;NR_046018;DDX11L1 82 4222 113 5652
1 14362 14829 Exon exon_11;NR_024540;WASH7P 37 1830 7 357
1 14362 29370 Transcript NR_024540;WASH7P 704 37262 524 34377
1 14362 29370 Gene WASH7P 704 37262 524 34377
1 14409 19408 Downstream NR_046018;DDX11L1 122 7633 39 4254
- Column 1: Chromosome name
- Column 2: Genomic region start
- Column 3: Genomic region end
- Column 4: Genomic region type (e.g. Exon, Gene, SpliceSiteDonor, etc.)
- Column 5: ID (e.g. exon ID ; transcript ID; gene ID)
- Column 6: Count of reads (in file readsFile_1.bam) intersecting genomic region.
- Column 7: Count of bases (in file readsFile_1.bam) intersecting genomic region, i.e. each read is intersected and the resulting number of bases added.
- Column ...: (repeat count reads and bases for each BAM file provided)
Totals and Binomial model
Using command line option -p, you can calculate p-values based on a Binomial model.
For example (output edited for the sake of brevity):
$ java -Xmx8g -jar snpEff.jar count -v BDGP5.69 fly.bam > countReads.txt
00:00:00.000 Reading configuration file 'snpEff.config'
...
00:00:12.148 Calculating probability model for read length 50
...
type p.binomial reads.fly expected.fly pvalue.fly
Chromosome 1.0 205215 205215 1.0
Downstream 0.29531659795589793 59082 60603 1.0
Exon 0.2030262729897713 53461 41664 0.0
Gene 0.49282883664487515 110475 101136 0.0
Intergenic 0.33995644860241336 54081 69764 0.9999999963234701
Intron 0.3431415343615103 72308 70418 9.06236369003514E-19
RareAminoAcid 9.245222303207472E-7 3 0 9.879186871519377E-4
SpliceSiteAcceptor 0.014623209601955131 3142 3001 0.005099810118785825
SpliceSiteDonor 0.015279075154423956 2998 3135 0.9937690786738507
Transcript 0.49282883664487515 110475 101136 0.0
Upstream 0.31499087549896493 64181 64641 0.9856950416729887
Utr3prime 0.03495370828296416 8850 7173 1.1734134297889064E-84
Utr5prime 0.02765432673262785 8146 5675 7.908406840800345E-215
The columns in for this table are (in the previous example the input file was 'fly.bam' so fileName is 'fly'):
- type : Type of interval
- p.binomial : Probability that a random read hits this 'type' of interval (in binomial model)
- reads.fileName : Total number of reads in 'fileName' (user provided BAM/SAM file)
- expected.fileName : Expected number of reads hitting this 'type' of interval (for user provided BAM/SAM file)
- pvalue.fileName : p-value that 'reads.fileName' reads or more hit this 'type' of interval (for user provided BAM/SAM file)
- Column ...: (repeat last three column for each BAM/SAM file provided by the user)
User defined intervals
You can add user defined intervals using -i file.bed command line option.
The option can be used multiple times, thus allowing multiple BED files to be added.
Example : You want to know how many reads intersect each peak from a peak detection algorithm:
java -Xmx8g -jar snpEff.jar count -i peaks.bed GRCh37.68 reads.bam
SnpEff databases
This command provides a list of configured databases, i.e. available in snpEff.config file.
Usage: snpEff databases [galaxy|html]
The output format can be selected: plain text (default), galaxy (Galaxy menu format), or html (HTML table format).
Example:
$ java -jar snpEff.jar databases
Genome Organism Status Bundle Database download link
------ -------- ------ ------ ----------------------
129S1_SvImJ_v1.99 Mus_musculus_129s1svimj https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_129S1_SvImJ_v1.99.zip
AIIM_Pcri_1.0.99 Pavo_cristatus https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_AIIM_Pcri_1.0.99.zip
AKR_J_v1.99 Mus_musculus_akrj https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_AKR_J_v1.99.zip
AP006557.1 SARS coronavirus TWH genomic RNA, complete genome. https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_AP006557.1.zip
AP006558.1 SARS coronavirus TWJ genomic RNA, complete genome. https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_AP006558.1.zip
AP006559.1 SARS coronavirus TWK genomic RNA, complete genome. https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_AP006559.1.zip
AP006560.1 SARS coronavirus TWS genomic RNA, complete genome. https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_AP006560.1.zip
AP006561.1 SARS coronavirus TWY genomic RNA, complete genome. https://snpeff-public.s3.amazonaws.com/databases/v5_0/snpEff_v5_0_AP006561.1.zip
...
SnpEff download
This command downloads and installs a database.
Usage: snpEff download [options] {snpeff | genome_version}
You can use snpEff download snpeff to download/update SnpEff itself, or specify a genome version to download a pre-built database.
Warning
Note that the database must be configured in snpEff.config and available at the download site.
Example: Download and install C.Elegans genome:
$ java -jar snpEff.jar download -v WBcel215.69
SnpEff dump
Dump the contents of a database to a text file, a BED file or a tab separated TXT file (that can be loaded into R).
Usage: snpEff dump [options] genome_version
Options:
-bed : Dump in BED format
-chr <string> : Prepend 'string' to chromosome name
-txt : Dump as a TXT table
-0 : Output zero-based coordinates
-1 : Output one-based coordinates
BED file example:
$ java -jar snpEff.jar download -v GRCh37.70
$ java -Xmx8g -jar snpEff.jar dump -v -bed GRCh37.70 > GRCh37.70.bed
00:00:00.000 Reading database for genome 'GRCh37.70' (this might take a while)
00:00:32.476 done
00:00:32.477 Building interval forest
00:00:45.928 Done.
The output file looks like a typical BED file (chr \t start \t end \t name).
Warning
Keep in mind that BED file coordinates are zero based, semi-open intervals.
So a 2 base interval at (one-based) positions 100 and 101 is expressed as a BED interval [99 - 101].
$ head GRCh37.70.bed
1 0 249250621 Chromosome_1
1 111833483 111863188 Gene_ENSG00000134216
1 111853089 111863002 Transcript_ENST00000489524
1 111861741 111861861 Cds_CDS_1_111861742_111861861
1 111861948 111862090 Cds_CDS_1_111861949_111862090
1 111860607 111860731 Cds_CDS_1_111860608_111860731
1 111861114 111861300 Cds_CDS_1_111861115_111861300
1 111860305 111860427 Cds_CDS_1_111860306_111860427
1 111862834 111863002 Cds_CDS_1_111862835_111863002
1 111853089 111853114 Utr5prime_exon_1_111853090_111853114
TXT file example:
$ java -Xmx8g -jar snpEff.jar dump -v -txt GRCh37.70 > GRCh37.70.txt
00:00:00.000 Reading database for genome 'GRCh37.70' (this might take a while)
00:00:31.961 done
00:00:31.962 Building interval forest
00:00:45.467 Done.
$ head GRCh37.70.txt
chr start end strand type id geneName geneId numberOfTranscripts canonicalTranscriptLength transcriptId cdsLength numerOfExons exonRank exonSpliceType
1 1 249250622 +1 Chromosome 1
1 111833484 111863189 +1 Gene ENSG00000134216 CHIA ENSG00000134216 10 1431
1 111853090 111863003 +1 Transcript ENST00000489524 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9
1 111861742 111861862 +1 Cds CDS_1_111861742_111861861 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9
1 111861949 111862091 +1 Cds CDS_1_111861949_111862090 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9
1 111853090 111853115 +1 Utr5prime exon_1_111853090_111853114 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 1 ALTTENATIVE_3SS
1 111854311 111854341 +1 Utr5prime exon_1_111854311_111854340 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 2 SKIPPED
1 111860608 111860732 +1 Exon exon_1_111860608_111860731 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 5 RETAINED
1 111853090 111853115 +1 Exon exon_1_111853090_111853114 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 1 ALTTENATIVE_3SS
1 111861742 111861862 +1 Exon exon_1_111861742_111861861 CHIA ENSG00000134216 10 1431 ENST00000489524 862 9 7 RETAINED
The format is:
| Column | Meaning |
|---|---|
| chr | Chromosome name |
| start | Marker start (one-based coordinate) |
| end | Marker end (one-based coordinate) |
| strand | Strand (positive or negative) |
| type | Type of marker (e.g. exon, transcript, etc.) |
| id | ID. E.g. if it's a Gene, then it may be ENSEMBL's gene ID |
| geneName | Gene name, if marker is within a gene (exon, transcript, UTR, etc.), empty otherwise (e.g. intergenic) |
| geneId | Gene ID, if marker is within a gene |
| numberOfTranscripts | Number of transcripts in the gene |
| canonicalTranscriptLength | CDS length of canonical transcript |
| transcriptId | Transcript ID, if marker is within a transcript |
| cdsLength | CDS length of the transcript |
| numerOfExons | Number of exons in this transcript |
| exonRank | Exon rank, if marker is an exon |
| exonSpliceType | Exon splice type, if marker is an exon |
SnpEff genes2bed
Dumps a selected set of genes (or all genes) as BED intervals. By default it outputs gene-level coordinates, but can also output exons, CDS regions, introns, or transcripts.
Usage: snpEff genes2bed genomeVer [-f genes.txt | geneList]
Options:
-cds : Show coding exons (no UTRs).
-e : Show exons for every transcript.
-f <file.txt> : A TXT file having one gene ID (or name) per line.
-i : Show introns for every transcript.
-pc : Use only protein coding genes.
-tr : Show transcript coordinates.
-ud <num> : Expand gene interval upstream and downstream by 'num' bases.
geneList : A list of gene IDs or names as command line arguments.
Info
Options -cds, -e, and -tr are mutually exclusive. If no gene list is provided (neither via -f nor as arguments), all genes in the genome are used.
Example:
$ java -Xmx8g -jar snpEff.jar genes2bed GRCh37.66 DDX11L1 WASH7P
#chr start end geneName;geneId;strand
1 11868 14411 DDX11L1;ENSG00000223972;+
1 14362 29805 WASH7P;ENSG00000227232;-
Example showing exons:
$ java -Xmx8g -jar snpEff.jar genes2bed -e GRCh37.66 DDX11L1
#chr start end geneName;geneId;transcriptId;exonRank;strand
SnpEff cds
Performs a database sanity check by calculating CDS sequences from a SnpEff database and comparing them to a FASTA file containing the "correct" sequences.
Usage: snpEff cds [options] genome_version cds_file
This command is invoked automatically when building databases (snpEff build), so there is usually no need to invoke it manually.
SnpEff protein
Performs a database sanity check by calculating protein sequences from a SnpEff database and comparing them to a FASTA file containing the "correct" sequences.
Usage: snpEff protein [options] genome_version protein_file
Options:
-codonTables : Try all codon tables on each chromosome and calculate error rates.
This command is invoked automatically when building databases (snpEff build), so there is usually no need to invoke it manually.
The -codonTables option is useful for debugging genomes that may use non-standard codon tables.
SnpEff len
Calculates the genomic length of every type of marker (Gene, Exon, Utr, etc.). Length is calculated by overlapping all markers and counting the number of bases (e.g. a base is counted as 'Exon' if any exon falls onto that base). This command also reports the probability of a Binomial model.
Usage: snpEff len [options] genome_version
Options:
-r <num> : Assume a read size of 'num' bases.
-iter <num> : Perform 'num' iterations of random sampling.
-reads <num> : Each random sampling iteration has 'num' reads.
Info
Parameter -r num adjusts the model for a read length of 'num' bases. That is, if two markers of the same type are closer than 'num' bases, it joins them by including the bases separating them.
E.g.:
$ java -Xmx1g -jar snpEff.jar len -r 100 BDGP5.69
marker size count raw_size raw_count binomial_p
Cds 22767006 56955 45406378 122117 0.13492635563570918
Chromosome 168736537 15 168736537 15 1.0
Downstream 49570138 5373 254095562 50830 0.29377240330587084
Exon 31275946 61419 63230008 138474 0.18535372691689175
Gene 82599213 11659 87017182 15222 0.4895158717166277
Intergenic 56792611 11637 56792611 11650 0.3365756581812509
Intron 55813748 42701 168836797 113059 0.33077452573297744
SpliceSiteAcceptor 97977 48983 226118 113059 5.806507691929223E-4
SpliceSiteDonor 101996 50981 226118 113059 6.044689657225808E-4
Transcript 82599213 11659 232066805 25415 0.4895158717166277
Upstream 52874082 5658 254044876 50830 0.3133528928592389
Utr3prime 5264120 13087 10828991 24324 0.031197274126824114
Utr5prime 3729197 19324 6368070 33755 0.02210070839607192
- marker : Type of marker interval
- size : Size of all intervals in the genome, after overlap and join.
- count : Number of intervals in the genome, after overlap and join.
- raw_size : Size of all intervals in the genome. Note that this could be larger than the genome.
- raw_count : Number of intervals in the genome.
SnpEff pdb
Build interaction database based on PDB (Protein Data Bank) or AlphaFold structure data. This command analyzes protein structures to identify amino acid pairs that are in close physical proximity and adds this interaction information to the SnpEff database for variant annotation.
Usage: snpEff pdb [options] genome_version
Options:
-aaSep <number> : Minimum number of AA separation within sequence.
-idMap <file> : ID map file (PDB ID to transcript ID mapping).
-maxDist <number> : Maximum distance in Angstrom for atom pairs.
-maxErr <number> : Maximum AA sequence difference between PDB and genome.
-org <name> : Organism common name.
-orgScientific <name> : Organism scientific name.
-pdbDir <path> : Path to PDB files.
-res <number> : Maximum PDB file resolution.
For detailed instructions on obtaining PDB/AlphaFold data and building interaction databases, see Building databases: PDB and AlphaFold.
SnpEff seq
Translates DNA sequences to protein using the genome's codon table. This is a simple utility for quick sequence translations from the command line.
Usage: snpEff seq [-r] genome seq_1 seq_2 ... seq_N
Options:
-r : Reverse Watson-Crick complement before translating.
The command shows the protein translation in three formats: 3-letter amino acid code, 1-letter code with spacing, and 1-letter code.
Example:
$ java -jar snpEff.jar seq GRCh38.105 ATGCGAGCT
Sequence : ATGCGAGCT
Protein (3-Letter) : Met-Arg-Ala
Protein (1-Letter-space) : M R A
Protein (1-Letter) : MRA
SnpEff show
Show a text representation of genes or transcripts including coordinates, DNA sequence and protein sequence. Useful for visual inspection and debugging of gene annotations.
Usage: snpEff show genome_version gene_1 ... gene_N ... trId_1 ... trId_N
The command accepts both gene IDs and transcript IDs. It displays an ASCII-art representation of the transcript structure including exons, introns, coding regions, UTRs, and the corresponding DNA and protein sequences. Coordinates are zero-based.
SnpEff translocReport
Create a translocation report with SVG visualizations from a VCF file containing structural variants (BND, DUP, DEL). The report shows gene fusions and translocations with transcript-level detail.
Usage: snpEff translocReport [options] genome_version input.vcf
Options:
-onlyOneTr : Report only one transcript pair per translocation (used for debugging).
-outPath <dir> : Create individual output SVG files for each translocation in 'dir'.
-report <file> : Output report file name. Default: translocations_report.html
The command generates an HTML report (by default translocations_report.html) containing SVG visualizations of each translocation. If -outPath is specified, individual SVG files are also saved to that directory.
Scripts
Warning
The Perl utility scripts previously distributed with SnpEff (sam2fastq.pl, fasta2tab.pl, fastaSplit.pl, vcfEffOnePerLine.pl, etc.) have been deprecated and are no longer included in the main scripts/ directory.