Output summary files

SnpEff creates an additional output file showing overall statistics. This "stats" file is an HTML file which can be opened using a web browser. It can also be created as a CSV file, for easier parsing or manipulation. You can find an example of a 'stats' file here.

Summary file options

By default, SnpEff performs some statistics and saves them to the snpEff_summary.html on the same directory where snpEff is being executed. You can see the file, by opening it in your browser. SnpEff also creates a "Genes Statistics" file (snpEff_genes.txt) with gene-level statistics in tabular (tab-separated) format.

There are some command line options related to the statistics:

-stats <file>: You can change the default name and location of the HTML file. This also changes the name and location of the "genes" (TXT) file.
-noStats: Do not calculate statisticsm, or create stats (summary) files
-csvStats <file>: Create the statistics file in CSV format with the specified name.

Info

The "Genes statistics" file path is by default the directory where SnpEff is executed and called snpEff_genes.txt. If you change the summary file name / path by either using -stats or csvStats command line, the "Genes statistics" file path will be the same directory as the summary file and the file name is the same "base name" plus a ".genes.txt".

Summary report

The summary report consist of several sections:

Main summary table
Variant rate by chromosome: A table of number of variants and variants rate per chromosome
Variants by type
Number of variants by impact: A count and percentage of variants by impact
Number of variants by functional class: A count and percentage of variants by functional class
Number of variants by annotation
Quality histogram: Variant Quality histogram, i.e. a histogram of the VCF's QUAL field
InDel length histogram: Length of INS and DEL variants
Base variant table: Table of SNV variants changes
Transition vs transversions (ts/tv): Transition vs transversions table, split into "known" and all variants (known variants are the ones with non-empty ID field).
Allele frequency: Allele frequency histogram
Allele Count: Allele count histogram
Codon change table: A table of all codon changes counts
Amino acid change table: A table of all amino acid changes counts
Chromosome variants plots: Plots of number of variants for each chromosome location
Details by gene: Link to "Genes Statistics" file

Main summary table

The main summary table contains basic information about the SnpEff run and some overall statistics:

Table Entry	Note
Genome	Genome name and version, as specified in the command line
Date	Date and time when the analysis was performed
SnpEff version	SnpEff version
Command line arguments	Command line arguments and options used to annotate
Warnings	Number of WARNING annotation messages (i.e. `WARNING` messages in the ANN field)
Errors	Number of ERROR annotation messages (i.e. `ERROR` messages in the ANN field)
Number of lines (input file)	Number of lines in the input file, excluding comment / header lines
Number of variants (before filter)	Number of variants in the input file. Note that this can differ from the number of lines; e.g. VCF allows for multiple variants per line and IUPAC expansion.
Number of not variants	Number of non-variants, e.g. if the `REF` and `ALT` fields are the same
Number of variants processed	Number of variants processed. This can be different than the number of variants due to filtering and non-variants entries.
Number of known variants	Variants that have a non-empty `ID` field.
Number of multi-allelic VCF entries	Variants that have mode then two alleles. Most variants have only two alleles: `REF` and one `ALT`. Multi-allelic variants have multiple `ALT` entries.
Number of annotations	Total number of variant annotations. Note that this is typically higher than the number of variant
Genome total length	Total genome length (in bases)
Genome effective length	Total length of the chromosomes (in bases). This only counts chromosomes that had varinats
Variant rate	Number of variants per genomic length: `Number of variants` / `Genome effeective length`

Warning

The number of input lines, number of variants, and number of annotation are different counts and typically are not equal, see details in this FAQ

Variants by type

This table contains a list of the number of variants, grouped by variant type:

Type	Note
`SNP`	SNP / SNV is a single nucleotide variant, e.g. 'A -> G'
`MNP`	MNP / MNV is a multiple nucleotide variant, e.g. 'AC -> GT'
`INS`	Insertion, e.g. 'A -> AT'
`DEL`	Deletion, e.g. 'AT -> A'
`MIXED`	A mixed vairant is a combination of SNP / MNP / INS / DEL, for example '
`INV`	An inversion of reference sequence
`DUP`	A duplication is a region of elevated copy number relative to the reference
`BND`	An arbitrary rearrangement
`INTERVAL`	An interval marke, e.g. an interval from a BED file

Histograms

E.g.: In the stats file, you can see coverage histogram plots like this one:

Annotations & Region

SnpEff annotates variants using "functional annotaions", e.g. NON_SYNONYMOUS_CODING, STOP_GAINED, etc.. These variants affect regions of the genome (e.g. EXON, INTRON). The two tables count how many effects for each type and for each region exists.

E.g.: In an EXON region, you can have all the following effect types: NON_SYNONYMOUS_CODING, SYNONYMOUS_CODING, FRAME_SHIFT, STOP_GAINED, etc.

The complicated part is that some annotaitons affect a region that has the same name (yes, I know, this is confusing).

E.g.: In a UTR_5_PRIME region you can have UTR_5_PRIME and START_GAINED effect type.

This means that the number of both tables are not exactly the same, because the labels don't mean the same. See the next figure as an example:

type_vs_region

So the number of effects that affect a UTR_5_PRIME region is 206. Of those, 57 are effects type START_GAINED and 149 are effects type UTR_5_PRIME.

How exactly are effect type and effect region related? See the following table:

Effect Type	Region
`NONE`, `CHROMOSOME`, `CUSTOM`, `CDS`	`NONE`
`INTERGENIC`, `INTERGENIC_CONSERVED`	`INTERGENIC`
`UPSTREAM`	`UPSTREAM`
`UTR_5_PRIME`, `UTR_5_DELETED`, `START_GAINED`	`UTR_5_PRIME`
`SPLICE_SITE_ACCEPTOR`	`SPLICE_SITE_ACCEPTOR`
`SPLICE_SITE_DONOR`	`SPLICE_SITE_DONOR`
`SPLICE_SITE_REGION`	`SPLICE_SITE_REGION`
`INTRAGENIC`, `START_LOST`, `SYNONYMOUS_START`, `NON_SYNONYMOUS_START`, `GENE`, `TRANSCRIPT`	`EXON` or `NONE`
`EXON`, `EXON_DELETED`, `NON_SYNONYMOUS_CODING`, `SYNONYMOUS_CODING`, `FRAME_SHIFT`, `CODON_CHANGE`, `CODON_INSERTION`, `CODON_CHANGE_PLUS_CODON_INSERTION`, `CODON_DELETION`, `CODON_CHANGE_PLUS_CODON_DELETION`, `STOP_GAINED`, `SYNONYMOUS_STOP`, `STOP_LOST`, `RARE_AMINO_ACID`	`EXON`
`INTRON`, `INTRON_CONSERVED`	`INTRON`
`UTR_3_PRIME`, `UTR_3_DELETED`	`UTR_3_PRIME`
`DOWNSTREAM`	`DOWNSTREAM`
`REGULATION`	`REGULATION`

Gene statistics

SnpEff also generates a TXT (tab separated) file having counts of number of variants affecting each transcript and gene. By default, the file name is snpEff_genes.txt, but it can be changed using the -stats command line option.

Here is an example of this file:

$ head snpEff_genes.txt
# The following table is formatted as tab separated values.
#GeneName   GeneId  TranscriptId    BioType variants_impact_HIGH    variants_impact_LOW variants_impact_MODERATE    variants_impact_MODIFIER    variants_effect_3_prime_UTR_variant variants_effect_5_prime_UTR_premature_start_codon_gain_variant  variants_effect_5_prime_UTR_variant variants_effect_downstream_gene_variant variants_effect_intron_variant  variants_effect_missense_variant    variants_effect_non_coding_exon_variant variants_effect_splice_acceptor_variant variants_effect_splice_donor_variant    variants_effect_splice_region_variant   variants_effect_start_lost  variants_effect_stop_gained variants_effect_stop_lost   variants_effect_synonymous_variant  variants_effect_upstream_gene_variant   bases_affected_DOWNSTREAM   total_score_DOWNSTREAM  length_DOWNSTREAM   bases_affected_EXON total_score_EXON    length_EXON bases_affected_INTRON   total_score_INTRON  length_INTRON   bases_affected_SPLICE_SITE_ACCEPTOR total_score_SPLICE_SITE_ACCEPTOR    length_SPLICE_SITE_ACCEPTOR bases_affected_SPLICE_SITE_DONOR    total_score_SPLICE_SITE_DONOR   length_SPLICE_SITE_DONOR    bases_affected_SPLICE_SITE_REGION   total_score_SPLICE_SITE_REGION  length_SPLICE_SITE_REGION   bases_affected_TRANSCRIPT   total_score_TRANSCRIPT  length_TRANSCRIPT   bases_affected_UPSTREAM total_score_UPSTREAM    length_UPSTREAM bases_affected_UTR_3_PRIME  total_score_UTR_3_PRIME length_UTR_3_PRIME  bases_affected_UTR_5_PRIME  total_score_UTR_5_PRIME length_UTR_5_PRIME
AC000029.1  ENSG00000221069 ENST00000408142 miRNA   0   0   0   2   0   0   0   2   0   0   0   0   0   0   0   0   5000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
AC000068.5  ENSG00000185065 ENST00000431090 antisense   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   5000    0   0   0   0   0   0
AC000081.2  ENSG00000230194 ENST00000433141 processed_pseudogene    0   0   0   8   0   0   0   3   0   0   0   0   0   0   5000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5   0   5000    0   0
AC000089.3  ENSG00000235776 ENST00000424559 processed_pseudogene    0   0   0   1   0   0   0   0   0   0   0   0   0   0   5000    0   0   0   0   0   0
AC002472.1  ENSG00000269103 ENST00000547793 protein_coding  0   0   0   6   0   0   0   5   0   0   0   0   0   0   0   5000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   5000    0   0
AC002472.11 ENSG00000226872 ENST00000450652 antisense   0   0   0   13  0   0   0   5   2   0   0   0   0   0   0   5000    0   0   0   2   0   11199   0   0   0   0   0   0   0   0   0   0   0   0   6   0   5000    0   0
AC002472.13 ENSG00000187905 ENST00000342608 protein_coding  0   1   6   1   0   0   0   0   1   6   0   0   0   1   0   116 1   0   934 0   0   0   0   0   0   1   0   3   0   0   0   0   0   0   0   0   0   0   0
AC002472.13 ENSG00000187905 ENST00000442047 protein_coding  0   1   6   1   0   0   0   0   1   6   0   0   0   1   0   116 1   0   934 0   0   0   0   0   0   1   0   3   0   0   0   0   0   0   0   0   0   0   0

The columns in this table are:

Column name	Meaning
GeneName	Gene name (usually HUGO)
GeneId	Gene's ID
TranscriptId	Transcript's ID
BioType	Transcript's bio-type (if available)
	The following column is repeated for each impact {HIGH, MODERATE, LOW, MODIFIER}
variants_impact_*	Count number of variants for each impact category
	The following column is repeated for each annotated effect (e.g. missense_variant, synonymous_variant, stop_lost, etc.)
variants_effect_*	Count number of variants for each effect type
	The following columns are repeated for several genomic regions (DOWNSTREAM, EXON, INTRON, UPSTREAM, etc.)
bases_affected_*	Number of bases that variants overlap genomic region
total_score_*	Sum of scores overlapping this genomic region. Note: Scores are only available when input files are type 'BED' (e.g. when annotating ChipSeq experiments)
length_*	Genomic region length