Features, versions and roadmap
1. Roadmap
Major features planned:
- SnpEff: Improvements in loss of function analysis
- SnpSift: Switch to ANTLR 4.X, handle arbitrary expressions.
2. Features
Features by version
Version: 4.3 (2016-09).
- Improved support for gene fusions
- Annotation of large structural variants
Version: 4.1 (2015-01).
- Standard annotation format: 'ANN' INFO field
- A better / more robust HGVS implementation
- Variants are re-aligned to the most 3'UTR (in agreement with HGVS).
Version: 4.0 (2014-11).
- Consistent 'help' screen when using command line option
- Effects sorted canonical transcripts first (for same level of effect / impact)
- Corrected problem on LOF annotations for gene names having spaces.
Version: 4.0 (2014-07).
- HGVS notations (now is default)
- Sequence Ontology terms (now by default)
- SnpEff downloads databases automatically
- Automatic third party databases downloads
- Support for new genome versions (such as GRCh38 / hg38)
- NextProt, Loss of function (LOF) and Nonsense mediated decay (MND) annotations by default
- Improved protein coding transcript detection (when building databases)
- Full support for MIXED variants: E.g. Some variants maybe a combination of Insertions, Deletions, SNPs or MNPs.
- Major code refactoring
- SnpSift
: Improved annotate support. - SnpSift
: Several improvements on annotation methods. - Added support for gVCF files
Version: 3.6 (2013-05-23).
- Improved support for MIXED variants: E.g. Some variants maybe a combination of Insertions, Deletions, SNPs or MNPs.
- Improved HGVS notation
- SnpSift:
: Calculate concordance statistics between two VCF files (e.g. a sequencing and a chip-genotyping experiment) - SnpSift:
command (check VCF for several "common" problems) - Moved to Java 7. mostly due to several problems in Java 6 libraries when reading bgzip files.
Version: 3.5 (2013-03-23).
- Improvements in cancer sample annotations
- Added
annotation - SnpSift
: Annotate if a variant is "private" to a family (or cohort) - SnpSift:
: Case / control summary statistics (of annotated files). - SnpSift
: Added tabix indexed files support. Automatic detection.
Version: 3.4 (2013-11-23).
- Automatic database download ("-download" option)
- Cancer samples: can be defined using a TXT file instead of VCF header.
- Improved GenBank
- Extended configuration options
- Better frame handling for GTF/GFF files
- Improvements in HGVS notation
- Galaxy support: Improvements and bug fixes
- SnpSift: Better support for dbNSFP (v2.1)
Version: 3.3 (2013-06-12).
- Over 8,500 genomes supported.
- All ENSEMBL (version 18) : Bacteria, Fungi, Metazoa, Plants and Protist genomes added.
- NextProt annotations added
- Motif annotations support added
- SnpSift: GeneSet annotations
- SnpEff count: Genomic region statistics counting reads, variants, intervals, etc.
Version: 3.2 (2013-14-01).
- Cancer variants analysis
- GATK compatible (
-o gatk
) - HGVS notations support
Version: 3.1 (2012-11-02).
- All NCBI bacterial genomes added: Over 2,500 genomes added!
- Loss of function effect and tag added (experimental command line option '-lof')
- Nonsense-mediated decay effect and tag added (experimental command line option '-lof')
- ENSEMBL version 68 genomes added
- SnpEff 'countReads' count number of reads and bases (form a BAM file) on each gene, transcript, exon, intron, etc.
- SnpEff Intron and Intergenic annotations improved.
Version: 3.0, revision 'f' (2012-08-23).
- GATK output format compatibility option: '-o gatk'
- Fixed problem when parsing comment after GFF headers.
- Added GENCODE tags for GTF parsing
- Splice site analysis tools
- Analysis of U12 branch sites.
- Minor problems caused by empty VCF headers solved.
- Fixed bug in calculation of degenerate sites.
- Fixed problem in canonical transcripts.
- Plasmodium falciparum hand curated versions (by Daniel Park, Broad): Pf3D7v72 and Pf3D7v90
- Maven project, created by Louis Letourneau.
- Project source code changed to SVN (Louis Letourneau).
- Databases will be 'backwards compatible' from now on.
- New format for VCF files: added CDS length in amino acid (AA_LEN field).
- Canonical transcript filter (command line option "-canon").
- Improved GenBank parsing.
- SnpSift 'dbnsfp': Annotate using dbNSFP (Louis Letourneau).
- SnpSift 'gwasCat': Added GWA catalog annotations.
- SnpSift 'extractFields': extract fields to TXT files (tab separated)
- SnpSift 'sift': Annotate using SIFT database.
- SnpSift 'Annotate' and 'AnnMem': Now support to add all fields in a VCF file for annotations.
Version: 2.1b (2012-04-26).
- Revision "2.1c" : Maven project (by Louis Letourneau)
- Revision "2.1c" : Improved Galaxy wrappers (by Peter briggs)
- Revision "2.1b" : Improved RefSeq parsing
- Revision "2.1a" : Multi-thread race condition solved.
- Note If you are using hg19, it is recommended to download the latest database (due to improved RefSeq parsing in 2.1b).
- Added multi-threaded support (command line option '-t').
- GenBank support for building databases. See details here.
- Config file simplified
- E.Coli database added
- Galaxy download database option added.
- Added all ENSEMBL version 66 genomes
- Database 'download' issue solved. Apparently SourceForge servers were choking on URL that had double slashes, this should not happen. Implemented a workaround.
- SnpSift GWAS catalog: Annotate using GWAS Catalog.
- SnpSift: Added 'varType' to annotate variant type (SNP/MNP/INS/DEL), as well as HOM/HET if possible.
- Faster VCF processing.
Version: 2.0.5 (2011-11-25).
- Support for RARE amino acids (see details here)
- Database for Soybean (Glycine max) added
Version: 2.0.5 (2011-11-25).
- Database download command, e.g. "java -jar snpEff.jar download GRCH37.64"
- Added all ENSEMBL version 65 genomes
- RefSeq annotations support added.
- Rogue transcript filter: By default SnpEff filters out some suspicious transcripts from annotations databases. This should improve false positive rates.
- Amino acid changes in HGVS style (VCF output)
- Optimized parsing for VCF files with large number of samples (genotypes).
- Option to suppress summary calculation ('-noStats'), can speed up processing considerably in cases where VCF files have hundreds or thousands of genotype fields.
- Option '-onlyCoding' is set to 'auto' to reduce number of false positives (see next).
- Option '-onlyCoding' can be assigne a value: If value is 'true', report only 'protein_coding' transcripts as protein coding changes. If 'false', report all transcript as if they were conding. Default: Auto, i.e. if transcripts any marked as 'protein_coding' the set it to 'true', if no transcripts are marked as 'protein_coding' then set it to 'false'.
- Added BED output format. This is usefull to annotate the output of a Chip-Seq experiment (e.g. after performing peak calling with MACS, you want to know where the peaks hit).
- Added BED Annotation output format. This is usefull to get all annotation intervals that intersect a set of variants (or genomic regions).
SnpSift filter:
- Added generic index ('*') for variables, genotypes and effects. E.g.: ( 'GEN[*].GT = '1|1' )
- Added support for 'EFF' and subfields (from SnpEff processed files). E.g.: ( EFF[*].EFFECT = 'NON_SYNONYMOUS_CODING' )
SnpSift intidx: Designed to extract a small number of intervals from huge VCF files. Added indexing using memory mapped I/O files for retrieving intervals from huge VCF files. Works really fast!
Version: 2.0.3 (2011-10-08)
- Functional classes added in VCF output (i.e. NONE, SILENT, MISSENSE, NONSENSE)
- Added MODIFIER effect 'impact'.
- Rice genome added.
- Added all ENSEMBL version 64 genomes.
- Several minor issues solved.
- Report usage statistics to server (can be disabled using '-noLog' options).
Version: 2.0.2 (2011-09-09)
- VCF output format
- GATK integration. Now you can use SnpEff from GATK's VariantAnnotator.
- Default input file is STDIN. I.e. inputFile parameter can be ommited now.
- Gene list outputs to a TXT file (tab separated) instead of the summary (HTML) file.
- Command line format changed for various options
- Option '-sort' deprecated.
Version: 1.9.6 (2011-08-08)
- Ensembl genomes v63 added.
Warning! Genome names changed to agree with Ensembl naming convention, here are the names:
Full name Short name Ailuropoda_melanoleuca ailMel1.63 Anolis_carolinensis AnoCar2.0.63 Bos_taurus Btau_4.0.63 Caenorhabditis_elegans WS220.63 Callithrix_jacchus C_jacchus3.2.1.63 Canis_familiaris BROADD2.63 Cavia_porcellus cavPor3.63 Choloepus_hoffmanni choHof1.63 Ciona_intestinalis JGI2.63 Ciona_savignyi CSAV2.0.63 Danio_rerio Zv9.63 Dasypus_novemcinctus dasNov2.63 Dipodomys_ordii dipOrd1.63 Drosophila_melanogaster BDGP5.25.63 Echinops_telfairi TENREC.63 Equus_caballus EquCab2.63 Erinaceus_europaeus HEDGEHOG.63 Felis_catus CAT.63 Gallus_gallus WASHUC2.63 Gasterosteus_aculeatus BROADS1.63 Gorilla_gorilla gorGor3.63 Homo_sapiens GRCh37.63 Loxodonta_africana loxAfr3.63 Macaca_mulatta MMUL_1.63 Macropus_eugenii Meug_1.0.63 Meleagris_gallopavo UMD2.63 Microcebus_murinus micMur1.63 Monodelphis_domestica BROADO5.63 Mus_musculus NCBIM37.63 Myotis_lucifugus Myoluc2.0.63 Nomascus_leucogenys Nleu1.0.63 Ochotona_princeps pika.63 Ornithorhynchus_anatinus OANA5.63 Oryctolagus_cuniculus oryCun2.63 Oryzias_latipes MEDAKA1.63 Otolemur_garnettii BUSHBABY1.63 Pan_troglodytes CHIMP2.1.63 Pongo_abelii PPYG2.63 Procavia_capensis proCap1.63 Pteropus_vampyrus pteVam1.63 Rattus_norvegicus RGSC3.4.63 Saccharomyces_cerevisiae EF3.63 Sorex_araneus COMMON_SHREW1.63 Spermophilus_tridecemlineatus SQUIRREL.63 Sus_scrofa Sscrofa9.63 Taeniopygia_guttata taeGut3.2.4.63 Takifugu_rubripes FUGU4.63 Tarsius_syrichta tarSyr1.63 Tetraodon_nigroviridis TETRAODON8.63 Tupaia_belangeri TREESHREW.63 Tursiops_truncatus turTru1.63 Vicugna_pacos vicPac1.63 Xenopus_tropicalis JGI_4.2.63 -
Problems with VCF heterozygous: Fixed
- Problems parsing some InDels: Fixed
- Error conditions on deletion at the border between UTR and Exon: Fixed
- Problems reporting some CDS relative positions: Fixed
- Some issues related to distance calculation on Downstream genes on negative strands: Fixed
Version: 1.9.5 (2011-03-10)
- Variants per gene table.
- Improvements in summary report.
- Improved GFF3 parsing.
- Several genomes added.
Version: 1.9 (2011-03-10) Features recently added:
- Improved command line
- Genomes added (Arabidopsis) : alyrata107, athaliana130
Genomes added (all ENSEMBL version 61):
ailmel1.61, anoCar2.0.61, btau4.0.61, bushBaby1.61, calJac3.2.1.61, canFam2.61, cat1.61, cavPor3.61, ce.WS220.61, chimp2.1.61, choHof1.61, cInt2.61, cSav2.0.61, danRer9.61, dasNov2.61, dipOrd1.61, dm5.25.61, equCab2.61, eriEur1.61, fugu4.61, gacu1.61, ggallus2.61, gorGor3.61, hg37.61, loxAfr3.61, medaka1.61, meug1.0.61, micMur1.61, mm37.61, mmul1.61, monDom5.61, myoLuc1.61, oana5.61, ochPri2.61, oryCun2.61, ppyg2.61, proCap1.61, pteVam1.61, rat3.4.61, sacCer2.61, sorAra1.61, speTri1.61, sScrofa9.61, taeGut3.2.4.61, tarSyr1.61, tenrec1.61, tetraodon8.61, tupBel1.61, turkey.UMD2.61, turTru1.61, vicPac1.61, xtrop4.1.61
Genomes added (Flybase): dm5.34
- Genomes added (legacy hg18): hg36.54
- Improved summary and statistics
- Supports BED format: if you just need to check where an interval hits (e.g. exon, intron, genes, etc.)
- Added support for GTF 2.2 format
- Improved robustness of GFF3 and GFF2 parsing
- Improved splice site detection: SPLICE_SITE_DONOR and SPLICE_SITE_ACCEPTOR
- Improved support for large deletions: EXON_DELETED and UTR_DELETED
- Added suport for INTRON_CONSERVED and INTERGENIC_CONSERVED intervals (available in GTF 2.2 files)
- Added support for ambiguous sequences in exons (e.g. sequences that have "N")
Database dump support:
java -jar snpEff.jar dump genome_version
CDS testing support:
java -jar snpEff.jar cds genome_version cds.fasta
Older features
- Show DNA and amino acid sequence before and after change: option "-a, --around", e.g. "-a 5" shows 5 codons around sequence change)
- WARNING: Since version 1.7 snpEff assumes one-based coordinates (i.e. option "-1" is the default instead of "-0")
- WARNING: Since version 1.7 snpEff does not sort sequence changes. You should use option "-sort" if you want that.
- Genomes added (Pseudomonas): Pseudomonas aeruginosa (paeru.PA01 and paeru.PA14) and Pseudomonas fluorescens (pfluo.SBW25.NC_009444 and pfluo.SBW25.NC_012660)
Genomes supported (all ENSEMBL version 60):
ailMel1.60, amel2, anoCar1.0.60, btau4.0.59, btau4.0.60, bushBaby1.60, calJac3.2.1.60, canFam2.59, canFam2.60, cat1.60, cavPor3.60, ce6, ce.WS210.60, chimp2.1.59, chimp2.1.60, choHof1.60, cInt2.60, cSav2.0.60, danRer6, danRer8.59, danRer9.60, dasNov2.60, dipOrd1.60, dm3, dm5.12, dm5.22, dm5.25.59, dm5.25.60, dm5.30, dm5.31, equCab2.60, eriEur1.60, fugu4.60, gacu1.60, ggallus2.59, ggallus2.60, gorGor3.60, hg37, hg37.59, hg37.60, loxAfr3.60, medaka1.60, meug1.0.60, micMur1.60, mm37, mm37.59, mm37.60, mmul1.60, monDom5.60, myoLuc1.60, oana5.60, ochPri2.60, oryCun2.60, ppyg2.60, proCap1.60, pteVam1.60, rat3.4.59, rat3.4.60, sacCer2, sacCer2.59, sacCer2.60, SIVmac239, sorAra1.60, speTri1.60, sScrofa9.60, taeGut3.2.4.60, tarSyr1.60, tenrec1.60, testCase, tetraodon8.60, tupBel1.60, turTru1.60, vicPac1.60, xtrop4.1.60
VCF4 input format is now supported
- Support new genome Apis Mellifera
- Statistics and plots
- Filter intervals (only analyze selected intervals)
- One-based and zero-based positions for input and output (as well arbitrary offsets)
- Support for heterozygous SNPs (e.g. A/W)
- Predicts insertions and deletions (FRAME_SHIFT)
- Supports GFF format when building databases.
- Added: Multiple nucleotide polymorphisms (MNPs)
- New format shows SNP quality and coverage.
- Can filter SNPs, InDels and MNPs based on quality, coverage and zygosity (Hom/Het).