Skip to content

SnpSift GeneSets

Annotating GeneSets, such as Gene Ontology (GO), KEGG, Reactome, etc.; can be quite useful to find significant variants.

Gene set annotations can be added to a SnpEff annotated file using SnpSift geneSets command. The VCF file must be annotated using SnpEff before performing Gene Sets annotations, because we must know which gene the variant affects.

Usage

java -jar SnpSift.jar geneSets [-v] msigdb.gmt file.vcf > output.vcf

The command takes two positional arguments: the GMT file and the VCF file. Unlike most SnpSift commands, STDIN is not supported for the VCF input.

Info

You can download MSigDb from Broad Institute

Input format

The gene sets file must be in GMT (Gene Matrix Transposed) format: a tab-separated file where each line defines one gene set. Each line has the format: gene_set_name <TAB> description <TAB> gene1 <TAB> gene2 <TAB> ...

MSigDb GMT files follow this format and can be used directly.

How it works

For each variant, the command looks at the gene names from SnpEff annotations (the GENE field in ANN). For each gene, it finds all gene sets that contain that gene and adds them to the MSigDb INFO field (comma-separated, sorted alphabetically).

Warning

Gene names in the GMT file must exactly match the gene names used by SnpEff. MSigDb typically uses HGNC gene symbols (e.g. BRCA1). If your SnpEff database uses different gene identifiers (e.g. Ensembl gene IDs), the lookup will fail silently. Make sure your SnpEff annotations and GMT file use the same gene naming convention.

Example

$ java -jar SnpSift.jar geneSets -v db/msigDb/msigdb.v5.1.symbols.gmt test.ann.vcf > test.geneSets.vcf
00:00:00.000    Reading MSigDb from file: 'db/msigDb/msigdb.v5.1.symbols.gmt'
00:00:01.168    Done. Total:
        8513 gene sets
        31847 genes
00:00:01.168    Annotating variants from: 'test.ann.vcf'
00:00:01.298    Done.
# Summary
#       gene_set    gene_set_size   variants
#       ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN   940 8
#       CHR1P36 504 281
#       KEGG_OLFACTORY_TRANSDUCTION 389 8
#       REACTOME_GPCR_DOWNSTREAM_SIGNALING  805 8
#       REACTOME_OLFACTORY_SIGNALING_PATHWAY    328 8
...
#       REACTOME_SIGNALING_BY_GPCR  920 8

$ cat test.geneSets.vcf
## INFO=<ID=MSigDb,Number=.,Type=String,Description="Gene set from MSigDB database (GSEA)">
1   69849   .   G   A   454.73  PASS    AC=33;ANN=...;MSigDb=ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN,CHR1P36,KEGG_OLFACTORY_TRANSDUCTION,REACTOME_GPCR_DOWNSTREAM_SIGNALING,REACTOME_OLFACTORY_SIGNALING_PATHWAY,REACTOME_SIGNALING_BY_GPCR

The summary (gene set name, size, and number of annotated variants) is printed to STDERR when using -v.