Integration: GATK and Galaxy

SnpEff is integrated with other tools commonly used in sequencing data analysis pipelines. Most notably Galaxy and Broad Institute's Genome Analysis Toolkit (GATK) projects support SnpEff. By using standards, such as VCF, SnpEff makes it easy to integrate with other programs.

Integration: GATK

In order to make sure SnpEff and GATK understand each other, you must activate GATK compatibility in SnpEff by using the -o gatk command line option. The reason for using '-o gatk' is that, even though both GATK and SnpEff use VCF format, SnpEff has recently updated the EFF sub-field format and this might cause some trouble (since GATK still uses the original version).

Warning

GATK only picks one effect. Indeed, the GATK team decided to only report the effect having the highest impact. This was done intentionally for the sake of brevity, in a 'less is more' spirit. You can get the full effect by using snpEff independently, instead of using it within GATK framework.

Script example: In this example we combine SnpEff and GATK's VariantAnnotator (you can find this script in snpEff/scripts/ directory of the distribution)

#!/bin/sh

#-------------------------------------------------------------------------------
# Files
#-------------------------------------------------------------------------------

in=$1                                                   # Input VCF file
eff=`dirname $in`/`basename $in .vcf`.ann.vcf        # SnpEff annotated VCF file
out=`dirname $in`/`basename $in .vcf`.gatk.vcf          # Output VCF file (annotated by GATK)

ref=$HOME/snpEff/data/genomes/hg19.fa                   # Reference genome file
dict=`dirname $ref`/`basename $ref .fa`.dict            # Reference genome: Dictionary file

#-------------------------------------------------------------------------------
# Path to programs and libraries
#-------------------------------------------------------------------------------

gatk=$HOME/tools/gatk/GenomeAnalysisTK.jar
picard=$HOME/tools/picard/
snpeff=$HOME/snpEff/snpEff.jar

#-------------------------------------------------------------------------------
# Main
#-------------------------------------------------------------------------------

# Create genome index file
echo
echo "Indexing Genome reference FASTA file: $ref"
samtools faidx $ref

# Create dictionary
echo
echo "Creating Genome reference dictionary file: $dict"
java -jar $picard/CreateSequenceDictionary.jar R= $ref O= $dict

# Annotate
echo
echo "Annotate using SnpEff"
echo "    Input file  : $in"
echo "    Output file : $eff"
java -Xmx8g -jar $snpeff -c $HOME/snpEff/snpEff.config -v -o gatk hg19 $in > $eff

# Use GATK
echo
echo "Annotating using GATK's VariantAnnotator:"
echo "    Input file  : $in"
echo "    Output file : $out"
java -Xmx8g -jar $gatk \
    -T VariantAnnotator \
    -R $ref \
    -A SnpEff \
    --variant $in \
    --snpEffFile $eff \
    -L $in \
    -o $out

Warning

Important: In order for this to work, GATK requires that the Genome Reference file should have the chromosomes in karotyping order (largest to smallest chromosomes, followed by the X, Y, and MT). Your VCF file should also respect that order.

Now we can use the script:

$ ~/snpEff/scripts/gatk.sh zzz.vcf

Indexing Genome reference FASTA file: /home/pcingola/snpEff/data/genomes/hg19.fa

Creating Genome reference dictionary file: /home/pcingola/snpEff/data/genomes/hg19.dict
[Fri Apr 12 11:23:12 EDT 2013] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/home/pcingola/snpEff/data/genomes/hg19.fa OUTPUT=/home/pcingola/snpEff/data/genomes/hg19.dict    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Fri Apr 12 11:23:12 EDT 2013] Executing as pcingola@localhost.localdomain on Linux 3.6.11-4.fc16.x86_64 amd64; OpenJDK 64-Bit Server VM 1.6.0_24-b24; Picard version: 1.89(1408)
[Fri Apr 12 11:23:12 EDT 2013] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=141164544
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" net.sf.picard.PicardException: /home/pcingola/snpEff/data/genomes/hg19.dict already exists.  Delete this file and try again, or specify a different output file.
        at net.sf.picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:114)
        at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
        at net.sf.picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:93)

Annotate using SnpEff
    Input file  : zzz.vcf
    Output file : ./zzz.ann.vcf
00:00:00.000    Reading configuration file '/home/pcingola/snpEff/snpEff.config'
00:00:00.173    done
00:00:00.173    Reading database for genome version 'hg19' from file '/home/pcingola//snpEff/data/hg19/snpEffectPredictor.bin' (this might take a while)
00:00:11.860    done
00:00:11.885    Building interval forest
00:00:17.755    done.
00:00:18.391    Genome stats :
# Genome name                : 'Homo_sapiens (USCS)'
# Genome version             : 'hg19'
# Has protein coding info    : true
# Genes                      : 25933
# Protein coding genes       : 20652
# Transcripts                : 44253
# Avg. transcripts per gene  : 1.71
# Protein coding transcripts : 36332
# Cds                        : 365442
# Exons                      : 429543
# Exons with sequence        : 409789
# Exons without sequence     : 19754
# Avg. exons per transcript  : 9.71
# Number of chromosomes      : 50
# Chromosomes names [sizes]  : '1' [249250621]  '2' [243199373] '3' [198022430] '4' [191154276] '5' [180915260] '6' [171115067] '7' [159138663] 'X' [155270560] '8' [146364022] '9' [141213431] '10' [135534747]    '11' [135006516]    '12' [133851895]    '13' [115169878]    '14' [107349540]    '15' [102531392]    '16' [90354753] '17' [81195210] '18' [78077248] '20' [63025520] 'Y' [59373566]  '19' [59128983] '22' [51304566] '21' [48129895] '6_ssto_hap7' [4905564] '6_mcf_hap5' [4764535]  '6_cox_hap2' [4734611]  '6_mann_hap4' [4679971] '6_qbl_hap6' [4609904]  '6_dbb_hap3' [4572120]  '6_apd_hap1' [4383650]  '17_ctg5_hap1' [1574839]    '4_ctg9_hap1' [582546]  'Un_gl000220' [156152]  '19_gl000209_random' [145745]   'Un_gl000213' [139339]  '17_gl000205_random' [119732]   'Un_gl000223' [119730]  '4_gl000194_random' [115071]    'Un_gl000228' [114676]  'Un_gl000219' [99642]   'Un_gl000218' [97454]   'Un_gl000211' [93165]   'Un_gl000222' [89310]   '4_gl000193_random' [88375] '7_gl000195_random' [86719] '1_gl000192_random' [79327] 'Un_gl000212' [60768]   '1_gl000191_random' [50281] 'M' [16571] 
00:00:18.391    Predicting variants
00:00:20.267    Creating summary file: snpEff_summary.html
00:00:20.847    Creating genes file: snpEff_genes.txt
00:00:25.026    done.
00:00:25.036    Logging
00:00:26.037    Checking for updates...

Annotating using GATK's VariantAnnotator:
    Input file  : zzz.vcf
    Output file : ./zzz.gatk.vcf
INFO  11:23:41,316 ArgumentTypeDescriptor - Dynamically determined type of zzz.vcf to be VCF 
INFO  11:23:41,343 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  11:23:41,344 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-9-g532efad, Compiled 2013/03/19 07:35:36 
INFO  11:23:41,344 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  11:23:41,344 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  11:23:41,347 HelpFormatter - Program Args: -T VariantAnnotator -R /home/pcingola/snpEff/data/genomes/hg19.fa -A SnpEff --variant zzz.vcf --snpEffFile ./zzz.ann.vcf -L zzz.vcf -o ./zzz.gatk.vcf 
INFO  11:23:41,347 HelpFormatter - Date/Time: 2013/04/12 11:23:41 
INFO  11:23:41,348 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  11:23:41,348 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  11:23:41,353 ArgumentTypeDescriptor - Dynamically determined type of zzz.vcf to be VCF 
INFO  11:23:41,356 ArgumentTypeDescriptor - Dynamically determined type of ./zzz.ann.vcf to be VCF 
INFO  11:23:41,399 GenomeAnalysisEngine - Strictness is SILENT 
INFO  11:23:41,466 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  11:23:41,480 RMDTrackBuilder - Loading Tribble index from disk for file zzz.vcf 
INFO  11:23:41,503 RMDTrackBuilder - Loading Tribble index from disk for file ./zzz.ann.vcf 
WARN  11:23:41,505 RMDTrackBuilder - Index file /data/pcingola/Documents/projects/snpEff/gatk_test/./zzz.ann.vcf.idx is out of date (index older than input file), deleting and updating the index file 
INFO  11:23:41,506 RMDTrackBuilder - Creating Tribble index in memory for file ./zzz.ann.vcf 
INFO  11:23:41,914 RMDTrackBuilder - Writing Tribble index to disk for file /data/pcingola/Documents/projects/snpEff/gatk_test/./zzz.ann.vcf.idx 
INFO  11:23:42,076 IntervalUtils - Processing 33411 bp from intervals 
INFO  11:23:42,125 GenomeAnalysisEngine - Creating shard strategy for 0 BAM files 
INFO  11:23:42,134 GenomeAnalysisEngine - Done creating shard strategy 
INFO  11:23:42,134 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  11:23:42,135 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  11:23:49,268 VariantAnnotator - Processed 9966 loci.

INFO  11:23:49,280 ProgressMeter -            done        3.34e+04    7.0 s        3.6 m    100.0%         7.0 s     0.0 s 
INFO  11:23:49,280 ProgressMeter - Total runtime 7.15 secs, 0.12 min, 0.00 hours 
INFO  11:23:49,953 GATKRunReport - Uploaded run statistics report to AWS S3

Integration: Galaxy

In order to install SnpEff in your own Galaxy server, you can use the galaxy/*.xml files provided in the main distribution.

This is a screen capture from a Galaxy server (click to enlarge):

Installing SnpEff in a Galaxy server:

# Set variable to snpEff install dir (we only use it for this install script)
export snpEffDir="$HOME/snpEff"

# Go to your galaxy 'tools' dir
cd galaxy-dist/tools

# Create a directory and copy the XML config files from SnpEff's distribution
mkdir snpEff
cd snpEff/
cp $snpEffDir/galaxy/* .

# Create links to JAR files
ln -s $snpEffDir/snpEff.jar
ln -s $snpEffDir/SnpSift.jar

# Link to config file
ln -s $snpEffDir/snpEff.config

# Allow scripts execution
chmod a+x *.{pl,sh}

# Copy genomes information
cd ../..
cp $snpEffDir/galaxy/tool-data/snpEff_genomes.loc tool-data/

# Edit Galaxy's tool_conf.xml and add all the tools
vi tool_conf.xml

-------------------- Begin: Edit tool_conf.xml --------------------
<!-- 
    Add this section to tool_conf.xml file in your galaxy distribution

    Note: The following lines should be added at the end of the 
          file, right before "</toolbox>" line
-->
<section name="snpEff tools" id="snpEff_tools">
    <tool file="snpEff/snpEff.xml" />
    <tool file="snpEff/snpEff_download.xml" />
    <tool file="snpEff/snpSift_annotate.xml" />
    <tool file="snpEff/snpSift_caseControl.xml" />
    <tool file="snpEff/snpSift_filter.xml" />
    <tool file="snpEff/snpSift_int.xml" />
</section>
-------------------- End: Edit tool_conf.xml --------------------

# Run galaxy and check that the new menus appear
./run.sh