SnpSift Extract Fields
Extract fields from a VCF file to a TXT, tab separated format, that you can easily load in R, XLS, etc.
Command Options
java -jar SnpSift.jar extractFields [options] file.vcf fieldName1 fieldName2 ... fieldNameN > tabFile.txt
-s <separator>: Separator for multiple values within the same field when using[*]wildcards. Default:\t(tab).-e <string>: String to output for empty/missing fields. Default:''(empty string).
Use - as the file name to read from STDIN.
Warning
The default -s separator is tab, which is also the column separator. This means that when using [*] wildcards, multiple values appear as extra columns in the output, making it hard to parse programmatically. Use -s "," to get comma-separated values within each column instead.
Field expressions
Standard VCF fields
CHROMPOSIDREFALTQUALFILTER
INFO fields
Any INFO field available in the VCF header can be extracted by name (e.g., AF, AC, DP, MQ).
For comma-separated INFO values (e.g., per-allele fields), you can use indexing:
AF: Returns the raw comma-separated string (e.g.,0.1,0.2)AF[0]: Returns only the first value (e.g.,0.1)AF[*]: Iterates over all values, separating them with the-sseparator
Genotype fields
GEN[0].GT: GT sub-field of the first sample (0-based index)GEN[0].AD[0]: First value of the AD sub-field of the first sample (ref allele depth)GEN[0].AD[1]: Second value of the AD sub-field (alt allele depth)GEN[0].AD: The full AD field as a raw comma-separated stringGEN[HG00096].DS: Genotype sub-field by sample name (instead of numeric index)GEN[*].GT: GT from ALL samples (iterated with the-sseparator)
SnpEff 'ANN' fields
ANN[*].ALLELE(alias GENOTYPE)ANN[*].EFFECT(alias ANNOTATION): Effect in Sequence ontology terms (e.g. 'missense_variant', 'synonymous_variant', 'stop_gained', etc.)ANN[*].IMPACT:{ HIGH, MODERATE, LOW, MODIFIER }ANN[*].GENE:Gene name (e.g. 'PSD3')ANN[*].GENEID:Gene IDANN[*].FEATUREANN[*].FEATUREID(alias TRID: Transcript ID)ANN[*].BIOTYPE:Biotype, as described by the annotations (e.g. 'protein_coding')ANN[*].RANK:Exon or Intron rank (i.e. exon number in a transcript)ANN[*].HGVS_C(alias HGVS_DNA, CODON): Variant in HGVS (DNA) notationANN[*].HGVS_P(alias HGVS, HGVS_PROT, AA): Variant in HGVS (protein) notationANN[*].CDNA_POS(alias POS_CDNA)ANN[*].CDNA_LEN(alias LEN_CDNA)ANN[*].CDS_POS(alias POS_CDS)ANN[*].CDS_LEN(alias LEN_CDS)ANN[*].AA_POS(alias POS_AA)ANN[*].AA_LEN(alias LEN_AA)ANN[*].DISTANCEANN[*].ERRORS(alias WARNING, INFOS)
SnpEff 'EFF' fields (older format, new versions use 'ANN')
EFF[*].EFFECTEFF[*].IMPACTEFF[*].FUNCLASSEFF[*].CODONEFF[*].AAEFF[*].AA_LENEFF[*].GENEEFF[*].BIOTYPEEFF[*].CODINGEFF[*].TRIDEFF[*].RANK
SnpEff 'LOF' and 'NMD' fields
LOF[*].GENE,LOF[*].GENEID,LOF[*].NUMTR,LOF[*].PERCNMD[*].GENE,NMD[*].GENEID,NMD[*].NUMTR,NMD[*].PERC
Wildcard indexing
[*] (or [ANY]) iterates over all values of a multi-valued field. When multiple [*] wildcards appear in different fields, the iteration produces a cartesian product: for each value of the outer iterator, all values of the inner iterator are enumerated.
The iteration priority order (inner to outer) is: genotype sub-values > genotypes > NMD > LOF > effects > variants.
Warning
When using wildcards, remember to quote the expression on the command line (e.g. "ANN[*].EFFECT") to prevent shell glob expansion.
Warning
Field names containing non-alphanumeric characters (like + in dbNSFP_GERP++_RS) cannot be used directly because the expression parser interprets them as operators.
As a workaround, rename the fields in the VCF file first (e.g., using sed).
Examples
Example 1: Extracting chromosome, position, ID and allele frequency
$ java -jar SnpSift.jar extractFields s.vcf CHROM POS ID AF | head
#CHROM POS ID AF
1 69134 0.086
1 69496 rs150690004 0.001
1 69511 rs75062661 0.983
1 69569 0.538
1 721559 0.001
1 721757 0.011
1 846854 rs111957712 0.003
1 865584 rs148711625 0.001
1 865625 rs146327803 0.001
Example 2: Extracting genotype fields
$ java -jar SnpSift.jar extractFields file.vcf "CHROM" "POS" "ID" "THETA" "GEN[0].GL[1]" "GEN[1].GL" "GEN[3].GL[*]" "GEN[*].GT"
This means to extract:
CHROMPOSID: regular fields (as in the previous example)THETA: This one is from INFOGEN[0].GL[1]: Second likelihood from first genotypeGEN[1].GL: The whole GL field as a raw comma-separated string (not split)GEN[3].GL[*]: All likelihoods from genotype 3, separated by the-sseparator (default: tab)GEN[*].GT: Genotype subfields (GT) from ALL samples, separated by the-sseparator
The result will look something like:
#CHROM POS ID THETA GEN[0].GL[1] GEN[1].GL GEN[3].GL[*] GEN[*].GT
1 10583 rs58108140 0.0046 -0.47 -0.24,-0.44,-1.16 -0.48 -0.48 -0.48 0|0 0|0 0|0 0|1 0|0 0|1 0|0 0|0 0|1
1 10611 rs189107123 0.0077 -0.48 -0.24,-0.44,-1.16 -0.48 -0.48 -0.48 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0
1 13302 rs180734498 0.0048 -0.58 -2.45,-0.00,-5.00 -0.48 -0.48 -0.48 0|0 0|1 0|0 0|0 0|0 1|0 0|0 0|1 0|0
Note that GEN[1].GL returns the raw comma-separated string -0.24,-0.44,-1.16, while GEN[3].GL[*] splits the values and separates them with tabs.
Example 3: Using a custom separator for multi-valued fields
Using -s "," to get comma-separated values within each column, and -e "." for empty fields:
$ java -jar SnpSift.jar extractFields -s "," -e "." examples/test.chr22.ann.vcf CHROM POS REF ALT "ANN[*].EFFECT" "ANN[*].HGVS_P"
#CHROM POS REF ALT ANN[*].EFFECT ANN[*].HGVS_P
22 17071756 T C 3_prime_UTR_variant,downstream_gene_variant .,.
22 17072035 C T missense_variant,downstream_gene_variant p.Gly469Glu,.
22 17072258 C A missense_variant,downstream_gene_variant p.Gly395Cys,.
22 17072674 G A missense_variant,downstream_gene_variant p.Pro256Leu,.
22 17072747 T C missense_variant,downstream_gene_variant p.Met232Val,.
22 17072781 C T synonymous_variant,downstream_gene_variant p.Pro220Pro,.
22 17073043 C T missense_variant,downstream_gene_variant p.Arg133Gln,.
22 17073066 A G synonymous_variant,downstream_gene_variant p.Ala125Ala,.
22 17073119 C T missense_variant,downstream_gene_variant p.Val108Met,.
Notice how the effects are comma-separated within one column, making the output easy to parse.
Example 4: Extracting effects with default separator vs one-per-line
Without -s, each effect value becomes a separate tab-separated column (since the default separator is tab):
$ java -jar SnpSift.jar extractFields examples/test.chr22.ann.vcf CHROM POS REF ALT "ANN[*].EFFECT"
#CHROM POS REF ALT ANN[*].EFFECT
22 17071756 T C 3_prime_UTR_variant downstream_gene_variant
22 17072035 C T missense_variant downstream_gene_variant
If you prefer one effect per line, use the vcfEffOnePerLine.pl script provided with SnpEff:
$ cat examples/test.chr22.ann.vcf \
| ./scripts/vcfEffOnePerLine.pl \
| java -jar SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].EFFECT"
#CHROM POS REF ALT ANN[*].EFFECT
22 17071756 T C 3_prime_UTR_variant
22 17071756 T C downstream_gene_variant
22 17072035 C T missense_variant
22 17072035 C T downstream_gene_variant
Info
- as input file name denotes STDIN, enabling piping.
Example 5: Extracting genotype using sample name instead of index
You can use sample names (from the VCF header) instead of numeric indices:
$ java -jar SnpSift.jar extractFields examples/1kg.head_chr1.vcf.gz CHROM POS REF ALT "GEN[HG00096].DS" "GEN[HG00097].DS"
#CHROM POS REF ALT GEN[HG00096].DS GEN[HG00097].DS
1 10583 G A 0.2 0.15
1 10611 C G 0.05 0.75
1 13302 C T 0.05 1.0
1 13327 G C 0.0 0.95
1 13957 TC T 0.05 0.65
1 13980 T C 0.05 0.6
1 30923 G T 1.75 0.35
1 46402 C CTGT 0.05 0.15
1 47190 G GA 0.15 0.0