SnpSift Extract Fields
Extract fields from a VCF file to a TXT, tab separated format, that you can easily load in R, XLS, etc.
Typical usage
You can also use sub-fields and genotype fields / sub-fields such as:
- Standard VCF fields:
CHROM
POS
ID
REF
ALT
FILTER
- INFO fields:
AF
AC
DP
MQ
- etc. (any info field available)
- SnpEff 'ANN' fields:
ANN[*].ALLELE
(alias GENOTYPE)ANN[*].EFFECT
(alias ANNOTATION): Effect in Sequence ontology terms (e.g. 'missense_variant', 'synonymous_variant', 'stop_gained', etc.)ANN[*].IMPACT:
{ HIGH, MODERATE, LOW, MODIFIER }ANN[*].GENE:
Gene name (e.g. 'PSD3')ANN[*].GENEID:
Gene IDANN[*].FEATURE
ANN[*].FEATUREID
(alias TRID: Transcript ID)ANN[*].BIOTYPE:
Biotype, as described by the annotations (e.g. 'protein_coding')ANN[*].RANK:
Exon or Intron rank (i.e. exon number in a transcript)ANN[*].HGVS_C
(alias HGVS_DNA, CODON): Variant in HGVS (DNA) notationANN[*].HGVS_P
(alias HGVS, HGVS_PROT, AA): Variant in HGVS (protein) notationANN[*].CDNA_POS
(alias POS_CDNA)ANN[*].CDNA_LEN
(alias LEN_CDNA)ANN[*].CDS_POS
(alias POS_CDS)ANN[*].CDS_LEN
(alias LEN_CDS)ANN[*].AA_POS
(alias POS_AA)ANN[*].AA_LEN
(alias LEN_AA)ANN[*].DISTANCE
ANN[*].ERRORS
(alias WARNING, INFOS)
-
SnpEff 'EFF' fields (this is for older SnpEff/SnpSift versions, new version use 'ANN' field):
EFF[*].EFFECT
EFF[*].IMPACT
EFF[*].FUNCLASS
EFF[*].CODON
EFF[*].AA
EFF[*].AA_LEN
EFF[*].GENE
EFF[*].BIOTYPE
EFF[*].CODING
EFF[*].TRID
EFF[*].RANK
Info
You can combine
vcfEffOnePerLine.pl
script withSnpSift extractFields
if you want to have each effect in a separate line. -
SnpEff 'LOF' fields:
LOF[*].GENE
LOF[*].GENEID
LOF[*].NUMTR
LOF[*].PERC
- SnpEff' NMD' fields:
NMD[*].GENE
NMD[*].GENEID
NMD[*].NUMTR
NMD[*].PERC
Warning
When using multiple indexes you must remember to use quotes and escape the character in the command line (e.g. "ANN[*].EFFECT"
).
Otherwise, the shell would parse the asterisk changing the expression and producing unexpected results.
Info
You can use command line option -s
to specify multiple field separator and -e
to specify how to represent empty fields.
Example 1: Extracting chromosome, position, ID and allele frequency
$ java -jar SnpSift.jar extractFields s.vcf CHROM POS ID AF | head
#CHROM POS ID AF
1 69134 0.086
1 69496 rs150690004 0.001
1 69511 rs75062661 0.983
1 69569 0.538
1 721559 0.001
1 721757 0.011
1 846854 rs111957712 0.003
1 865584 rs148711625 0.001
1 865625 rs146327803 0.001
Example 2: Extracting genotype fields
$ java -jar SnpSift.jar extractFields file.vcf "CHROM" "POS" "ID" "THETA" "GEN[0].GL[1]" "GEN[1].GL" "GEN[3].GL[*]" "GEN[*].GT"
This means to extract:
CHROM
POS
ID
: regular fields (as in the previous example)THETA
: This one is from INFOGEN[0].GL[1]
: Second likelihood from first genotypeGEN[1].GL
: The whole GL fields (all entries without separating them)GEN[3].GL[*]
: All likelihoods form genotype 3 (this time they will be tab separated, as opposed to the previous one).GEN[*].GT
: Genotype subfields (GT) from ALL samples (tab separated).
The result will look something like:
#CHROM POS ID THETA GEN[0].GL[1] GEN[1].GL GEN[3].GL[*] GEN[*].GT
1 10583 rs58108140 0.0046 -0.47 -0.24,-0.44,-1.16 -0.48 -0.48 -0.48 0|0 0|0 0|0 0|1 0|0 0|1 0|0 0|0 0|1
1 10611 rs189107123 0.0077 -0.48 -0.24,-0.44,-1.16 -0.48 -0.48 -0.48 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0
1 13302 rs180734498 0.0048 -0.58 -2.45,-0.00,-5.00 -0.48 -0.48 -0.48 0|0 0|1 0|0 0|0 0|0 1|0 0|0 0|1 0|0
1 13327 rs144762171 0.0204 -1.11 -1.97,-0.01,-2.51 -0.48 -0.48 -0.48 0|0 0|1 0|0 0|0 0|0 1|0 0|0 0|0 0|0
1 13957 rs201747181 0.0100 0 0,0,0 0 0 0 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0
1 13980 rs151276478 0.0139 -0.48 -0.48,-0.48,-0.48 -0.48 -0.48 -0.48 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
1 30923 rs140337953 0.0162 -0.61 -0.10,-0.69,-2.81 -0.48 -0.48 -0.48 1|1 0|0 0|0 1|1 1|0 0|0 1|1 1|0 1|1
1 46402 rs199681827 0.0121 0 0,0,0 0 0 0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
1 47190 rs200430748 0.0153 0 0,0,0 0 0 0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
Example 3: Extracting fields with multiple values in a friendlier format
You can use command line option -s
to specify multiple field separator and -e
to specify how to represent empty fields.
$ java -jar SnpSift.jar extractFields -s "," -e "." test.chr22.ann.vcf CHROM POS REF ALT "EFF[*].EFFECT" "EFF[*].AA"
Notice how we separate same fields using "," instead of the default tab using the option -s ","
, and we use "." for empty fields (option -e "."
).
The results is:
$ java -jar SnpSift.jar extractFields -s "," -e "." examples/test.chr22.ann.vcf CHROM POS REF ALT "ANN[*].EFFECT" "ANN[*].HGVS_P"
#CHROM POS REF ALT ANN[*].EFFECT ANN[*].HGVS_P
22 17071756 T C 3_prime_UTR_variant,downstream_gene_variant .,.
22 17072035 C T missense_variant,downstream_gene_variant p.Gly469Glu,.
22 17072258 C A missense_variant,downstream_gene_variant p.Gly395Cys,.
22 17072674 G A missense_variant,downstream_gene_variant p.Pro256Leu,.
22 17072747 T C missense_variant,downstream_gene_variant p.Met232Val,.
22 17072781 C T synonymous_variant,downstream_gene_variant p.Pro220Pro,.
22 17073043 C T missense_variant,downstream_gene_variant p.Arg133Gln,.
22 17073066 A G synonymous_variant,downstream_gene_variant p.Ala125Ala,.
22 17073119 C T missense_variant,downstream_gene_variant p.Val108Met,.
Example 4: Extracting effects, one per line
In order to extract effects, you can simply do something like this (notice that there are multiple columns per line because there are multiple effects per variant):
$ java -jar SnpSift.jar extractFields examples/test.chr22.ann.vcf CHROM POS REF ALT "ANN[*].EFFECT"
#CHROM POS REF ALT ANN[*].EFFECT
22 17071756 T C 3_prime_UTR_variant downstream_gene_variant
22 17072035 C T missense_variant downstream_gene_variant
22 17072258 C A missense_variant downstream_gene_variant
22 17072674 G A missense_variant downstream_gene_variant
22 17072747 T C missense_variant downstream_gene_variant
22 17072781 C T synonymous_variant downstream_gene_variant
22 17073043 C T missense_variant downstream_gene_variant
22 17073066 A G synonymous_variant downstream_gene_variant
22 17073119 C T missense_variant downstream_gene_variant
If we prefer to have one effect per line, then we can use the vcfEffOnePerLine.pl
provided with SnpEff distribution
$ cat examples/test.chr22.ann.vcf \
| ./scripts/vcfEffOnePerLine.pl \
| java -jar SnpSift.jar extractFields - CHROM POS REF ALT "ANN[*].EFFECT" \
#CHROM POS REF ALT ANN[*].EFFECT
22 17071756 T C 3_prime_UTR_variant
22 17071756 T C downstream_gene_variant
22 17072035 C T missense_variant
22 17072035 C T downstream_gene_variant
22 17072258 C A missense_variant
22 17072258 C A downstream_gene_variant
22 17072674 G A missense_variant
22 17072674 G A downstream_gene_variant
22 17072747 T C missense_variant
Info
Note that in SnpSift, we used -
as input file name, which denotes STDIN.
Example 5: Extracting genotype using genotype name instead of genotype number
As of SnpSift version 4.1A, you can use the genotype name in expressions:
$ java -jar SnpSift.jar extractFields examples/1kg.head_chr1.vcf.gz CHROM POS REF ALT "GEN[HG00096].DS" "GEN[HG00097].DS"
#CHROM POS REF ALT GEN[HG00096].DS GEN[HG00097].DS
1 10583 G A 0.2 0.15
1 10611 C G 0.05 0.75
1 13302 C T 0.05 1.0
1 13327 G C 0.0 0.95
1 13957 TC T 0.05 0.65
1 13980 T C 0.05 0.6
1 30923 G T 1.75 0.35
1 46402 C CTGT 0.05 0.15
1 47190 G GA 0.15 0.0
Example 6: Extracting non alphanumeric field names
Warning
SnpSift extractFields
can get confused if the VCF field has non-alphanumeric charaters in the name (e.g. dbNSFP_GERP++_RS
has two "+" signs).
A quick fix, it so is to change the field names in the VCF file.
Here is an example:
# Change field names in VCF
$ cat kath.vcf | sed "s/dbNSFP_GERP++/dbNSFP_GERP/g" > kath.gerp.vcf
# Use new names to extract fields
$ java -jar SnpSift.jar extractFields kath.gerp.vcf CHROM POS REF ALT dbNSFP_GERP_RS dbNSFP_GERP_NS
#CHROM POS REF ALT dbNSFP_GERP_RS
1 142827044 G A
2 132914561 G A
7 151933217 C A
7 151933251 T C
7 151933302 T C
7 151945101 G C -0.892
7 151945167 G T
7 151962176 T A
7 151970672 A T
7 151970856 T A 3.71
18 14183638 G C
18 14183710 A G
18 14542909 G A
18 14543039 T C -0.942