Cancer samples
Here we describe details about annotating cancer samples.
Single multi-sample VCF vs. multiple VCF files.
It is common practice, to have all samples in a single "multi-sample VCF file" (having two or more separate VCF files is highly discouraged). This is also the "gold standard" in cancer analysis standard, so all samples (both somatic and germline) should be in one VCF file.
SnpEff requires that you follow gold standard practices, thus requires a single multi-sample VCF (it is not possible to run cancer analysis using multiple VCF files).
Running in cancer analysis mode
Using the -cancer command line option, you can compare somatic vs germline samples.
So an example command line would be:
$ java -Xmx8g -jar snpEff.jar -v -cancer GRCh37.75 cancer.vcf > cancer.ann.vcf
Info
Cancer analysis only triggers for multiallelic variants (i.e. VCF entries with multiple ALT alleles) or "back to reference" mutations (where the somatic sample reverts to the reference allele). Additionally, cancer comparisons are only computed when at least one standard annotation has a non-MODIFIER impact.
Representing cancer data
In a typical cancer sequencing experiment, we want to measure and annotate differences between germline (healthy) and somatic (cancer) tissue samples from the same patient. The complication is that germline is not always the same as the reference genome, so a typical annotation does not work.
For instance, let's assume that at a given genomic position (e.g. chr1:69511), reference genome is 'A', germline is homozygous 'C/C' and somatic is homozygous 'G/G'. This should be represented in a VCF file as:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Patient_01_Germline Patient_01_Somatic
1 69511 . A C,G . PASS AC=1 GT 1/1 2/2
Warning
Some people tend to represent this by changing REF base 'A' using germline 'C'. This is a mistake, REF must always represent the reference genome, not one of your samples.
Under normal conditions, SnpEff would provide the effects of changes "A -> C" and "A -> G". But in case of cancer samples, we are actually interested in the difference between somatic and germline, so we'd like to calculate the effect of a "C -> G" mutation. Calculating this effect is not trivial, since we have to build a new "reference" by calculating the effect of the first mutation ("A -> C") and then calculate the effect of the second one ("C -> G") on our "new reference".
Info
In order to activate cancer analysis, you must use -cancer command line option.
Defining cancer samples
As we already mentioned, cancer data is represented in a VCF file using multiple ALTs (REF field always is reference genome). In order to specify which samples are somatic and which ones are germline, there are two options:
- Use a TXT file using
-cancerSamplescommand line option. - Use the PEDIGREE meta information in your VCF file's header. This is the default, but some people might find hard to edit / change information in VCF file's headers.
Warning
If you do not provide either PEDIGREE meta information or a TXT samples file, SnpEff will not know which somatic samples derive from which germline samples.
Thus it will be unable to perform cancer effect analysis.
TXT file
This is quite easy. All you have to do is to create a tab-separated TXT file having two columns: the first column has the germline sample names and the second column has the somatic sample names. Make sure that sample names match exactly the ones in the VCF file.
E.g.: Create a TXT file named 'samples_cancer.txt'
Patient_01_Germline Patient_01_Somatic
Patient_02_Germline Patient_02_Somatic
Patient_03_Germline Patient_03_Somatic
Patient_04_Germline Patient_04_Somatic
-cancerSamples command line option.
E.g. In our example, the file name is 'samples_cancer.txt', so the command line would look like this:
$ cat examples/samples_cancer_one.txt
Patient_01_Germline Patient_01_Somatic
$ java -Xmx8g -jar snpEff.jar -v \
-cancer \
-cancerSamples examples/samples_cancer_one.txt \
GRCh37.75 \
examples/cancer.vcf \
> cancer.ann.vcf
This is the default method and the main advantage is that you don't have to carry information on a separate TXT file (all the information is within your VCF file).
You have to add the PEDIGREE header with the appropriate information to your VCF file.
Obviously this requires you to edit you VCF file's header.
Warning
How to edit VCF headers is beyond the scope of this manual (we recommend using vcf-annotate from VCFtools). But if you find adding PEDIGREE information to your VCF file difficult, just use the TXT file method described in the previous sub-section.
E.g.: Pedigree information in a VCF file would look like this:
$ cat examples/cancer_pedigree.vcf
##PEDIGREE=<Derived=Patient_01_Somatic,Original=Patient_01_Germline>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Patient_01_Germline Patient_01_Somatic
1 69091 . A C,G . PASS AF=0.1122 GT 1/0 2/1
1 69849 . G A,C . PASS AF=0.1122 GT 1/0 2/1
1 69511 . A C,G . PASS AF=0.3580 GT 1/1 2/2
$ java -Xmx8g -jar snpEff.jar -v -cancer GRCh37.75 examples/cancer_pedigree.vcf > examples/cancer_pedigree.ann.vcf
Patient_01_Somatic is derived from the sample called Patient_01_Germline.
In this context, this means that cancer sample is derived from the healthy tissue.
Interpreting Cancer annotations
Interpretation of ANN field cancer sample relies on 'Allele' sub-field.
Just as a reminder, ANN field has the following format:
ANN = Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS_WARNINGS_INFO
The Allele field tells you which effect relates to which genotype.
More importantly, genotype difference between Somatic and Germline.
Example: when there are multiple ALTs (e.g. REF='A' ALT='C,G') and the genotype field says:
- Allele = "C": it means is the effect related to the first ALT ('C')
- Allele = "G" if it's the effect related to the second ALT ('G')
- Allele = "G-C" means that this is the effect of having the second ALT as variant while using the first ALT as reference ("C -> G"). It is important that you understand the meaning of the last one, because you'll use it often for your cancer analysis.
Example: Sample output for the previously mentioned VCF example would be (the output has been edited for readability reasons)
For position 69511 we get (edited for readability):
$ java -Xmx8g -jar snpEff.jar -v -cancer GRCh37.75 examples/cancer_pedigree.vcf > examples/cancer_pedigree.ann.vcf
1 69511 . A C,G . PASS AF=0.3580;
ANN=C|missense_variant|MODERATE|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.421A>C|p.Thr141Pro|421/918|421/918|141/305||
,G|missense_variant|MODERATE|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.421A>G|p.Thr141Ala|421/918|421/918|141/305||
,G-C|missense_variant|MODERATE|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.421C>G|p.Pro141Ala|421/918|421/918|141/305||
GT 1/1 2/2
- In this case, we have two ALTs = 'C' and 'G'.
- Germline sample is homozygous 'C/C' (GT = '1/1')
- Somatic tissue is homozygous 'G/G' (GT = '2/2')
-
Changes A -> C and A -> G are always calculated by SnpEff (this is the "default mode").
-
A -> C produces this effect:
C|missense_variant|MODERATE|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.421A>C|p.Thr141Pro|421/918|421/918|141/305||Note that the Allele field is 'C' indicating this is produced by the first ALT.
-
A -> G produces this effect:
G|missense_variant|MODERATE|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.421A>G|p.Thr141Ala|421/918|421/918|141/305||Note that the Allele field is 'G' indicating this is produced by the second ALT.
-
-
Finally, this is what you were looking for: the cancer comparisons. Germline is homozygous C/C (GT='1/1') and somatic is homozygous G/G (GT='2/2'), so SnpEff compares each somatic allele against each germline allele. There are 2x2=4 possible comparisons:
- G vs C : Somatic vs Germline. SnpEff reports this one.
- G vs C : Duplicate of the above, so SnpEff skips it.
- G vs C : Duplicate again, skipped.
- G vs C : Duplicate again, skipped.
Since all four comparisons reduce to the same G vs C, only one cancer annotation is produced. The 'C -> G' somatic mutation produces the following effect:
G-C|missense_variant|MODERATE|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.421C>G|p.Pro141Ala|421/918|421/918|141/305||Warning
Notice the Allele field is "G-C" meaning we produce a new reference on the fly using ALT 1 ('C') and then use ALT 2 ('G') as the variant. So we compare 'G' (ALT) to 'C' (REF). The HGVS notation
c.421C>Gandp.Pro141Alareflect this somatic change relative to the germline.
Info
For unphased genotypes (using / separator, e.g. 1/0), SnpEff reports ALL possible somatic-vs-germline allele comparisons (Cartesian product of somatic alleles x germline alleles), after removing duplicates and comparisons already covered by standard annotations. For phased genotypes (using | separator, e.g. 1|0), only position-matched comparisons are made (first somatic allele vs first germline allele, second vs second).
Cancer annotations using 'EFF' field (deprecated):
Warning
The EFF field is the legacy annotation format, used with the -classic command line option. The current standard format is ANN. The following section is kept for reference only.
Interpretation of EFF field cancer sample relies on 'Genotype' sub-field.
Just as a reminder, EFF field has the following format:
EFF = Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank | Genotype_Number [ | ERRORS | WARNINGS ] )
For the previous example, we get (edited for readability):
$ java -Xmx8g -jar snpEff.jar -v -classic -cancer GRCh37.75 examples/cancer_pedigree.vcf > examples/cancer_pedigree.eff.vcf
1 69511 . A C,G . PASS AF=0.3580;
EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Cca|T141P|305|OR4F5|protein_coding|CODING|ENST00000335137|1|C)
,NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|1|G)
,NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Cca/Gca|P141A|305|OR4F5|protein_coding|CODING|ENST00000335137|1|G-C)
The GenotypeNum field tells you which effect relates to which genotype.
More importantly, genotype difference between Somatic and Germline.
Example: when there are multiple ALTs (e.g. REF='A' ALT='C,G') and the genotype field says:
- GenotypeNum = "1": it means is the effect related to the first ALT ('C')
- GenotypeNum = "2" if it's the effect related to the second ALT ('G')
- GenotypeNum = "2-1" means that this is the effect of having the second ALT as variant while using the first ALT as reference ("C -> G").