Cancer samples
Here we describe details about annotating cancer samples.
Single multi-sample VCF vs. multiple VCF files.
It is common practice, to have all samples in a single "multi-sample VCF file" (having two or more separate VCF files is highly discouraged). This is also the "gold standard" in cancer analysis standard, so all samples (both somatic and germline) should be in one VCF file.
SnpEff requires that you follow gold standard practices, thus requires a single multi-sample VCF (it is not possible to run cancer analysis using multiple VCF files).
Running in cancer analysis mode
Using the -cancer
command line option, you can compare somatic vs germline samples.
So an example command line would be:
$ java -Xmx8g -jar snpEff.jar -v -cancer GRCh37.75 cancer.vcf > cancer.ann.vcf
Representing cancer data
In a typical cancer sequencing experiment, we want to measure and annotate differences between germline (healthy) and somatic (cancer) tissue samples from the same patient. The complication is that germline is not always the same as the reference genome, so a typical annotation does not work.
For instance, let's assume that at a given genomic position (e.g. chr1:69091), reference genome is 'A', germline is 'C' and somatic is 'G'. This should be represented in a VCF file as:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Patient_01_Germline Patient_01_Somatic
1 69091 . A C,G . PASS AC=1 GT 1/0 2/1
Warning
Some people tend to represent this by changing REF base 'A' using germline 'C'. This is a mistake, REF must always represent the reference genome, not one of your samples.
Under normal conditions, SnpEff would provide the effects of changes "A -> C" and "A -> G". But in case of cancer samples, we are actually interested in the difference between somatic and germline, so we'd like to calculate the effect of a "C -> G" mutation. Calculating this effect is not trivial, since we have to build a new "reference" by calculating the effect of the first mutation ("A -> C") and then calculate the effect of the second one ("C -> G") on our "new reference".
Info
In order to activate cancer analysis, you must use -cancer
command line option.
Defining cancer samples
As we already mentioned, cancer data is represented in a VCF file using multiple ALTs (REF field always is reference genome). In order to specify which samples are somatic and which ones are germline, there are two options:
- Use a TXT file using
-cancerSamples
command line option. - Use the PEDIGREE meta information in your VCF file's header. This is the default, but some people might find hard to edit / change information in VCF file's headers.
Warning
If you do not provide either PEDIGREE
meta information or a TXT samples file, SnpEff will not know which somatic samples derive from which germline samples.
Thus it will be unable to perform cancer effect analysis.
TXT file
This is quite easy. All you have to do is to create a tab-separated TXT file having two columns: the first column has the germline sample names and the second column has the somatic sample names. Make sure that sample names match exactly the ones in the VCF file.
E.g.: Create a TXT file named 'samples_cancer.txt'
Patient_01_Germline Patient_01_Somatic
Patient_02_Germline Patient_02_Somatic
Patient_03_Germline Patient_03_Somatic
Patient_04_Germline Patient_04_Somatic
-cancerSamples
command line option.
E.g. In our example, the file name is 'samples_cancer.txt', so the command line would look like this:
$ cat examples/samples_cancer_one.txt
Patient_01_Germline Patient_01_Somatic
$ java -Xmx8g -jar snpEff.jar -v \
-cancer \
-cancerSamples examples/samples_cancer_one.txt \
GRCh37.75 \
examples/cancer.vcf \
> cancer.ann.vcf
This is the default method and the main advantage is that you don't have to carry information on a separate TXT file (all the information is within your VCF file).
You have to add the PEDIGREE
header with the appropriate information to your VCF file.
Obviously this requires you to edit you VCF file's header.
Warning
How to edit VCF headers is beyond the scope of this manual (we recommend using vcf-annotate
from VCFtools). But if you find adding PEDIGREE information to your VCF file difficult, just use the TXT file method described in the previous sub-section.
E.g.: Pedigree information in a VCF file would look like this:
$ cat examples/cancer_pedigree.vcf
##PEDIGREE=<Derived=Patient_01_Somatic,Original=Patient_01_Germline>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Patient_01_Germline Patient_01_Somatic
1 69091 . A C,G . PASS AF=0.1122 GT 1/0 2/1
1 69849 . G A,C . PASS AF=0.1122 GT 1/0 2/1
1 69511 . A C,G . PASS AF=0.3580 GT 1/1 2/2
$ java -Xmx8g -jar snpEff.jar -v -cancer GRCh37.75 examples/cancer_pedigree.vcf > examples/cancer_pedigree.ann.vcf
Patient_01_Somatic
is derived from the sample called Patient_01_Germline
.
In this context, this means that cancer sample is derived from the healthy tissue.
Interpreting Cancer annotations
Interpretation of ANN
field cancer sample relies on 'Allele' sub-field.
Just as a reminder, ANN
field has the following format:
ANN = Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS_WARNINGS_INFO
The Allele
field tells you which effect relates to which genotype.
More importantly, genotype difference between Somatic and Germline.
Example: when there are multiple ALTs (e.g. REF='A' ALT='C,G') and the genotype field says:
- Allele = "C": it means is the effect related to the first ALT ('C')
- Allele = "G" if it's the effect related to the second ALT ('G')
- Allele = "G-C" means that this is the effect of having the second ALT as variant while using the first ALT as reference ("C -> G"). It is important that you understand the meaning of the last one, because you'll use it often for your cancer analysis.
Example: Sample output for the previously mentioned VCF example would be (the output has been edited for readability reasons)
For the first line we get (edited for readability):
$ java -Xmx8g -jar snpEff.jar -v -cancer -cancerSamples examples/samples_cancer_one.txt GRCh37.75 examples/cancer.vcf > examples/cancer.eff.vcf
1 69091 . A C,G . PASS AF=0.1122;
ANN=G|start_lost|HIGH|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>G|p.Met1?|1/918|1/918|1/305||
,G-C|start_lost|HIGH|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>G|p.Leu1?|1/918|1/918|1/305||
,C|initiator_codon_variant|LOW|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>C|p.Met1?|1/918|1/918|1/305||
GT 1/0 2/1
- In this case, we have two ALTs = 'C' and 'G'.
- Germline sample is heterozygous 'C/A' (GT = '1/0')
- Somatic tissue is heterozygous 'G/C' (GT = '2/1')
-
Change A -> C and A -> G are always calculated by SnpEff (this is the "default mode").
-
A -> C produces this effect:
C|initiator_codon_variant|LOW|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>C|p.Met1?|1/918|1/918|1/305||
Note that the last field (genotype field) is 'C' indicating this is produced by the first ALT.
-
A -> G produces this effect:
G|start_lost|HIGH|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>G|p.Met1?|1/918|1/918|1/305||
Note that the last field (genotype field) is 'G' indicating this is produced by the second ALT.
-
-
Finally, this is what you were expecting for, the cancer comparisons. Since both germline and somatic are heterozygous (GT are '1/0' and '2/1'), there are 4 possible comparisons to make:
- G vs C : This is the Somatic vs Germline we are interested in. SnpEff reports this one
- G vs A : This compares ALT to REF, so it was already reported in "default mode". SnpEff doesn't report this one again.
- C vs C : This is not a variant, since both og them ar '1'. SnpEff skips this one.
- C vs A : This compares ALT to REF, so it was already reported in "default mode". SnpEff doesn't report this one again.
I know is confusing, but the bottom line is that only the first comparison one makes sense, and is the one SnpEff reports. So 'C -> G' produces the following effect:
G-C|start_lost|HIGH|OR4F5|ENSG00000186092|transcript|ENST00000335137|protein_coding|1/1|c.1A>G|p.Leu1?|1/918|1/918|1/305||
Warning
Notice the genotype field is "G-C" meaning the we produce a new reference on the fly using ALT 1 ('C') and then used ALT 2 ('G') as the variant. So we compare 'G' (ALT) to 'C' (REF).
Cancer annotations using 'EFF' field:
Interpretation of EFF
field cancer sample relies on 'Genotype' sub-field.
Just as a reminder, EFF
field has the following format:
EFF = Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank | Genotype_Number [ | ERRORS | WARNINGS ] )
For the previous example, we get (edited for readability):
$ java -Xmx8g -jar snpEff.jar -v -classic -cancer -cancerSamples examples/samples_cancer_one.txt GRCh37.75 examples/cancer.vcf > examples/cancer.eff.vcf
1 69091 . A C,G . PASS AC=1;
EFF=START_LOST(HIGH|MISSENSE|Atg/Gtg|M1V|305|OR4F5|protein_coding|CODING|ENST00000335137|1|G)
,START_LOST(HIGH|MISSENSE|Ctg/Gtg|L1V|305|OR4F5|protein_coding|CODING|ENST00000335137|1|G-C)
,NON_SYNONYMOUS_START(LOW|MISSENSE|Atg/Ctg|M1L|305|OR4F5|protein_coding|CODING|ENST00000335137|1|C)
The GenotypeNum
field tells you which effect relates to which genotype.
More importantly, genotype difference between Somatic and Germline.
Example: when there are multiple ALTs (e.g. REF='A' ALT='C,G') and the genotype field says:
- GenotypeNum = "1": it means is the effect related to the first ALT ('C')
- GenotypeNum = "2" if it's the effect related to the second ALT ('G')
- GenotypeNum = "2-1" means that this is the effect of having the second ALT as variant while using the first ALT as reference ("C -> G").