SnpSift CaseControl
Allows you to count how many samples are in 'case' and 'control' groups.
Typical usage
This command counts the number of 'homozygous', 'heterozygous' and 'total' variants in a case and control groups and performs some basic pValue calculation using Fisher exact test and Cochran-Armitage test.
Case and Control groups can be defined either by a command line string or a TFAM file (see PLINK's documentation).
Case/Control command line string containing plus and minus symbols {'+', '-', '0'} where '+' is case, '-' is control and '0' is neutral (ignored).
E.g. We have ten samples, which means ten genotype columns in the VCF file. The first four are 'cases', the fifth one is 'neutral', and the last five are 'control'. So the description string would be "++++0-----" (note that the following output has been edited, only counts are shown, no pValues):
$ java -jar SnpSift.jar caseControl "++++0-----" cc.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_01 Sample_02 Sample_03 Sample_04 Sample_05 Sample_06 Sample_07 Sample_08 Sample_09 Sample_10
1 69496 . G A . PASS AF=0.01;Cases=1,2,4;Controls=2,2,6 GT 0/1 1/1 1/0 0/0 0/0 0/1 1/1 1/1 1/0 0/0
Cases=1,2,4
Control genotypes are samples 6 to 10 : 0/1, 1/1, 1/1, 1/0 and 0/0. So there are 2 homozygous, 2 heterozygous, and a total of 6 variants (2 * 2 + 1 * 2 = 6)
Thus the annotation is Controls=2,2,6
Info
You can use the -tfam
command line option to specify a TFAM file.
Case, control from are read from phenotype field of a TFAM file (6th column).
Phenotype order in TFAM files do not need to match VCF sample order (sample IDs are used).
Phenotype column should be coded as {0,1,2} meaning {Missing, Control, Case} respectively.
See PLINK's reference for details about TFAM file format.
Info
You can use the -name nameString
command line option to add name to the INFO tags.
This can be used to count different case/control groups in the same dataset (e.g. multiple phenotypes)
$ java -jar SnpSift.jar caseControl -name "_MY_GROUP" "++++0-----" cc.vcf \
| java -jar SnpSift.jar caseControl -name "_ANOTHER_GROUP" "+-+-+-+-+-" -
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_01 Sample_02 Sample_03 Sample_04 Sample_05 Sample_06 Sample_07 Sample_08 Sample_09 Sample_10
1 69496 . G A . PASS AF=0.01;Cases_MY_GROUP=1,2,4;Controls_MY_GROUP=2,2,6;Cases_ANOTHER_GROUP=1,3,5;Controls_ANOTHER_GROUP=2,1,5 GT 0/1 1/1 1/0 0/0 0/0 0/1 1/1 1/1 1/0 0/0
p-values
SnpSift caseControl calculates the p-value using different models: dominant, recessive, allelic and co-dominant.
Info
When we say we use Fisher exact test, it means that we use the real Fisher exact test calculation, not approximations (like Chi-Square approximations). So the p-values should be correct even for low counts on any of the values in the contingency tables. Approximations tend to be wrong when any count in a contingency table is below 5. You should not see that problem here.
Models:
-
Dominant model (
CC_DOM
): A 2 by 2 contingency table is created:-- Alt (A/a + a/a) Ref (A/A) Cases N11 N12 Controls N21 N22 This means that the first column are the number of samples that have ANY non-reference: either 1 (heterozygous) or 2 (homozygous). Fisher exact test is used to calculate the p-value.
-
Recessive model (
CC_REC
): A 2 by 2 contingency table is created:-- Alt (a/a) Ref + Het (A/A + A/a) Cases N11 N12 Controls N21 N22 This means that the first column are the number of samples that have both non-reference chromosomes: homozygous ALT. Fisher exact test is used to calculate the p-value.
-
Allelic model (
CC_ALL
): A 2 by 2 contingency table is created:-- Variants References Cases N11 N12 Controls N21 N22 This means that the first column are the number of non-reference genotypes. For instance homozygous reference samples count as 0, heterozygous count as 1 and homozygous non-reference count as 2. Fisher exact test is used to calculate the p-value.
-
Genotipic / Codominant model (
CC_GENO
): A 2 by 3 contingency table is created:-- A/A a/A a/a Cases N11 N12 N13 Controls N21 N22 N23 This means that the first column are the number of homozygous reference genotypes. The second column is the number of heterozygous. And the third column is the number of homozygous non-reference.
Chi-Square distribution with two degrees of freedom is calculate the p-value.
-
Cochran-Armitage trend model (
CC_TREND
): A 2 by 3 contingency table is created:-- A/A a/A a/a Cases N11 N12 N13 Controls N21 N22 N23 Weight 0.0 1.0 2.0 This means that the first column are the number of homozygous reference genotypes. The second column is the number of heterozygous. And the third column is the number of homozygous non-reference.
Cochran-Armitage test is used to calculate the p-value, using the weights shown in the last row.