Skip to content

SnpSift GT

Compress genotype calls, reducing the overall size of the VCF file.

This is intended for compressing very large VCF in very large sequencing projects (e.g. thousands of samples).

Info

For instance, we've reduced 1Tb (1,000 Gb) VCF file to roughly 1Gb in a project that has over 10,000 samples.

Usage

java -jar SnpSift.jar gt [options] [file.vcf] > file.gt.vcf
Options:
    -u   : Uncompress (restore genotype fields).
Default input is STDIN.

How it works

In large re-sequencing projects most of the variants are singletons. This means that most variants are present in only one of the samples. For those variants, you have thousands of samples that are homozygous reference (i.e. genotype entry is "0/0") and one that is a variant (e.g. '0/1' or '1/1').

A trivial way to compress these VCF entries is just to state which sample has non-reference information. Intuitively, this is similar to the way used to represent sparse matrices (only store non-zero elements).

SnpSift gt creates three INFO fields. These three fields are composed of comma-separated 0-based sample indexes having:

  • HO: Indicates homozygous variant samples (i.e. '1/1').
  • HE: Indicates heterozygous variant samples (i.e. '0/1').
  • NA: Indicates samples with missing genotype data (i.e. './.').

All genotype columns are removed from compressed entries. You can use -u command line option to uncompress.

Limitations

Multi-allelic variants (entries with more than one ALT allele) are not compressed. They are output as-is with their full genotype fields. This means the output can contain a mix of compressed and uncompressed entries.

Warning

This is lossy compression:

  • Only the GT sub-field is preserved. All other genotype sub-fields (DP, GQ, PL, AD, etc.) are permanently lost.
  • Phasing information is lost: phased genotypes (e.g. 0|1, 1|0) are restored as unphased (0/1) on decompression.
  • After a compress/decompress round-trip, the FORMAT field is reduced to just GT.

Example

$ cat test.vcf
#CHROM  POS     ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  Sample_1  Sample_2  Sample_3  Sample_4  Sample_5  Sample_6  Sample_7  Sample_8  Sample_9  Sample_10  Sample_11  Sample_12  Sample_13  Sample_14  Sample_15
1       861276  .   A    G    .     PASS    AC=1  GT      0/0       1/1       0/0       0/0       0/0       0/0       0/0       0/0       0/0       0/0        0/0        0/0        0/0        0/0        0/0

#---
# Compress genotypes
#---
$ java -jar SnpSift.jar gt test.vcf | tee test.gt.vcf
##INFO=<ID=HO,Number=.,Type=Integer,Description="List of sample indexes having homozygous ALT genotypes">
##INFO=<ID=HE,Number=.,Type=Integer,Description="List of sample indexes having heterozygous ALT genotypes">
##INFO=<ID=NA,Number=.,Type=Integer,Description="List of sample indexes having missing genotypes">
#CHROM  POS     ID  REF  ALT  QUAL  FILTER  INFO        FORMAT  Sample_1  Sample_2  Sample_3  Sample_4  Sample_5  Sample_6  Sample_7  Sample_8  Sample_9  Sample_10  Sample_11  Sample_12  Sample_13  Sample_14  Sample_15
1       861276  .   A    G    .     PASS    AC=1;HO=1

#---
# Uncompress genotypes (command line option '-u')
#---
$ java -jar SnpSift.jar gt -u test.gt.vcf
#CHROM  POS     ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  Sample_1  Sample_2  Sample_3  Sample_4  Sample_5  Sample_6  Sample_7  Sample_8  Sample_9  Sample_10  Sample_11  Sample_12  Sample_13  Sample_14  Sample_15
1       861276  .   A    G    .     PASS    AC=1  GT      0/0       1/1       0/0       0/0       0/0       0/0       0/0       0/0       0/0       0/0        0/0        0/0        0/0        0/0        0/0

Note that Sample_2 is the second sample (0-based index 1), so it appears as HO=1 in the compressed output.