Building databases: Regulatory and Non-coding
SnpEff supports regulatory and non-coding annotations. In this section we show how to build those databases. As in the previous section, most likely you will never have to do it yourself and can just use available pre-built databases.
There are two ways to add support for regulatory annotations (these are not mutually exclusive, you can use both at the same time):
- GFF regulation file (from ENSEMBL).
- BED files.
Warning
Adding regulation support and analyzing data using regulation tracks can take much more memory. For instance, for the human genome I use 10Gb to 20Gb of RAM.
Warning
It is assumed the the genome is already installed, only regulatory tracks are added.
Option 1: Using a GFF file
This example shows how to create a regulation database for human (GRCh37.65):
-
Get the GFF regulatory annotations (into path/to/snpEff/data/GRCh37.65/regulation.gff):
cd path/to/snpEff/data/GRCh37.65 wget ftp:/ftp.ensembl.org/pub/release-65/regulation/homo_sapiens/AnnotatedFeatures.gff.gz mv AnnotatedFeatures.gff.gz regulation.gff.gz
-
Create databases. Note that we use
-onlyReg
flag, because we are only creating regulatory databases. If you omit it, it will create both of "normal' and regulatory databases:cd /path/to/snpEff java -Xmx20G -jar snpEff.jar build -v -onlyReg GRCh37.65
The output looks like this:
Reading regulation elements (GFF) Chromosome '11' line: 226964 Chromosome '12' line: 493780 ... Chromosome '9' line: 4832434 Chromosome 'X' line: 5054301 Chromosome 'Y' line: 5166958 Done Total lines : 5176289 Total annotation count : 3961432 Percent : 76.5% Total annotated length : 3648200193 Number of cell/annotations : 266 Saving database 'HeLa-S3' in file '/path/to/snpEff/data/GRCh37.65/regulation_HeLa-S3.bin' Saving database 'HepG2' in file '/path/to/snpEff/data/GRCh37.65/regulation_HepG2.bin' Saving database 'NHEK' in file '/path/to/snpEff/data/GRCh37.65/regulation_NHEK.bin' Saving database 'GM12878' in file '/path/to/snpEff/data/GRCh37.65/regulation_GM12878.bin' Saving database 'HUVEC' in file '/path/to/snpEff/data/GRCh37.65/regulation_HUVEC.bin' Saving database 'H1ESC' in file '/path/to/snpEff/data/GRCh37.65/regulation_H1ESC.bin' Saving database 'CD4' in file '/path/to/snpEff/data/GRCh37.65/regulation_CD4.bin' Saving database 'GM06990' in file '/path/to/snpEff/data/GRCh37.65/regulation_GM06990.bin' Saving database 'IMR90' in file '/path/to/snpEff/data/GRCh37.65/regulation_IMR90.bin' Saving database 'K562' in file '/path/to/snpEff/data/GRCh37.65/regulation_K562.bin' Done.
As you can see, annotations for each cell type are saved in different files. This makes it easier to load annotations only for the desired cell types when analyzing data.
Option 2: Using an BED file
This example shows how to create a regulation database for human (GRCh37.65).
We assume we have a file called my_regulation.bed
which has information for H3K9me3 in Pancreatic Islets (for instance, as a result of a Chip-Seq experiment and peak enrichment analysis).
-
Add all your BED files to
path/to/snpEff/data/GRCh37.65/regulation.bed/
dir:cd path/to/snpEff/data/GRCh37.65 mkdir regulation.bed cd regulation.bed mv where/ever/your/bed/file/is/my_regulation.bed ./regulation.Pancreatic_Islets.H3K9me3.bed
Note: The name of the file must be
regulation.CELL_TYPE.ANNOTATION_TYPE.bed
. In this case,CELL_TYPE=Pancreatic_Islets
andANNOTATION_TYPE=H3K9me3
-
Create databases (note the
-onlyReg
flag):cd /path/to/snpEff java -Xmx20G -jar snpEff.jar build -v -onlyReg GRCh37.65
The output looks like this:
Building database for 'GRCh37.65' Reading regulation elements (GFF) Cannot read regulation elements form file '/path/to/snpEff/data/GRCh37.65/regulation.gff' Directory has 1 bed files and 1 cell types Creating consensus for cellType 'Pancreatic_Islets', files: [/path/to/snpEff/data/GRCh37.65/regulation.bed/regulation.Pancreatic_Islets.H3K9me3.bed] Reading file '/path/to/snpEff/data/GRCh37.65/regulation.bed/regulation.Pancreatic_Islets.H3K9me3.bed' Chromosome '10' line: 5143 Chromosome '11' line: 8521 ... Chromosome 'X' line: 52481 Chromosome 'Y' line: 53340 Done Total lines : 53551 Total annotation count : 53573 Percent : 100.0% Total annotated length : 75489402 Number of cell/annotations : 1 Creating consensus for cell type: Pancreatic_Islets Sorting: Pancreatic_Islets , size: 53573 Adding to final consensus Final consensus for cell type: Pancreatic_Islets , size: 53549 Saving database 'Pancreatic_Islets' in file '/path/to/snpEff/data/GRCh37.65/regulation_Pancreatic_Islets.bin' Done Finishing up
Note: If there are many annotations, they are saved in one binary file for each cell type (i.e. several BED files for different cell types are collapsed together). This makes it easier to load annotations only for the desired cell types when analyzing data.