Building databases: Regulatory and Non-coding

SnpEff supports regulatory and non-coding annotations. In this section we show how to build those databases. As in the previous section, most likely you will never have to do it yourself and can just use available pre-built databases.

There are two ways to add support for regulatory annotations (these are not mutually exclusive, you can use both at the same time):

GFF regulation file (from ENSEMBL).
BED files.

Warning

Adding regulation support and analyzing data using regulation tracks can take much more memory. For instance, for the human genome I use 10Gb to 20Gb of RAM.

Warning

It is assumed the the genome is already installed, only regulatory tracks are added.

Option 1: Using a GFF file

This example shows how to create a regulation database for human (GRCh37.65):

Get the GFF regulatory annotations (into path/to/snpEff/data/GRCh37.65/regulation.gff):

cd path/to/snpEff/data/GRCh37.65
wget ftp:/ftp.ensembl.org/pub/release-65/regulation/homo_sapiens/AnnotatedFeatures.gff.gz
mv AnnotatedFeatures.gff.gz regulation.gff.gz

Create databases. Note that we use -onlyReg flag, because we are only creating regulatory databases. If you omit it, it will create both of "normal' and regulatory databases:

cd /path/to/snpEff
java -Xmx20G -jar snpEff.jar build -v -onlyReg GRCh37.65

The output looks like this:

Reading regulation elements (GFF)
    Chromosome '11' line: 226964
    Chromosome '12' line: 493780
    ...
    Chromosome '9'  line: 4832434
    Chromosome 'X'  line: 5054301
    Chromosome 'Y'  line: 5166958
Done
    Total lines                 : 5176289
    Total annotation count      : 3961432
    Percent                     : 76.5%
    Total annotated length      : 3648200193
    Number of cell/annotations  : 266
Saving database 'HeLa-S3' in file '/path/to/snpEff/data/GRCh37.65/regulation_HeLa-S3.bin'
Saving database 'HepG2' in file '/path/to/snpEff/data/GRCh37.65/regulation_HepG2.bin'
Saving database 'NHEK' in file '/path/to/snpEff/data/GRCh37.65/regulation_NHEK.bin'
Saving database 'GM12878' in file '/path/to/snpEff/data/GRCh37.65/regulation_GM12878.bin'
Saving database 'HUVEC' in file '/path/to/snpEff/data/GRCh37.65/regulation_HUVEC.bin'
Saving database 'H1ESC' in file '/path/to/snpEff/data/GRCh37.65/regulation_H1ESC.bin'
Saving database 'CD4' in file '/path/to/snpEff/data/GRCh37.65/regulation_CD4.bin'
Saving database 'GM06990' in file '/path/to/snpEff/data/GRCh37.65/regulation_GM06990.bin'
Saving database 'IMR90' in file '/path/to/snpEff/data/GRCh37.65/regulation_IMR90.bin'
Saving database 'K562' in file '/path/to/snpEff/data/GRCh37.65/regulation_K562.bin'
Done.

As you can see, annotations for each cell type are saved in different files. This makes it easier to load annotations only for the desired cell types when analyzing data.

Option 2: Using an BED file

This example shows how to create a regulation database for human (GRCh37.65). We assume we have a file called my_regulation.bed which has information for H3K9me3 in Pancreatic Islets (for instance, as a result of a Chip-Seq experiment and peak enrichment analysis).

Add all your BED files to path/to/snpEff/data/GRCh37.65/regulation.bed/ dir:
```
cd path/to/snpEff/data/GRCh37.65
mkdir regulation.bed
cd regulation.bed
mv where/ever/your/bed/file/is/my_regulation.bed ./regulation.Pancreatic_Islets.H3K9me3.bed
```
Note: The name of the file must be regulation.CELL_TYPE.ANNOTATION_TYPE.bed. In this case, CELL_TYPE=Pancreatic_Islets and ANNOTATION_TYPE=H3K9me3

Create databases (note the -onlyReg flag):

cd /path/to/snpEff
java -Xmx20G -jar snpEff.jar build -v -onlyReg GRCh37.65

The output looks like this:

Building database for 'GRCh37.65'
Reading regulation elements (GFF)
Cannot read regulation elements form file '/path/to/snpEff/data/GRCh37.65/regulation.gff'
Directory has 1 bed files and 1 cell types
Creating consensus for cellType 'Pancreatic_Islets', files: [/path/to/snpEff/data/GRCh37.65/regulation.bed/regulation.Pancreatic_Islets.H3K9me3.bed]
Reading file '/path/to/snpEff/data/GRCh37.65/regulation.bed/regulation.Pancreatic_Islets.H3K9me3.bed'
    Chromosome '10' line: 5143
    Chromosome '11' line: 8521
    ...
    Chromosome 'X'  line: 52481
    Chromosome 'Y'  line: 53340
Done
    Total lines                 : 53551
    Total annotation count      : 53573
    Percent                     : 100.0%
    Total annotated length      : 75489402
    Number of cell/annotations  : 1
Creating consensus for cell type: Pancreatic_Islets
Sorting: Pancreatic_Islets  , size: 53573
Adding to final consensus
Final consensus for cell type: Pancreatic_Islets    , size: 53549
Saving database 'Pancreatic_Islets' in file '/path/to/snpEff/data/GRCh37.65/regulation_Pancreatic_Islets.bin'
Done
Finishing up

Note: If there are many annotations, they are saved in one binary file for each cell type (i.e. several BED files for different cell types are collapsed together). This makes it easier to load annotations only for the desired cell types when analyzing data.