Genomic Feature Store
The feature_store
module provides efficient storage and retrieval of genomic features. It offers multiple implementations with different performance characteristics to suit various use cases.
Overview
Genomic feature stores are one of the core solutions in PyGnome, providing specialized data structures for efficient storage, indexing, and querying of genomic features based on their genomic coordinates. They solve the fundamental bioinformatics challenge of quickly locating genomic elements within large genomes.
What is a Genomic Feature Store?
A genomic feature store is a data structure that:
- Stores genomic features (genes, transcripts, exons, variants, etc.) organized by their chromosomal locations
- Indexes these features using efficient spatial data structures (interval trees, binning, etc.)
- Provides fast query operations to find features based on genomic coordinates
- Optimizes memory usage through various implementation strategies
The feature store is designed to efficiently store and query genomic features based on their genomic coordinates. It supports several types of queries:
- Position queries: Find all features at a specific position
- Interval queries: Find all features that overlap with a given range
- Nearest feature queries: Find the nearest feature to a specific position
Core Components
GenomicFeatureStoreProtocol
This protocol defines the interface that all feature store implementations must follow. It includes methods for adding features, querying features, and managing the store.
Methods
Method | Description |
---|---|
add(feature: GenomicFeature) -> None |
Add a genomic feature to the store |
get_by_position(chrom: str, position: int) -> list[GenomicFeature] |
Get all features at a specific position |
get_by_interval(chrom: str, start: int, end: int) -> list[GenomicFeature] |
Get all features that overlap with the given range |
get_nearest(chrom: str, position: int, max_distance: int = MAX_DISTANCE) -> GenomicFeature \| None |
Get the nearest feature to the given position |
__getitem__(chrom: str) -> ChromosomeFeatureStore |
Get a chromosome store by name |
__iterator__() |
Iterate over all chromosome stores |
trim() -> None |
Trim internal data structures to reduce memory usage |
GenomicFeatureStore
The main implementation of the genomic feature store. It delegates storage and queries to chromosome-specific stores based on the chosen store type.
Constructor
store_type
: Type of store to use (default:StoreType.INTERVAL_TREE
)bin_size
: Size of bins for the binned store (default: 100000)
Methods
Method | Description |
---|---|
add(feature: GenomicFeature) -> None |
Add a genomic feature to the store |
add_features(features: list[GenomicFeature]) -> None |
Add multiple genomic features to the store |
get_by_position(chrom: str, position: int) -> list[GenomicFeature] |
Get all features at a specific position |
get_by_interval(chrom: str, start: int, end: int) -> list[GenomicFeature] |
Get all features that overlap with the given range |
get_nearest(chrom: str, position: int, max_distance: int = MAX_DISTANCE) -> GenomicFeature \| None |
Get the nearest feature to the given position |
get_chromosomes() -> list[str] |
Get all chromosome names in the store |
get_feature_count(chrom: str \| None = None) -> int |
Get the number of features in the store |
trim() -> None |
Trim internal data structures to reduce memory usage |
save(filepath: Path) -> None |
Save the genomic feature store to a file using pickle |
load(filepath: Path) -> GenomicFeatureStore |
Load a genomic feature store from a file |
Context Manager
The GenomicFeatureStore
class implements the context manager protocol (__enter__
and __exit__
methods), which should be used when adding features to ensure proper indexing:
StoreType
An enumeration of the available feature store types.
Value | Description |
---|---|
INTERVAL_TREE |
Uses interval trees for efficient range queries |
BINNED |
Uses binning for memory-efficient storage |
BRUTE_FORCE |
Simple implementation for testing |
MSI |
Specialized for microsatellite instability sites |
ChromosomeFeatureStore
Abstract base class for chromosome-specific genomic feature storage. It has a list of features and provides methods to add and query them.
Constructor
chromosome
: Name of the chromosome
Methods
Method | Description |
---|---|
add(feature: GenomicFeature) -> None |
Add a feature to this chromosome's store |
get_by_position(position: int) -> list[GenomicFeature] |
Get all features at a specific position |
get_by_interval(start: int, end: int) -> list[GenomicFeature] |
Get all features that overlap with the given range |
get_features() -> list[GenomicFeature] |
Get all features |
get_nearest(position: int, max_distance: int = MAX_DISTANCE) -> GenomicFeature \| None |
Get the nearest feature to the given position |
index_build_start() -> None |
Start building the index |
index_build_end() -> None |
Finish building the index |
trim() -> None |
Trim internal data structures to reduce memory usage |
Store Implementations
IntervalTreeStore
Store genomic features using an efficient interval tree. This is the default implementation and provides a good balance between memory usage and query speed.
Performance Characteristics
- Memory Usage: Medium
- Query Speed: Fast
- Best For: General purpose, balanced performance
BinnedGenomicStore
Store genomic features using a memory-efficient binning approach. Features are grouped into bins based on their genomic coordinates, which allows for efficient range queries while using less memory than interval trees.
Constructor
chromosome
: Name of the chromosomebin_size
: Size of each bin in base pairs (default: 100000)
Performance Characteristics
- Memory Usage: Low
- Query Speed: Medium
- Best For: Large genomes, memory-constrained environments
BruteForceFeatureStore
A naive brute-force implementation for genomic feature storage. This is not memory efficient and is not recommended for large datasets. It is primarily for testing purposes.
Performance Characteristics
- Memory Usage: Very Low
- Query Speed: Slow
- Best For: Testing, very small datasets
MsiChromosomeStore
Efficient storage for millions of MSI (Microsatellite Instability) sites in a chromosome. Uses NumPy arrays and DnaStringArray for memory efficiency. Implements binary search for efficient querying.
Constructor
def __init__(self, chrom: str, feature_count: int = DEFAULT_FEATURE_COUNT, max_lengths_by_bin: dict[int, int] | None = None, bin_size: int = DEFAULT_BIN_SIZE):
chrom
: Name of the chromosomefeature_count
: Number of features to allocate space for (default: 1024)max_lengths_by_bin
: Dictionary mapping bin IDs to maximum feature length in that binbin_size
: Size of each bin in base pairs (default: 100000)
Performance Characteristics
- Memory Usage: Very Low
- Query Speed: Fast
- Best For: Specialized for microsatellite sites
Usage Examples
Basic Usage
from pygnome.feature_store.genomic_feature_store import GenomicFeatureStore
from pygnome.genomics.gene import Gene
from pygnome.genomics.strand import Strand
# Create a feature store
store = GenomicFeatureStore()
# Create some features
gene1 = Gene(id="GENE001", chrom="chr1", start=1000, end=5000, strand=Strand.POSITIVE)
gene2 = Gene(id="GENE002", chrom="chr1", start=7000, end=9000, strand=Strand.NEGATIVE)
gene3 = Gene(id="GENE003", chrom="chr2", start=2000, end=6000, strand=Strand.POSITIVE)
# Add features to the store
with store: # Use context manager to ensure proper indexing
store.add(gene1)
store.add(gene2)
store.add(gene3)
# Query features
features_at_position = store.get_by_position("chr1", 1500)
print(f"Features at position chr1:1500: {features_at_position}")
features_in_range = store.get_by_interval("chr1", 4000, 8000)
print(f"Features in range chr1:4000-8000: {features_in_range}")
nearest_feature = store.get_nearest("chr1", 6000)
print(f"Nearest feature to chr1:6000: {nearest_feature}")
Choosing a Store Type
from pygnome.feature_store.genomic_feature_store import GenomicFeatureStore, StoreType
# Create a feature store with interval trees (default)
default_store = GenomicFeatureStore()
# Create a feature store with binning
binned_store = GenomicFeatureStore(store_type=StoreType.BINNED, bin_size=100000)
# Create a feature store with brute force
brute_force_store = GenomicFeatureStore(store_type=StoreType.BRUTE_FORCE)
# Create a feature store for MSI sites
msi_store = GenomicFeatureStore(store_type=StoreType.MSI)
Saving and Loading
from pathlib import Path
from pygnome.feature_store.genomic_feature_store import GenomicFeatureStore
# Create and populate a feature store
store = GenomicFeatureStore()
# ... add features ...
# Trim the store to reduce memory usage
store.trim()
# Save the store
store.save(Path("path/to/store.pkl"))
# Load the store
loaded_store = GenomicFeatureStore.load(Path("path/to/store.pkl"))
Performance Considerations
Memory Usage
The memory usage of the feature store depends on the implementation:
IntervalTreeStore
: Uses more memory but provides faster queriesBinnedGenomicStore
: Uses less memory but queries may be slightly slowerBruteForceFeatureStore
: Uses minimal memory but queries are slowMsiChromosomeStore
: Specialized for MSI sites, very memory efficient
Query Speed
The query speed depends on the implementation and the number of features:
IntervalTreeStore
: O(log n + k) for finding k intervals that overlap with a given point or rangeBinnedGenomicStore
: O(b + k) where b is the number of bins that overlap with the query rangeBruteForceFeatureStore
: O(n) where n is the total number of featuresMsiChromosomeStore
: O(log n + k) using binary search
Best Practices
- Use the context manager pattern when adding features to ensure proper indexing
- Choose the appropriate store type based on your specific use case
- For large genomes, consider saving the populated store to disk with
store.save()
for faster loading in future sessions
Build Time vs. Load Time
Building genomic feature stores with large datasets can be time-consuming, especially when creating indexes for efficient querying. The build time depends on:
- The number of features being added
- The complexity of the indexing structure (interval trees, binning, etc.)
- The performance characteristics of the chosen store implementation
However, once built, these stores can be serialized to disk using Python's pickle format. This allows you to quickly load pre-built stores in future sessions, avoiding the need to rebuild them each time:
# Building a store can be time-consuming
store = GenomicFeatureStore()
with store:
# Adding thousands or millions of features...
for gene in genome.genes.values():
store.add(gene)
for transcript in gene.transcripts:
store.add(transcript)
# ...and so on
# Save the built store to avoid rebuilding it next time
# Note: trimming is done automatically during save
store.save(Path("path/to/store.pkl"))
# In future sessions, quickly load the pre-built store
loaded_store = GenomicFeatureStore.load(Path("path/to/store.pkl"))
# Ready to use immediately without rebuilding indexes
This approach is particularly valuable in production environments or when working with large reference genomes where the feature set doesn't change frequently.