Advanced Usage
This guide covers advanced usage patterns and optimization techniques for PyGnome.
Genomic Feature Stores
Genomic feature stores are one of the core solutions in PyGnome, providing specialized data structures for efficient storage, indexing, and querying of genomic features based on their genomic coordinates. They solve the fundamental bioinformatics challenge of quickly locating genomic elements within large genomes.
Genomic Feature Store Selection
PyGnome offers multiple feature store implementations with different performance characteristics:
Store Type | Memory Usage | Query Speed | Best For |
---|---|---|---|
IntervalTreeStore |
Medium | Fast | General purpose, balanced performance |
BinnedGenomicStore |
Medium | Medium | Large genomes, memory-constrained environments |
BruteForceFeatureStore |
Medium | Slow | Testing, very small datasets |
MsiChromosomeStore |
Very Low | Fast | Specialized for microsatellite sites |
Choose the appropriate store type based on your specific use case:
from pygnome.feature_store.genomic_feature_store import GenomicFeatureStore, StoreType
# For general use
default_store = GenomicFeatureStore() # Uses IntervalTreeStore
# For memory-constrained environments
binned_store = GenomicFeatureStore(store_type=StoreType.BINNED, bin_size=100000)
# For testing with small datasets
brute_force_store = GenomicFeatureStore(store_type=StoreType.BRUTE_FORCE)
# For microsatellite sites
msi_store = GenomicFeatureStore(store_type=StoreType.MSI)
Best practices for Genomic Feature Stores
When working with large genomes, use these techniques to optimize performance:
1. Use the Context Manager Pattern
Always use the context manager pattern when adding features to ensure proper indexing:
2. Save and Load Feature Stores to Avoid Rebuilding
Building genomic feature stores with large datasets can be time-consuming, especially when creating indexes for efficient querying. Once built, save them to disk to avoid rebuilding them in future sessions:
# Building a store with millions of features can take time
store = GenomicFeatureStore()
with store:
# Adding features from a large genome...
for gene in genome.genes.values():
store.add(gene)
# Add transcripts, exons, etc.
# Save the built store (trimming is done automatically)
store.save(Path("path/to/store.pkl"))
# In future sessions, quickly load the pre-built store
loaded_store = GenomicFeatureStore.load(Path("path/to/store.pkl"))
# Ready to use immediately without rebuilding indexes
Note: The save()
method automatically calls trim()
to reduce memory usage before serialization, so you don't need to call it manually.
Working with Large Genomes
When working with large genomes, consider these approaches:
Chunked Processing
Process large files in chunks to reduce memory usage:
from pathlib import Path
from pygnome.parsers.fasta.fasta_parser import FastaParser
# Process a large FASTA file in chunks
parser = FastaParser(Path("path/to/large_genome.fa"))
for record in parser:
# Process one sequence at a time
process_sequence(record.identifier, record.sequence)
VCF files and Variants
Reading Variants from VCF Files
from pathlib import Path
from pygnome.parsers.vcf.vcf_reader import VcfReader
# Open a VCF file
with VcfReader(Path("path/to/variants.vcf")) as reader:
for vcf_record in reader:
# A single VCF record (i.e. vcf line) can have multiple variants
for variant in vcf_record:
print(f"Variant: {variant}")
Annotating Variants with Genomic Feature stores
from pathlib import Path
from pygnome.parsers.vcf.vcf_reader import VcfReader
from pygnome.feature_store.genomic_feature_store import GenomicFeatureStore
from pygnome.parsers.genome_loader import GenomeLoader
# Load the genome
loader = GenomeLoader(genome_name="GRCh38", species="Homo sapiens")
genome = loader.load(
annotation_file=Path("path/to/annotations.gtf"),
sequence_file=Path("path/to/genome.fa.gz")
)
# Create a feature store with all genes
store = GenomicFeatureStore()
with store:
for gene in genome.genes.values():
store.add(gene)
for transcript in gene.transcripts:
store.add(transcript)
for exon in transcript.exons:
store.add(exon)
# Open a VCF file
with VcfReader(Path("path/to/variants.vcf")) as reader:
for record in reader:
# A single VCF record (i.e. vcf line) can have multiple variants
for variant in vcf_record:
# Get the variant position
chrom = variant.get_chrom()
pos = variant.get_pos()
# Find features at this position
features = store.get_by_position(chrom, pos)
# Show matches
for feature in features:
print(f"Variant at {chrom}:{pos} overlaps {feature.id} ({feature.__class__.__name__})")
Custom Feature Filtering
Combine feature store queries with custom filtering:
from pygnome.feature_store.genomic_feature_store import GenomicFeatureStore
from pygnome.genomics.gene import Gene
# Create and populate a feature store
store = GenomicFeatureStore()
# ... add features ...
# Get all features in a range
features = store.get_by_interval("chr1", 1000000, 2000000)
# Filter for specific feature types
genes = [f for f in features if isinstance(f, Gene)]
# Filter by additional criteria
protein_coding_genes = [g for g in genes if g.biotype == "protein_coding"]
Custom Genomic Feature Types
You can create custom genomic feature types by extending the GenomicFeature
class:
from dataclasses import dataclass
from pygnome.genomics.genomic_feature import GenomicFeature
from pygnome.genomics.strand import Strand
@dataclass
class EnhancerRegion(GenomicFeature):
"""A genomic enhancer region."""
target_genes: list[str] = None
activity_score: float = 0.0
def __post_init__(self):
super().__post_init__()
if self.target_genes is None:
self.target_genes = []
def is_active(self, threshold: float = 0.5) -> bool:
"""Check if the enhancer is active based on its activity score."""
return self.activity_score >= threshold
# Create an enhancer
enhancer = EnhancerRegion(
id="ENH001",
chrom="chr1",
start=1000000,
end=1001000,
strand=Strand.POSITIVE,
target_genes=["GENE001", "GENE002"],
activity_score=0.75
)
# Add to a feature store
store = GenomicFeatureStore()
with store:
store.add(enhancer)
# Query enhancers
enhancers = [f for f in store.get_by_interval("chr1", 900000, 1100000)
if isinstance(f, EnhancerRegion)]
active_enhancers = [e for e in enhancers if e.is_active()]