API Reference
This section provides detailed documentation for the PyGnome API. PyGnome is organized into several core modules, each focusing on a specific aspect of genomic data handling.
Core Modules
Genomics
The genomics module provides the core data models for representing genomic features:
GenomicFeature: Base class for all genomic featuresGenome: Top-level container for chromosomes and genesChromosome: Contains genes and sequence dataGene: Represents a gene with transcriptsTranscript: Represents a transcript with exons, introns, UTRs, and CDSExon,Intron,UTR,CDS: Represent specific genomic featuresStrand: Enumeration for strand orientation (positive/negative)Biotype: Enumeration for gene/transcript biotypesPhase: Enumeration for CDS phase
Feature Store
The feature_store module provides efficient storage and retrieval of genomic features:
GenomicFeatureStore: Main interface for storing and querying featuresIntervalTreeStore: Uses interval trees for efficient range queriesBinnedGenomicStore: Uses binning for memory-efficient storageBruteForceFeatureStore: Simple implementation for testingMsiChromosomeStore: Specialized for microsatellite instability sites
Sequences
The sequences module provides memory-efficient representations of DNA and RNA sequences:
BaseSequence: Abstract base class for nucleotide sequencesDnaString: Memory-efficient 2-bit representation of DNA sequencesRnaString: RNA sequence representationDnaStringArray: Efficient storage for multiple DNA sequences
Parsers
The parsers module includes parsers for common genomic file formats:
GenomeLoader: Loads genomes from annotation and sequence filesFastaParser: Parses FASTA filesFastqParser: Parses FASTQ filesGffParser,Gff3Parser,GtfParser: Parse GFF/GTF annotation filesVcfReader: Parses VCF variant filesMsiSitesReader: Parses microsatellite instability sites
Module Relationships
The modules in PyGnome are designed to work together:
- Parsers read genomic data from files
- Genomics models represent the parsed data as objects
- Feature Stores provide efficient storage and retrieval of genomic features
- Sequences provide memory-efficient representation of DNA/RNA sequences
Common Usage Patterns
Loading and Querying a Genome
from pathlib import Path
from pygnome.parsers.genome_loader import GenomeLoader
from pygnome.feature_store.genomic_feature_store import GenomicFeatureStore
# Load a genome
loader = GenomeLoader(genome_name="GRCh38", species="Homo sapiens")
genome = loader.load(
annotation_file=Path("path/to/annotations.gtf"),
sequence_file=Path("path/to/genome.fa.gz")
)
# Create a feature store for efficient querying
store = GenomicFeatureStore()
with store:
for gene in genome.genes.values():
store.add(gene)
for transcript in gene.transcripts:
store.add(transcript)
for exon in transcript.exons:
store.add(exon)
# Query features
features = store.get_by_interval("chr1", 1000000, 2000000)
Working with Sequences
from pygnome.sequences.dna_string import DnaString
from pygnome.sequences.rna_string import RnaString
# Create a DNA sequence
dna = DnaString("ATGCATGCATGC")
# Get a subsequence
subseq = dna[3:9]
# Complement and reverse complement
comp = dna.complement()
rev_comp = dna.reverse_complement()
# Transcribe DNA to RNA
rna = dna.transcribe()
# Translate RNA to protein
protein = rna.translate()
Error Handling
PyGnome uses Python's built-in exception handling. Common exceptions include:
ValueError: Raised when a function receives an argument of the correct type but an inappropriate valueTypeError: Raised when an operation or function is applied to an object of inappropriate typeIndexError: Raised when a sequence subscript is out of rangeFileNotFoundError: Raised when a file or directory is requested but doesn't exist
Example of error handling:
```python from pathlib import Path from pygnome.parsers.fasta.fasta_parser import FastaParser
try: parser = FastaParser(Path("path/to/nonexistent_file.fa")) records = parser.load() except FileNotFoundError as e: print(f"Error: {e}") # Handle the error appropriately