API Reference

This section provides detailed documentation for the PyGnome API. PyGnome is organized into several core modules, each focusing on a specific aspect of genomic data handling.

Core Modules

Genomics

The genomics module provides the core data models for representing genomic features:

GenomicFeature: Base class for all genomic features
Genome: Top-level container for chromosomes and genes
Chromosome: Contains genes and sequence data
Gene: Represents a gene with transcripts
Transcript: Represents a transcript with exons, introns, UTRs, and CDS
Exon, Intron, UTR, CDS: Represent specific genomic features
Strand: Enumeration for strand orientation (positive/negative)
Biotype: Enumeration for gene/transcript biotypes
Phase: Enumeration for CDS phase

Feature Store

The feature_store module provides efficient storage and retrieval of genomic features:

GenomicFeatureStore: Main interface for storing and querying features
IntervalTreeStore: Uses interval trees for efficient range queries
BinnedGenomicStore: Uses binning for memory-efficient storage
BruteForceFeatureStore: Simple implementation for testing
MsiChromosomeStore: Specialized for microsatellite instability sites

Sequences

The sequences module provides memory-efficient representations of DNA and RNA sequences:

BaseSequence: Abstract base class for nucleotide sequences
DnaString: Memory-efficient 2-bit representation of DNA sequences
RnaString: RNA sequence representation
DnaStringArray: Efficient storage for multiple DNA sequences

Parsers

The parsers module includes parsers for common genomic file formats:

GenomeLoader: Loads genomes from annotation and sequence files
FastaParser: Parses FASTA files
FastqParser: Parses FASTQ files
GffParser, Gff3Parser, GtfParser: Parse GFF/GTF annotation files
VcfReader: Parses VCF variant files
MsiSitesReader: Parses microsatellite instability sites

Module Relationships

The modules in PyGnome are designed to work together:

Parsers read genomic data from files
Genomics models represent the parsed data as objects
Feature Stores provide efficient storage and retrieval of genomic features
Sequences provide memory-efficient representation of DNA/RNA sequences

Common Usage Patterns

Loading and Querying a Genome

from pathlib import Path
from pygnome.parsers.genome_loader import GenomeLoader
from pygnome.feature_store.genomic_feature_store import GenomicFeatureStore

# Load a genome
loader = GenomeLoader(genome_name="GRCh38", species="Homo sapiens")
genome = loader.load(
    annotation_file=Path("path/to/annotations.gtf"),
    sequence_file=Path("path/to/genome.fa.gz")
)

# Create a feature store for efficient querying
store = GenomicFeatureStore()
with store:
    for gene in genome.genes.values():
        store.add(gene)
        for transcript in gene.transcripts:
            store.add(transcript)
            for exon in transcript.exons:
                store.add(exon)

# Query features
features = store.get_by_interval("chr1", 1000000, 2000000)

Working with Sequences

from pygnome.sequences.dna_string import DnaString
from pygnome.sequences.rna_string import RnaString

# Create a DNA sequence
dna = DnaString("ATGCATGCATGC")

# Get a subsequence
subseq = dna[3:9]

# Complement and reverse complement
comp = dna.complement()
rev_comp = dna.reverse_complement()

# Transcribe DNA to RNA
rna = dna.transcribe()

# Translate RNA to protein
protein = rna.translate()

Error Handling

PyGnome uses Python's built-in exception handling. Common exceptions include:

ValueError: Raised when a function receives an argument of the correct type but an inappropriate value
TypeError: Raised when an operation or function is applied to an object of inappropriate type
IndexError: Raised when a sequence subscript is out of range
FileNotFoundError: Raised when a file or directory is requested but doesn't exist

Example of error handling:

```python from pathlib import Path from pygnome.parsers.fasta.fasta_parser import FastaParser

try: parser = FastaParser(Path("path/to/nonexistent_file.fa")) records = parser.load() except FileNotFoundError as e: print(f"Error: {e}") # Handle the error appropriately