Getting Started with PyGnome
PyGnome is a Python library for working with genomic annotations and sequences. It provides efficient data structures and parsers for common genomic file formats, making it easy to work with genomic data in Python.
Installation
PyGnome can be installed using pip:
Quick Start
Here's a simple example to get you started with PyGnome:
from pathlib import Path
from pygnome.parsers.genome_loader import GenomeLoader
# Load a genome from GTF and FASTA files
loader = GenomeLoader(genome_name="GRCh38", species="Homo sapiens")
genome = loader.load(
annotation_file=Path("path/to/annotations.gtf"),
sequence_file=Path("path/to/genome.fa.gz")
)
# Access genomic features
for gene in genome.genes.values():
print(f"Gene: {gene.id} ({gene.name}) - {gene.chrom}:{gene.start}-{gene.end}")
for transcript in gene.transcripts:
print(f" Transcript: {transcript.id} - Exons: {len(transcript.exons)}")
Core Components
PyGnome consists of several core components:
Genomic Models
The genomic models provide a comprehensive object-oriented representation of genomic features:
Genome
: Top-level container for chromosomes and genesChromosome
: Contains genes and sequence dataGene
: Represents a gene with transcriptsTranscript
: Represents a transcript with exons, introns, UTRs, and CDSExon
,Intron
,UTR
,CDS
: Represent specific genomic features
Feature Stores
Feature stores provide efficient storage and retrieval of genomic features:
GenomicFeatureStore
: Main interface for storing and querying features- Multiple implementations with different performance characteristics:
IntervalTreeStore
: Uses interval trees for efficient range queriesBinnedGenomicStore
: Uses binning for memory-efficient storageBruteForceFeatureStore
: Simple implementation for testingMsiChromosomeStore
: Specialized for microsatellite instability sites
Sequence Handling
PyGnome provides memory-efficient representations of DNA and RNA sequences:
DnaString
: Memory-efficient 2-bit representation of DNA sequencesRnaString
: RNA sequence representationDnaStringArray
: Efficient storage for multiple DNA sequences
Parsers
PyGnome includes parsers for common genomic file formats:
FastaParser
: Parses FASTA filesFastqParser
: Parses FASTQ filesGffParser
,Gff3Parser
,GtfParser
: Parse GFF/GTF annotation filesVcfReader
: Parses VCF variant filesMsiSitesReader
: Parses microsatellite instability sites
Next Steps
- Check out the User Guide for more detailed usage examples
- Explore the API Reference for detailed documentation of all classes and methods