Parsers

The parsers module provides parsers for common genomic file formats. It includes parsers for FASTA/FASTQ, GFF/GTF, VCF, and MSI formats.

Overview

The parsers module is designed to efficiently read and parse genomic data from various file formats. It provides a consistent interface for working with different file formats and converts the parsed data into PyGnome's object model.

Key features:

Support for common genomic file formats (FASTA, FASTQ, GFF, GTF, VCF)
Memory-efficient parsing of large files
Conversion of parsed data into PyGnome's object model
Support for compressed files (gzip, bgzip)

FASTA/FASTQ Parsers

FastaParser

class FastaParser:

Parser for FASTA files. FASTA is a text-based format for representing nucleotide or peptide sequences.

Constructor

def __init__(self, filepath: Path):

filepath: Path to the FASTA file

Methods

Method	Description
`load() -> list[FastaRecord]`	Load all records from the FASTA file
`load_as_dict() -> dict[str, FastaRecord]`	Load all records from the FASTA file as a dictionary
`parse_as_dna_strings(filepath: Path) -> dict[str, DnaString]`	Parse a FASTA file and return a dictionary of DnaString objects

FastqParser

class FastqParser:

Parser for FASTQ files. FASTQ is a text-based format for storing both a nucleotide sequence and its corresponding quality scores.

Constructor

def __init__(self, filepath: Path):

filepath: Path to the FASTQ file

Methods

Method	Description
`load() -> list[FastqRecord]`	Load all records from the FASTQ file
`load_as_dict() -> dict[str, FastqRecord]`	Load all records from the FASTQ file as a dictionary

GFF/GTF Parsers

GffParser

class GffParser:

Base class for GFF/GTF parsers. GFF (General Feature Format) and GTF (Gene Transfer Format) are tab-delimited text formats for describing genes and other features of DNA, RNA, and protein sequences.

Constructor

def __init__(self, filepath: Path):

filepath: Path to the GFF/GTF file

Methods

Method	Description
`__iter__() -> Iterator[GffRecord]`	Iterate over records in the GFF/GTF file
`parse_attributes(attributes_str: str) -> dict[str, str]`	Parse the attributes field of a GFF/GTF record

Gff3Parser

class Gff3Parser(GffParser):

Parser for GFF3 files. GFF3 is the latest version of the General Feature Format.

Constructor

def __init__(self, filepath: Path):

filepath: Path to the GFF3 file

GtfParser

class GtfParser(GffParser):

Parser for GTF files. GTF (Gene Transfer Format) is a refinement of GFF that is used to describe genes and other features of DNA, RNA, and protein sequences.

Constructor

def __init__(self, filepath: Path):

filepath: Path to the GTF file

VCF Parsers

VcfReader

class VcfReader:

Parser for VCF files. VCF (Variant Call Format) is a text file format for storing gene sequence variations.

Constructor

def __init__(self, filepath: Path):

filepath: Path to the VCF file

Methods

Method	Description
`__iter__() -> Iterator[VcfRecord]`	Iterate over records in the VCF file
`get_samples() -> list[str]`	Get the sample names from the VCF file
`get_header() -> VcfHeader`	Get the header from the VCF file
`fetch(chrom: str, start: int, end: int) -> Iterator[VcfRecord]`	Fetch records in a specific region

VcfRecord

class VcfRecord:

Represents a single record (line) in a VCF file. Provides methods for accessing and modifying VCF fields.

Methods

Method	Description
`get_info(field_id: str) -> Any`	Get the value of an INFO field
`info[field_id]`	Dictionary-like access to INFO fields
`has_info(field_id: str) -> bool`	Check if an INFO field is present
`set_info(field_id: str, value: Any) -> None`	Set the value of an INFO field
`get_genotypes() -> list[Genotype]`	Get the genotypes for all samples
`get_genotype_value(field_id: str, sample_idx: int = 0) -> Any`	Get the value of a genotype field for a specific sample
`variant_annotations()`	Get a variant annotations parser for this record
`__iter__() -> Iterator[Variant]`	Iterate over the variants represented in this VCF record

INFO Field Access

VcfRecord provides two ways to access INFO fields:

# Method-based access
depth = record.get_info("DP")

# Dictionary-like access (more pythonic)
depth = record.info["DP"]

# Iterating through all INFO fields
for field_id, field_value in record.info:
    print(f"{field_id} = {field_value}")

Variant Annotations

VcfRecord provides a convenient method to access variant annotations:

# Get variant annotations directly from the record
for vann in record.variant_annotations():
    print(f"Variant annotation: {vann}")

VcfInfo

class VcfInfo:

Class for handling INFO fields in VCF records. Provides methods for parsing, accessing, and modifying INFO fields in VCF records.

Methods

Method	Description
`get(field_id: str) -> Any`	Get the value of an INFO field
`__getitem__(field_id: str) -> Any`	Dictionary-like access to INFO fields
`has(field_id: str) -> bool`	Check if an INFO field is present
`set(field_id: str, value: Any) -> None`	Set the value of an INFO field
`remove(field_id: str) -> None`	Remove an INFO field
`field_ids() -> list[str]`	Get the names of all INFO fields
`__iter__()`	Iterate over the INFO field names and values

Usage Examples

# Access INFO fields using dictionary-like syntax
depth = vcf_record.info["DP"]

# Iterate through all INFO fields
for field_id, field_value in vcf_record.info:
    print(f"{field_id} = {field_value}")

# Set an INFO field
vcf_record.info["DP"] = 30

AnnParser

class AnnParser:

Parser for the ANN field in VCF records. The ANN field contains variant annotation information according to the VCF annotation format specification.

Constructor

def __init__(self, record: VcfRecord):

record: A VcfRecord object containing the ANN field

Methods

Method	Description
`__iter__() -> Iterator[VariantAnnotation]`	Iterate over annotations in the ANN field
`parse() -> None`	Parse the ANN field from the record

EffectType

class EffectType(str, Enum):

Enumeration of effect types for variant annotations. These types represent the specific effects a variant can have on genomic features. Each effect type has an associated impact level and can be mapped to a Sequence Ontology term.

Methods

Method	Description
`get_impact() -> AnnotationImpact`	Get the impact level for the effect type
`to_sequence_ontology() -> str`	Convert the effect type to a Sequence Ontology term
`from_sequence_ontology(so_term: str) -> EffectType \| None`	Convert a Sequence Ontology term to an EffectType

Impact Levels

Effect types are categorized into four impact levels:

HIGH: Disruptive variants with high impact (e.g., frameshift, stop gained)
MODERATE: Variants that might change protein effectiveness (e.g., missense)
LOW: Variants unlikely to change protein behavior (e.g., synonymous)
MODIFIER: Non-coding variants or variants affecting non-coding genes (e.g., intron)

Examples

# Get the impact level of an effect type
effect = EffectType.FRAME_SHIFT
impact = effect.get_impact()  # Returns AnnotationImpact.HIGH

# Convert between effect types and Sequence Ontology terms
so_term = EffectType.NON_SYNONYMOUS_CODING.to_sequence_ontology()  # Returns "missense_variant"
effect = EffectType.from_sequence_ontology("missense_variant")  # Returns EffectType.NON_SYNONYMOUS_CODING

VariantAnnotation

class VariantAnnotation:

Represents a single annotation entry from the ANN field in a VCF record.

Constructor

def __init__(self, allele: str, annotation: str, effect: EffectType, putative_impact: AnnotationImpact):

allele: The allele being annotated
annotation: The original annotation string from the VCF (e.g., "missense_variant")
effect: The parsed effect type (e.g., EffectType.NON_SYNONYMOUS_CODING)
putative_impact: The putative impact of the variant (HIGH, MODERATE, LOW, MODIFIER)

Properties

Property	Type	Description
`allele`	str	The allele being annotated
`annotation`	str	The original annotation string from the VCF
`effect`	EffectType	The parsed effect type
`putative_impact`	AnnotationImpact	The putative impact of the variant
`gene_name`	str	The gene name
`gene_id`	str	The gene ID
`feature_type`	FeatureType	The type of feature (e.g., transcript, motif)
`feature_id`	str	The ID of the feature
`transcript_biotype`	BiotypeCoding	The biotype of the transcript (Coding, Noncoding)
`rank`	int	The rank of the exon or intron
`total`	int	The total number of exons or introns
`hgvs_c`	str	The HGVS notation at the DNA level
`hgvs_p`	str	The HGVS notation at the protein level
`cdna_pos`	int	The position in the cDNA
`cdna_length`	int	The length of the cDNA
`cds_pos`	int	The position in the CDS
`cds_length`	int	The length of the CDS
`protein_pos`	int	The position in the protein
`protein_length`	int	The length of the protein
`distance`	int	The distance to the feature
`messages`	list[ErrorWarningType]	Error, warning, or information messages

MSI Parsers

MsiSitesReader

class MsiSitesReader:

Parser for MSI (Microsatellite Instability) sites files. MSI sites are regions of the genome with repetitive DNA sequences.

Constructor

def __init__(self, filepath: Path):

filepath: Path to the MSI sites file

Methods

Method	Description
`read_all() -> list[MsiSiteRecord]`	Read all MSI sites from the file
`read_by_chromosome(chrom: str) -> list[MsiSiteRecord]`	Read MSI sites for a specific chromosome

Genome Loader

GenomeLoader

class GenomeLoader:

Class for loading complete genomes from annotation and sequence files. This class combines sequence data from FASTA files with annotation data from GFF/GTF files to build a complete Genome object with chromosomes, genes, transcripts, exons, and other genomic features.

Constructor

def __init__(self, annotation_file: Path, sequence_file: Path, genome_name: str = "genome", species: str = None, verbose: bool = False, error_handling: ErrorHandling = ErrorHandling.WARN):

annotation_file: Path to the GFF/GTF annotation file
sequence_file: Path to the FASTA sequence file
genome_name: Name of the genome
species: Species name
verbose: Whether to print progress information during loading
error_handling: How to handle consistency errors (throw, warn, or ignore)

Methods

Method	Description
`load() -> Genome`	Load a genome from the annotation and sequence files
`load_sequences(sequence_file: Path) -> dict[str, Chromosome]`	Load chromosome sequences from a FASTA file
`load_features(gff_file: Path) -> None`	Load genomic features from a GFF/GTF file
`check_consistency() -> None`	Check for consistency errors in the loaded genome

Usage Examples

Parsing FASTA Files

from pathlib import Path
from pygnome.parsers.fasta.fasta_parser import FastaParser

# Parse a FASTA file
parser = FastaParser(Path("path/to/sequences.fa"))
records = parser.load()

# Access sequences
for record in records:
    print(f"Sequence: {record.identifier}")
    print(f"Length: {len(record.sequence)}")

    # Convert to string if needed
    seq_str = str(record.sequence)
    print(f"First 10 bases: {seq_str[:10]}")

# Load as dictionary for quick access by identifier
sequences = FastaParser(Path("path/to/sequences.fa")).load_as_dict()
my_seq = sequences["chr1"].sequence

Parsing GFF/GTF Files

from pathlib import Path
from pygnome.parsers.gff.gff3_parser import Gff3Parser
from pygnome.parsers.gff.gtf_parser import GtfParser

# Parse a GFF3 file
gff_parser = Gff3Parser(Path("path/to/annotations.gff3"))
for record in gff_parser:
    print(f"{record.type}: {record.chrom}:{record.start}-{record.end}")
    print(f"Attributes: {record.attributes}")

# Parse a GTF file
gtf_parser = GtfParser(Path("path/to/annotations.gtf"))
for record in gtf_parser:
    if record.type == "gene":
        gene_id = record.attributes.get("gene_id")
        gene_name = record.attributes.get("gene_name")
        print(f"Gene: {gene_id} ({gene_name}) - {record.chrom}:{record.start}-{record.end}")

Parsing VCF Files

from pathlib import Path
from pygnome.parsers.vcf.vcf_reader import VcfReader

# Open a VCF file
with VcfReader(Path("path/to/variants.vcf")) as reader:
    # Get sample names
    samples = reader.get_samples()
    print(f"Samples: {samples}")

    # Iterate through records
    for record in reader:
        print(f"Record: {record.get_chrom()}:{record.get_pos()} {record.get_ref()}>{','.join(record.get_alt())}")

        # Access INFO fields directly using dictionary-like syntax
        if record.has_info("DP"):
            depth = record.info["DP"]
            print(f"Read depth: {depth}")

        # Iterate through INFO fields
        for field_id, field_value in record.info:
            print(f"{field_id} = {field_value}")

        # Create variant objects from the record
        for variant in record:  # Uses VariantFactory internally
            print(f"Variant: {variant}")

        # Access genotypes
        genotypes = record.get_genotypes()
        for i, genotype in enumerate(genotypes):
            print(f"  {samples[i]}: {genotype}")

    # Query a specific region
    for record in reader.fetch("chr1", 1000000, 2000000):
        for variant in record:
            print(f"Region variant: {variant}")

Parsing VCF Annotations (ANN Field)

from pathlib import Path
from pygnome.parsers.vcf.vcf_reader import VcfReader

# Open a VCF file
with VcfReader(Path("path/to/variants.vcf")) as reader:
    # Iterate through records
    for record in reader:
        # Get variant annotations directly from the record
        for vann in record.variant_annotations():
            print(f"Variant annotation: {vann.allele} - {vann.annotation}")
            print(f"  Impact: {vann.putative_impact}")

            if vann.gene_name:
                print(f"  Gene: {vann.gene_name}")

            if vann.feature_type and vann.feature_id:
                print(f"  Feature: {vann.feature_type.value} {vann.feature_id}")

            if vann.hgvs_c:
                print(f"  HGVS.c: {vann.hgvs_c}")

            if vann.hgvs_p:
                print(f"  HGVS.p: {vann.hgvs_p}")

Loading a Complete Genome

from pathlib import Path
from pygnome.parsers.genome_loader import GenomeLoader

# Create a genome loader
loader = GenomeLoader(
    annotation_file=Path("path/to/annotations.gtf"),
    sequence_file=Path("path/to/genome.fa.gz"),
    genome_name="GRCh38",
    species="Homo sapiens",
    verbose=True  # Print progress information
)

# Load genome structure and sequence
genome = loader.load()

# Access genome components
print(f"Genome: {genome.name} ({genome.species})")
print(f"Chromosomes: {len(genome.chromosomes)}")
print(f"Genes: {len(genome.genes)}")

# Get a specific chromosome
chr1 = genome.chromosomes.get("chr1")
if chr1:
    print(f"Chromosome: {chr1.name}, Length: {chr1.length}")
    print(f"Genes on chr1: {len(chr1.genes)}")