Sequences

The sequences module provides memory-efficient representations of DNA and RNA sequences. It includes classes for representing individual sequences and collections of sequences.

Overview

The sequences module is designed to efficiently store and manipulate nucleotide sequences. It uses a 2-bit encoding for DNA and RNA sequences, which allows for significant memory savings compared to string-based representations.

Key features:

Memory-efficient 2-bit representation (A=00, C=01, G=10, T/U=11)
Support for common sequence operations (complement, reverse complement, transcription, translation)
Efficient storage of multiple sequences in a single array
Support for slicing and indexing

Core Classes

BaseSequence

class BaseSequence(ABC):

Abstract base class for efficient 2-bit representation of nucleotide sequences. This class provides common functionality for storing nucleotide sequences using 2 bits per nucleotide, allowing 16 nucleotides to be packed into a single 32-bit integer.

Constructor

def __init__(self, sequence: str):

sequence: A string containing nucleotides

Methods

Method	Description
`__len__() -> int`	Return the length of the sequence
`__str__() -> str`	Return the sequence as a string
`__getitem__(key) -> str`	Get a nucleotide or subsequence
`to_string() -> str`	Convert the entire sequence to a string
`substring(start: int, length: int \\| None = None) -> str`	Extract a substring from the sequence
`__eq__(other) -> bool`	Check if two sequence objects are equal

DnaString

class DnaString(BaseSequence):

Efficient 2-bit representation of DNA sequences. This class stores DNA sequences (A, C, G, T) using 2 bits per nucleotide, allowing 16 nucleotides to be packed into a single 32-bit integer.

Constructor

def __init__(self, sequence: str):

sequence: A string containing DNA nucleotides (A, C, G, T)

Methods

Method	Description
`complement() -> DnaString`	Return the complement of the sequence
`reverse_complement() -> DnaString`	Return the reverse complement of the sequence
`transcribe() -> RnaString`	Transcribe DNA to RNA
`gc_content() -> float`	Calculate the GC content of the sequence

RnaString

class RnaString(BaseSequence):

Efficient 2-bit representation of RNA sequences. This class stores RNA sequences (A, C, G, U) using 2 bits per nucleotide, allowing 16 nucleotides to be packed into a single 32-bit integer.

Constructor

def __init__(self, sequence: str):

sequence: A string containing RNA nucleotides (A, C, G, U)

Methods

Method	Description
`complement() -> RnaString`	Return the complement of the sequence
`reverse_complement() -> RnaString`	Return the reverse complement of the sequence
`translate() -> str`	Translate RNA to protein

DnaStringArray

class DnaStringArray:

Efficient storage for millions of small DNA strings in a single NumPy array. This class stores multiple DNA sequences using the same 2-bit encoding as DnaString, but packs all sequences into a single contiguous array for improved memory efficiency when dealing with large numbers of sequences.

Constructor

def __init__(self, initial_data_bytes: int = DEFAULT_CAPACITY, initial_strings: int = DEFAULT_NUMBER_OF_STRINGS):

initial_data_bytes: Initial capacity in bytes for the data array (default: 1MB)
initial_strings: Initial number of strings the array can hold (default: 100,000)

Methods

Method	Description
`add(sequence: str) -> int`	Add a DNA sequence to the array
`add_multiple(sequences: list[str]) -> list[int]`	Add multiple DNA sequences to the array
`get(idx: int) -> str`	Get a sequence by its index
`get_subsequence(idx: int, start: int, length: int \\| None = None) -> str`	Extract a subsequence from a sequence in the array
`get_length(idx: int) -> int`	Get the length of a sequence
`__getitem__(idx: int) -> str`	Get a sequence by its index
`__len__() -> int`	Return the number of sequences in the array
`trim() -> None`	Trim the internal arrays to their actual used size
`get_stats() -> tuple[int, int, float]`	Get statistics about memory usage
`to_dna_string(idx: int) -> DnaString`	Convert a sequence in the array to a DnaString object

Usage Examples

Working with DNA Sequences

from pygnome.sequences.dna_string import DnaString

# Create a DNA sequence
dna = DnaString("ATGCATGCATGC")
print(f"Length: {len(dna)}")

# Get a subsequence
subseq = dna[3:9]  # Returns a new DnaString
print(f"Subsequence: {subseq}")

# Complement and reverse complement
comp = dna.complement()
rev_comp = dna.reverse_complement()
print(f"Complement: {comp}")
print(f"Reverse complement: {rev_comp}")

# Transcribe DNA to RNA
rna = dna.transcribe()  # Returns an RnaString
print(f"RNA: {rna}")

Working with RNA Sequences

from pygnome.sequences.rna_string import RnaString

# Create an RNA sequence
rna = RnaString("AUGCAUGCAUGC")
print(f"Length: {len(rna)}")

# Get a subsequence
subseq = rna[3:9]  # Returns a new RnaString
print(f"Subsequence: {subseq}")

# Complement and reverse complement
comp = rna.complement()
rev_comp = rna.reverse_complement()
print(f"Complement: {comp}")
print(f"Reverse complement: {rev_comp}")

# Translate RNA to protein
protein = rna.translate()
print(f"Protein: {protein}")

Working with Multiple Sequences

from pygnome.sequences.dna_string_array import DnaStringArray

# Create a DNA string array
array = DnaStringArray()

# Add sequences
idx1 = array.add("ATGCATGC")
idx2 = array.add("GCTAGCTA")

# Add multiple sequences at once
indices = array.add_multiple(["AAAAAA", "CCCCCC", "GGGGGG"])

# Access sequences
seq1 = array[idx1]
print(f"Sequence 1: {seq1}")

# Get subsequences
subseq = array.get_subsequence(idx2, 2, 4)
print(f"Subsequence: {subseq}")

# Get statistics
count, total_nt, bits_per_nt = array.get_stats()
print(f"Sequences: {count}, Total nucleotides: {total_nt}, Bits per nucleotide: {bits_per_nt:.2f}")

# Trim to reduce memory usage
array.trim()

Memory Efficiency

The 2-bit encoding used by the sequences module provides significant memory savings compared to string-based representations:

A standard Python string uses 1 byte (8 bits) per character
DnaString and RnaString use 2 bits per nucleotide
This results in a 4x reduction in memory usage

For example, a 1 million base pair chromosome would use:

String representation: ~1 MB
DnaString representation: ~250 KB

The DnaStringArray class provides even greater memory efficiency when storing multiple sequences by:

Minimizing memory overhead compared to individual DnaString objects
Improving cache locality for faster access patterns
Reducing memory fragmentation

Performance Considerations

Use DnaStringArray instead of multiple DnaString objects when working with many small sequences
Call trim() on DnaStringArray before serialization to reduce memory usage
For very large sequences (e.g., entire chromosomes), consider using memory-mapped files or chunked processing