A Python library for blazing-fast, memory-efficient genomics data operations using DataFrames.
polars-bio is a Python library for genomics that provides a DataFrame API for large-scale genomic interval datasets. It is built on Polars, Apache Arrow, and Apache DataFusion to deliver high performance, parallel processing, and out-of-core capabilities, making it suitable for computationally intensive bioinformatic analyses.
Bioinformaticians and computational biologists working with large genomic interval datasets who need efficient, scalable data manipulation in Python, particularly those dealing with data too large for memory or requiring high-speed operations.
Developers choose polars-bio because it is optimized as the most efficient single-node library for genomic interval DataFrames in Python, offering significant speedups over alternatives like Bioframe, along with out-of-core processing, cloud storage support, and compatibility with both Pandas and Polars DataFrames.
Blazing-Fast Bioinformatic Operations on Python DataFrames
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Benchmarks show up to 38x speedup in count_overlaps queries compared to Bioframe, thanks to an optimized Rust backend and COITrees for interval operations.
Supports streaming and federated reading from cloud storage like S3 and GCS, enabling analysis of datasets too large for memory without full materialization.
Leverages Apache DataFusion to provide SQL-powered data manipulation, allowing bioinformaticians to use familiar SQL syntax for complex genomic queries.
Integrates with libraries like noodles to handle common bioinformatics formats such as BED and GFF, facilitating seamless data ingestion from various sources.
Relies on Polars, Apache Arrow, and DataFusion, which can lead to installation challenges, version conflicts, and a steeper setup curve compared to lighter libraries.
Primarily designed for genomic interval operations; it lacks built-in tools for other bioinformatics tasks like sequence alignment or variant calling, requiring additional libraries.
Requires familiarity with SQL and DataFusion's query engine, which may be a barrier for Python-centric bioinformaticians accustomed to DataFrame APIs alone.