A collection of robust and fast Python tools for parsing, extracting, and analyzing web archive data, including a high-performance WARC parser.
ChatNoir Resiliparse is a Python toolkit for robust and fast processing of web archive data. It provides utilities for parsing, extracting, and cleaning web content, along with a high-performance WARC parser called FastWARC, designed to handle large-scale analytics efficiently.
Researchers, data scientists, and developers working with web archives, Common Crawl data, or large-scale web analytics who need reliable and performant tools for parsing and extracting information from noisy web data.
Developers choose Resiliparse for its optimized performance, resilience against malformed data, and comprehensive toolset specifically tailored for web archive processing, including the fast FastWARC parser which outperforms alternatives in speed.
A robust web archive analytics toolkit
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Tools like encoding detection and HTML parsing are highly optimized, with FastWARC written in C++/Cython for speed, supporting WARC/1.0, WARC/1.1, and GZip/LZ4 compression.
Process guards use decorators and context managers to enforce execution time and memory limits, ensuring resilience in large-scale jobs by interrupting tasks that exceed resources.
Extraction utilities are performance-optimized for main content extraction and boilerplate removal, tailored for handling noisy web data at scale.
Detailed documentation covers parsing, extraction, process guards, and FastWARC, with hosted guides on Read the Docs for easy reference.
Building from source requires vcpkg and manual configuration of dependencies, with additional steps like auditwheel to fix wheels for redistribution, increasing deployment overhead.
FastWARC is inspired by WARCIO but not a drop-in replacement, which can complicate migration for projects reliant on WARCIO's API without adjustments.
Tools like Itertools are designed for use within the Resiliparse ecosystem, potentially limiting flexibility when integrating with external libraries or frameworks.