A Python tutorial demonstrating how to access and process Common Crawl's web archive datasets (WARC, WET, WAT) using tools like warcio, cdx_toolkit, and DuckDB.
A whirlwind tour of Common Crawl's data using Python
See also this related blog post on 'Asking questions with web archives'
community HTML version of the official specification and hub for new proposals
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.