Showing 3 of 3 projects
A Go tool and library for downloading URLs and files from Common Crawl and Wayback Machine web archives.
A collection of Jupyter notebooks for analyzing Common Crawl web archive data using columnar indexes and webgraph datasets.
A Python tutorial demonstrating how to access and process Common Crawl's web archive datasets (WARC, WET, WAT) using tools like warcio, cdx_toolkit, and DuckDB.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.