Common Crawl

3 projects

Showing 3 of 3 projects

Go Get CrawlGo

A Go tool and library for downloading URLs and files from Common Crawl and Wayback Machine web archives.

#crawler#go-library#wayback-machine

Stars180

Forks17

Last commit1 year ago

Common Crawl Jupyter notebooksJupyter Notebook

A collection of Jupyter notebooks for analyzing Common Crawl web archive data using columnar indexes and webgraph datasets.

#warc-files#data-science#web-archive-analysis

Stars66

Forks11

Last commit

A Whirlwind Tour of Common Crawl's Datasets using PythonPython

A Python tutorial demonstrating how to access and process Common Crawl's web archive datasets (WARC, WET, WAT) using tools like warcio, cdx_toolkit, and DuckDB.

#data-indexing#parquet#cdx-index

Stars45

Forks9

Last commit

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub