Categories Alternatives Stacks Self-Hosted Explore

© 2026 Open-Awesome. Curated for the developer elite.

Terms Privacy About GitHub RSS

Home
Web Archiving
A Whirlwind Tour of Common Crawl's Datasets using Python

A Whirlwind Tour of Common Crawl's Datasets using Python

Apache-2.0Python

A Python tutorial demonstrating how to access and process Common Crawl's web archive datasets (WARC, WET, WAT) using tools like warcio, cdx_toolkit, and DuckDB.

45 stars9 forks0 contributors

Overview

A whirlwind tour of Common Crawl's data using Python

Quick Stats

Stars45

Forks9

Contributors0

Open Issues0

Last commit2 months ago

CreatedSince 2024

Tags

#data-indexing #parquet #warc #python #python-tutorial #archive #web-archiving #data-processing #data-analysis #common-crawl #tutorial

Built With

Included in

Web Archiving2.5k

Auto-fetched 1 day ago

Related Projects

GLAM Workbench: Web Archives

See also this related blog post on 'Asking questions with web archives'

IIPC and DPC Training materials: module for beginners (8 sessions)

Tutorial for Humanities researchers about how to explore Arquivo.pt

warc-specifications

community HTML version of the official specification and hub for new proposals

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub