Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Web Archiving
  3. A Whirlwind Tour of Common Crawl's Datasets using Python

A Whirlwind Tour of Common Crawl's Datasets using Python

Apache-2.0Python

A Python tutorial demonstrating how to access and process Common Crawl's web archive datasets (WARC, WET, WAT) using tools like warcio, cdx_toolkit, and DuckDB.

GitHubGitHub
45 stars9 forks0 contributors

Overview

A whirlwind tour of Common Crawl's data using Python

Quick Stats

Stars45
Forks9
Contributors0
Open Issues0
Last commit2 months ago
CreatedSince 2024

Tags

#data-indexing#parquet#warc#python#python-tutorial#archive#web-archiving#data-processing#data-analysis#common-crawl#tutorial

Built With

M
Make
P
Python
D
DuckDB

Included in

Web Archiving2.5k
Auto-fetched 1 day ago

Related Projects

GLAM Workbench: Web ArchivesGLAM Workbench: Web Archives

See also this related blog post on 'Asking questions with web archives'

Stars0
Forks0
Last commit
IIPC and DPC Training materials: module for beginners (8 sessions)IIPC and DPC Training materials: module for beginners (8 sessions)

Stars0
Forks0
Last commit
Tutorial for Humanities researchers about how to explore Arquivo.ptTutorial for Humanities researchers about how to explore Arquivo.pt

Stars0
Forks0
Last commit
warc-specificationswarc-specifications

community HTML version of the official specification and hub for new proposals

Stars0
Forks0
Last commit
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub