A research-driven web crawler for building and analyzing curated web corpora as networks of web entities.
Hyphe is a web crawler and corpus curation tool designed for researchers in the social sciences and digital humanities. It allows users to build curated collections of web pages (corpora) by defining 'web entities'—custom units like websites or page groups—and crawling links between them to generate analyzable networks. It solves the problem of conducting structured, reproducible web studies beyond simple list-based scraping.
Researchers, academics, and analysts in social sciences, digital humanities, and media studies who need to systematically collect and analyze web data as networks for projects like controversy mapping, community detection, or hyperlink analysis.
Developers and researchers choose Hyphe for its unique curation-first approach, built-in web interface for iterative exploration, and focus on generating meaningful network data from the web rather than just raw page collection. It bridges the gap between automated crawling and manual, methodologically-driven corpus building.
Websites crawler with built-in exploration and control web interface
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Designed specifically for social science and digital humanities research, with features like web entity curation and controversy mapping emphasized in its philosophy and key features.
Offers fine-grained management of crawl scope, depth, and inclusion/exclusion rules, enabling methodologically sound corpus building as described in the controlled crawling feature.
Provides a built-in interface for exploration, visualization, and iterative refinement of web corpora, reducing the need for external tools.
Backed by Sciences Po médialab with extensive tutorials, documentation, and a long list of peer-reviewed publications using it, as shown in the README.
Can consume significant disk space (e.g., up to 50GB for a corpus with a few hundred crawls at depth 2), requiring substantial storage and system resources.
Upgrading between versions often requires reinstallation and rebuilding of corpora, as migration from older versions is not guaranteed and can break data.
The easy install relies heavily on Docker, and manual installation is complex, limited to Linux, and poorly documented for other platforms.