A Python library and CLI tool for web crawling, scraping, and extracting main text, metadata, and comments from web pages.
Trafilatura is a Python package and command-line tool for gathering text and metadata from the web. It performs web crawling, scraping, and extraction of main texts, comments, and structured data, outputting results in formats like CSV, JSON, HTML, and Markdown. It solves the problem of turning raw HTML into clean, meaningful content by focusing on actual text and avoiding page noise.
Developers, researchers, and data scientists who need to extract and process web content at scale for projects like text analysis, data acquisition, or building datasets.
Developers choose Trafilatura for its robust and configurable extraction that balances precision and recall, its modular design without database dependencies, and its consistent outperformance of other open-source libraries in text extraction benchmarks.
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Consistently outperforms other open-source libraries in text extraction benchmarks, as noted in the README, balancing precision and recall for clean text output.
Efficiently handles live URLs and downloaded files in parallel, making it ideal for batch processing large web corpora without a database dependency.
Supports multiple formats including JSON, CSV, Markdown, and XML-TEI, providing versatility for data integration and research purposes.
Includes smart URL management with sitemap and feed support, streamlining the discovery process for structured web scraping.
Primarily targets static HTML extraction, so it may fail on modern sites requiring client-side rendering without external tools like headless browsers.
With numerous extraction options and parameters, tuning for optimal results can be complex and time-consuming, as hinted by its modular design.
Requires a Python environment and familiarity with its ecosystem, which may deter teams using other languages or seeking plug-and-play solutions.