A Go web scraping framework that extracts structured data from websites using CSS selectors, including JavaScript-rendered pages.
Dataflow kit is a web scraping framework for Go that extracts structured data from websites using CSS selectors. It solves the problem of programmatically collecting data from both static and dynamic JavaScript-driven web pages, providing a full pipeline from fetching to parsing and encoding.
Go developers and data engineers who need to build scalable, efficient web scrapers for data mining, processing, or archiving tasks.
Developers choose Dataflow kit for its speed, ability to handle JavaScript-rendered content, and modular design that supports large-scale scraping with flexible storage and output options.
Extract structured data from web sites. Web sites scraping.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses a Chrome fetcher to scrape dynamic, JavaScript-generated pages, as demonstrated in the example scraping persons page, ensuring compatibility with modern web apps.
Modular design handles large volumes efficiently, with tests parsing 4 million pages in about 7 hours, making it suitable for data mining and archiving.
Supports Diskv or MongoDB for intermediate data and encodes to CSV, Excel, JSON, JSON Lines, or XML, providing versatility for different workflows.
Handles websites behind login forms with cookies and sessions, and seamlessly processes paginated or infinite-scrolled pages, as highlighted in the key features.
Requires Docker and running multiple services (fetch.d and parse.d), adding infrastructure overhead compared to single-binary scraping tools.
Chrome fetcher is slower than Base fetcher for dynamic content, as admitted in the README, impacting speed for JavaScript-heavy sites.
Scraping rules rely on detailed JSON files with CSS selectors and extractors, which can be cumbersome for simple or rapid prototyping tasks.