A Node.js library to automatically scrape and extract readable article content from any web page, supporting both English and Chinese.
read-art is a Node.js library that automatically scrapes and extracts the main article content from any web page, making it readable and structured. It solves the problem of parsing messy HTML to retrieve clean titles and body text, which is essential for web crawlers and content aggregation systems. The tool is optimized for performance and supports both English and Chinese websites.
Developers building web crawlers, content aggregators, or data processing pipelines that require automated extraction of article content from diverse websites.
Developers choose read-art for its high performance in large‑scale crawling scenarios, its built‑in support for Chinese and English content, and its customizable extraction rules that allow fine‑tuning for improved accuracy.
Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Capable of processing up to 1500k documents per day with an average speed of 1k pages per minute, as demonstrated in real-world spider benchmarks.
Effectively handles both English and Chinese websites, accommodating different text structures without additional plugins or setup.
Allows fine-tuning via score rules and selectors to improve accuracy for specific sites, addressing the 10% of cases where default extraction fails.
Includes mechanisms to preserve images within articles, ensuring media content is retained during extraction when possible.
Default accuracy is only around 90%, and improving it for specific sites necessitates time-consuming configuration of custom score rules or selectors.
Key features and guides are hosted on a separate wiki, making it less accessible and integrated compared to inline README documentation.
Limited to Node.js environments, so it cannot be used in browser-side applications or with other server-side languages without significant adaptation.