How does Kimurai handle JavaScript rendered websites?

Kimurai supports headless Chrome and Firefox engines via Selenium, allowing it to execute JavaScript and render dynamic content. It also provides Capybara methods like click and fill_in for browser interactions.

Kimurai vs Scrapy: which is better for web scraping?

Kimurai is ideal for Ruby developers needing AI-assisted selector generation and browser interactions, while Scrapy is Python-based with a larger ecosystem for HTTP scraping. Choose based on language preference and AI features.

How to set up AI extraction in Kimurai?

Configure an LLM provider in the Kimurai.configure block with API keys, then use the extract method in your spider to define data schemas. Selectors are cached after the first run for zero-token-cost subsequent extractions.

Can Kimurai use proxies and rotate user-agents?

Yes, Kimurai allows proxy and user-agent rotation through the @config setting, supporting dynamic selection from lists to avoid detection and handle anti-bot measures.

How to schedule Kimurai spiders to run automatically?

Kimurai integrates with the Whenever gem to generate cron schedules, enabling automatic execution of spiders at specified intervals via the schedule.rb configuration file.

What is the performance cost of using headless browsers in Kimurai?

Headless browsers like Chrome consume more memory and are slower than lightweight engines like Mechanize, but are necessary for JavaScript sites. Kimurai offers memory control and engine selection to balance performance.

Open-Awesome

Kimurai

MITRuby

Write web scrapers in Ruby using a clean, AI-assisted DSL that caches selectors for fast, LLM-free extraction.

Visit Website GitHub

1.1k stars161 forks0 contributors

What is Kimurai?

Kimurai is a Ruby web scraping framework that uses AI to automatically generate and cache selectors for data extraction. It combines traditional scraping capabilities with LLM-powered intelligence, allowing developers to describe what data they want rather than writing complex XPath/CSS selectors manually. The framework supports multiple browsers and provides a clean DSL for building robust, maintainable scrapers.

Target Audience

Ruby developers who need to build web scrapers for data collection, particularly those working with JavaScript-rendered websites or seeking to reduce selector maintenance overhead. It's ideal for data engineers, researchers, and developers building data pipelines.

Value Proposition

Kimurai uniquely combines AI-powered selector generation with traditional scraping tools, offering the intelligence of LLMs without the per-request costs. Its caching mechanism means you get AI accuracy during development but pure Ruby performance in production, making it both powerful and cost-effective.

Overview

Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs.

Use Cases

Best For

Scraping JavaScript-rendered websites with dynamic content
Building data collection pipelines without writing complex selectors
Extracting structured data from websites with frequently changing layouts
Parallel processing of large numbers of web pages
Interactive scraping that requires form submission and click simulation
Projects needing both traditional and AI-assisted extraction approaches

Not Ideal For

Projects built in Python or other non-Ruby ecosystems where scraping libraries like Scrapy are already integrated
Teams needing quick, minimal-configuration scraping without browser dependencies or AI setup
High-frequency scraping tasks where website layouts change daily, making cached AI selectors obsolete quickly

Pros & Cons

Pros

AI-Powered Selector Generation

Automatically generates and caches XPath/CSS selectors using LLMs based on your data schema, eliminating manual selector writing and maintenance as shown in the extract method examples.

Multi-Engine Flexibility

Supports headless Chrome, Firefox, and Mechanize engines, allowing adaptation to both JavaScript-heavy and static websites without code changes.

Capybara-Based Interactions

Integrates Capybara for full browser control, enabling complex interactions like form submissions, clicks, and scrolling for dynamic content.

Built-in Parallel Processing

Includes thread-safe parallel crawling with the in_parallel method for high-performance data extraction from multiple pages simultaneously.

Cons

Complex Initial Setup

Requires Ruby >=3.2.0, specific browser installations, and system dependencies like Selenium, making onboarding more involved than lightweight scrapers.

AI Configuration Overhead

Initial AI extraction depends on external LLM APIs (e.g., OpenAI, Gemini) with token costs and key management, adding complexity and potential expenses.

Ruby-Centric Limitation

As a Ruby framework, it may not integrate well with projects in other languages, and the scraping ecosystem is smaller compared to Python alternatives like Scrapy.

Frequently Asked Questions

Related Projects

Mechanize

Mechanize is a ruby library that makes automated web interaction easy.

Stars4,442

Forks478

Last commit2 months ago

Upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)

Stars1,598

Forks109

Last commit7 years ago

Wombat

Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

Stars1,360

Forks128

Last commit25 days ago

MetaInspector

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...

Stars1,046

Forks165

Last commit23 days ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub