How to handle JavaScript pages with Crawly?

Crawly supports browser rendering via external services like Splash or headless Chrome. Configure the fetcher in middleware settings, and refer to the browser rendering documentation for setup details and examples.

Crawly vs Scrapy which is better for Elixir projects?

Crawly is the natural choice for Elixir teams due to its native integration and BEAM concurrency, while Scrapy in Python has a larger ecosystem. For projects already in Elixir, Crawly offers seamless tooling and performance benefits.

How to deploy Crawly spiders with Docker?

Use the standalone Docker mode by mounting spider YAML files or modules into the container. The standalone documentation provides steps for running crawls without a full Elixir project setup.

Crawly rate limiting configuration example

Set concurrent_requests_per_domain in config to control request rates. For example, config :crawly, concurrent_requests_per_domain: 8 limits simultaneous requests per domain to avoid overloading servers.

Can Crawly handle authentication for scraping?

Yes, through custom middlewares. The README references an article on extracting data behind authentication, which guides implementing login flows or session handling in spiders.

How to save scraped data to a database in Crawly?

Use custom pipelines; for instance, create a pipeline module that inserts items into databases like PostgreSQL. The pipelines section in config allows chaining such processors after validation and filtering.

Crawly spider examples for e-commerce sites

Check the example projects on GitHub, such as products-advisor for e-commerce. These demonstrate parsing product details, handling pagination, and exporting structured data.

Crawly — Elixir Web Crawling Framework

What is Crawly?

Crawly is an application framework for crawling websites and extracting structured data, built with Elixir. It provides a robust, configurable system for building scalable web scrapers to handle data mining, information processing, and historical archival. The framework uses a spider-based architecture with middleware and pipelines for customization.

Target Audience

Elixir developers who need to build scalable, maintainable web scrapers for data extraction tasks, such as data engineers or backend developers working on data aggregation projects.

Value Proposition

Developers choose Crawly for its high-level abstraction that balances power with ease of use, offering features like browser rendering for JavaScript-heavy sites, a management UI for monitoring, and standalone Docker deployment. Its extensible middleware and pipeline system allows fine-tuned control over crawling behavior.

Crawly, a high-level web crawling & scraping framework for Elixir.

Use Cases

Best For

Building scalable web scrapers in Elixir for data mining and information processing.
Extracting structured data from dynamic websites that require JavaScript rendering.
Creating standalone crawling applications deployable via Docker with spiders defined in YAML or modules.
Monitoring and managing web scraping jobs through a built-in web interface for starting, stopping, and viewing items.
Implementing concurrent, rate-limited crawls with configurable request handling per domain.
Rapid spider development using mix tasks for code generation and configuration templates.

Not Ideal For

Projects requiring integration with non-Elixir ecosystems or languages for downstream data processing pipelines
Quick, one-off scraping tasks where setting up an Elixir project and framework overhead is unjustified
Teams needing visual, point-and-click scraping tools for non-developers to configure crawls without code
High-scale distributed scraping across multiple geolocations without built-in proxy rotation or advanced anti-bot bypass features

Pros & Cons

Pros

Spider-based Architecture

Uses a familiar callback model similar to Scrapy, making it intuitive for Elixir developers to define crawls with URL generation and parsing logic, as shown in the quickstart example.

Browser Rendering Support

Configurable to fetch pages with JavaScript rendering via tools like Splash or Chrome, essential for scraping dynamic content from modern websites, as documented in the browser rendering guide.

Extensible Middleware and Pipelines

Offers pluggable components for customizing request handling and item processing, demonstrated in config examples with DomainFilter, UniqueRequest, and WriteToFile pipelines.

Standalone Docker Deployment

Enables running spiders via Docker with YAML or module definitions, simplifying deployment without full Elixir project setup, as covered in the standalone documentation.

Simple Management UI

Provides a built-in web interface on localhost:4001 for starting/stopping spiders and viewing items, with options to disable or integrate as a plug in existing apps.

Cons

Basic Management UI

The default UI is minimalistic, and the more advanced Phoenix-based UI (CrawlyUI) is deprecated, limiting out-of-the-box monitoring and development features for complex workflows.

Complex Browser Rendering Setup

Enabling JavaScript rendering requires external services like Splash or Chrome, adding deployment and maintenance overhead beyond the core framework.

Elixir Ecosystem Dependency

Tightly coupled to Elixir and BEAM, making it less suitable for teams not already using this stack or needing interoperability with other language ecosystems.

Evolving API with Breaking Changes

As a version 0.x project, frequent updates like those in 0.15.0 may introduce breaking changes, requiring ongoing maintenance for production deployments.

Frequently Asked Questions

What is Crawly?

Target Audience

Elixir developers who need to build scalable, maintainable web scrapers for data extraction tasks, such as data engineers or backend developers working on data aggregation projects.

Value Proposition

Use Cases

Best For

Building scalable web scrapers in Elixir for data mining and information processing.
Extracting structured data from dynamic websites that require JavaScript rendering.
Creating standalone crawling applications deployable via Docker with spiders defined in YAML or modules.
Monitoring and managing web scraping jobs through a built-in web interface for starting, stopping, and viewing items.
Implementing concurrent, rate-limited crawls with configurable request handling per domain.
Rapid spider development using mix tasks for code generation and configuration templates.

Not Ideal For

Projects requiring integration with non-Elixir ecosystems or languages for downstream data processing pipelines
Quick, one-off scraping tasks where setting up an Elixir project and framework overhead is unjustified
Teams needing visual, point-and-click scraping tools for non-developers to configure crawls without code
High-scale distributed scraping across multiple geolocations without built-in proxy rotation or advanced anti-bot bypass features

Pros & Cons

Pros

Spider-based Architecture

Uses a familiar callback model similar to Scrapy, making it intuitive for Elixir developers to define crawls with URL generation and parsing logic, as shown in the quickstart example.

Browser Rendering Support

Configurable to fetch pages with JavaScript rendering via tools like Splash or Chrome, essential for scraping dynamic content from modern websites, as documented in the browser rendering guide.

Extensible Middleware and Pipelines

Offers pluggable components for customizing request handling and item processing, demonstrated in config examples with DomainFilter, UniqueRequest, and WriteToFile pipelines.

Standalone Docker Deployment

Enables running spiders via Docker with YAML or module definitions, simplifying deployment without full Elixir project setup, as covered in the standalone documentation.

Simple Management UI

Provides a built-in web interface on localhost:4001 for starting/stopping spiders and viewing items, with options to disable or integrate as a plug in existing apps.

Cons

Basic Management UI

The default UI is minimalistic, and the more advanced Phoenix-based UI (CrawlyUI) is deprecated, limiting out-of-the-box monitoring and development features for complex workflows.

Complex Browser Rendering Setup

Enabling JavaScript rendering requires external services like Splash or Chrome, adding deployment and maintenance overhead beyond the core framework.

Elixir Ecosystem Dependency

Tightly coupled to Elixir and BEAM, making it less suitable for teams not already using this stack or needing interoperability with other language ecosystems.

Evolving API with Breaking Changes

As a version 0.x project, frequent updates like those in 0.15.0 may introduce breaking changes, requiring ongoing maintenance for production deployments.

Frequently Asked Questions

Crawly

What is Crawly?

Overview

Use Cases

Best For

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions

Related Projects

Found a gem we're missing?

Crawly

What is Crawly?

Overview

Use Cases

Best For

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions

Related Projects

Found a gem we're missing?