Configure the web crawler for automated content extraction and embedding
The OpenRails web crawler uses browser automation to navigate and extract content from websites. Crawled content can be automatically embedded into data lakes for RAG retrieval. The crawler supports configurable depth, concurrency limits, and URL filtering.
From the sidebar, go to Tools > Web Crawlers.
Click New Crawler to open the configuration form.
Fill in the basic configuration:
Set the crawl parameters:
Toggle Auto-Embed to automatically process crawled content through the ingestion pipeline. Content is chunked, embedded, and indexed in the target data lake.
Click Save to store the configuration. Click Run Now to start the crawler immediately, or set up a schedule.
Track crawl progress from the crawler's run history:
| Metric | Description |
|---|---|
| Pages Crawled | Total pages successfully extracted |
| Pages Skipped | Pages excluded by URL patterns or depth limits |
| Errors | Pages that failed to load or extract (404s, timeouts, etc.) |
| Duration | Total crawl time |
| Content Size | Total extracted text size |
Use include and exclude patterns to control crawl scope:
https://docs.example.com/*)*/login*, */admin/*)Patterns support wildcards (*) for flexible matching.
robots.txt policies.