Web Crawler Setup

Configure the web crawler for automated content extraction and embedding

Overview

The OpenRails web crawler uses browser automation to navigate and extract content from websites. Crawled content can be automatically embedded into data lakes for RAG retrieval. The crawler supports configurable depth, concurrency limits, and URL filtering.

Configure a Web Crawler

Navigate to Web Crawlers

From the sidebar, go to Tools > Web Crawlers.

Click "New Crawler"

Click New Crawler to open the configuration form.

Enter Crawler Details

Fill in the basic configuration:

Name — Descriptive name for the crawler
Start URL — The URL where crawling begins
Target Data Lake — Where crawled content will be stored

Configure Crawl Settings

Set the crawl parameters:

Max Depth — How many link levels deep to crawl from the start URL
Max Pages — Maximum number of pages to crawl in a single run
Concurrency — Number of pages to crawl simultaneously (system max: 10 concurrent runs)
URL Patterns — Include/exclude patterns to control which URLs are crawled

Enable Auto-Embed

Toggle Auto-Embed to automatically process crawled content through the ingestion pipeline. Content is chunked, embedded, and indexed in the target data lake.

Save and Run

Click Save to store the configuration. Click Run Now to start the crawler immediately, or set up a schedule.

Monitoring Crawl Runs

Track crawl progress from the crawler's run history:

Metric	Description
Pages Crawled	Total pages successfully extracted
Pages Skipped	Pages excluded by URL patterns or depth limits
Errors	Pages that failed to load or extract (404s, timeouts, etc.)
Duration	Total crawl time
Content Size	Total extracted text size

URL Pattern Filtering

Use include and exclude patterns to control crawl scope:

Include Patterns — Only crawl URLs matching these patterns (e.g., https://docs.example.com/*)
Exclude Patterns — Skip URLs matching these patterns (e.g., */login*, */admin/*)

Patterns support wildcards (*) for flexible matching.

Tip: Start with a low max depth (1-2) and limited page count to test your crawler configuration. Review the results before increasing depth for a full crawl.

Important: The system enforces a maximum of 10 concurrent crawler runs across all projects. Schedule large crawls during off-peak hours to avoid impacting other users. Always respect the target website's robots.txt policies.

Next Steps

Managing Data Lakes — Manage crawled content in data lakes
Scheduling Automations — Schedule recurring crawl runs
Using RAG Context — Use crawled content for RAG