Web Crawler Setup

Configure the web crawler for automated content extraction and embedding

Overview

The OpenRails web crawler uses browser automation to navigate and extract content from websites. Crawled content can be automatically embedded into data lakes for RAG retrieval. The crawler supports configurable depth, concurrency limits, and URL filtering.

Configure a Web Crawler

Navigate to Web Crawlers

From the sidebar, go to Tools > Web Crawlers.

Click "New Crawler"

Click New Crawler to open the configuration form.

Enter Crawler Details

Fill in the basic configuration:

  • Name — Descriptive name for the crawler
  • Start URL — The URL where crawling begins
  • Target Data Lake — Where crawled content will be stored

Configure Crawl Settings

Set the crawl parameters:

  • Max Depth — How many link levels deep to crawl from the start URL
  • Max Pages — Maximum number of pages to crawl in a single run
  • Concurrency — Number of pages to crawl simultaneously (system max: 10 concurrent runs)
  • URL Patterns — Include/exclude patterns to control which URLs are crawled

Enable Auto-Embed

Toggle Auto-Embed to automatically process crawled content through the ingestion pipeline. Content is chunked, embedded, and indexed in the target data lake.

Save and Run

Click Save to store the configuration. Click Run Now to start the crawler immediately, or set up a schedule.

Monitoring Crawl Runs

Track crawl progress from the crawler's run history:

Metric Description
Pages Crawled Total pages successfully extracted
Pages Skipped Pages excluded by URL patterns or depth limits
Errors Pages that failed to load or extract (404s, timeouts, etc.)
Duration Total crawl time
Content Size Total extracted text size

URL Pattern Filtering

Use include and exclude patterns to control crawl scope:

Patterns support wildcards (*) for flexible matching.

Tip: Start with a low max depth (1-2) and limited page count to test your crawler configuration. Review the results before increasing depth for a full crawl.
Important: The system enforces a maximum of 10 concurrent crawler runs across all projects. Schedule large crawls during off-peak hours to avoid impacting other users. Always respect the target website's robots.txt policies.

Next Steps