Web Crawler

Browser automation-powered web content extraction, configurable depth and concurrency, and auto-embedding

travel_explore

Overview

The OpenRails Web Crawler extends the document ingestion pipeline to the web. Using headless browser automation, it can crawl and extract content from JavaScript-rendered websites, single-page applications, and dynamic content that traditional HTTP scrapers miss. Crawled content is automatically processed through the same ingestion pipeline as uploaded documents — chunked, embedded, and indexed for RAG retrieval.

Key Value: Keep your AI knowledge base current with web content. Crawl competitor sites, documentation portals, regulatory updates, or any public information source — and make it instantly searchable and queryable through AI chat.

stars

Key Capabilities

public Headless Browser Automation

Full headless browser rendering means the crawler sees exactly what a user sees. JavaScript-rendered content, SPAs, lazy-loaded elements, and dynamically injected text are all captured accurately.

account_tree Configurable Depth & Concurrency

Set maximum crawl depth (how many links deep to follow), page limits, and concurrency (how many pages to process simultaneously). Balance thoroughness with resource usage and politeness.

article Content Extraction

Intelligent content extraction strips navigation, headers, footers, ads, and boilerplate to capture the meaningful content of each page. Configurable CSS selectors for targeted extraction from specific page regions.

auto_awesome Auto-Embed

Crawled content automatically flows through the ingestion pipeline: chunking, embedding generation, and indexing in the vector database and knowledge graph. No manual processing step required.

filter_alt URL Patterns & Filters

Include or exclude URLs based on regex patterns. Restrict crawling to specific subdomains, paths, or content types. Respect robots.txt directives and implement custom crawl delay policies.

schedule Scheduled Recrawling

Set up recurring crawl jobs to keep web content fresh. Configure recrawl intervals per URL pattern — daily for news sites, weekly for documentation, monthly for static resources.

tune

Configurable Crawling

Every crawl job is fully configurable to match your needs. Control how deep the crawler follows links, how many pages to collect, crawl speed, and whether content is automatically fed into your knowledge base. All settings are adjustable per job from the dashboard.

Hands-off ingestion: Enable auto-embed to have crawled content automatically processed and added to your knowledge base — no manual upload step required.
route

Crawl Pipeline

Seed URLBrowser RenderContent ExtractionLink Discovery (filter + dedupe) → Queue New URLsIngestion Pipeline (chunk + embed + index)

Parallel processing across concurrent browser instances | Deduplication prevents re-processing unchanged pages

lightbulb

Use Cases

menu_book

Documentation Indexing

Crawl product documentation sites and make them searchable through AI chat

monitoring

Competitive Intelligence

Monitor competitor websites for product changes, pricing updates, and content shifts

gavel

Regulatory Monitoring

Crawl government and regulatory sites to stay current on policy changes and requirements

Related Feature Sheets