Browser automation-powered web content extraction, configurable depth and concurrency, and auto-embedding
The OpenRails Web Crawler extends the document ingestion pipeline to the web. Using headless browser automation, it can crawl and extract content from JavaScript-rendered websites, single-page applications, and dynamic content that traditional HTTP scrapers miss. Crawled content is automatically processed through the same ingestion pipeline as uploaded documents — chunked, embedded, and indexed for RAG retrieval.
Key Value: Keep your AI knowledge base current with web content. Crawl competitor sites, documentation portals, regulatory updates, or any public information source — and make it instantly searchable and queryable through AI chat.
Full headless browser rendering means the crawler sees exactly what a user sees. JavaScript-rendered content, SPAs, lazy-loaded elements, and dynamically injected text are all captured accurately.
Set maximum crawl depth (how many links deep to follow), page limits, and concurrency (how many pages to process simultaneously). Balance thoroughness with resource usage and politeness.
Intelligent content extraction strips navigation, headers, footers, ads, and boilerplate to capture the meaningful content of each page. Configurable CSS selectors for targeted extraction from specific page regions.
Crawled content automatically flows through the ingestion pipeline: chunking, embedding generation, and indexing in the vector database and knowledge graph. No manual processing step required.
Include or exclude URLs based on regex patterns. Restrict crawling to specific subdomains, paths, or content types. Respect robots.txt directives and implement custom crawl delay policies.
Set up recurring crawl jobs to keep web content fresh. Configure recrawl intervals per URL pattern — daily for news sites, weekly for documentation, monthly for static resources.
Every crawl job is fully configurable to match your needs. Control how deep the crawler follows links, how many pages to collect, crawl speed, and whether content is automatically fed into your knowledge base. All settings are adjustable per job from the dashboard.
Seed URL → Browser Render → Content Extraction → Link Discovery (filter + dedupe) → Queue New URLs → Ingestion Pipeline (chunk + embed + index)
Parallel processing across concurrent browser instances | Deduplication prevents re-processing unchanged pages
Crawl product documentation sites and make them searchable through AI chat
Monitor competitor websites for product changes, pricing updates, and content shifts
Crawl government and regulatory sites to stay current on policy changes and requirements