Document Ingestion

Multi-format document processing with OCR, video/audio transcription, semantic chunking, and collection management

description

Overview

OpenRails Document Ingestion is the entry point for all knowledge that powers the platform's AI capabilities. It accepts all major document formats, automatically extracts and structures content, applies semantic chunking, generates embeddings, and indexes everything for instant retrieval. Whether you are uploading a single PDF or batch-processing thousands of files, the ingestion pipeline handles it with background processing, progress tracking, and error recovery.

Key Value: Most platforms only handle text documents. OpenRails ingests video, audio, images, and presentations alongside traditional documents — giving your AI a complete picture of organizational knowledge regardless of format.

folder_open

Supported Formats

Format	Extensions	Extracted Content
PDF	.pdf	Text, tables, images, metadata
Word	.docx, .doc	Text, headings, tables, styles
PowerPoint	.pptx, .ppt	Text, speaker notes, slide structure
Excel	.xlsx, .csv	Cell data, sheet names, formulas (values)
Plain Text	.txt, .md, .rst	Full text content
HTML	.html, .htm	Text content, structure, links
Images	.png, .jpg, .tiff	Extracted text, alt descriptions
Video	.mp4, .webm, .avi	Transcription, key frames, metadata
Audio	.mp3, .wav, .m4a	Full transcription, speaker detection
Email	.eml, .msg	Body, headers, attachments

stars

Key Capabilities

document_scanner OCR Processing

Scanned documents, photographs of whiteboards, and image-based PDFs are automatically processed with optical character recognition. Supports multiple languages and handles mixed-content documents (text + scanned pages).

mic Video & Audio Transcription

Meeting recordings, training videos, podcasts, and voice memos are transcribed using speech-to-text. Video files also undergo key frame extraction for visual context. Timestamps are preserved for reference.

content_cut Semantic Chunking

Documents are intelligently split into chunks that preserve meaning. The chunking engine respects section boundaries, paragraph structure, and semantic coherence. Configurable chunk sizes with overlap for continuity.

folder_special Collection Management

Organize documents into collections (data lakes) for scoped access. Control which collections are available to specific projects, agents, or users. Supports nested collections and tagging.

upload Bulk Upload

Upload hundreds or thousands of files at once. Background processing queue with progress tracking, error reporting, and automatic retry for failed items. API endpoint for programmatic ingestion.

history Version Tracking

Re-ingest updated versions of documents while maintaining history. The system tracks document lineage and can show how content has changed across versions.

route

Ingestion Pipeline

Upload → Format Detection → Content Extraction (OCR/Transcription if needed) → Semantic Chunking → Embedding Generation → Vector Indexing + Knowledge Graph Extraction → Available for RAG

All steps execute asynchronously via background workers | Progress available via UI and API

lightbulb

Use Cases

business

Corporate Knowledge

Ingest HR policies, SOPs, training materials, and internal wikis for company-wide AI access

gavel

Legal Document Review

Process contracts, court filings, and regulatory documents including scanned PDFs via OCR

videocam

Meeting Intelligence

Transcribe recorded meetings and make discussions searchable and queryable via AI chat

Related Feature Sheets

RAG Pipeline — how ingested documents power retrieval
AI Chat — querying ingested content conversationally
Web Crawler — automated web content ingestion