Document Ingestion

Multi-format document processing with OCR, video/audio transcription, semantic chunking, and collection management

description

Overview

OpenRails Document Ingestion is the entry point for all knowledge that powers the platform's AI capabilities. It accepts all major document formats, automatically extracts and structures content, applies semantic chunking, generates embeddings, and indexes everything for instant retrieval. Whether you are uploading a single PDF or batch-processing thousands of files, the ingestion pipeline handles it with background processing, progress tracking, and error recovery.

Key Value: Most platforms only handle text documents. OpenRails ingests video, audio, images, and presentations alongside traditional documents — giving your AI a complete picture of organizational knowledge regardless of format.

folder_open

Supported Formats

FormatExtensionsExtracted Content
PDF.pdfText, tables, images, metadata
Word.docx, .docText, headings, tables, styles
PowerPoint.pptx, .pptText, speaker notes, slide structure
Excel.xlsx, .csvCell data, sheet names, formulas (values)
Plain Text.txt, .md, .rstFull text content
HTML.html, .htmText content, structure, links
Images.png, .jpg, .tiffExtracted text, alt descriptions
Video.mp4, .webm, .aviTranscription, key frames, metadata
Audio.mp3, .wav, .m4aFull transcription, speaker detection
Email.eml, .msgBody, headers, attachments
stars

Key Capabilities

document_scanner OCR Processing

Scanned documents, photographs of whiteboards, and image-based PDFs are automatically processed with optical character recognition. Supports multiple languages and handles mixed-content documents (text + scanned pages).

mic Video & Audio Transcription

Meeting recordings, training videos, podcasts, and voice memos are transcribed using speech-to-text. Video files also undergo key frame extraction for visual context. Timestamps are preserved for reference.

content_cut Semantic Chunking

Documents are intelligently split into chunks that preserve meaning. The chunking engine respects section boundaries, paragraph structure, and semantic coherence. Configurable chunk sizes with overlap for continuity.

folder_special Collection Management

Organize documents into collections (data lakes) for scoped access. Control which collections are available to specific projects, agents, or users. Supports nested collections and tagging.

upload Bulk Upload

Upload hundreds or thousands of files at once. Background processing queue with progress tracking, error reporting, and automatic retry for failed items. API endpoint for programmatic ingestion.

history Version Tracking

Re-ingest updated versions of documents while maintaining history. The system tracks document lineage and can show how content has changed across versions.

route

Ingestion Pipeline

UploadFormat DetectionContent Extraction (OCR/Transcription if needed) → Semantic ChunkingEmbedding GenerationVector Indexing + Knowledge Graph ExtractionAvailable for RAG

All steps execute asynchronously via background workers | Progress available via UI and API

lightbulb

Use Cases

business

Corporate Knowledge

Ingest HR policies, SOPs, training materials, and internal wikis for company-wide AI access

gavel

Legal Document Review

Process contracts, court filings, and regulatory documents including scanned PDFs via OCR

videocam

Meeting Intelligence

Transcribe recorded meetings and make discussions searchable and queryable via AI chat

Related Feature Sheets