Multi-format document processing with OCR, video/audio transcription, semantic chunking, and collection management
OpenRails Document Ingestion is the entry point for all knowledge that powers the platform's AI capabilities. It accepts all major document formats, automatically extracts and structures content, applies semantic chunking, generates embeddings, and indexes everything for instant retrieval. Whether you are uploading a single PDF or batch-processing thousands of files, the ingestion pipeline handles it with background processing, progress tracking, and error recovery.
Key Value: Most platforms only handle text documents. OpenRails ingests video, audio, images, and presentations alongside traditional documents — giving your AI a complete picture of organizational knowledge regardless of format.
| Format | Extensions | Extracted Content |
|---|---|---|
| Text, tables, images, metadata | ||
| Word | .docx, .doc | Text, headings, tables, styles |
| PowerPoint | .pptx, .ppt | Text, speaker notes, slide structure |
| Excel | .xlsx, .csv | Cell data, sheet names, formulas (values) |
| Plain Text | .txt, .md, .rst | Full text content |
| HTML | .html, .htm | Text content, structure, links |
| Images | .png, .jpg, .tiff | Extracted text, alt descriptions |
| Video | .mp4, .webm, .avi | Transcription, key frames, metadata |
| Audio | .mp3, .wav, .m4a | Full transcription, speaker detection |
| .eml, .msg | Body, headers, attachments |
Scanned documents, photographs of whiteboards, and image-based PDFs are automatically processed with optical character recognition. Supports multiple languages and handles mixed-content documents (text + scanned pages).
Meeting recordings, training videos, podcasts, and voice memos are transcribed using speech-to-text. Video files also undergo key frame extraction for visual context. Timestamps are preserved for reference.
Documents are intelligently split into chunks that preserve meaning. The chunking engine respects section boundaries, paragraph structure, and semantic coherence. Configurable chunk sizes with overlap for continuity.
Organize documents into collections (data lakes) for scoped access. Control which collections are available to specific projects, agents, or users. Supports nested collections and tagging.
Upload hundreds or thousands of files at once. Background processing queue with progress tracking, error reporting, and automatic retry for failed items. API endpoint for programmatic ingestion.
Re-ingest updated versions of documents while maintaining history. The system tracks document lineage and can show how content has changed across versions.
Upload → Format Detection → Content Extraction (OCR/Transcription if needed) → Semantic Chunking → Embedding Generation → Vector Indexing + Knowledge Graph Extraction → Available for RAG
All steps execute asynchronously via background workers | Progress available via UI and API
Ingest HR policies, SOPs, training materials, and internal wikis for company-wide AI access
Process contracts, court filings, and regulatory documents including scanned PDFs via OCR
Transcribe recorded meetings and make discussions searchable and queryable via AI chat