Document Ingestion Pipeline

How OpenRails processes documents through parsing, chunking, embedding, and indexing

Overview

The document ingestion pipeline is the automated workflow that transforms raw uploaded files into searchable, embeddable content for RAG retrieval. Every document passes through four stages: Parse, Chunk, Embed, and Index.

Pipeline Stages

Parse

Raw files are converted to plain text. The parsing method depends on the file type:

PDF/DOCX/PPTX — Native text extraction with layout preservation
Images — OCR to extract text from scanned documents and photos
Video/Audio — Speech-to-text transcription to convert audio to text
TXT/CSV/JSON/HTML/Markdown — Direct text extraction

Chunk

Extracted text is split into smaller segments called chunks. Chunking parameters are configurable per data lake:

Chunk Size — Configurable number of characters per chunk
Chunk Overlap — Configurable number of overlapping characters between consecutive chunks

Overlap ensures that context at chunk boundaries is preserved, improving retrieval quality.

Embed

Each chunk is converted into a vector embedding using the configured embedding model. These dense vector representations capture the semantic meaning of the text.

Index

Embeddings are stored in a vector collection, and optionally in a knowledge graph for Graph RAG. The index enables fast similarity search during retrieval.

Monitoring Ingestion

Track the status of document ingestion from the data lake's Documents tab:

Status	Description
Queued	File uploaded and waiting to be processed
Parsing	Text extraction in progress
Chunking	Text being split into chunks
Embedding	Chunks being converted to vector embeddings
Indexing	Vectors being stored in the vector database
Complete	Document fully processed and available for RAG
Error	Processing failed; check error details for more information

Tip: Adjust chunk size based on your use case. Smaller chunks work well for precise Q&A, while larger chunks preserve more context for summarization tasks. Experiment with the configurable chunk size and overlap settings to find the right balance for your content.

Important: OCR and audio transcription stages can be resource-intensive. If you are processing many images or audio files simultaneously, monitor your server resources to avoid bottlenecks.

Next Steps

Uploading Documents — Learn how to upload files for ingestion
Managing Data Lakes — Configure data lake settings including chunk size
Configuring RAG Collections — Set up vector collections and Graph RAG