Document Ingestion Pipeline

How OpenRails processes documents through parsing, chunking, embedding, and indexing

Overview

The document ingestion pipeline is the automated workflow that transforms raw uploaded files into searchable, embeddable content for RAG retrieval. Every document passes through four stages: Parse, Chunk, Embed, and Index.

Pipeline Stages

Parse

Raw files are converted to plain text. The parsing method depends on the file type:

  • PDF/DOCX/PPTX — Native text extraction with layout preservation
  • Images — OCR to extract text from scanned documents and photos
  • Video/Audio — Speech-to-text transcription to convert audio to text
  • TXT/CSV/JSON/HTML/Markdown — Direct text extraction

Chunk

Extracted text is split into smaller segments called chunks. Chunking parameters are configurable per data lake:

  • Chunk Size — Configurable number of characters per chunk
  • Chunk Overlap — Configurable number of overlapping characters between consecutive chunks

Overlap ensures that context at chunk boundaries is preserved, improving retrieval quality.

Embed

Each chunk is converted into a vector embedding using the configured embedding model. These dense vector representations capture the semantic meaning of the text.

Index

Embeddings are stored in a vector collection, and optionally in a knowledge graph for Graph RAG. The index enables fast similarity search during retrieval.

Monitoring Ingestion

Track the status of document ingestion from the data lake's Documents tab:

Status Description
Queued File uploaded and waiting to be processed
Parsing Text extraction in progress
Chunking Text being split into chunks
Embedding Chunks being converted to vector embeddings
Indexing Vectors being stored in the vector database
Complete Document fully processed and available for RAG
Error Processing failed; check error details for more information
Tip: Adjust chunk size based on your use case. Smaller chunks work well for precise Q&A, while larger chunks preserve more context for summarization tasks. Experiment with the configurable chunk size and overlap settings to find the right balance for your content.
Important: OCR and audio transcription stages can be resource-intensive. If you are processing many images or audio files simultaneously, monitor your server resources to avoid bottlenecks.

Next Steps