How OpenRails processes documents through parsing, chunking, embedding, and indexing
The document ingestion pipeline is the automated workflow that transforms raw uploaded files into searchable, embeddable content for RAG retrieval. Every document passes through four stages: Parse, Chunk, Embed, and Index.
Raw files are converted to plain text. The parsing method depends on the file type:
Extracted text is split into smaller segments called chunks. Chunking parameters are configurable per data lake:
Overlap ensures that context at chunk boundaries is preserved, improving retrieval quality.
Each chunk is converted into a vector embedding using the configured embedding model. These dense vector representations capture the semantic meaning of the text.
Embeddings are stored in a vector collection, and optionally in a knowledge graph for Graph RAG. The index enables fast similarity search during retrieval.
Track the status of document ingestion from the data lake's Documents tab:
| Status | Description |
|---|---|
| Queued | File uploaded and waiting to be processed |
| Parsing | Text extraction in progress |
| Chunking | Text being split into chunks |
| Embedding | Chunks being converted to vector embeddings |
| Indexing | Vectors being stored in the vector database |
| Complete | Document fully processed and available for RAG |
| Error | Processing failed; check error details for more information |