Document Q&A¶

Paperless NGX Dedupe can answer natural language questions about your documents using Retrieval-Augmented Generation (RAG). Ask questions like "What was my electricity bill last quarter?" or "Find contracts mentioning penalty clauses" and get answers with citations back to the source documents.

How It Works¶

The Q&A system combines vector search (semantic similarity) with full-text search (exact keyword matching) to find the most relevant document passages, then sends them to a large language model to generate a grounded answer.

flowchart LR
    A[Your Question] --> B[Generate Embedding]
    B --> C[Vector Search]
    A --> D[Full-Text Search]
    C --> E[Rank Fusion]
    D --> E
    E --> F[Build Context]
    F --> G[Stream LLM Answer]
    G --> H[Answer + Citations]
    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#e8eaf6,stroke:#3f51b5
    style D fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8eaf6,stroke:#3f51b5
    style F fill:#e8eaf6,stroke:#3f51b5
    style G fill:#e8eaf6,stroke:#3f51b5
    style H fill:#e8eaf6,stroke:#3f51b5

The RAG Pipeline¶

Indexing — Your documents' OCR text is split into overlapping chunks (~400 tokens each, configurable). Each chunk is converted to a vector using OpenAI's embedding model (default: text-embedding-3-small at 1536 dimensions) and stored alongside a full-text search index. This happens once per document and is incremental — only new or modified documents are re-indexed.
Retrieval — When you ask a question, it's converted to a vector and compared against all stored chunks using cosine similarity (via sqlite-vec). Simultaneously, the same query runs against the SQLite FTS5 full-text index. Results from both searches are combined using Reciprocal Rank Fusion (RRF), which produces a single ranked list without needing to calibrate score scales.
Generation — The top-ranked chunks are assembled into a context prompt and sent to your configured OpenAI model. The model generates an answer grounded in the retrieved context, and the response is streamed token-by-token to the UI.
Citations — Each answer includes source citations showing which documents contributed to the response, with relevance scores and text excerpts.

Why Hybrid Search?¶

Vector search excels at understanding meaning ("electricity costs" matches "energy bill") but can miss exact terms. Full-text search is precise for specific identifiers (invoice numbers, names) but misses paraphrases. Combining both gives the best of both worlds — especially important for OCR text which may contain noise.

Zero Extra Infrastructure¶

Unlike most RAG systems that require a separate vector database (Qdrant, Pinecone, etc.), Document Q&A uses sqlite-vec — a SQLite extension that stores vectors directly in your existing database file. No additional containers, no separate backups, no extra configuration.

Setup¶

Document Q&A is disabled by default. Enable it with environment variables:

Variable	Required	Default	Notes
`RAG_ENABLED`	No	`false`	Master switch for Document Q&A
`AI_OPENAI_API_KEY`	Yes (when enabled)	-	Required for generating embeddings and answers

OpenAI key required

Both embeddings and answer generation use OpenAI models. The OpenAI key is mandatory when RAG_ENABLED=true.

RAG_ENABLED is independent of AI_ENABLED — you can use Document Q&A without enabling AI classification, or vice versa.

Configuration¶

After enabling, configure Q&A behavior in Settings > Document Q&A or via PUT /api/v1/rag/config. All settings are stored in the database and take effect immediately.

Embedding Settings¶

Setting	Default	Range	Description
`embeddingModel`	`text-embedding-3-small`	see below	OpenAI embedding model
`embeddingDimensions`	`1536`	256–3072	Vector dimensions (lower = less storage, slightly less accuracy)

Available embedding models:

Model	Dimensions	Cost	Notes
`text-embedding-3-small`	1536	$0.02/1M tokens	Best cost/performance ratio (recommended)
`text-embedding-3-large`	3072	$0.13/1M tokens	Higher quality, 6.5× more expensive

Chunking Settings¶

Setting	Default	Range	Description
`chunkSize`	`400`	100–2000	Target tokens per chunk
`chunkOverlap`	`40`	0–500	Overlap tokens between consecutive chunks

Retrieval Settings¶

Setting	Default	Range	Description
`topK`	`20`	1–100	Number of chunks retrieved per query
`maxContextTokens`	`8000`	500–100,000	Max tokens of retrieved context sent to the answer model

Answer Model¶

The model that generates answers from retrieved context is configured independently from AI Processing:

Setting	Default	Range	Description
`answerProvider`	`openai`	`openai`	LLM provider for answers
`answerModel`	`gpt-5.4-mini`	see AI Processing	Model identifier
`systemPrompt`	built-in	string	System instructions for the answer model
`autoIndex`	`false`	boolean	Auto-run indexing after document sync
`concurrentBatches`	`5`	1–20	Number of concurrent embedding API batches during indexing

Indexing Documents¶

Before you can ask questions, your documents must be indexed. Indexing converts document text into vector embeddings and builds the full-text search index.

Starting Indexing¶

There are three ways to index:

Manual — Click "Index Now" on the /ask page or "Rebuild Index" in Settings
Auto-index — Enable autoIndex in settings to automatically index after each sync
API — POST /api/v1/rag/index

Incremental vs. Full Rebuild¶

By default, indexing is incremental — only documents that are new or whose content has changed since last indexing are processed. To force a full re-index (useful after changing the embedding model or dimensions), use the "Rebuild Index" button in Settings or pass { "rebuild": true } to the API.

Changing embedding model requires a rebuild

If you change embeddingModel or embeddingDimensions, you must rebuild the index. Old embeddings are incompatible with the new model and search results will be poor until the index is rebuilt.

Cost Estimation¶

Embedding costs depend on the model and total document text:

text-embedding-3-small: ~$0.02 per 1M tokens
A typical 500-word document is ~625 tokens
1,000 documents ≈ 625K tokens ≈ $0.01
10,000 documents ≈ 6.25M tokens ≈ $0.13

Incremental indexing only processes new or changed documents, so recurring costs are minimal.

Using the Q&A Interface¶

Navigate to Ask Documents in the sidebar to open the chat interface.

Asking Questions¶

Type your question in the input bar and press Enter or click the send button. The answer streams in token-by-token with a typing indicator.

Tips for effective questions:

Be specific — "What was my British Gas electricity bill for Q1 2024?" works better than "electricity bills"
Reference document types — "Find all invoices from Amazon" uses both semantic and keyword matching
Ask about content — "What are the penalty clauses in my lease agreement?" searches inside documents, not just titles

Source Citations¶

Each answer includes expandable source citations showing:

Document title — which document the information came from
Excerpt — the relevant text passage
Match score — how relevant the passage was to your question

Conversations¶

Conversations are persisted in the database. The sidebar shows past conversations ordered by most recent, with the ability to:

Resume any past conversation
Start a new conversation
Delete conversations you no longer need

Multi-turn conversations maintain context — follow-up questions can reference earlier parts of the conversation.

Tips¶

Best practices

Index before asking — The Q&A feature requires indexed documents. If you see zero chunks, run indexing first.
Start with the default settings — The defaults work well for most document libraries. Tune only if retrieval quality is poor.
Increase topK for broad questions — Questions spanning many documents benefit from more retrieved chunks (15–20). Narrow questions work fine with fewer (5–10).
Increase maxContextTokens for detailed answers — If answers are too brief or miss details, increase the context budget. Be aware this increases per-query token costs.
Use a powerful answer model — Unlike batch classification where cost matters, Q&A is interactive and benefits from stronger models. Consider gpt-5.4 for best results.
Enable auto-index — If you sync documents regularly, enabling auto-index keeps the search index up to date automatically.

Hybrid search for OCR documents

OCR text often contains errors (misread characters, merged words). The hybrid search approach is particularly valuable here — vector search understands meaning despite typos, while full-text search catches exact terms the vector search might miss.

Technical Details¶

Storage¶

All RAG data is stored in the same SQLite database file:

document_chunk — Chunk text, metadata, and content hash (Drizzle-managed table)
document_chunk_vec — Vector embeddings (sqlite-vec virtual table)
document_chunk_fts — Full-text search index (SQLite FTS5 virtual table)
rag_conversation — Conversation sessions
rag_message — Chat messages with source citations

Scale Considerations¶

sqlite-vec uses linear scan (no approximate nearest neighbor indexes). Performance by collection size:

Documents	Approximate Chunks	Query Latency
< 1,000	< 5K	< 10ms
1,000–10,000	5K–50K	< 50ms
10,000–50,000	50K–250K	< 200ms
50,000–100,000	250K–500K	< 500ms

For most Paperless-NGX installations (< 50K documents), performance is excellent.

Search Fusion Algorithm¶

Reciprocal Rank Fusion (RRF) combines results from vector and full-text search:

\[\text{score}(d) = \sum_{i} \frac{1}{k + \text{rank}_i(d)}\]

Where $k = 60$ (standard constant) and $\text{rank}_i(d)$ is the rank of document $d$ in result set $i$. Documents appearing in both result sets receive higher fused scores.