Paperless NGX Dedupe¶
Intelligent document deduplication, AI metadata extraction, and document Q&A for Paperless-NGX
Features¶
Intelligent Duplicate Detection¶
MinHash signatures combined with Locality-Sensitive Hashing provide efficient O(n log n) candidate discovery — no need to compare every document against every other.
Multi-Dimensional Scoring¶
Two weighted dimensions — Jaccard text overlap and fuzzy text matching — are combined into a base score, then a discriminative penalty down-scores pairs that share only boilerplate text. All weights and penalty strength are configurable.
AI Metadata Extraction¶
Automatically extract correspondents, document types, and tags from document text using OpenAI models. Each suggestion includes a confidence score and evidence snippet, so you can review and apply results individually or in bulk.
RAG Document Q&A¶
Ask natural language questions about your document library. Hybrid search combines vector embeddings with full-text search via Reciprocal Rank Fusion, with multi-turn conversations and source citations for every answer.
Real-Time Processing¶
Background worker threads handle sync, analysis, AI extraction, and document indexing with real-time progress streamed via Server-Sent Events.
Observability¶
Optional OpenTelemetry integration provides traces, metrics, and structured logs. A built-in Prometheus scrape endpoint and Paperless-NGX system metrics collector mean no extra exporter containers are needed.
Single Container¶
Deploy with Docker Compose using an embedded SQLite database. No Redis, no Postgres, no external dependencies beyond Paperless-NGX itself.
Quick Start¶
# 1. Create your configuration
cp .env.example .env
# Edit .env — set PAPERLESS_URL and PAPERLESS_API_TOKEN
# 2. Start the application
docker compose up -d
# 3. Open the web UI
# http://localhost:3000
# 4. Sync → Analyze → Review duplicates
See the Getting Started Guide for a full walkthrough.
Explore the Documentation¶
-
Getting Started
First run walkthrough — sync documents, run analysis, and review duplicates
-
Configuration
Environment variables, authentication methods, and algorithm tuning parameters
-
API Reference
Complete REST API documentation with curl examples for every endpoint
-
How It Works
The deduplication pipeline — shingling, MinHash, LSH, scoring, and clustering
Community & Support¶
- GitHub: rknightion/paperless-ngx-dedupe
- Issues: Report bugs or request features
- Discussions: Community discussions
- Paperless-NGX: Official Paperless-NGX project