How It Works¶
Paperless NGX Dedupe uses a multi-stage pipeline to identify duplicate documents efficiently. This page explains each stage in plain terms.
Overview¶
The pipeline works in three broad phases:
- Sync documents from Paperless-NGX and prepare their text
- Index documents using probabilistic data structures (MinHash + LSH) to find candidate pairs without comparing every document to every other
- Score and group candidates using Jaccard and fuzzy text similarity, then cluster them for review
flowchart LR
A[Sync] --> B[Shingle]
B --> C[MinHash]
C --> D[LSH]
D --> E[Score]
E --> F[Cluster]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8eaf6,stroke:#3f51b5
style C fill:#e8eaf6,stroke:#3f51b5
style D fill:#e8eaf6,stroke:#3f51b5
style E fill:#e8eaf6,stroke:#3f51b5
style F fill:#e8eaf6,stroke:#3f51b5
Step 1: Document Sync¶
When you trigger a sync, Paperless NGX Dedupe fetches documents from your Paperless-NGX instance via its REST API.
For each document, the sync process:
- Stores metadata (title, correspondent, document type, tags, dates)
- Extracts the full OCR text content
- Normalizes the text: lowercases, strips punctuation, collapses whitespace
- Computes a fingerprint (hash of the normalized text) so future syncs can skip unchanged documents
After the first full sync, incremental syncs only fetch documents that have changed.
Step 2: Shingling¶
Before comparing documents, the normalized text is split into overlapping word groups called shingles (also known as n-grams).
With the default ngramSize of 3, the sentence "the quick brown fox jumps" produces these shingles:
- "the quick brown"
- "quick brown fox"
- "brown fox jumps"
Each document becomes a set of shingles. Two documents that share many shingles have similar content, measured by the Jaccard similarity:
The shingle set is the foundation for all subsequent steps.
Documents with fewer words than minWords (default: 20) are skipped because short documents produce too few shingles for reliable comparison.
Step 3: MinHash Signatures¶
Comparing shingle sets directly is expensive -- each set can contain thousands of entries, and comparing every pair of documents would be O(n^2).
MinHash compresses each shingle set into a compact fixed-size signature (a list of 256 numbers by default, controlled by numPermutations). The key property is:
The probability that two MinHash signatures agree at any position equals the Jaccard similarity of the original shingle sets.
This means we can estimate how similar two documents are by comparing their short signatures instead of their full shingle sets. The more permutations, the more accurate the estimate, but at the cost of more memory and CPU.
Step 4: Locality-Sensitive Hashing (LSH)¶
Even with compact signatures, comparing every pair of documents is still O(n^2). LSH solves this by only comparing documents that are likely to be similar.
The technique works by dividing each signature into bands (default: 32). Each band is a short segment of the signature. Documents are placed into hash buckets based on each band. Two documents that land in the same bucket for any band become a candidate pair.
The band/row structure creates an S-curve probability threshold:
- Document pairs with high similarity are almost certain to be candidates
- Pairs with low similarity are almost certain to be filtered out
- The transition region depends on the number of bands and rows per band
With 256 permutations and 32 bands, the effective candidate funnel is tuned for moderate recall before the final similarityThreshold filter (default: 0.75) removes weaker matches.
Step 5: Similarity Scoring¶
Each candidate pair from LSH is scored across two dimensions:
-
Jaccard Similarity (default weight: 55%) -- Set overlap of shingle sets, estimated from MinHash signatures. Measures how much text content the two documents share.
-
Fuzzy Text Similarity (default weight: 45%) -- Edit-distance-based ratio computed on a character sample of the normalized text (controlled by
fuzzySampleSize). Catches cases where documents have similar content but different word order or minor variations.
The overall confidence score is the weighted combination of both dimensions. Pairs scoring below similarityThreshold (default: 0.75) are discarded.
Step 6: Clustering¶
Scored pairs are grouped into clusters using a union-find (disjoint-set) data structure. If document A is similar to B, and B is similar to C, all three end up in the same group -- even if A and C were not directly compared.
Each cluster becomes a duplicate group in the database with:
- A confidence score (derived from the strongest pair in the group)
- Individual similarity dimension scores
- Member documents with a designated primary (the one to keep)
Groups are presented in the web UI for review, sorted by confidence.
Tuning Guide¶
Too many false positives (unrelated documents grouped together)¶
- Raise
similarityThreshold(e.g., 0.85 or 0.90) to require stronger matches - Adjust confidence weights to shift emphasis between Jaccard and fuzzy text matching
- Reduce
numBandsto narrow the LSH candidate funnel
Missing obvious duplicates¶
- Lower
similarityThreshold(e.g., 0.60) to allow weaker matches through - Increase
numBandsto widen the candidate funnel - Lower
minWordsif short documents are being skipped - Check
ngramSize-- a smaller value (2) is more sensitive to small differences
Slow analysis on large libraries¶
- Reduce
numPermutations(e.g., 128) for faster signature generation at the cost of some accuracy - Reduce
fuzzySampleSize(e.g., 2000) to speed up fuzzy text comparison - Increase
similarityThresholdto reduce the number of pairs that need scoring
Understanding the relationship between bands and permutations¶
numPermutations must be evenly divisible by numBands. The number of rows per band is numPermutations / numBands. More rows per band means a higher effective LSH threshold (fewer candidates, higher precision). More bands means a lower effective threshold (more candidates, higher recall).
| Permutations | Bands | Rows/Band | Effect |
|---|---|---|---|
| 256 | 32 | 8 | Default balance of recall/precision |
| 256 | 16 | 16 | Fewer candidates, higher precision |
| 128 | 16 | 8 | Faster, with less stable estimates |
| 512 | 32 | 16 | More stable estimates, higher CPU |
See Also¶
- Architecture -- monorepo structure, data flow diagrams, and database schema
- Configuration -- all algorithm parameters with ranges and defaults