Configuration Reference¶
Paperless NGX Dedupe uses:
- Environment variables for server/runtime behavior
- Dedup settings stored in the app database and editable at runtime
Environment Variables¶
Core Runtime¶
| Variable | Required | Default | Notes |
|---|---|---|---|
PAPERLESS_URL |
Yes | - | Full Paperless-NGX URL (for example http://paperless:8000) |
PAPERLESS_API_TOKEN |
Yes* | - | Preferred auth method |
PAPERLESS_USERNAME |
No | - | Use with PAPERLESS_PASSWORD when not using token |
PAPERLESS_PASSWORD |
No | - | Use with PAPERLESS_USERNAME |
DATABASE_URL |
No | ./data/paperless-ngx-dedupe.db |
SQLite file path |
PORT |
No | 3000 |
Web/API listen port |
LOG_LEVEL |
No | info |
debug, info, warn, error |
CORS_ALLOW_ORIGIN |
No | empty | Empty = same-origin only; * = allow all |
AUTO_MIGRATE |
No | true |
Auto-run DB schema migration on startup |
* Provide either PAPERLESS_API_TOKEN or both PAPERLESS_USERNAME + PAPERLESS_PASSWORD.
If both token and username/password are set, token is used first.
Container Runtime¶
| Variable | Required | Default | Notes |
|---|---|---|---|
PUID |
No | 1000 |
UID used inside the container |
PGID |
No | 1000 |
GID used inside the container |
SvelteKit / Proxy¶
| Variable | Required | Default | Notes |
|---|---|---|---|
ORIGIN |
Usually no | - | Set when running behind reverse proxies or non-localhost hostnames to satisfy origin checks |
AI Processing (Optional)¶
| Variable | Required | Default | Notes |
|---|---|---|---|
AI_ENABLED |
No | false |
Enable AI-powered document classification |
AI_OPENAI_API_KEY |
When AI enabled | - | OpenAI API key |
The API key is required when AI_ENABLED=true. Runtime settings (model, prompt, etc.) are configured in the Settings page or via API. See AI Processing for full details.
Document Q&A / RAG (Optional)¶
| Variable | Required | Default | Notes |
|---|---|---|---|
RAG_ENABLED |
No | false |
Enable natural language Q&A across your documents |
AI_OPENAI_API_KEY |
When RAG enabled | - | Required for generating embeddings and answers |
RAG_ENABLED is independent of AI_ENABLED — you can use Q&A without AI classification, or both. The OpenAI key is always required when RAG is enabled. Runtime settings (embedding model, chunk size, answer model, etc.) are configured in the Settings page or via API. See Document Q&A for full details.
Observability (Optional)¶
OpenTelemetry is off unless OTEL_ENABLED=true. Common vars:
OTEL_ENABLEDOTEL_SERVICE_NAMEOTEL_EXPORTER_OTLP_ENDPOINT(or per-signal endpoints)OTEL_TRACES_EXPORTER,OTEL_METRICS_EXPORTER,OTEL_LOGS_EXPORTER
See .env.example for the full list.
| OTEL_SERVICE_NAMESPACE | No | paperless-dedupe | Groups frontend and backend as one app in Grafana Cloud App Observability |
| OTEL_EXPORTER_OTLP_COMPRESSION | No | (none) | Set to gzip for Grafana Cloud (recommended) |
| OTEL_SEMCONV_STABILITY_OPT_IN | No | (none) | Set to database to use stable DB semantic conventions |
Continuous Profiling (Optional)¶
| Variable | Required | Default | Notes |
|---|---|---|---|
PYROSCOPE_ENABLED |
No | false |
Enable wall-time and heap profiling |
PYROSCOPE_SERVER_ADDRESS |
When Pyroscope enabled | - | Grafana Cloud Pyroscope endpoint or self-hosted URL |
PYROSCOPE_BASIC_AUTH_USER |
For Grafana Cloud | - | Grafana Cloud instance ID |
PYROSCOPE_BASIC_AUTH_PASSWORD |
For Grafana Cloud | - | Grafana Cloud API key |
Profiles are labeled by operation (sync, analysis, ai_batch, worker) for flame graph filtering.
Prometheus Scrape Endpoint (Optional)¶
| Variable | Required | Default | Notes |
|---|---|---|---|
OTEL_PROMETHEUS_ENABLED |
No | false |
Expose a Prometheus scrape endpoint at /api/v1/metrics |
When enabled, all application metrics (sync, analysis, jobs, AI, observable gauges) are available in Prometheus exposition format at GET /api/v1/metrics. This can be used standalone (without OTEL_ENABLED) or alongside full OTEL for both push and pull metrics.
When both are active, the Prometheus endpoint exposes the same metrics as the OTLP pipeline.
Paperless-NGX System Metrics (Optional)¶
When enabled, Paperless NGX Dedupe collects system-level metrics from your Paperless-NGX instance — storage, document counts, tags, correspondents, and more. This provides the same observability as running a separate prometheus-paperless-exporter container, but delivered through whichever metrics pipeline you have active (OTLP, Prometheus, or both) — one fewer container to manage.
Metric names match the Prometheus exporter exactly (e.g. paperless_status_storage_total_bytes, paperless_statistics_documents_total) for Grafana dashboard compatibility.
Separately opt-in
This is opt-in independently of OTEL_ENABLED / OTEL_PROMETHEUS_ENABLED because collectors poll the Paperless-NGX API every export interval (~60s), adding load to your Paperless instance. Enable only the collectors you need if this is a concern.
| Variable | Required | Default | Notes |
|---|---|---|---|
PAPERLESS_METRICS_ENABLED |
No | false |
Enable Paperless system metrics collection. Requires OTEL_ENABLED=true or OTEL_PROMETHEUS_ENABLED=true. |
PAPERLESS_METRICS_COLLECTORS |
No | all | Comma-separated list of collectors to enable |
Available collectors:
| Collector | API Calls | Description |
|---|---|---|
status |
1 | Storage, database, Redis, Celery, index, classifier, and sanity check status |
statistics |
1 + paginated | Document totals, inbox count, file type breakdown, character count, metadata counts |
document |
1 | Total document count |
tag |
paginated | Per-tag info, document counts, inbox flag |
correspondent |
paginated | Per-correspondent info, document counts, last correspondence timestamp |
document_type |
paginated | Per-document-type info and document counts |
storage_path |
paginated | Per-storage-path info and document counts |
task |
1 | Background task info, status, timestamps |
group |
1 | User group count |
user |
1 | User count |
remote_version |
1 | Update availability check (causes Paperless-NGX to make an outbound network call) |
All collectors are enabled by default. To enable only specific collectors:
Metrics are collected on the same interval as OTEL metric exports (controlled by OTEL_METRIC_EXPORT_INTERVAL, default 60s). Instances with many tags, correspondents, or document types will produce proportionally more time series from the labeled collectors (tag, correspondent, document_type, storage_path). Disable these if cardinality is a concern.
Credit: metric definitions and collector design inspired by prometheus-paperless-exporter by hansmi.
Deduplication Settings¶
Change these in Settings or via PUT /api/v1/config/dedup.
Algorithm Parameters¶
| Setting | Default | Range | Notes |
|---|---|---|---|
numPermutations |
256 |
16-1024 | MinHash signature length |
numBands |
32 |
1-100 | LSH bands; should divide numPermutations evenly |
ngramSize |
3 |
1-10 | Word shingle size |
minWords |
20 |
1-1000 | Skip very short docs below this |
similarityThreshold |
0.75 |
0-1 | Minimum overall similarity to keep a pair |
fuzzySampleSize |
10000 |
100-100000 | Character sample size for fuzzy compare |
autoAnalyze |
true |
boolean | Auto-run analysis after sync |
Confidence Weights¶
The confidence model uses a 2-weight base score plus a discriminative penalty:
Base weights are integers 0-100 and must sum to 100:
| Setting | Default | Notes |
|---|---|---|
confidenceWeightJaccard |
60 |
Weight for Jaccard (set overlap) similarity |
confidenceWeightFuzzy |
40 |
Weight for fuzzy (edit distance) similarity |
Discriminative penalty reduces confidence when template-based documents have different structured data (dates, amounts, invoice numbers, routes):
| Setting | Default | Range | Notes |
|---|---|---|---|
discriminativePenaltyStrength |
70 |
0-100 | How aggressively differing structured data reduces confidence (0 = disabled) |
The final confidence formula is:
base = (jaccard × J_weight + fuzzy × F_weight) / (J_weight + F_weight)
final = base × (1 - penalty_strength/100 × (1 - discriminative_score))
When the discriminative score is high (documents share the same dates, amounts, and references), the penalty has little effect. When it is low (documents have different dates, amounts, invoice numbers, or routes despite sharing a template), the penalty reduces the confidence score.
Strength guidelines:
- Low (0-30%): Minimal impact. Monthly invoices or train tickets with different dates may still appear as duplicates.
- Medium (40-70%): Recommended for most libraries. Catches template-based false positives while keeping true duplicates intact.
- High (80-100%): Aggressive. Best for libraries with many monthly invoices, bank statements, or train/flight tickets. May over-penalize minor OCR differences in dates or amounts.
When any weight or penalty strength changes, existing group confidence scores are recalculated automatically.
Example API Updates¶
# Update threshold
curl -X PUT http://localhost:3000/api/v1/config/dedup \
-H 'Content-Type: application/json' \
-d '{"similarityThreshold":0.8}'
# Rebalance weights (must sum to 100)
curl -X PUT http://localhost:3000/api/v1/config/dedup \
-H 'Content-Type: application/json' \
-d '{
"confidenceWeightJaccard":70,
"confidenceWeightFuzzy":30
}'
# Adjust discriminative penalty strength (0 = disabled, 100 = maximum)
curl -X PUT http://localhost:3000/api/v1/config/dedup \
-H 'Content-Type: application/json' \
-d '{"discriminativePenaltyStrength":75}'