Configuration Reference¶

Paperless NGX Dedupe uses:

Environment variables for server/runtime behavior
Dedup settings stored in the app database and editable at runtime

Environment Variables¶

Core Runtime¶

Variable	Required	Default	Notes
`PAPERLESS_URL`	Yes	-	Full Paperless-NGX URL (for example `http://paperless:8000`)
`PAPERLESS_API_TOKEN`	Yes*	-	Preferred auth method
`PAPERLESS_USERNAME`	No	-	Use with `PAPERLESS_PASSWORD` when not using token
`PAPERLESS_PASSWORD`	No	-	Use with `PAPERLESS_USERNAME`
`DATABASE_URL`	No	`./data/paperless-ngx-dedupe.db`	SQLite file path
`PORT`	No	`3000`	Web/API listen port
`LOG_LEVEL`	No	`info`	`debug`, `info`, `warn`, `error`
`CORS_ALLOW_ORIGIN`	No	empty	Empty = same-origin only; `*` = allow all
`AUTO_MIGRATE`	No	`true`	Auto-run DB schema migration on startup

* Provide either PAPERLESS_API_TOKEN or both PAPERLESS_USERNAME + PAPERLESS_PASSWORD.

If both token and username/password are set, token is used first.

Container Runtime¶

Variable	Required	Default	Notes
`PUID`	No	`1000`	UID used inside the container
`PGID`	No	`1000`	GID used inside the container

SvelteKit / Proxy¶

Variable	Required	Default	Notes
`ORIGIN`	Usually no	-	Set when running behind reverse proxies or non-localhost hostnames to satisfy origin checks

AI Processing (Optional)¶

Variable	Required	Default	Notes
`AI_ENABLED`	No	`false`	Enable AI-powered document classification
`AI_OPENAI_API_KEY`	When AI enabled	-	OpenAI API key

The API key is required when AI_ENABLED=true. Runtime settings (model, prompt, etc.) are configured in the Settings page or via API. See AI Processing for full details.

Document Q&A / RAG (Optional)¶

Variable	Required	Default	Notes
`RAG_ENABLED`	No	`false`	Enable natural language Q&A across your documents
`AI_OPENAI_API_KEY`	When RAG enabled	-	Required for generating embeddings and answers

RAG_ENABLED is independent of AI_ENABLED — you can use Q&A without AI classification, or both. The OpenAI key is always required when RAG is enabled. Runtime settings (embedding model, chunk size, answer model, etc.) are configured in the Settings page or via API. See Document Q&A for full details.

Observability (Optional)¶

OpenTelemetry is off unless OTEL_ENABLED=true. Common vars:

OTEL_ENABLED
OTEL_SERVICE_NAME
OTEL_EXPORTER_OTLP_ENDPOINT (or per-signal endpoints)
OTEL_TRACES_EXPORTER, OTEL_METRICS_EXPORTER, OTEL_LOGS_EXPORTER

See .env.example for the full list.

Continuous Profiling (Optional)¶

Variable	Required	Default	Notes
`PYROSCOPE_ENABLED`	No	`false`	Enable wall-time and heap profiling
`PYROSCOPE_SERVER_ADDRESS`	When Pyroscope enabled	-	Grafana Cloud Pyroscope endpoint or self-hosted URL
`PYROSCOPE_BASIC_AUTH_USER`	For Grafana Cloud	-	Grafana Cloud instance ID
`PYROSCOPE_BASIC_AUTH_PASSWORD`	For Grafana Cloud	-	Grafana Cloud API key

Profiles are labeled by operation (sync, analysis, ai_batch, worker) for flame graph filtering.

Prometheus Scrape Endpoint (Optional)¶

Variable	Required	Default	Notes
`OTEL_PROMETHEUS_ENABLED`	No	`false`	Expose a Prometheus scrape endpoint at `/api/v1/metrics`

When enabled, all application metrics (sync, analysis, jobs, AI, observable gauges) are available in Prometheus exposition format at GET /api/v1/metrics. This can be used standalone (without OTEL_ENABLED) or alongside full OTEL for both push and pull metrics.

When both are active, the Prometheus endpoint exposes the same metrics as the OTLP pipeline.

Paperless-NGX System Metrics (Optional)¶

When enabled, Paperless NGX Dedupe collects system-level metrics from your Paperless-NGX instance — storage, document counts, tags, correspondents, and more. This provides the same observability as running a separate prometheus-paperless-exporter container, but delivered through whichever metrics pipeline you have active (OTLP, Prometheus, or both) — one fewer container to manage.

Metric names match the Prometheus exporter exactly (e.g. paperless_status_storage_total_bytes, paperless_statistics_documents_total) for Grafana dashboard compatibility.

Separately opt-in

This is opt-in independently of OTEL_ENABLED / OTEL_PROMETHEUS_ENABLED because collectors poll the Paperless-NGX API every export interval (~60s), adding load to your Paperless instance. Enable only the collectors you need if this is a concern.

Variable	Required	Default	Notes
`PAPERLESS_METRICS_ENABLED`	No	`false`	Enable Paperless system metrics collection. Requires `OTEL_ENABLED=true` or `OTEL_PROMETHEUS_ENABLED=true`.
`PAPERLESS_METRICS_COLLECTORS`	No	all	Comma-separated list of collectors to enable

Available collectors:

Collector	API Calls	Description
`status`	1	Storage, database, Redis, Celery, index, classifier, and sanity check status
`statistics`	1 + paginated	Document totals, inbox count, file type breakdown, character count, metadata counts
`document`	1	Total document count
`tag`	paginated	Per-tag info, document counts, inbox flag
`correspondent`	paginated	Per-correspondent info, document counts, last correspondence timestamp
`document_type`	paginated	Per-document-type info and document counts
`storage_path`	paginated	Per-storage-path info and document counts
`task`	1	Background task info, status, timestamps
`group`	1	User group count
`user`	1	User count
`remote_version`	1	Update availability check (causes Paperless-NGX to make an outbound network call)

All collectors are enabled by default. To enable only specific collectors:

PAPERLESS_METRICS_COLLECTORS=status,statistics,document

Metrics are collected on the same interval as OTEL metric exports (controlled by OTEL_METRIC_EXPORT_INTERVAL, default 60s). Instances with many tags, correspondents, or document types will produce proportionally more time series from the labeled collectors (tag, correspondent, document_type, storage_path). Disable these if cardinality is a concern.

Credit: metric definitions and collector design inspired by prometheus-paperless-exporter by hansmi.

Deduplication Settings¶

Change these in Settings or via PUT /api/v1/config/dedup.

Algorithm Parameters¶

Setting	Default	Range	Notes
`numPermutations`	`256`	16-1024	MinHash signature length
`numBands`	`32`	1-100	LSH bands; should divide `numPermutations` evenly
`ngramSize`	`3`	1-10	Word shingle size
`minWords`	`20`	1-1000	Skip very short docs below this
`similarityThreshold`	`0.75`	0-1	Minimum overall similarity to keep a pair
`fuzzySampleSize`	`10000`	100-100000	Character sample size for fuzzy compare
`autoAnalyze`	`true`	boolean	Auto-run analysis after sync

Confidence Weights¶

The confidence model uses a 2-weight base score plus a discriminative penalty:

Base weights are integers 0-100 and must sum to 100:

Setting	Default	Notes
`confidenceWeightJaccard`	`60`	Weight for Jaccard (set overlap) similarity
`confidenceWeightFuzzy`	`40`	Weight for fuzzy (edit distance) similarity

Discriminative penalty reduces confidence when template-based documents have different structured data (dates, amounts, invoice numbers, routes):

Setting	Default	Range	Notes
`discriminativePenaltyStrength`	`70`	0-100	How aggressively differing structured data reduces confidence (0 = disabled)

The final confidence formula is:

base  = (jaccard × J_weight + fuzzy × F_weight) / (J_weight + F_weight)
final = base × (1 - penalty_strength/100 × (1 - discriminative_score))

When the discriminative score is high (documents share the same dates, amounts, and references), the penalty has little effect. When it is low (documents have different dates, amounts, invoice numbers, or routes despite sharing a template), the penalty reduces the confidence score.

Strength guidelines:

Low (0-30%): Minimal impact. Monthly invoices or train tickets with different dates may still appear as duplicates.
Medium (40-70%): Recommended for most libraries. Catches template-based false positives while keeping true duplicates intact.
High (80-100%): Aggressive. Best for libraries with many monthly invoices, bank statements, or train/flight tickets. May over-penalize minor OCR differences in dates or amounts.

When any weight or penalty strength changes, existing group confidence scores are recalculated automatically.

Example API Updates¶

# Update threshold
curl -X PUT http://localhost:3000/api/v1/config/dedup \
  -H 'Content-Type: application/json' \
  -d '{"similarityThreshold":0.8}'

# Rebalance weights (must sum to 100)
curl -X PUT http://localhost:3000/api/v1/config/dedup \
  -H 'Content-Type: application/json' \
  -d '{
    "confidenceWeightJaccard":70,
    "confidenceWeightFuzzy":30
  }'

# Adjust discriminative penalty strength (0 = disabled, 100 = maximum)
curl -X PUT http://localhost:3000/api/v1/config/dedup \
  -H 'Content-Type: application/json' \
  -d '{"discriminativePenaltyStrength":75}'