Skip to content

Configuration Reference

Paperless NGX Dedupe uses:

  • Environment variables for server/runtime behavior
  • Dedup settings stored in the app database and editable at runtime

Environment Variables

Core Runtime

Variable Required Default Notes
PAPERLESS_URL Yes - Full Paperless-NGX URL (for example http://paperless:8000)
PAPERLESS_API_TOKEN Yes* - Preferred auth method
PAPERLESS_USERNAME No - Use with PAPERLESS_PASSWORD when not using token
PAPERLESS_PASSWORD No - Use with PAPERLESS_USERNAME
DATABASE_URL No ./data/paperless-ngx-dedupe.db SQLite file path
PORT No 3000 Web/API listen port
LOG_LEVEL No info debug, info, warn, error
CORS_ALLOW_ORIGIN No empty Empty = same-origin only; * = allow all
AUTO_MIGRATE No true Auto-run DB schema migration on startup

* Provide either PAPERLESS_API_TOKEN or both PAPERLESS_USERNAME + PAPERLESS_PASSWORD.

If both token and username/password are set, token is used first.

Container Runtime

Variable Required Default Notes
PUID No 1000 UID used inside the container
PGID No 1000 GID used inside the container

SvelteKit / Proxy

Variable Required Default Notes
ORIGIN Usually no - Set when running behind reverse proxies or non-localhost hostnames to satisfy origin checks

AI Processing (Optional)

Variable Required Default Notes
AI_ENABLED No false Enable AI-powered document classification
AI_OPENAI_API_KEY When AI enabled - OpenAI API key

The API key is required when AI_ENABLED=true. Runtime settings (model, prompt, etc.) are configured in the Settings page or via API. See AI Processing for full details.

Document Q&A / RAG (Optional)

Variable Required Default Notes
RAG_ENABLED No false Enable natural language Q&A across your documents
AI_OPENAI_API_KEY When RAG enabled - Required for generating embeddings and answers

RAG_ENABLED is independent of AI_ENABLED — you can use Q&A without AI classification, or both. The OpenAI key is always required when RAG is enabled. Runtime settings (embedding model, chunk size, answer model, etc.) are configured in the Settings page or via API. See Document Q&A for full details.

Observability (Optional)

OpenTelemetry is off unless OTEL_ENABLED=true. Common vars:

  • OTEL_ENABLED
  • OTEL_SERVICE_NAME
  • OTEL_EXPORTER_OTLP_ENDPOINT (or per-signal endpoints)
  • OTEL_TRACES_EXPORTER, OTEL_METRICS_EXPORTER, OTEL_LOGS_EXPORTER

See .env.example for the full list.

| OTEL_SERVICE_NAMESPACE | No | paperless-dedupe | Groups frontend and backend as one app in Grafana Cloud App Observability | | OTEL_EXPORTER_OTLP_COMPRESSION | No | (none) | Set to gzip for Grafana Cloud (recommended) | | OTEL_SEMCONV_STABILITY_OPT_IN | No | (none) | Set to database to use stable DB semantic conventions |

Continuous Profiling (Optional)

Variable Required Default Notes
PYROSCOPE_ENABLED No false Enable wall-time and heap profiling
PYROSCOPE_SERVER_ADDRESS When Pyroscope enabled - Grafana Cloud Pyroscope endpoint or self-hosted URL
PYROSCOPE_BASIC_AUTH_USER For Grafana Cloud - Grafana Cloud instance ID
PYROSCOPE_BASIC_AUTH_PASSWORD For Grafana Cloud - Grafana Cloud API key

Profiles are labeled by operation (sync, analysis, ai_batch, worker) for flame graph filtering.

Prometheus Scrape Endpoint (Optional)

Variable Required Default Notes
OTEL_PROMETHEUS_ENABLED No false Expose a Prometheus scrape endpoint at /api/v1/metrics

When enabled, all application metrics (sync, analysis, jobs, AI, observable gauges) are available in Prometheus exposition format at GET /api/v1/metrics. This can be used standalone (without OTEL_ENABLED) or alongside full OTEL for both push and pull metrics.

When both are active, the Prometheus endpoint exposes the same metrics as the OTLP pipeline.

Paperless-NGX System Metrics (Optional)

When enabled, Paperless NGX Dedupe collects system-level metrics from your Paperless-NGX instance — storage, document counts, tags, correspondents, and more. This provides the same observability as running a separate prometheus-paperless-exporter container, but delivered through whichever metrics pipeline you have active (OTLP, Prometheus, or both) — one fewer container to manage.

Metric names match the Prometheus exporter exactly (e.g. paperless_status_storage_total_bytes, paperless_statistics_documents_total) for Grafana dashboard compatibility.

Separately opt-in

This is opt-in independently of OTEL_ENABLED / OTEL_PROMETHEUS_ENABLED because collectors poll the Paperless-NGX API every export interval (~60s), adding load to your Paperless instance. Enable only the collectors you need if this is a concern.

Variable Required Default Notes
PAPERLESS_METRICS_ENABLED No false Enable Paperless system metrics collection. Requires OTEL_ENABLED=true or OTEL_PROMETHEUS_ENABLED=true.
PAPERLESS_METRICS_COLLECTORS No all Comma-separated list of collectors to enable

Available collectors:

Collector API Calls Description
status 1 Storage, database, Redis, Celery, index, classifier, and sanity check status
statistics 1 + paginated Document totals, inbox count, file type breakdown, character count, metadata counts
document 1 Total document count
tag paginated Per-tag info, document counts, inbox flag
correspondent paginated Per-correspondent info, document counts, last correspondence timestamp
document_type paginated Per-document-type info and document counts
storage_path paginated Per-storage-path info and document counts
task 1 Background task info, status, timestamps
group 1 User group count
user 1 User count
remote_version 1 Update availability check (causes Paperless-NGX to make an outbound network call)

All collectors are enabled by default. To enable only specific collectors:

PAPERLESS_METRICS_COLLECTORS=status,statistics,document

Metrics are collected on the same interval as OTEL metric exports (controlled by OTEL_METRIC_EXPORT_INTERVAL, default 60s). Instances with many tags, correspondents, or document types will produce proportionally more time series from the labeled collectors (tag, correspondent, document_type, storage_path). Disable these if cardinality is a concern.

Credit: metric definitions and collector design inspired by prometheus-paperless-exporter by hansmi.

Deduplication Settings

Change these in Settings or via PUT /api/v1/config/dedup.

Algorithm Parameters

Setting Default Range Notes
numPermutations 256 16-1024 MinHash signature length
numBands 32 1-100 LSH bands; should divide numPermutations evenly
ngramSize 3 1-10 Word shingle size
minWords 20 1-1000 Skip very short docs below this
similarityThreshold 0.75 0-1 Minimum overall similarity to keep a pair
fuzzySampleSize 10000 100-100000 Character sample size for fuzzy compare
autoAnalyze true boolean Auto-run analysis after sync

Confidence Weights

The confidence model uses a 2-weight base score plus a discriminative penalty:

Base weights are integers 0-100 and must sum to 100:

Setting Default Notes
confidenceWeightJaccard 60 Weight for Jaccard (set overlap) similarity
confidenceWeightFuzzy 40 Weight for fuzzy (edit distance) similarity

Discriminative penalty reduces confidence when template-based documents have different structured data (dates, amounts, invoice numbers, routes):

Setting Default Range Notes
discriminativePenaltyStrength 70 0-100 How aggressively differing structured data reduces confidence (0 = disabled)

The final confidence formula is:

base  = (jaccard × J_weight + fuzzy × F_weight) / (J_weight + F_weight)
final = base × (1 - penalty_strength/100 × (1 - discriminative_score))

When the discriminative score is high (documents share the same dates, amounts, and references), the penalty has little effect. When it is low (documents have different dates, amounts, invoice numbers, or routes despite sharing a template), the penalty reduces the confidence score.

Strength guidelines:

  • Low (0-30%): Minimal impact. Monthly invoices or train tickets with different dates may still appear as duplicates.
  • Medium (40-70%): Recommended for most libraries. Catches template-based false positives while keeping true duplicates intact.
  • High (80-100%): Aggressive. Best for libraries with many monthly invoices, bank statements, or train/flight tickets. May over-penalize minor OCR differences in dates or amounts.

When any weight or penalty strength changes, existing group confidence scores are recalculated automatically.

Example API Updates

# Update threshold
curl -X PUT http://localhost:3000/api/v1/config/dedup \
  -H 'Content-Type: application/json' \
  -d '{"similarityThreshold":0.8}'

# Rebalance weights (must sum to 100)
curl -X PUT http://localhost:3000/api/v1/config/dedup \
  -H 'Content-Type: application/json' \
  -d '{
    "confidenceWeightJaccard":70,
    "confidenceWeightFuzzy":30
  }'

# Adjust discriminative penalty strength (0 = disabled, 100 = maximum)
curl -X PUT http://localhost:3000/api/v1/config/dedup \
  -H 'Content-Type: application/json' \
  -d '{"discriminativePenaltyStrength":75}'